对R中的彩票号码进行有效采样

问题描述

我想编写一个函数，对 n 张彩票进行抽奖，每张彩票的 6 个数字从 1到45 ，而无需替换。但是，我需要高效地执行此操作，这意味着没有循环或类似循环的功能。（我想Rcpp也可以，但是我更喜欢在基数R中使用矢量化解决方案）

无限制地解决：

lottery_inef <- function(n){
  
 t(replicate(n,sample(1:45,6)))
}

因此，在这里我得到一个矩阵，其中每一行对应一张彩票。现在，如果我要模拟数百万张彩票，这会变得很慢，因此我对矢量化解决方案很感兴趣。

我的想法是：

lottery_ef <- function(n){
  
  m <- matrix(sample(1:45,n*6,replace = TRUE),ncol = 6)
  
  # somehow subset the matrix without a loop to remove all the 
  # rows that have non-unique values as in the lottery we can only draw each number once
}

对于高效版本，在没有循环或apply（）的子集设置时，我有点迷失了。如果有人可以解决此子集问题或将我指向完全不同的方向，这将使我找到解决方案，我将不胜感激。

解决方法

replicate实际上并不能在这种规模下做得很好。使用即时编译（已经在R中使用了几年），for循环可以更快，尤其是当我们可以精确地预分配数据结构时。我们还可以避免使用t()：

lottery_inef <- function(n){
 t(replicate(n,sample(1:45,6)))
}

lottery_preall <- function(n){
  m = matrix(NA_integer_,nrow = n,ncol = 6)
  for(i in 1:n) {
    m[i,] = sample.int(45L,size = 6)
  }
  m
}

nn = 1e6
microbenchmark::microbenchmark(
  lottery_inef(nn),lottery_preall(nn),times = 2
)
# Unit: seconds
#                expr      min       lq     mean   median       uq      max neval
#    lottery_inef(nn) 9.400862 9.400862 9.571756 9.571756 9.742649 9.742649     2
#  lottery_preall(nn) 4.948216 4.948216 5.454482 5.454482 5.960749 5.960749     2

replicate将结果累加到list中，然后需要检查每个维度的大小，然后才能将其简化为矩阵，并且必须进行转换。预先分配的整数矩阵会跳过所有这些开销，从而使速度大约提高2倍。

我们也可以比较一下，例如vapply（快速测试显示vapply比循环慢一点），但是我认为要提高速度，您需要并行运行-这将是一个不错的选择，并且可能使您的加速速度几乎等于所使用的内核数。

sample.int几乎只是对C代码的调用，因此使用Rcpp可能不会做得更好-我认为并行化是提高速度的最佳选择。

由于生成该大小的集合的所有组合仅需几秒钟，因此可能值得这样做，然后将其作为“彩票”的子集。下面，我使用sample()生成一百万行索引（无论是否替换），并在整个集合上用括号括起来的子集来生成可能的票证。

如果您需要经常执行此操作，或者在不同的时间执行此操作，则最好保存完整的组合集，而不是每次都重新生成它。几乎所有处理都在生成完整的组合集。之后，快速选择“门票”。

时间显示创建所有组合大约需要6秒钟，一百万个索引大约需要0.2秒，一百万行的包围式子集大约需要0.1秒。

set.seed(2)

tictoc::tic() #included for timing

# All possible lotto combinations as matrix,1 per row
lotto_all <- t(combn(1:45,6))

tictoc::toc() #included for timing
#> 5.899 sec elapsed

# A look at the data:
head(lotto_all)
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    1    2    3    4    5    6
#> [2,]    1    2    3    4    5    7
#> [3,]    1    2    3    4    5    8
#> [4,]    1    2    3    4    5    9
#> [5,]    1    2    3    4    5   10
#> [6,]    1    2    3    4    5   11

# Getting index (row) numbers for our 'tickts' with & without replacement
tictoc::tic()
sample_indices_no_replacement <- sample(1:nrow(lotto_all),size = 1e6,replace = F)
tictoc::toc()
#> 0.178 sec elapsed

sample_indices_w_replacement <- sample(1:nrow(lotto_all),replace = T)

# The number combinations of our 'tickets'
tictoc::tic()
sample_tickets_no_rep <- lotto_all[sample_indices_no_replacement,]
tictoc::toc()
#> 0.097 sec elapsed

sample_tickets_rep <- lotto_all[sample_indices_w_replacement,]

# A look at the sample tickets:
head(sample_tickets_no_rep)
#>      [,]    8   12   14   31   34   44
#> [2,]    6   10   16   26   32   36
#> [3,]    3    4   10   15   41   43
#> [4,]    2    3    5   17   33   36
#> [5,]    7   17   24   25   35   40
#> [6,]   32   33   34   36   39   43

# See that there are some duplicates using replacement = T
length(unique(sample_indices_no_replacement))
#> [1] 1000000
length(unique(sample_indices_w_replacement))
#> [1] 941309

^{由reprex package（v0.3.0）于2020-10-27创建}

由于效率是重点，因此有两个软件包arrangements和RcppAlgos ^*介意。

在开始之前，我们首先说明在使用sample时，我们无法控制结果的唯一性。每次绘制均来自均匀分布，因此有可能我们可以多次重复绘制相同的排列。使用@Gregor提供的功能，我们可以：

set.seed(42)
system.time(a <- lottery_inef(1e6))
 user  system elapsed 
7.640   0.345   7.984

sum(duplicated(a))
[1] 86

set.seed(42)
system.time(b <- lottery_preall(1e6))
 user  system elapsed 
3.673   0.256   3.929

sum(duplicated(b))
[1] 86

虽然使用软件包arrangements的速度更快，但我们仍然看到相同的行为：

set.seed(42)
system.time(arng <- arrangements::permutations(45,6,nsample = 1e6))
  user  system elapsed 
 0.761   0.021   0.785 

sum(duplicated(arng))
[1] 108

现在，使用包RcppAlgos，如果请求的结果数少于结果总数（在我们的例子中超过50亿），我们可以保证结果是唯一的：

RcppAlgos::permuteCount(45,6)
[1] 5864443200

system.time(algosSer <- RcppAlgos::permuteSample(45,n = 1e6,seed = 42))
 user  system elapsed 
0.560   0.001   0.561

sum(duplicated(algosSer))
[1] 0

另外，我们可以通过nThreads参数利用多个线程，以进一步提高速度。

system.time(algosPar <- RcppAlgos::permuteSample(45,seed = 42,nThreads = 4))
 user  system elapsed 
0.574   0.001   0.280

## Results are the same as the serial version
identical(algosPar,algosSer)
[1] TRUE

^*我是RcppAlgos

的作者

function function function r r sampling