r生成具有限制的随机1和0的列

问题描述

我有一个包含500个观察值的数据集。我喜欢根据两种情况随机生成1和0

当前数据集

  Id     Age    Category   
  1      23     1
  2      24     1
  3      21     2
  .      .      .
  .      .      .
  .      .      .
500      27     3

场景1

1的总数应为200，并且应为随机数。剩下的300应该是0。

场景2

1的总数应该为200。其余的300应该为0。
- 1中的40％应该在Category1中。那就是80 1应该属于Category1
- 1的40％应该属于Category2，也就是说80的1应该属于Category2
- 1中的20％应该在Category3中，也就是说40 1中应该在Category3中

预期产量

  Id     Age    Category  Indicator  
  1      23     1         1
  2      24     1         0
  3      21     2         1
  .      .      .
  .      .      .
  .      .      .
500      27     3         1

我知道函数sample(c(0,1),500)会产生1，但是我不知道如何使它随机产生200 1。同样不确定如何在Category1中随机生成80 1s，在category2中随机生成80 1s，在Category3中生成40 1s。

解决方法

这是一个完整的示例。

假设您的数据如下所示：

set.seed(69)

df <- data.frame(id = 1:500,Age = 20 + sample(10,500,TRUE),Category = sample(3,TRUE))

head(df)
#>   id Age Category
#> 1  1  21        2
#> 2  2  22        2
#> 3  3  28        3
#> 4  4  27        2
#> 5  5  27        1
#> 6  6  26        2

现在，您没有提到每个类别中有多少个，因此让我们检查一下样本中有多少个：

table(df$Category)

#>   1   2   3 
#> 153 179 168

方案1很简单。您需要创建一个包含500个零的向量，然后将一个1写入新向量的索引的样本200中：

df$label <- numeric(nrow(df))
df$label[sample(nrow(df),200)] <- 1

head(df)
#>   id Age Category label
#> 1  1  21        2     1
#> 2  2  22        2     1
#> 3  3  28        3     0
#> 4  4  27        2     0
#> 5  5  27        1     0
#> 6  6  26        2     1

所以我们有随机的零和一，但是当我们计算它们时，我们有：

table(df$label)
#> 
#>   0   1 
#> 300 200

场景2相似，但涉及更多，因为我们需要按类别执行 groupwise 的类似操作：

df$label <- numeric(nrow(df))
df <- do.call("rbind",lapply(split(df,df$Category),function(d) {
  n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
  d$label[sample(nrow(d),n_ones)] <- 1 
  d
  }))

head(df)
#>      id Age Category label
#> 1.5   5  27        1     0
#> 1.10 10  24        1     0
#> 1.13 13  23        1     1
#> 1.19 19  24        1     0
#> 1.26 26  22        1     1
#> 1.27 27  24        1     1

现在，由于每个类别中的数字都不能很好地被10整除，因此我们无法准确得到40％和20％（尽管您可能拥有自己的数据），但是我们尽可能地接近了它，如下所示演示：

label_table <- table(df$Category,df$label)
label_table   
#>       0   1
#>   1  92  61
#>   2 107  72
#>   3 134  34

apply(label_table,1,function(x) x[2]/sum(x))
#>         1         2         3 
#> 0.3986928 0.4022346 0.2023810

^{由reprex package（v0.3.0）于2020-08-12创建}

填充随机值的另一种方法是创建一个可能值的向量（80个值为1，nrow-80个值为0），然后从这些可能值中采样。与通过索引设置值相比，这可能会使用更多的内存，但是潜在值的向量是如此之小，以至于它通常是微不足道的。

set.seed(42)

df <- data.frame(id = 1:500,TRUE))

## In Tidyverse

library(tidyverse)

set.seed(42)

df2 <- df %>%
  group_by(Category) %>%
  mutate(Label = case_when(
    Category == 1 ~ sample(
      c(rep(1,80),rep(0,n()-80)),n()
    ),Category == 2 ~ sample(
      c(rep(1,Category == 3 ~ sample(
      c(rep(1,40),n()-40)),n()
    )
  ))

table(df2$Category,df2$Label)

#     0   1
# 1  93  80
# 2  82  80
# 3 125  40

## In base

df3 <- df

df3[df$Category == 1,"Label"] <- sample(
  c(rep(1,nrow(df[df$Category == 1,])-80)),])
)
df3[df$Category == 2,nrow(df[df$Category == 2,])
)
df3[df$Category == 3,nrow(df[df$Category == 3,])-40)),])
)

table(df3$Category,df3$Label)

#     0   1
# 1  93  80
# 2  82  80
# 3 125  40

要解决方案1，您需要创建一个包含300个零和200个零的矢量，然后从零开始替换它。

pull_from = c(rep(0,300),rep(1,200))

sample(pull_from,replace = FALSE)

对于场景2，我建议根据类别将数据分为3个单独的块，对零和所需的数字使用不同的值重复上述步骤，然后重新组合为一个数据帧。

r random sampling

r生成具有限制的随机1和0的列

问题描述

解决方法

相关问答