R:选择项目样本,同时控制多个变量的差异

问题描述

我有一个数据集,其中包含属于 3 个组(A、H 或 V)的单词列表和 2 个连续变量(词长和词频):

mydata = structure(list(word = c("elastisch","rutschig","verklebt","dumpf","hallend","formbar","gelb","braun","blond","klebrig","blass","blendend","schlaff","bunt","singend","lauwarm","strahlend","biegsam","durchsichtig","verbal","erleuchtet","schrill","erloschen","dehnbar","beige","farbig","gepunktet","heiser","musikalisch","schweigend","schreiend","schwer","transparent","flackernd","blinkend","stumpf","gedimmt","lautlos","gefleckt","pappig","feucht","stumm","eisig","taub","steif","weich","leise","kalt","fein","laut","warm","still"),group = c("H","H","A","V","A"),length = c(9L,8L,5L,7L,4L,9L,12L,6L,10L,11L,5L),frequency = c(1.114,1.519,1.176,0.903,1.079,2.328,2.305,2.255,1.716,2.199,1.944,1.505,1.724,1.146,1.699,1.255,1.633,1.204,1.591,1.23,1.531,1.041,1.447,1.477,1.544,1.845,3.72,1,0.699,1.756,0.301,1.982,0.477,2.241,2.064,1.431,2.718,2.236,2.651,2.877,3.311,2.838,3.333,2.937,3.435)),class = "data.frame",row.names = c(NA,-52L))

现在我需要从每个组(A、V 和 H)中选择 5 个项目的子样本,以便这 3 个新子样本之间的长度和频率差异尽可能小,理想情况下不具有统计显着性。我通常手动执行此操作并且需要很多时间,但是有什么方法可以使此过程自动化?感谢您提供任何提示/想法。

解决方法

好吧,安迪·埃格斯 (Andy Eggers) 上面建议的一种非优雅的蛮力方法是选择随机样本直到满足条件,例如:

cycle = 1

repeat {
  
  print(cycle)
  cycle = cycle+1
  
  subsample = mydata %>% group_by(group) %>% slice_sample(n = 5) ## how many items should be selected from each group
  
  res.aov.freq <- aov(frequency ~ group,data = subsample)
  res.aov.freq.p = anova(res.aov.freq)$"Pr(>F)"[1] ## save ANOVA p-value for frequency
  
  res.aov.len <- aov(length ~ group,data = subsample)
  res.aov.len.p = anova(res.aov.len)$"Pr(>F)"[1] ## save ANOVA p-value for length
  
  cond = (res.aov.freq.p > .05)&
    (res.aov.len.p > .05) ## set required p-values for both variables

  if ((cond == TRUE)|
      (cycle == 1000)){ ## after how many cycles the script should stop if no solution found
    
    break
    
  }
}
,

原来有一个特殊的R包来解决这个问题(LexOPS): https://jackedtaylor.github.io/LexOPSdocs/index.html