找出对组方差贡献最大的变量

问题描述

我有两组观察次数不同。这是玩具示例:

obs1_1 <- c(1,13,5,2,6,7,2)
obs2_1 <- c(0,1,4,3)
obs3_1 <- c(2,10,8,2)
obs4_1 <- c(1,2)
obs5_1 <- c(0,3)

group1 <- data.frame(obs1_1,obs2_1,obs3_1,obs4_1,obs5_1)
rownames(group1) <- c("var1","var2","var3","var4","var5","var6","var7","var8","var9","var10")

obs1_2 <- c(11,11,2)
obs2_2 <- c(11,1)
obs3_2 <- c(11,2)
obs4_2 <- c(11,2)
obs5_2 <- c(10,3,3)
obs6_2 <- c(10,3)

group2 <- data.frame(obs1_2,obs2_2,obs3_2,obs4_2,obs5_2,obs6_2)
rownames(group2) <- c("var1","var11","var12","var13","var14")

所有值都是连续的,我认为它们不是正态分布的。此外,并非所有变量都存在于两组中。 我正在考虑使用 Wilcoxon sum rank test 来分析组之间的差异。但是我很困惑如何在 R 中做到这一点,因为我有太多的变量。

如何找到对组方差贡献最大的变量?在这个玩具示例中,变量 var1、var3 和 var4 在组之间的差异最大。所以,我想,他们应该是导致群体差异的因素。加上那些只出现在一组中的人。

解决方法

您可以合并两个数据框并执行如下循环:

library(dplyr)

# Transposing the dataframes

g1 <- as.data.frame(t(group1))
g2 <- as.data.frame(t(group2))

# Melting to have a variable  and value columns

g1 <- reshape2::melt(g1,measure = colnames(g1))
g1$group <- "g1"

g2 <- reshape2::melt(g2,measure = colnames(g2))
g2$group <- "g2"


# Binding both groups

df <- rbind(g1,g2)

# Creating an empty list for output

wsr <- list()

for (i in unique(df$variable)) {
  d <- df %>% filter(variable == i)
  wsr[[i]] <- ifelse(length(unique(d$group)) == 2,wilcox.test(value~group,data = d),NA)
  
}

# If you just need the p.value 

wsr_p <- data.frame(matrix(nrow = 0,ncol = 2))

for (i in unique(df$variable)) {
  d <- filter(df,variable == i)
  s <- ifelse(length(unique(d$group)) == 2,data = d)[["p.value"]],NA)
  wsr_p <- rbind(wsr_p,data.frame(variable = i,p = s))
  
}

wsr_p

variable           p
1      var1 0.006349039
2      var2          NA
3      var3 0.004390260
4      var4 0.007135044
5      var5 0.847768887
6      var6          NA
7      var7 0.156039402
8      var8 0.516063755
9      var9          NA
10    var10          NA
11    var11          NA
12    var12          NA
13    var13          NA
14    var14          NA