问题描述
我想使用R中的 anova 函数比较嵌套模型。我的数据集:
structure(list(Gene = c("ID-1","ID-1","ID-4","ID-5","ID-6","ID-7","ID-7"),mRNA = c(-0.181385669,-0.059647494,0.104476117,-0.052190978,-0.040484945,0.194226742,-0.501601326,0.102342605,-0.127143845,-0.008523742,-0.102946211,-0.042894028,0.002922923,-0.134394347,-0.214204393,-0.138122686,0.203242361,0.097935502,0.147068146,-0.089430917,0.331565412,-0.034572422,-0.129896329,0.324191,0.470108479,-0.027268223,0.232304713,0.090348708,0.070848402,0.181540708,-0.502255367,-0.267631441,-0.368647839,-0.040910404,-0.003983171,-0.14980589,-0.119449612,-0.309154214,-0.487589361,0.272803506,-0.421733575,-0.467108567,0.024868338,-0.156025729,-0.044680175,-0.206716896,-0.272014193,-0.230499883,-0.238597397,-0.118130949,0.349957464,0.172048587,-0.186226994,0.16113822,-0.293029136,-0.111636253,-0.044189887,0.081555274,-0.048106079,-0.05853566,0.010407814,-0.066981809,-0.09828484,-0.315190986,-0.005102456,0.221556197,0.206584568,0.102649006,-0.011777384,-0.36963487,-0.054853074,-0.230240699,-0.210508323,-0.208889919,-0.050763372,0.023073782,-0.095118984,-0.091076071,-0.330257395,0.102772933,0.247872038,0.216357646,0.126169901,-0.237278842,-0.066908278,0.105082639,NA,-0.050061512,-0.143484352),Time = c(20L,20L,40L,60L,120L,0L,NA),Condition = c("Irradiated","Irradiated","reference","reference")),class = "data.frame",row.names = c(NA,-95L))
我的代码:
model1 <- lm(mRNA ~ Time,data=GenemRNATimeCondition)
model2 <- lm(mRNA ~ Time + Gene,data=GenemRNATimeCondition)
model3 <- lm(mRNA ~ Time + Gene + Condition,data=GenemRNATimeCondition)
anova_df <- anova(model1,model2,model3)
anova_df[,"model"] <- c("Time","Time+Gene","Time+Gene+Condition")
anova_df
anova(model1,model3)
运行模型3时出现此错误:
Error in `contrasts<-`(`*tmP*`,value = contr.funs[1 + isOF[nn]]) :
contrasts can only be applied to factor variables with 2 or more levels
当我跑步时
anova_df <- anova(model1,model3)
我收到此错误:
Error in anova.lmlist(object,...) :
the models were estimated for different sample sizes
我知道对于“条件”列中的参考值,我已经相应地在“时间”列中具有NA值,但我不理解为什么这是一个问题(如果有问题)。希望您能帮助我以一种直截了当的方式(可能也是从统计学的角度)理解。
解决方法
对于第一个错误,它告诉您缺少因子,或者是因为您没有因子,或者是因为缺少值而将其删除。所以对于前。如果对于特定的组合,您仅缺少值,那么将删除该组合的所有行,并且不会估计任何此类项,这将引发错误。
第二个错误是相关的,因为在每个模型中对数据进行不同的分组,将丢弃不同数量的行,这导致在不同的子样本上估计模型,这在比较模型时也是一个问题。
基本上,这是因为缺少值,您应该在继续操作之前先处理这些值,或者采用其他方法。