多列 t 检验的一个微妙问题

问题描述

我有一个数据框,可以回答多个问题(下面有 2 个问题的可重现示例)

set.seed(1)
df <- data.frame (
          UserId = c(rep("A",4),rep("B",rep("C",rep("D",4)),Sex = c(rep("Female",8),rep("Male",rep("No_Response",Answer_Date = as.Date(c("1990-01-01","1990-02-01","1990-03-01","1990-04-01","1991-02-01","1991-03-01","1991-04-01","1991-05-01","1992-03-01","1992-04-01","1992-05-01","1992-06-01","1993-07-10","1992-08-10","1993-09-10","1993-10-10")),Q1 = sample(1:10,16,replace = TRUE),Q2 = sample(1:10,replace = TRUE)
      ) %>%
      group_by(UserId) %>%
      mutate(First_Answer_Date = min(Answer_Date)) %>%
      mutate(Last_Answer_Date  = max(Answer_Date)) %>%
      ungroup()

遵循中的建议

https://sebastiansauer.github.io/multiple-t-tests-with-dplyr/

我针对 Q1 和 Q2 的 t 检验针对真实均值为 0 的原假设进行:

questions <- c("Q1","Q2")
df %>%
  select(questions,Sex) %>%
  filter(Sex != "No_Response") %>%
  gather(key = variable,value = value,-Sex) %>%
  group_by(Sex,variable) %>%
  summarize(value = list(value)) %>%
  spread(Sex,value) %>%
  group_by(variable) %>%
  mutate( p_Female = t.test(unlist(Female))$p.value,p_Male   = t.test(unlist(Male)  )$p.value,t_Female = t.test(unlist(Female))$statistic,t_Male   = t.test(unlist(Male)  )$statistic) %>%
  mutate( Female = length(unlist(Female)),Male   = length(unlist(Male))
  )

这给了我

# A tibble: 2 x 7
# Groups:   variable [2]
  variable Female  Male  p_Female p__Male t_Female t_Male
  <chr>     <int> <int>     <dbl>   <dbl>    <dbl>  <dbl>
1 Q1            8     4 0.0000501 0.00137     8.78  11.6 
2 Q2            8     4 0.00217   0.0115      4.71   5.55

到目前为止一切都很好。当我只想在 First_Answer_Date 进行 t 检验时,我的麻烦就开始了。

df %>%
  filter(Answer_Date == First_Answer_Date) %>%
  select(questions,Sex) %>%
  filter(Sex != "No_Response") %>%

    # A tibble: 3 x 3
         Q1    Q2 Sex   
      <int> <int> <chr> 
    1     9     5 Female
    2     2     5 Female
    3     1     9 Male 

现在,只有一个男性的回答和两个女性的回答,并且在第 2 季度,两位女性受访者的回答相同。如果我重新运行我的 t-test 代码

df %>%
  filter(Answer_Date == First_Answer_Date) %>%
  select(questions,p__Male = t.test(unlist(Male))$p.value,t_Male = t.test(unlist(Male))$statistic) %>%
  mutate( Female = length(unlist(Female)),Male   = length(unlist(Male)))

Error: Problem with `mutate()` input `p_Female`.
x data are essentially constant
i Input `p_Female` is `t.test(unlist(Female))$p.value`.
i The error occurred in group 2: variable = "Q2".

我得到的错误消息是合乎逻辑的,但这是我在实践中可能遇到的情况 - 某些子集的大小可能为 1 或 0,某些问题的所有受访者都可能给出相同的答案等。 . 我怎样才能让代码优雅地降级,只需在其输出标题中的那些单元格中放置一个空白或 NA,在这些单元格中,由于某种原因无法计算出答案?

真诚的

托马斯飞利浦

解决方法

也许,您可以使用 tryCatch 来处理错误:

library(dplyr)
library(tidyr)

df %>%
  filter(Answer_Date == First_Answer_Date) %>%
  select(questions,Sex) %>%
  filter(Sex != "No_Response") %>%
  pivot_longer(cols = -Sex,names_to = "variable") %>%
  group_by(Sex,variable) %>%
  summarize(value = list(value)) %>%
  pivot_wider(names_from = Sex,values_from = value) %>%
  group_by(variable) %>%
  mutate( p_Female = tryCatch(t.test(unlist(Female))$p.value,error = function(e) return(NA)),p_Male   = tryCatch(t.test(unlist(Male) )$p.value,t_Female = tryCatch(t.test(unlist(Female))$statistic,t_Male   = tryCatch(t.test(unlist(Male))$statistic,error = function(e) return(NA))) %>%
  ungroup %>%
  mutate( Female = lengths(Female),Male   = lengths(Male))