根据数据帧中的实例创建2x2条件表

问题描述

我有一个数据框,该数据框按作者的性别,他们在项目中的角色以及标识符(PMID)细分(见下文)。

我需要创建一个2x2列联表,这样我就可以计算出成为女性第一作者与拥有女性资深作者之间的关联的比值比。为此,我需要计算以下内容

  • A:第一作者是女性,资深作者是女性的次数
  • B:第一作者是女性,资深作者是男性的次数
  • C:第一作者为男性,资深作者为男性的次数
  • D:第一作者为男性,资深作者为女性的次数 (很明显,如果只有资深或只有一位 每个PMID的第一作者)

我已经按PMID对表进行了分组(请参见下文),所以我真的只需要弄清楚如何计算上述每个实例。辛苦了,不胜感激!


# A tibble: 178,056 x 3
# Groups:   pmid [101,907]
    gender authorship pmid    
    <chr>  <chr>      <chr>   
  1 male   First      18958667
  2 male   Senior     18958667
  3 male   First      18958651
  4 male   First      18751818
  5 male   Senior     18751818
  6 male   First      18751811
  7 male   Senior     18751811
  8 female First      18751810
  9 female Senior     18751810
 10 male   First      18088800
 11 male   Senior     18088800
 12 male   First      17710072
 13 female First      17977216
 14 male   Senior     17762065
 15 male   First      17611457
 16 male   First      17611433
 17 male   First      17532688
 18 male   Senior     17532688
 19 female First      17405310
 20 male   Senior     17386862
 21 female First      17319096
 22 male   Senior     17319096
 23 female First      17300028
 24 male   First      17282480
 25 female First      17177771
 26 male   First      17124681
 27 female First      17093906
 28 female First      17042011
 29 male   Senior     17042011
 30 female First      17042010
 31 male   Senior     17042010
 32 female First      17042006
 33 male   Senior     17042006
 34 female First      17042003
 35 female First      17042002
 36 male   Senior     17042002
 37 male   First      17042001
 38 female First      17041999
 39 male   Senior     17041997
 40 female First      17041995
 41 female First      17041994
 42 female First      17041993
 43 female Senior     17041993
 44 female First      17041992
 45 female Senior     17041992
 46 female First      17041991
 47 male   First      17041990
 48 male   Senior     17041990
 49 male   First      17041989
 50 male   Senior     17041989

解决方法

pivot_wider

是一个很好的解决方案
library(dplyr)
library(tidyr)

newdf <- 
   mydf %>% 
   group_by(pmid) %>% 
   pivot_wider(names_from = authorship,values_from = gender)

table(newdf$First,newdf$Senior)
#>         
#>          female male
#>   female      3    5
#>   male        0    7

chisq.test(table(newdf$First,newdf$Senior))
#> Warning in chisq.test(table(newdf$First,newdf$Senior)): Chi-squared
#> approximation may be incorrect
#> 
#>  Pearson's Chi-squared test with Yates' continuity correction
#> 
#> data:  table(newdf$First,newdf$Senior)
#> X-squared = 1.356,df = 1,p-value = 0.2442

newdf %>% 
   filter(!is.na(First) & !is.na(Senior))
#> # A tibble: 15 x 3
#> # Groups:   pmid [15]
#>    pmid     First  Senior
#>    <chr>    <chr>  <chr> 
#>  1 18958667 male   male  
#>  2 18751818 male   male  
#>  3 18751811 male   male  
#>  4 18751810 female female
#>  5 18088800 male   male  
#>  6 17532688 male   male  
#>  7 17319096 female male  
#>  8 17042011 female male  
#>  9 17042010 female male  
#> 10 17042006 female male  
#> 11 17042002 female male  
#> 12 17041993 female female
#> 13 17041992 female female
#> 14 17041990 male   male  
#> 15 17041989 male   male

table(newdf$First,p-value = 0.2442

您的数据

mydf <- tibble(
gender = c("male","male","female","male"),authorship = c("First","Senior","First","Senior"),pmid = c("18958667","18958667","18958651","18751818","18751811","18751810","18088800","17710072","17977216","17762065","17611457","17611433","17532688","17405310","17386862","17319096","17300028","17282480","17177771","17124681","17093906","17042011","17042010","17042006","17042003","17042002","17042001","17041999","17041997","17041995","17041994","17041993","17041992","17041991","17041990","17041989","17041989")
                                                  )
,

带有高尔夫球编码的简洁解决方案:

library(tidyr)
contig_table <- mydf %>% 
  spread(authorship,gender) %>% 
  #Only need drop_na() if data is incomplete
  #Given that you have a lot more rows I assume this will not be needed
  drop_na() %$% 
  table(First,Senior)