问题描述
我有一个数据框,该数据框按作者的性别,他们在项目中的角色以及标识符(PMID)细分(见下文)。
我需要创建一个2x2列联表,这样我就可以计算出成为女性第一作者与拥有女性资深作者之间的关联的比值比。为此,我需要计算以下内容:
- A:第一作者是女性,资深作者是女性的次数
- B:第一作者是女性,资深作者是男性的次数
- C:第一作者为男性,资深作者为男性的次数
- D:第一作者为男性,资深作者为女性的次数 (很明显,如果只有资深或只有一位 每个PMID的第一作者)
我已经按PMID对表进行了分组(请参见下文),所以我真的只需要弄清楚如何计算上述每个实例。辛苦了,不胜感激!
# A tibble: 178,056 x 3
# Groups: pmid [101,907]
gender authorship pmid
<chr> <chr> <chr>
1 male First 18958667
2 male Senior 18958667
3 male First 18958651
4 male First 18751818
5 male Senior 18751818
6 male First 18751811
7 male Senior 18751811
8 female First 18751810
9 female Senior 18751810
10 male First 18088800
11 male Senior 18088800
12 male First 17710072
13 female First 17977216
14 male Senior 17762065
15 male First 17611457
16 male First 17611433
17 male First 17532688
18 male Senior 17532688
19 female First 17405310
20 male Senior 17386862
21 female First 17319096
22 male Senior 17319096
23 female First 17300028
24 male First 17282480
25 female First 17177771
26 male First 17124681
27 female First 17093906
28 female First 17042011
29 male Senior 17042011
30 female First 17042010
31 male Senior 17042010
32 female First 17042006
33 male Senior 17042006
34 female First 17042003
35 female First 17042002
36 male Senior 17042002
37 male First 17042001
38 female First 17041999
39 male Senior 17041997
40 female First 17041995
41 female First 17041994
42 female First 17041993
43 female Senior 17041993
44 female First 17041992
45 female Senior 17041992
46 female First 17041991
47 male First 17041990
48 male Senior 17041990
49 male First 17041989
50 male Senior 17041989
解决方法
pivot_wider
library(dplyr)
library(tidyr)
newdf <-
mydf %>%
group_by(pmid) %>%
pivot_wider(names_from = authorship,values_from = gender)
table(newdf$First,newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First,newdf$Senior))
#> Warning in chisq.test(table(newdf$First,newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First,newdf$Senior)
#> X-squared = 1.356,df = 1,p-value = 0.2442
newdf %>%
filter(!is.na(First) & !is.na(Senior))
#> # A tibble: 15 x 3
#> # Groups: pmid [15]
#> pmid First Senior
#> <chr> <chr> <chr>
#> 1 18958667 male male
#> 2 18751818 male male
#> 3 18751811 male male
#> 4 18751810 female female
#> 5 18088800 male male
#> 6 17532688 male male
#> 7 17319096 female male
#> 8 17042011 female male
#> 9 17042010 female male
#> 10 17042006 female male
#> 11 17042002 female male
#> 12 17041993 female female
#> 13 17041992 female female
#> 14 17041990 male male
#> 15 17041989 male male
table(newdf$First,p-value = 0.2442
您的数据
mydf <- tibble(
gender = c("male","male","female","male"),authorship = c("First","Senior","First","Senior"),pmid = c("18958667","18958667","18958651","18751818","18751811","18751810","18088800","17710072","17977216","17762065","17611457","17611433","17532688","17405310","17386862","17319096","17300028","17282480","17177771","17124681","17093906","17042011","17042010","17042006","17042003","17042002","17042001","17041999","17041997","17041995","17041994","17041993","17041992","17041991","17041990","17041989","17041989")
)
,
带有高尔夫球编码的简洁解决方案:
library(tidyr)
contig_table <- mydf %>%
spread(authorship,gender) %>%
#Only need drop_na() if data is incomplete
#Given that you have a lot more rows I assume this will not be needed
drop_na() %$%
table(First,Senior)