问题描述
对于当前涉及重复测量的项目,我第一次使用长数据集。
我正在尝试获取多个分类变量的每个时间点的描述性统计数据(计数、百分比)。
我的数据:
library(dplyr)
questiondata <- structure(list(id = c(2,2,6,9,22,23,25,30,31,33,34,34),time = structure(c(1L,2L,1L,2L),.Label = c("time1","time2"),class = "factor"),age = c(65,69.17,76.75,81.05,58.64,62.71,59.37,63.56,58,61.69,55.78,59.95,59.3,63.36,60.45,64.39,56.3,60.08,59.53,63.84),sex = structure(c(1L,.Label = c("men","women"),hypert_drug = structure(c(1L,1L),.Label = c("no","yes"),class = "factor")),row.names = c(NA,-20L),class = c("tbl_df","tbl","data.frame"))
对应于以下tibble:
# A tibble: 20 x 5
id time age sex hypert_drug
<dbl> <fct> <dbl> <fct> <fct>
1 2 time1 65 men no
2 2 time2 69.2 men yes
3 6 time1 76.8 women yes
4 6 time2 81.0 women yes
5 9 time1 58.6 men no
6 9 time2 62.7 men no
7 22 time1 59.4 men no
8 22 time2 63.6 men no
9 23 time1 58 women no
10 23 time2 61.7 women no
11 25 time1 55.8 men no
12 25 time2 60.0 men no
13 30 time1 59.3 women no
14 30 time2 63.4 women yes
15 31 time1 60.4 men yes
16 31 time2 64.4 men yes
17 33 time1 56.3 men no
18 33 time2 60.1 men no
19 34 time1 59.5 women no
20 34 time2 63.8 women no
要获得每次我没有的性别次数:
long_dataset %>%
group_by(time,sex) %>%
summarize(n_sex=n())
产生以下输出:
summarise()` has grouped output by 'time'. You can override using the `.groups` argument.
# A tibble: 10 x 3
# Groups: time [5]
time sex n_sex
<fct> <fct> <int>
1 time1 men 398
2 time1 women 371
3 time2 men 398
4 time2 women 371
5 time3 men 398
6 time3 women 371
7 time4 men 804
8 time4 women 917
9 time5 men 1202
10 time5 women 1288
我想要做的也是获取每个时间点的男性和女性比例的列,以及描述变量“hypert_drug”每个时间点的计数和百分比的类似列。
有什么想法吗?谢谢!
解决方法
按照您的示例 long_dataset。只需扩展您的 dplyr 链。
library(dplyr)
long_dataset <- structure(list(id = c(2,2,6,9,22,23,25,30,31,33,34,34),time = structure(c(1L,2L,1L,2L),.Label = c("time1","time2"),class = "factor"),age = c(65,69.17,76.75,81.05,58.64,62.71,59.37,63.56,58,61.69,55.78,59.95,59.3,63.36,60.45,64.39,56.3,60.08,59.53,63.84),sex = structure(c(1L,.Label = c("men","women"),hypert_drug = structure(c(1L,1L),.Label = c("no","yes"),class = "factor")),row.names = c(NA,-20L),class = c("tbl_df","tbl","data.frame"))
long_dataset %>%
dplyr::group_by(time,sex,hypert_drug) %>%
dplyr::summarise(count = n()) %>%
dplyr::mutate(count_freq = count / sum(count))
#> # A tibble: 8 x 5
#> # Groups: time,sex [4]
#> time sex hypert_drug count count_freq
#> <fct> <fct> <fct> <int> <dbl>
#> 1 time1 men no 5 0.833
#> 2 time1 men yes 1 0.167
#> 3 time1 women no 3 0.75
#> 4 time1 women yes 1 0.25
#> 5 time2 men no 4 0.667
#> 6 time2 men yes 2 0.333
#> 7 time2 women no 2 0.5
#> 8 time2 women yes 2 0.5
Created on 2021-06-28 by the reprex package (v0.3.0)
更新
不确定如何在单个 dplyr 链中执行此操作。这是一个三重 dplyr 链。也许别人做得更好。我希望,我理解你对输出的正确理解。
library(dplyr)
long_dataset <- structure(list(id = c(2,"data.frame"))
sex <- long_dataset %>%
dplyr::group_by(time,sex) %>%
dplyr::summarise(n_sex = dplyr::n()) %>%
dplyr::mutate(freq_sex = n_sex / sum(n_sex)) %>%
dplyr::ungroup()
drug <- long_dataset %>%
dplyr::group_by(time,hypert_drug) %>%
dplyr::summarise(n_drug = dplyr::n()) %>%
dplyr::mutate(freq_drug = n_drug / sum(n_drug)) %>%
dplyr::ungroup() %>%
dplyr::select(-time)
dplyr::bind_cols(sex,drug)
#> # A tibble: 4 x 7
#> time sex n_sex freq_sex hypert_drug n_drug freq_drug
#> <fct> <fct> <int> <dbl> <fct> <int> <dbl>
#> 1 time1 men 6 0.6 no 8 0.8
#> 2 time1 women 4 0.4 yes 2 0.2
#> 3 time2 men 6 0.6 no 6 0.6
#> 4 time2 women 4 0.4 yes 4 0.4
Created on 2021-06-29 by the reprex package (v0.3.0)