问题描述
当数据为长格式(即:每个观察值都有自己的行)时,我很难理解如何计算平均值。
例如,我加入了一个基于社会 ID 和手术日期的手术和输血(接受血液制品)数据库。为此,我添加了一个名为“transfused”的列,它是一个二进制开关,如果该行包含任何输血(即血液、血浆、血小板 = 1,则输血 = 1)
每次手术和每次输血都是数据中的一行,这会导致每个 id 有很多行,从而导致平均计算不正确。
例如,如果我们有一个简单的示例,其中的数据集仅包含 2 个实际手术,其中一个接受了 50 次输血,而其中一个没有输血;在组合数据集中,我将有 51 个完整的行(由于 R 回收)。
实际上 50% 的手术需要输血,但在上面的例子中,平均计算将显示 50/51 的手术被输血。
我哪里出错了? 我意识到 R 正在做我告诉它做的事情,但我不知道我应该如何进行,以便在最终计算中每个唯一 ID 和日期只设置或计算一次“输血”标志
library(tidyverse)
surgeries <- tibble(
id = 1:10,operation = c("App","App","App"),date = c("2020-01-01","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05","2020-01-05")
)
transfusions <- tibble(
id = c(1,1,2,3,4,8,8),type = c("Blood","Blood","plasma","Platelets","Blood"),"2020-01-05")
)
combined <- surgeries %>%
left_join(transfusions,by = c("id","date"))
combined <- combined %>%
mutate(
transfused = if_else((type == "Blood" | type == "plasma" | type == "Platelets"),missing=0)
)
aggregate(combined,by=list(Operation = combined$operation),mean)
在上面的例子中,平均值的期望结果应该是 2/10,但由于每行的格式是一个观察值而变成了 3/11
解决方法
患者 ID = 1
进行了一次手术和两次输血,因此在您的最后一个 combined
中有重复的行。由于您只计算患者是否接受过输血(是或否)而不是数量,因此在计算之前删除重复项:
combined %>%
distinct() %>%
group_by(operation) %>%
summarize(mean_transfusions = mean(transfused))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 1 x 2
operation mean_transfusion
<chr> <dbl>
1 App 0.2
,
问题是left_join创建了更多的行,你可以通过总结结果来解决
library(tidyverse)
surgeries <- tibble(
id = 1:10,operation = c("App","App","App"),date = c("2020-01-01","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05","2020-01-05")
)
transfusions <- tibble(
id = c(1,1,2,3,4,8,8),type = c("Blood","Blood","Plasma","Platelets","Blood"),"2020-01-05")
)
combined <- surgeries %>%
left_join(transfusions,by = c("id","date"))
combined <- combined %>%
mutate(
transfused = if_else((type == "Blood" | type == "Plasma" | type == "Platelets"),missing=0)
)
combined %>%
group_by(id,operation) %>%
summarise(transfused_right = any(transfused == 1),.groups = "drop") %>%
group_by(operation) %>%
summarise(mean_rate = mean(transfused_right))
#> # A tibble: 1 x 2
#> operation mean_rate
#> * <chr> <dbl>
#> 1 App 0.2
由 reprex package (v1.0.0) 于 2021 年 2 月 4 日创建