R中每个ID仅以长格式计算正确平均值一次

问题描述

当数据为长格式(即:每个观察值都有自己的行)时,我很难理解如何计算平均值。

例如,我加入了一个基于社会 ID 和手术日期的手术和输血(接受血液制品)数据库。为此,我添加一个名为“transfused”的列,它是一个二进制开关,如果该行包含任何输血(即血液、血浆、血小板 = 1,则输血 = 1)

每次手术和每次输血都是数据中的一行,这会导致每个 id 有很多行,从而导致平均计算不正确。

例如,如果我们有一个简单的示例,其中的数据集仅包含 2 个实际手术,其中一个接受了 50 次输血,而其中一个没有输血;在组合数据集中,我将有 51 个完整的行(由于 R 回收)。

实际上 50% 的手术需要输血,但在上面的例子中,平均计算将显示 50/51 的手术被输血。

我哪里出错了? 我意识到 R 正在做我告诉它做的事情,但我不知道我应该如何进行,以便在最终计算中每个唯一 ID 和日期只设置或计算一次“输血”标志

library(tidyverse)

surgeries <- tibble(
  id     = 1:10,operation  = c("App","App","App"),date   = c("2020-01-01","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05","2020-01-05")
)

transfusions <- tibble(
  id     = c(1,1,2,3,4,8,8),type  = c("Blood","Blood","plasma","Platelets","Blood"),"2020-01-05")
)

combined <- surgeries %>% 
  left_join(transfusions,by = c("id","date"))

combined <- combined %>% 
  mutate(
    transfused = if_else((type == "Blood" | type == "plasma" | type == "Platelets"),missing=0)
  )

aggregate(combined,by=list(Operation = combined$operation),mean)

在上面的例子中,平均值的期望结果应该是 2/10,但由于每行的格式是一个观察值而变成了 3/11

解决方法

患者 ID = 1 进行了一次手术和两次输血,因此在您的最后一个 combined 中有重复的行。由于您只计算患者是否接受过输血(是或否)而不是数量,因此在计算之前删除重复项:

combined %>%
  distinct() %>%
  group_by(operation) %>%
  summarize(mean_transfusions = mean(transfused))

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 1 x 2
  operation mean_transfusion
  <chr>                <dbl>
1 App                    0.2
,

问题是left_join创建了更多的行,你可以通过总结结果来解决

library(tidyverse)

surgeries <- tibble(
  id     = 1:10,operation  = c("App","App","App"),date   = c("2020-01-01","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05","2020-01-05")
)

transfusions <- tibble(
  id     = c(1,1,2,3,4,8,8),type  = c("Blood","Blood","Plasma","Platelets","Blood"),"2020-01-05")
)

combined <- surgeries %>% 
  left_join(transfusions,by = c("id","date"))

combined <- combined %>% 
  mutate(
    transfused = if_else((type == "Blood" | type == "Plasma" | type == "Platelets"),missing=0)
  )

combined %>% 
  group_by(id,operation) %>% 
  summarise(transfused_right = any(transfused == 1),.groups = "drop") %>%
  group_by(operation) %>% 
  summarise(mean_rate = mean(transfused_right))
#> # A tibble: 1 x 2
#>   operation mean_rate
#> * <chr>         <dbl>
#> 1 App             0.2

reprex package (v1.0.0) 于 2021 年 2 月 4 日创建