R中每个ID仅以长格式计算正确平均值一次

问题描述

当数据为长格式（即：每个观察值都有自己的行）时，我很难理解如何计算平均值。

例如，我加入了一个基于社会 ID 和手术日期的手术和输血（接受血液制品）数据库。为此，我添加了一个名为“transfused”的列，它是一个二进制开关，如果该行包含任何输血（即血液、血浆、血小板 = 1，则输血 = 1）

每次手术和每次输血都是数据中的一行，这会导致每个 id 有很多行，从而导致平均计算不正确。

例如，如果我们有一个简单的示例，其中的数据集仅包含 2 个实际手术，其中一个接受了 50 次输血，而其中一个没有输血；在组合数据集中，我将有 51 个完整的行（由于 R 回收）。

实际上 50% 的手术需要输血，但在上面的例子中，平均计算将显示 50/51 的手术被输血。

我哪里出错了？我意识到 R 正在做我告诉它做的事情，但我不知道我应该如何进行，以便在最终计算中每个唯一 ID 和日期只设置或计算一次“输血”标志

library(tidyverse)

surgeries <- tibble(
  id     = 1:10,operation  = c("App","App","App"),date   = c("2020-01-01","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05","2020-01-05")
)

transfusions <- tibble(
  id     = c(1,1,2,3,4,8,8),type  = c("Blood","Blood","plasma","Platelets","Blood"),"2020-01-05")
)

combined <- surgeries %>% 
  left_join(transfusions,by = c("id","date"))

combined <- combined %>% 
  mutate(
    transfused = if_else((type == "Blood" | type == "plasma" | type == "Platelets"),missing=0)
  )

aggregate(combined,by=list(Operation = combined$operation),mean)

在上面的例子中，平均值的期望结果应该是 2/10，但由于每行的格式是一个观察值而变成了 3/11

解决方法

患者 ID = 1 进行了一次手术和两次输血，因此在您的最后一个 combined 中有重复的行。由于您只计算患者是否接受过输血（是或否）而不是数量，因此在计算之前删除重复项：

combined %>%
  distinct() %>%
  group_by(operation) %>%
  summarize(mean_transfusions = mean(transfused))

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 1 x 2
  operation mean_transfusion
  <chr>                <dbl>
1 App                    0.2

问题是left_join创建了更多的行，你可以通过总结结果来解决

library(tidyverse)

surgeries <- tibble(
  id     = 1:10,operation  = c("App","App","App"),date   = c("2020-01-01","2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-05","2020-01-05")
)

transfusions <- tibble(
  id     = c(1,1,2,3,4,8,8),type  = c("Blood","Blood","Plasma","Platelets","Blood"),"2020-01-05")
)

combined <- surgeries %>% 
  left_join(transfusions,by = c("id","date"))

combined <- combined %>% 
  mutate(
    transfused = if_else((type == "Blood" | type == "Plasma" | type == "Platelets"),missing=0)
  )

combined %>% 
  group_by(id,operation) %>% 
  summarise(transfused_right = any(transfused == 1),.groups = "drop") %>%
  group_by(operation) %>% 
  summarise(mean_rate = mean(transfused_right))
#> # A tibble: 1 x 2
#>   operation mean_rate
#> * <chr>         <dbl>
#> 1 App             0.2

^{由 reprex package (v1.0.0) 于 2021 年 2 月 4 日创建}

counting counting group-by r r