R，如何根据多个条件在列表列中累积值

问题描述

我有一个患者在不同医院接受治疗的数据集（仅限住院患者），其中一些分析揭示了一些不一致之处。其中之一是 - 软件允许患者在不关闭先前打开的 case_id 的情况下入院。

为了更好地理解它，让我们考虑一下样本数据集

样本数据

dput(df)

df <- structure(list(case_id = 1:22,patient_id = c(1L,1L,2L,3L,4L,5L,6L,7L,8L,8L),pack_id = c(12L,62L,59L,68L,77L,86L,20L,55L,72L,54L,75L,26L,21L,12L,49L,35L,51L,31L,10L,54L),hosp_id = c(1L,admn_date = structure(c(18262,18264,18265,18266,18277,18279,18283,18262,18287,18275,18301,18291,18366,18374,18309,18319,18364,18303,18328,18341),class = "Date"),discharge_date = structure(c(18275,18276,18271,18288,18280,18286,18297,18375,18381,18347,18367,18341,18344),class = "Date")),row.names = c(NA,-22L
),class = "data.frame")

> df
   case_id patient_id pack_id hosp_id  admn_date discharge_date
1       1          1      12       1 2020-01-01     2020-01-14
2       2          1      62       1 2020-01-03     2020-01-15
3       3          1      59       2 2020-01-04     2020-01-10
4       4          1      68       2 2020-01-05     2020-01-14
5       5          1      77       1 2020-01-16     2020-01-27
6       6          1      86       1 2020-01-18     2020-01-19
7       7          1      20       2 2020-01-22     2020-01-25
8       8          2      55       3 2020-01-01     2020-01-14
9       9          2      86       3 2020-01-03     2020-01-15
10     10          2      72       4 2020-01-16     2020-01-27
11     11          1       7       2 2020-01-26     2020-01-30
12     12          3      54       3 2020-01-14     2020-01-22
13     13          3      75       3 2020-02-09     2020-02-17
14     14          3      26       3 2020-01-30     2020-02-05
15     15          4      21       4 2020-04-14     2020-04-23
16     16          4      12       5 2020-04-22     2020-04-29
17     17          5      49       6 2020-02-17     2020-03-26
18     18          5      35       6 2020-02-27     2020-03-07
19     19          6      51       7 2020-04-12     2020-04-15
20     20          7      31       7 2020-02-11     2020-02-17
21     21          8      10       8 2020-03-07     2020-03-20
22     22          8      54       8 2020-03-20     2020-03-23

如果我们在上面的数据中看到，ID 为 1 的患者于 1 月 1 日入院_1（第 1 行），并于 1 月 14 日出院。本次出院前，患者再次入院（第2行）；并在hospital_2 再次两次（第3 行和第4 行），最终在1 月15 日（第2 行）关闭了所有这四个记录。

我已经过滤了患者多次在多家医院/同一家医院住院的记录；通过以下代码

代码尝试

df_2 <- df %>% arrange(patient_id,admn_date,discharge_date) %>%
  mutate(sort_key = row_number()) %>%
  pivot_longer(c(admn_date,discharge_date),names_to ="activity",values_to ="date",names_pattern = "(.*)_date") %>%
  mutate(activity = factor(activity,ordered = T,levels = c("admn","discharge")),admitted = ifelse(activity == "admn",1,-1)) %>%
  group_by(patient_id) %>%
  arrange(date,sort_key,activity,.by_group = TRUE) %>% 
  mutate (admitted = cumsum(admitted)) %>%
  ungroup()
  
 > df_2
# A tibble: 44 x 8
   case_id patient_id pack_id hosp_id sort_key activity  date       admitted
    <int>      <int>   <int>   <int>    <int> <ord>     <date>        <dbl>
 1      1          1      12       1        1 admn      2020-01-01        1
 2      2          1      62       1        2 admn      2020-01-03        2
 3      3          1      59       2        3 admn      2020-01-04        3
 4      4          1      68       2        4 admn      2020-01-05        4
 5      3          1      59       2        3 discharge 2020-01-10        3
 6      1          1      12       1        1 discharge 2020-01-14        2
 7      4          1      68       2        4 discharge 2020-01-14        1
 8      2          1      62       1        2 discharge 2020-01-15        0
 9      5          1      77       1        5 admn      2020-01-16        1
10      6          1      86       1        6 admn      2020-01-18        2
# ... with 34 more rows

使用此代码 df_2 %>% filter(admitted >1 & activity == "admn") 我可以一次性过滤掉不一致的记录。

但是，我想包含/生成一个 list column，只要一个新记录/case_id 被打开而不关闭任何以前的记录/case_id，只要 activity == 'admn' 和hospital_id 被删除，hsopital_ids 就会累积每当 activity == 'discharge' 从现有条目。所以基本上我想要的 df_2 输出类似于

期望的输出

# A tibble: 44 x 8
   case_id patient_id pack_id hosp_id sort_key activity  date       admitted    open_records
    <int>      <int>   <int>   <int>    <int> <ord>     <date>        <dbl>     <list>
 1      1          1      12       1        1 admn      2020-01-01        1     1
 2      2          1      62       1        2 admn      2020-01-03        2     1,1
 3      3          1      59       2        3 admn      2020-01-04        3     1,2
 4      4          1      68       2        4 admn      2020-01-05        4     1,2,2
 5      3          1      59       2        3 discharge 2020-01-10        3     1,2
 6      1          1      12       1        1 discharge 2020-01-14        2     1,2
 7      4          1      68       2        4 discharge 2020-01-14        1     1,8      2          1      62       1        2 discharge 2020-01-15        0     <NULL>
 9      5          1      77       1        5 admn      2020-01-16        1     1
10      6          1      86       1        6 admn      2020-01-18        2     1,1
# ... with 34 more rows

注意我知道列表列不会像我为解释目的而显示的那样显示在 tibble/data.frame 中。但是，如果有任何可以打印的方法，我肯定想知道。

MOREOVER 如果有更好的策略将医院 ID 存储在列中而不是生成列表列，我也想知道。

解决方法

如果你不介意使用循环

library(stringi)

df3 <- df2
df3$open_records <- NA
df3$hosp_id <- as.character(df3$hosp_id) #makes pasting easier

for(i in 1:nrow(df3)){
  #if re-admn
  if(df3$activity[i] == "admn"){
    df3$open_records[i] <- paste(lag(df3$open_records,default = "")[i],df3$hosp_id[i],sep = ",")
  #we'll handle pretty commas later
  }
  
  #if discharge
  if(df3$activity[i] == "discharge"){
    df3$open_records[i] <- sub(df3$hosp_id[i],"",stri_reverse(df3$open_records[i-1]))
  #sub out one hospital if discharge
  #we reverse the string before removing to get the last hosp_id
  }
  
  #if admitted == 0
  if(df3$admitted[i] == 0) df3$open_records[i] <- NA
  
  #if just starting the group
  if(df3$activity[i] == "admn" & df3$admitted[i] == 1){
    df3$open_records[i] <- df3$hosp_id[i]
  }
}
  
#comma clean
df3$open_records <- gsub("^,*|(?<=,),|,*$",df3$open_records,perl=T)
df3$open_records <- gsub(",",df3$open_records)

如果您的数据集非常大，这可能不是最佳选择。向每个 if 语句添加 next() 命令也是值得的（如果你这样做，我认为将起始组 if 语句移到循环的顶部是有意义的）。

（逗号清洁源：Removing multiple commas and trailing commas using gsub）

编辑，基于需要不使用循环

library(tidyverse)

paste3 <- function(out,input,activity,") {
  if (activity == "admn") {
    paste(out,sep = sep)
  } else
    if (activity == "discharge") {
      sub(input,out)
    }
}

df4 <- df2 %>%
  mutate(temp_act = lead(activity)) %>%
  mutate(open_records = accumulate2(hosp_id,head(temp_act,-1),paste3)
  ) %>%
  select(-temp_act)


df4$open_records <- gsub("^,df4$open_records,perl=T)
df4$open_records <- gsub(",df4$open_records)

我注意到，同一家医院可以多次同时收治患者。您可能需要考虑的一件事是将 case_id 和 hosp_id 连接起来，因此在发生放电时，您可以删除与正确 {hosp_id 相对应的那个，而不是删除第一个匹配的 case_id {1}}。（将代码中的 hosp_id 替换为您的新变量。）

这不会出现在您的示例代码中，但是如果有人的 open_records 为 2,1,2,2 并且从他们的第三次准入中出院，当您可能想要 {{1} 时，我的代码将返回 1,2 }.

这是一个不错的 tidyverse 解决方案：

library(dplyr)
library(purrr)

df_2 %>%
  group_by(patient_id) %>%
  mutate(open_records = accumulate(2:n(),.init = paste0(hosp_id[1],"),~ if(activity[.y] == "admn") {
                                     paste0(.x,hosp_id[.y],")
                                   } else {
                                     sub(paste0(hosp_id[.y],.x)
                                   }),open_records = gsub("([d,]*)\\,$",open_records))

# A tibble: 44 x 9
# Groups:   patient_id [8]
   case_id patient_id pack_id hosp_id sort_key activity  date       admitted open_records
     <int>      <int>   <int>   <int>    <int> <ord>     <date>        <dbl> <chr>       
 1       1          1      12       1        1 admn      2020-01-01        1 "1"         
 2       2          1      62       1        2 admn      2020-01-03        2 "1,1"       
 3       3          1      59       2        3 admn      2020-01-04        3 "1,2"     
 4       4          1      68       2        4 admn      2020-01-05        4 "1,2"   
 5       3          1      59       2        3 discharge 2020-01-10        3 "1,2"     
 6       1          1      12       1        1 discharge 2020-01-14        2 "1,2"       
 7       4          1      68       2        4 discharge 2020-01-14        1 "1"         
 8       2          1      62       1        2 discharge 2020-01-15        0 ""          
 9       5          1      77       1        5 admn      2020-01-16        1 "1"         
10       6          1      86       1        6 admn      2020-01-18        2 "1,1"       
# ... with 34 more rows

accumulate list list r r tidyverse