问题描述
我有一个患者数据框,格式为每张胸部 X 光片一行。我的列包括胸部 X 光检查的特定测量值、胸部 X 光检查的日期,以及与给定患者相同的其他几个列(如最终结果)。
例如:
+--------+------------+----------+------------+-------------+-----+-------+---------+
| pat_id | index_date | cxr_date | delta_date | cxr_measure | age | admit | outcome |
+--------+------------+----------+------------+-------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0 | 0.1 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.3 | 55 | 1 | 0 |
| 1 | 1/2/2020 | 1/3/2020 | 1 | 0.5 | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | 1 | 0.2 | 59 | 0 | 0 |
| 2 | 2/1/2020 | 2/3/2020 | 2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0 | 0.7 | 66 | 1 | 1 |
+--------+------------+----------+------------+-------------+-----+-------+---------+
我想重新格式化表格,以便每位患者一行。我认为我的结束表应该如下所示,其中每个变量都变成了:cxr_measure_#
其中 #
是 delta_date
。在真实的数据集中,我会有很多这样的列(# 的范围从 -5 到 +30)。如果在同一个 delta_date 上有两行/值,理想情况下我想取平均值。
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| pat_id | index_date | first_cxr_date | cxr_measure_0 | cxr_measure_1 | cxr_measure_2 | age | admit | outcome |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
| 1 | 1/2/2020 | 1/2/2020 | 0.1 | 0.4 | NA | 55 | 1 | 0 |
| 2 | 2/1/2020 | 2/2/2020 | NA | 0.2 | 0.9 | 59 | 0 | 0 |
| 3 | 1/6/2020 | 1/6/2020 | 0.7 | NA | NA | 66 | 1 | 1 |
+--------+------------+----------------+---------------+---------------+--------------+-----+-------+---------+
是否有一种简单的方法可以在这两个表之间进行基本上重塑?我玩过pivot_longer和pivot_wider,但不确定如何(1)处理在变量名称中获取delta_date以及(2)如果有两个重叠日期如何取平均值。也很好奇这是否在 python 中更容易完成(使用 Pandas 完成了大部分数据管理,然后在 R 中进行了一些额外的数据清理和分析)。
解决方法
为了扩展@Dave2e 响应,您可以使用 group_by
然后 min
来通过 first_cxr_date
获得 pat_id
,这可以让您编写一个简洁的功能解决方案。
library(tibble)
library(dplyr)
library(tidyr)
df <-
tribble(
~pat_id,~index_date,~cxr_date,~delta_date,~cxr_measure,~age,~admit,~outcome,1,'1/2/2020',0.1,55,'1/3/2020',0.3,0.5,2,'2/1/2020','2/2/2020',0.2,59,'2/3/2020',0.9,3,'1/6/2020',0.7,66,1)
df %>%
group_by(pat_id) %>% mutate(first_cxr_date = min(cxr_date)) %>% ungroup() %>% # set first_cxr_date as min of group by pat_id
pivot_wider(id_cols = -c(delta_date,cxr_measure,cxr_date),names_from = delta_date # column names from delta_date,values_from = cxr_measure,names_prefix = 'cxr_measure_' # paste string to column names,values_fn = mean # combine with mean
)
# A tibble: 3 x 9
pat_id index_date age admit outcome first_cxr_date cxr_measure_0 cxr_measure_1 cxr_measure_2
<dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1/2/2020 55 1 0 1/2/2020 0.1 0.4 NA
2 2 2/1/2020 59 0 0 2/2/2020 NA 0.2 0.9
3 3 1/6/2020 66 1 1 1/6/2020 0.7 NA NA
,
这是混合方法,使用pivot_wider 计算car_measures 的均值,使用dplyr 来汇总函数以确定第一个cxr_date。
df<- structure(list(pat_id = c(1L,1L,2L,3L),index_date = c("1/2/2020","1/2/2020","2/1/2020","1/6/2020"),cxr_date = c("1/2/2020","1/3/2020","2/2/2020","2/3/2020",delta_date = c(0L,0L),cxr_measure = c(0.1,0.7),age = c(55L,55L,59L,66L),admit = c(1L,0L,1L),outcome = c(0L,1L)),class = "data.frame",row.names = c(NA,-6L))
library(tidyr)
library(dplyr)
answer <-pivot_wider(df,id_cols = -c("delta_date","cxr_measure","cxr_date"),names_from = "delta_date",values_from = c("cxr_measure"),values_fn = list(cxr_measure = mean),names_glue ='cxr_measure_{delta_date}')
firstdate <-df %>% group_by(pat_id) %>% summarize(first_cxr_date=min(as.Date(cxr_date,"%m/%d/%Y")))
answer <- left_join(answer,firstdate)
Joining,by = "pat_id"
# A tibble: 3 x 9
pat_id index_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2 first_cxr_date
<int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <date>
1 1 1/2/2020 55 1 0 0.1 0.4 NA 2020-01-02
2 2 2/1/2020 59 0 0 NA 0.2 0.9 2020-02-02
3 3 1/6/2020 66 1 1 0.7 NA NA 2020-01-06
我确定有一种方法可以将所有这些组合到一个函数调用中,但有时丑陋只是更快。
,特别感谢亲爱的@Onyambu 先生,他今天教会了我一个宝贵的观点。
您也可以使用以下解决方案。请注意 .value
,当需要从数据创建多个列名时,它特别适用于 pivot_longer
。这里它告诉 pivot_wider
名称的一部分实际上是我们从中获取值的列的名称。
library(dplyr)
library(tidyr)
df %>%
group_by(pat_id) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = delta_date,names_glue = "{.value}_{delta_date}") %>%
mutate(across(cxr_measure_0:cxr_measure_2,~ mean(.x,na.rm = TRUE))) %>%
select(-id) %>%
slice_head(n = 1)
# A tibble: 3 x 9
# Groups: pat_id [3]
pat_id index_date cxr_date age admit outcome cxr_measure_0 cxr_measure_1 cxr_measure_2
<int> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 1/2/2020 1/2/2020 55 1 0 0.1 0.4 NaN
2 2 2/1/2020 2/2/2020 59 0 0 NaN 0.2 0.9
3 3 1/6/2020 1/6/2020 66 1 1 0.7 NaN NaN