根据 groupby 列下的某些过滤条件替换 r data.table 的整列

问题描述

我有一个从监测超重和高血糖患者的临床中心获得的数据集。

基本上每个由 patient_id 列标识的患者，记录他们的初始体重 (init_weight) 在他们注册时 (created_at)。

患者应定期（但肯定不是在注册后 1 天内）再次访问诊所，以便根据日期时间戳列记录他们随后的体重（subseq_weight）{ {1}}。

updated_at 列只是 diff_days 和 created_at 的天数差，而 updated_at 是 diff_days_ceil 的上限，这是我故意创建的。

所以数据表看起来像这样。

diff_days

现在的问题是，> dt patient_id init_weight created_at subseq_weight 1: 24 77 2018-12-23 07:15:57 72 2: 38 99 2018-12-24 12:13:06 107 3: 38 99 2018-12-24 12:13:06 110 4: 38 99 2018-12-24 12:13:06 115 5: 38 99 2018-12-24 12:13:06 118 6: 47 63 2018-12-27 09:53:40 63 7: 47 63 2018-12-27 09:53:40 64 updated_at diff_days diff_days_ceil 1: 2018-12-23 07:23:44 0.00541 1 2: 2019-04-02 03:48:20 98.64947 99 3: 2019-02-18 12:23:19 56.00709 57 4: 2019-01-12 11:33:15 18.97233 19 5: 2018-12-24 12:17:44 0.00322 1 6: 2019-01-03 19:08:04 7.38500 8 7: 2018-12-27 10:01:48 0.00565 1 中的条目并不总是正确的。

如果发现每个患者的 init_weight 在 subseq_weight 输入的 1 天内，则需要将其替换为 1 天内输入的最新 init_weight。

这意味着每个 subseq_weight 我们需要在 patient_id 列中查找 0:1 范围内的值。如果找到，则患者的所有 diff_days_ceil 记录将替换为对应于最新 init_weight 但在 subseq_weight 的 1 天内（或在1 天内对应于 updated_at 的其他词）。

例如：这里created_at满足这样的条件，例如max(diff_days_ceil)记录在patient_id ==24的同一天。所以第一行 subseq_weight 将替换为 72。

同样适用于 init_weight ==47，所有 init_weight 列条目都将替换为对应于第 7 行的 patient_id。

我尝试了一种方法，但完全不确定我是否会丢失数据。

init_weight

在第一部分中，我为记录创建了一个新列 subseq_weight==64，其中 dt[diff_days_ceil %in% 0:1,init_weight1 := .SD[diff_days_ceil %in% 0:1,subseq_weight[which.max(diff_days)]],.(patient_id)] dt[,init_weight1 := nafill(x = init_weight1,type = 'nocb'),.(patient_id)] 在 init_weight1 条目的 1 天内。

在第二部分中，所有其他情况，即 subseq_weights ，使用“下一次向后观察”技术填充每个 init_weight。

我想要一些替代技术，可以避免这种 NA 创建然后填充。谢谢。

解决方法

library(data.table)
setDT(df)[diff_days <= 1,init_weight := subseq_weight,by = .(patient_id)]
df

#   patient_id init_weight          created_at subseq_weight          updated_at
#1:         24          72 2018-12-23 07:15:57            72 2018-12-23 07:23:44
#2:         38          99 2018-12-24 12:13:06           107 2019-04-02 03:48:20
#3:         38          99 2018-12-24 12:13:06           110 2019-02-18 12:23:19
#4:         38          99 2018-12-24 12:13:06           115 2019-01-12 11:33:15
#5:         38         118 2018-12-24 12:13:06           118 2018-12-24 12:17:44
#6:         47          63 2018-12-27 09:53:40            63 2019-01-03 19:08:04
#7:         47          64 2018-12-27 09:53:40            64 2018-12-27 10:01:48
#   diff_days diff_days_ceil
#1:   0.00541              1
#2:  98.64947             99
#3:  56.00709             57
#4:  18.97233             19
#5:   0.00322              1
#6:   7.38500              8
#7:   0.00565              1

数据：

df <- structure(list(patient_id = c(24L,38L,47L,47L
),init_weight = c(77L,99L,63L,63L),created_at = c("2018-12-23 07:15:57","2018-12-24 12:13:06","2018-12-27 09:53:40","2018-12-27 09:53:40"
),subseq_weight = c(72L,107L,110L,115L,118L,64L),updated_at = c("2018-12-23 07:23:44","2019-04-02 03:48:20","2019-02-18 12:23:19","2019-01-12 11:33:15","2018-12-24 12:17:44","2019-01-03 19:08:04","2018-12-27 10:01:48"),diff_days = c(0.00541,98.64947,56.00709,18.97233,0.00322,7.385,0.00565),diff_days_ceil = c(1L,57L,19L,1L,8L,1L)),class = "data.frame",row.names = c(NA,-7L))

calculated-columns data-processing data.table data.table r r replace replace replace