问题描述
id = c(rep(1,3),rep(2,rep(3,3))
start = as.Date(c("2014-07-01","2015-03-12","2016-08-13","2014-07-01","2016-08-13"))
end = as.Date(c("2015-03-11","2015-08-12","2018-12-31","2015-03-11","2018-12-31"))
DT = data.table(id,start,end)
DT
id start end
1: 1 2014-07-01 2015-03-11
2: 1 2015-03-12 2015-08-12
3: 1 2016-08-13 2018-12-31
4: 2 2014-07-01 2015-03-11
5: 2 2015-03-12 2015-08-12
6: 2 2016-08-13 2018-12-31
7: 3 2014-07-01 2015-03-11
8: 3 2015-03-12 2015-08-12
9: 3 2016-08-13 2018-12-31
有一个像这样的临床登记(体重和身高):
id_clin = (c(rep(1,2),rep (2,4)))
date = as.Date(c("2014-10-23","2016-09-01","2017-01-01","2014-08-01","2015-02-01","2017-06-01","2018-03-05","2018-09-01","2018-11-30"))
weight = c(60,65,62,75,68,90,102,104,98 )
height = c(160,160,170,175,200,200)
DT_clin = data.table(id_clin,date,weight,height)
DT_clin
id_clin date weight height
1: 1 2014-10-23 60 160
2: 1 2016-09-01 65 160
3: 2 2017-01-01 62 170
4: 2 2014-08-01 75 175
5: 2 2015-02-01 68 170
6: 3 2017-06-01 90 200
7: 3 2018-03-05 102 200
8: 3 2018-09-01 104 200
9: 3 2018-11-30 98 200
- 当某个 id 的临床测量 (DT_clin) 的注册表位于同一 id 的句点 (DT) 的开始和结束之间时,必须连接注册表的值。
- 如果 DT 周期之间的 DT_clin 中没有值,则无需连接任何内容。
- 如果 DT 周期之间存在多个值,我想计算重叠值的平均值。
期望的结果看起来像这样*:
id start end date date2 weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
9: 3 2016-08-13 2018-12-31 2018-03-05 2018-11-30 101.3 200.0
另外,如果有一种方法可以对不同的变量进行多个操作,我也会有兴趣知道一种方法。 (例如,在我进行连接的同时计算重量的平均值和高度的最大值)
当只有一个值时,我测试了 foverlaps 并取得了良好的结果,但是当有多个值重叠时,我无法实现我的目标:
setkey(DT,id,end)
setkey(DT_clin,id_clin,date2)
foverlaps(DT[id == "1",],DT_clin[id == "1",by.x =c("id","start","end"),by.y = c("id_clin","date","date2" ),nomatch = NA )
我应该使用非等值联接吗?
在此先感谢您的帮助:)
*我复制了 date 来创建 date2 并伪造了一个时间间隔
解决方法
使用非对等连接,然后按 id、开始和结束进行汇总
ans <- DT_clin[DT,on = .(date >= start,date <= end,id_clin = id)]
ans[,.(date = min(date2),date2 = max(date2),weight = mean(weight),height = mean(height)),by = .(id = id_clin,start = date,end = date.1)]
# id start end date date2 weight height
# 1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
# 2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
# 4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
# 5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
# 7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
# 8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 9: 3 2016-08-13 2018-12-31 2017-06-01 2018-11-30 98.5 200.0
,
使用foverlaps
:
library(data.table)
setkey(DT_clin,id_clin,date,date2)
foverlaps(DT,DT_clin,by.x =c("id","start","end"),by.y = c("id_clin","date","date2" ),nomatch = NA )[,.(datemin = min(date),datemax = max(date),weight = mean(weight,na.r=T),height = mean(height,na.rm=T)),by=.(id,start,end)]
id start end datemin datemax weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NaN NaN
8: 3 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
9: 3 2016-08-13 2018-12-31 2017-06-01 2018-11-30 98.5 200.0