问题描述
我有两个大型数据集,唯一共享的功能是数字时间戳。我想按此时间戳合并数据帧,但数据收集的频率并不完全匹配,因此我需要允许它与最近的可能匹配项合并。
作为一个简化的例子,这里有一个带有值列、一些事件和一个 ID 的小数据集:
a<-c("150","164","175","183","195","200","205","213")
b<-c("start1","end1","start2","end2","start1","end2")
c<-c("A","A","B","B")
(data<-data.table(value = a,event = b,ID = c))
我希望能够通过值列将此“数据”与此数字系列(“时间”)合并:
(times<-data.frame(value = c(seq(from = 150,to = 213,by = 3))))
以便它们通过值列中最接近的近似匹配进行合并以生成此最终数据框:
agoal<-c(seq(from = 150,by = 3))
bgoal<-c("start1","","end2")
cgoal<-c("A","B")
(goal<-data.frame(value = agoal,event = bgoal,ID = cgoal))
有没有办法做到这一点,尤其是对于一个非常大的数据集(所以它不会使 R 崩溃)?
解决方法
data.table
提供了一个滚动连接解决方案。
library(data.table)
setkey(data,value)
setkey(times,value)
data[times,roll = "nearest"]
# value event ID
# 1: 150 start1 A
# 2: 153 start1 A
# 3: 156 start1 A
# 4: 159 end1 A
# 5: 162 end1 A
# 6: 165 end1 A
# 7: 168 end1 A
# 8: 171 start2 A
# 9: 174 start2 A
#10: 177 start2 A
#11: 180 end2 A
#12: 183 end2 A
#13: 186 end2 A
#14: 189 end2 A
#15: 192 start1 B
#16: 195 start1 B
#17: 198 end1 B
#18: 201 end1 B
#19: 204 start2 B
#20: 207 start2 B
#21: 210 end2 B
#22: 213 end2 B
数据:
a<-c("150","164","175","183","195","200","205","213")
b<-c("start1","end1","start2","end2","start1","end2")
c<-c("A","A","B","B")
data<-data.table(value = as.numeric(a),event = b,ID = c)
times<-data.table(value = c(seq(from = 150,to = 213,by = 3)))
,
为了按最近匹配加入而不用近似匹配填补空白,fuzzyjoin 效果很好!
(end<-fuzzyjoin::difference_left_join(times,data,by = "value",max_dist = 1,distance_col= "distance"))