问题描述
如果徒步旅行者正在参观,我有一组小屋的数据表。
library(data.table)
dt <- data.table(time = as.POSIXct(as.Date(10:35,origin = "2020-01-01")),hut= c(1,NA,8,1,4,5,1))
模式是它们将从野外的小屋(例如小屋1)移出(小屋= NA),并在2-5天内返回。这是一个事件。有时他们会去一个新的小屋(例如,小屋4)-这不是一件大事。问题在于,有时它们会意外地放在事件内部的小屋中(如第4行)。因此,这仍然是一个事件。输出应该看起来像这样,但是我不知道如何得到它。实际数据是数十亿行,因此它也应该是有效的,因此data.table:
dt[,event:= c(NA,2,3,NA)]
dt
time hut event
1: 2020-01-11 01:00:00 1 NA
2: 2020-01-12 01:00:00 NA 1
3: 2020-01-13 01:00:00 NA 1
4: 2020-01-14 01:00:00 8 1
5: 2020-01-15 01:00:00 1 NA
6: 2020-01-16 01:00:00 1 NA
7: 2020-01-17 01:00:00 NA 2
8: 2020-01-18 01:00:00 NA 2
9: 2020-01-19 01:00:00 NA 2
10: 2020-01-20 01:00:00 1 NA
11: 2020-01-21 01:00:00 NA NA
12: 2020-01-22 01:00:00 NA NA
13: 2020-01-23 01:00:00 NA NA
14: 2020-01-24 01:00:00 4 NA
15: 2020-01-25 01:00:00 NA 3
16: 2020-01-26 01:00:00 NA 3
17: 2020-01-27 01:00:00 4 NA
18: 2020-01-28 01:00:00 NA 4
19: 2020-01-29 01:00:00 5 4
20: 2020-01-30 01:00:00 NA 4
21: 2020-01-31 01:00:00 NA 4
22: 2020-02-01 01:00:00 4 NA
23: 2020-02-02 01:00:00 NA 5
24: 2020-02-03 01:00:00 4 NA
25: 2020-02-04 01:00:00 NA NA
26: 2020-02-05 01:00:00 1 NA
解决方法
这是使用非等额联接的另一种选择:
dt[,rn := .I]
visits <- dt[!is.na(hut)]
visits[,c("start","end") := .(time + 2L,time + 5L)]
rows <- visits[visits,on=.(hut,time>=start,time<=end),mult="first",nomatch=0L,.(hut,i.time,x.time,i.rn,x.rn)]
dt[rows,on=.(rn>i.rn,rn<x.rn),event := 1L]
dt[,ri := rleid(event)][!is.na(event),event := rleid(ri)]
dt[rn %in% unique(c(rows$i.rn,rows$x.rn)),event := NA_integer_]
dt[,c("ri","rn") := NULL][]
输出:
time hut event
1: 2020-01-11 1 NA
2: 2020-01-12 NA 1
3: 2020-01-13 NA 1
4: 2020-01-14 8 1
5: 2020-01-15 1 NA
6: 2020-01-16 1 NA
7: 2020-01-17 NA 2
8: 2020-01-18 NA 2
9: 2020-01-19 NA 2
10: 2020-01-20 1 NA
11: 2020-01-21 NA NA
12: 2020-01-22 NA NA
13: 2020-01-23 NA NA
14: 2020-01-24 4 NA
15: 2020-01-25 NA 3
16: 2020-01-26 NA 3
17: 2020-01-27 4 NA
18: 2020-01-28 NA 4
19: 2020-01-29 5 4
20: 2020-01-30 NA 4
21: 2020-01-31 NA 4
22: 2020-02-01 4 NA
23: 2020-02-02 NA 5
24: 2020-02-03 4 NA
25: 2020-02-04 NA NA
26: 2020-02-05 1 NA
time hut event
或者,使用滚动联接而不是上面的非等联接:
is <- 2L
intvl <- 5L - is
dt[,c("rn","oned") := .(.I,time + is)]
rows <- dt[dt[!is.na(hut)],time=oned),roll=-intvl,x.rn)]
#the rest of the code from the non-equi join above is needed here as well
数据:
library(data.table)
dt <- data.table(time = as.Date(10:35,origin = "2020-01-01"),hut= c(1,NA,8,1,4,5,1))
,
好吧,这不是很明显,但是让我们尝试..
library(data.table)
dt <- data.table(time = as.POSIXct(as.Date(10:35,origin = "2020-01-01")),1))
library(dplyr)
dt[,last.hut1 := lag(hut,n = 1,order_by = time)]
dt[,last.hut2 := lag(hut,n = 2,last.hut3 := lag(hut,n = 3,last.hut4 := lag(hut,n = 4,last.hut5 := lag(hut,n = 5,next.hut1 := lead(hut,next.hut2 := lead(hut,next.hut3 := lead(hut,next.hut4 := lead(hut,next.hut5 := lead(hut,order_by = time)]
dt[,end.event := case_when((hut == last.hut2 | hut == last.hut3 | hut == last.hut4 | hut == last.hut5)
& (last.hut1 != hut | is.na(last.hut1)) ~ 1,TRUE ~ 0)]
dt[,start.event := case_when((hut == next.hut2 | hut == next.hut3 | hut == next.hut4 | hut == next.hut5)
& (next.hut1 != hut | is.na(next.hut1)) ~ 1,TRUE ~ 0)]
dt[,start.event2 := cumsum(start.event)]
dt[,end.event2 := cumsum(end.event)]
dt[,event := case_when((start.event2 > end.event2) & (start.event == 0) & (end.event == 0) ~ start.event2,TRUE ~ NA_real_)]
dt[,c("last.hut1","last.hut2","last.hut3","last.hut4","last.hut5","next.hut1","next.hut2","next.hut3","next.hut4","next.hut5","start.event","start.event2","end.event","end.event2") := .(NULL,NULL,NULL)]