在事件发生后的时间间隔内分配唯一ID

问题描述

这有点奇怪,我无法在stackoverflow上找到解决方案。我有一个数据集,其中包含一个日期时间列和一个指示事件的值列,例如下面的dat示例。日期时间是每小时,但是请注意,偶尔会出现“错过的”小时(第12和13行之间缺少2小时)。

dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),max(as.POSIXct("2010-04-04 10:00:00 UTC")),by = "hour")[-c(13,14)],event = c(1,rep(NA,9),2,5),3,4,5,NA,6))
> dat
              datetime event
1  2010-04-03 03:00:00     1
2  2010-04-03 04:00:00    NA
3  2010-04-03 05:00:00    NA
4  2010-04-03 06:00:00    NA
5  2010-04-03 07:00:00    NA
6  2010-04-03 08:00:00    NA
7  2010-04-03 09:00:00    NA
8  2010-04-03 10:00:00    NA
9  2010-04-03 11:00:00    NA
10 2010-04-03 12:00:00    NA
11 2010-04-03 13:00:00     2
12 2010-04-03 14:00:00    NA
13 2010-04-03 17:00:00    NA
14 2010-04-03 18:00:00    NA
15 2010-04-03 19:00:00    NA
16 2010-04-03 20:00:00    NA
17 2010-04-03 21:00:00     3
18 2010-04-03 22:00:00     4
19 2010-04-03 23:00:00    NA
20 2010-04-04 00:00:00    NA
21 2010-04-04 01:00:00    NA
22 2010-04-04 02:00:00    NA
23 2010-04-04 03:00:00    NA
24 2010-04-04 04:00:00    NA
25 2010-04-04 05:00:00    NA
26 2010-04-04 06:00:00    NA
27 2010-04-04 07:00:00    NA
28 2010-04-04 08:00:00     5
29 2010-04-04 09:00:00    NA
30 2010-04-04 10:00:00     6

我希望在事件发生后的7小时内,用唯一的标识符来标识每一行,但要注意以下几点(因此称为“奇怪的情况”):

  • 如果后续事件在该事件之前的7个小时内发生,则该后续事件实质上被忽略(即,“事件”编号不等于分配的标识符值),并且
  • 考虑了丢失时间(即,规则基于经过的时间,而不是行数)。

该产品看起来像result

library(dplyr)

result <- dat %>% 
  mutate(id = c(rep(1,8),2),rep(2,6),rep(3,3),rep(4,3)))
> result
              datetime event id
1  2010-04-03 03:00:00     1  1
2  2010-04-03 04:00:00    NA  1
3  2010-04-03 05:00:00    NA  1
4  2010-04-03 06:00:00    NA  1
5  2010-04-03 07:00:00    NA  1
6  2010-04-03 08:00:00    NA  1
7  2010-04-03 09:00:00    NA  1
8  2010-04-03 10:00:00    NA  1
9  2010-04-03 11:00:00    NA NA
10 2010-04-03 12:00:00    NA NA
11 2010-04-03 13:00:00     2  2
12 2010-04-03 14:00:00    NA  2
13 2010-04-03 17:00:00    NA  2
14 2010-04-03 18:00:00    NA  2
15 2010-04-03 19:00:00    NA  2
16 2010-04-03 20:00:00    NA  2
17 2010-04-03 21:00:00     3  3
18 2010-04-03 22:00:00     4  3
19 2010-04-03 23:00:00    NA  3
20 2010-04-04 00:00:00    NA  3
21 2010-04-04 01:00:00    NA  3
22 2010-04-04 02:00:00    NA  3
23 2010-04-04 03:00:00    NA  3
24 2010-04-04 04:00:00    NA  3
25 2010-04-04 05:00:00    NA NA
26 2010-04-04 06:00:00    NA NA
27 2010-04-04 07:00:00    NA NA
28 2010-04-04 08:00:00     5  4
29 2010-04-04 09:00:00    NA  4
30 2010-04-04 10:00:00     6  4

最理想的情况是,这将在dplyr框架中完成。

解决方法

library(lubridate)
library(tidyverse)

dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),max(as.POSIXct("2010-04-04 10:00:00 UTC")),by = "hour")[-c(13,14)],event = c(1,rep(NA,9),2,5),3,4,5,NA,6)) %>% 
  mutate(id = c(rep(1,8),2),rep(2,6),rep(3,3),rep(4,3)))


Events <- dat %>% 
  #Get only the roes with events
  filter(!is.na(event)) %>% 
  #Get the duration of time between events
  mutate(
    EventLag = datetime - lag(datetime)) %>% 
  ## remove events that occurred < 7 hrs after the previous or that are NA (i.e. the first one). but in the real data
  ## I do not suspect your first point would ever be an event...? Maybe this can be removed in the 
  ## real dataset...
  filter(as.numeric(EventLag) > 7| is.na(EventLag)) %>% 
  as.data.frame()

## You now have all of the events that are of interest (i.e. those that occurred outside of the 7 hr buffer)
## Give the events a new ID so there are no gaps
## Join them with the rest of the datetime stamps
Events <- Events %>% 
  mutate(ID = row_number()) %>% 
  dplyr::select(datetime,ID)


## Expand each event by 7 hrs
Events <- Events %>%
  group_by(ID) %>%
  do(data.frame(ID= .$ID,datetime= seq(.$datetime,.$datetime + hours(7),by = '1 hour'),stringsAsFactors=FALSE)) %>% 
  as.data.frame()


## Join with initial data by datettime
DatJoin <- dat %>% 
  left_join(Events,by = "datetime")


DatJoin