问题描述
有一个用于不同ID组的日期列,每个观测值都有一个天数要添加的数字。
library("data.table")
data <- data.table(ID = c(1,1,2,3,3),Date =c("01/Sep/2020","11/Sep/2020","01/Sep/2020","08/Sep/2020","01/Aug/2020","04/Aug/2020","10/Aug/2020"),days_to_be_added = c(10,10,08,05,30))
data[,Date := as.Date(Date,format = "%d/%h/%Y")]
ID Date days_to_be_added
1: 1 2020-09-01 10
2: 1 2020-09-11 10
3: 2 2020-09-01 10
4: 2 2020-09-08 8
5: 3 2020-08-01 5
6: 3 2020-08-04 5
7: 3 2020-08-10 30
我必须获取每个 ID 组的日期间隔,以便将每个日期添加到“ days_to_be_added_group”中,并计算它们之间的天数。如果有任何日期重叠,则它们只会被计数一次。
示例: 对于 ID 2 :
3rd row : **1 Sep 2020** to **10 Sep 2020** is 10 days [as Days_to_be_added is 10]
4th row : **8 Sep 2020** to **15 Sep 2020** is 8 days [as Days to be added is 8]
But the total number of days for ID 2 should come as **15 days** since 8 Sep to 10 Sep is overlap for the ID group and should be counted once.
**Expected output:**
ID Number_of_days
1 20
2 15
3 38
```
**Note** If there are any **Date** as "NA" they should be ignored
解决方法
这是一种方法。
使用seq.Date
每天为每个Date
添加ID
,然后连续days_to_be_added
继续添加行。
然后,Number_of_days
是每个day
的唯一ID
值的总数,因此重叠的day
不会被重复计算。
data[,.(day = seq.Date(Date,by = 'day',length.out = days_to_be_added)),by = .(ID,1:nrow(data))
][,.(Number_of_days = uniqueN(day)),by = ID][]
输出
ID Number_of_days
1: 1 20
2: 2 15
3: 3 38