创建运行长度 ID,同时允许运行中存在一定长度的间隙

问题描述

(我最初发布了一个问题 here,但它没有完全涵盖我的问题)

我有一个带有“日期”列和降水量(降雨量)的数据框:

  date precip
1    1    0.0
2    2    0.0
3    3   12.4
4    4   10.2
5    5    0.0
6    6   13.6

我想为每个连续的降雨期创建一个带有计数器 (ID) 的“事件”列。降雨事件可以定义为降雨量大于例如的连续运行。 0.

如果我们不允许零雨的任何短时间间隔,“事件”将如下所示,非0 时段有一个计数器,无雨时段有 NA

  date precip event
1    1    0.0    NA
2    2    0.0    NA
3    3   12.4     1
4    4   10.2     1
5    5    0.0    NA
6    6   13.6     2

此外,我希望能够允许更短的时间不下雨,例如大小为 n = 1 天,在非 0 的每次运行中。

例如,在上面的数据框中,如果我们允许在连续的降雨期间内有 0 天的降雨,例如第 5 天,然后第 3 天到第 6 天可以定义为一次降雨事件:

  date precip event
1    1    0.0    NA
2    2    0.0    NA
3    3   12.4     1
4    4   10.2     1
5    5    0.0     1 # <- gap of 1 day with no rain: OK
6    6   13.6     1

一个稍大的玩具数据集:

structure(list(date = 1:31,precip = c(0,12.3999996185303,10.1999998092651,13.6000003814697,16.6000003814697,21.5,7.59999990463257,0.699999988079071,5.40000009536743,1,35.4000015258789,11.5,16.7000007629395,13.5,13.1000003814697,11.8000001907349,1.70000004768372,15.1000003814697,12.8999996185303,3.70000004768372,24.2999992370605)),row.names = c(NA,-31L),class = "data.frame")

现在我真的被困住了。我尝试了一些奇怪的事情,比如下面的那个(只是一个开始),但我想我自己不会弄明白,如果有任何帮助,我将不胜感激

# this is far from being any helpful,but just to show the direction I was heading...
# the threshold Could be 0 to mirror the example above...

rainfall_event = function(df,daily_thresh = .2,n = 1) {
  for (i in 1:nrow(df)) {
    zero_index = 1
    
    if (df[i,]$precip < daily_thresh) {
      # every time you encounter a value below the threshold count the 0s
      zero_counter = 0
      
      while (df[i,]$precip < daily_thresh) {

        zero_counter = zero_counter + 1
        
        if (i != nrow(df)) {
          i = i + 1
          zero_index = zero_index + 1
        } else{
          break
        }
      }
      
      if (zero_counter > n) {
        df[zero_index:zero_index + zero_counter,][["event"]] = NA
      }
      
    } else{
      event_counter = 1
      
      while (df[i,]$precip > daily_thresh) {

        df[["event"]] = event_counter
        if (i != nrow(rainfall_one_slide)) {
          i = i + 1
        } else{
          break
        }
      }
      
    }
  }
  
}

解决方法

一个 rle 替代方案:

# limit of n days with precip = 0 to be allowed in runs of non-zero
n = 1

# rle of precip == 0
r = rle(d$precip == 0)

# replace the values of precip = 0 & length > limit with NA
r$values[r$values & r$lengths > n] = NA

# reconstruct the vector from the updated runs
ir = inverse.rle(r)

# rle of "is NA"
r2 = rle(is.na(ir))

# replace length of NA runs with 0
r2$lengths[r2$values] = 0

# replace values of non-NA runs with a sequence
r2$values[!r2$values] = seq_along(r2$values[!r2$values])

# create event column
d[!is.na(ir),"event"] = inverse.rle(r2)

   date precip event
1     1    0.0    NA
2     2    0.0    NA
3     3   12.4     1
4     4   10.2     1
5     5    0.0     1
6     6   13.6     1
7     7   16.6     1
8     8   21.5     1
9     9    7.6     1
10   10    0.0    NA
11   11    0.0    NA
12   12    0.0    NA
13   13    0.7     2
14   14    0.0    NA
15   15    0.0    NA
16   16    0.0    NA
17   17    5.4     3
18   18    0.0     3
19   19    1.0     3
20   20   35.4     3
21   21   11.5     3
22   22   16.7     3
23   23   13.5     3
24   24   13.1     3
25   25   11.8     3
26   26    1.7     3
27   27    0.0     3
28   28   15.1     3
29   29   12.9     3
30   30    3.7     3
31   31   24.3     3
,

data.tablerleid 一起使用

library(data.table)

f1 <- function(dat,n) {

 tmp <- as.data.table(dat)[,grp := rleid(precip != 0)][precip != 0,event := .GRP,grp][,event_fill := nafill(nafill(event,'locf'),'nocb')]
 tmp[,event := fifelse(.N <= n & precip == 0,fcoalesce(event,event_fill),event),c("grp","event_fill") := NULL][]

 }

-测试

f1(df1,0)
     date precip event
 1:    1    0.0    NA
 2:    2    0.0    NA
 3:    3   12.4     1
 4:    4   10.2     1
 5:    5    0.0    NA
 6:    6   13.6     2
 7:    7   16.6     2
 8:    8   21.5     2
 9:    9    7.6     2
10:   10    0.0    NA
11:   11    0.0    NA
12:   12    0.0    NA
13:   13    0.7     3
14:   14    0.0    NA
15:   15    0.0    NA
16:   16    0.0    NA
17:   17    5.4     4
18:   18    0.0    NA
19:   19    1.0     5
20:   20   35.4     5
21:   21   11.5     5
22:   22   16.7     5
23:   23   13.5     5
24:   24   13.1     5
25:   25   11.8     5
26:   26    1.7     5
27:   27    0.0    NA
28:   28   15.1     6
29:   29   12.9     6
30:   30    3.7     6
31:   31   24.3     6

带有n = 1

f1(df1,1)
    date precip event
 1:    1    0.0    NA
 2:    2    0.0    NA
 3:    3   12.4     1
 4:    4   10.2     1
 5:    5    0.0     1
 6:    6   13.6     2
 7:    7   16.6     2
 8:    8   21.5     2
 9:    9    7.6     2
10:   10    0.0    NA
11:   11    0.0    NA
12:   12    0.0    NA
13:   13    0.7     3
14:   14    0.0    NA
15:   15    0.0    NA
16:   16    0.0    NA
17:   17    5.4     4
18:   18    0.0     4
19:   19    1.0     5
20:   20   35.4     5
21:   21   11.5     5
22:   22   16.7     5
23:   23   13.5     5
24:   24   13.1     5
25:   25   11.8     5
26:   26    1.7     5
27:   27    0.0     5
28:   28   15.1     6
29:   29   12.9     6
30:   30    3.7     6
31:   31   24.3     6
,

所以,它可能永远不会引起任何人的兴趣,但我想我也有一个解决方案:)

f2 = function(d,n = 1,daily_thresh = .2) {

  # start int the first row
  i = 1

  # start with rainfall event 1
  event_counter = 0
  
  # set the value initially to 0
  d[["event"]] = 0

  # while still in the dataframe
  while (i <= nrow(d)) {

    # get the current precip value
    precip = d[i,]$precip

    # if its below the threshold --> DRY period starts
    if (precip < daily_thresh) {

      # count unknown number of following dry days of this dry episode
      dry_days = 0

      ### DRY LOOP
      # start from the day with rainfall under the threshold
      for (j in i:nrow(d)) {

        # count the consecutive dry days
        if (d[j,]$precip < daily_thresh) {
          dry_days = dry_days + 1


        } else{

          # hit a rainy day --> Get out the dry loop,just decide to which event it belongs
          # if the preceeding dry days are smaller than n --> same as last event

          if (dry_days <= n) {

            # set all the days without rainfall but within n to rainfall
            # if its the first event put it to 1
            if(event_counter == 0) event_counter = 1
            d[(j-1):(j-dry_days),][["event"]] = event_counter
            # set the rainy day to the same event
            d[j,][["event"]] = event_counter
            break # get back to wet peiod

          } else{

            # if the gap was too big --> its a new event
            # set all the days without rainfall and within n to no rainfall
            d[(j-1):(j-dry_days),][["event"]] = NA
            # set the rainy day to a new rainfall event
            event_counter = event_counter + 1
            d[j,][["event"]] = event_counter
            break # get back to wet period
          }
        }
      }

      # set i to where we stopped in the dry loop
      i = j + 1

    } else{

      # if we initially hit a rainy day,just count on
      d[i,][["event"]] = event_counter
      i = i + 1

    }
  }
  return(d)
}