如何每 15 分钟计算一个“批次”

问题描述

1 2021-01-01 12:59:38
2 2021-01-01 14:08:59
3 2021-01-01 14:09:08
4 2021-01-01 14:11:30
5 2021-01-01 14:22:19
6 2021-01-01 14:41:07

我希望能够每 15 分钟计算一次条目的数量，但要滚动计算。例如，12:59 将在 15 分钟内变为 1，14:08 => 14:22 将在 15 分钟内全部返回，因此这将在该批次中返回 4，最后 14:41 将在另一个 15 分钟批次中单独出现。>

我希望这是有道理的，并提前致谢。

抱歉没有包含这个

> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707,1609462979.593,1609465088.437,1609476270.88,1609478479.177,1609479667.373,1609493081.887,1609499187.29,1609507506.37,1609510989.533,1609511522.023,1609511894.067,1609512194.773,1609512377.227,1609514474.153),tzone = "UTC",class = c("POSIXct","POSIXt"
)),batch_no = c(1L,2L,3L,4L,5L,6L,7L,8L,9L,10L,11L,12L,13L),batch_size = c(1L,1L,1L)),row.names = c(NA,-15L),class = c("tbl_df","tbl","data.frame"))

新编辑 - 感谢您的工作。我收到一个错误

Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "c('integer','numeric')"

这看起来很奇怪，我的变量在类中

> class(df$ClickedDate)
[1] "POSIXct" "POSIXt"

这是否适用于 mutate，还是我需要转换它？

> dput(df)
structure(list(ClickedDate = structure(c(1609460198.707,"data.frame"))

提前致谢

解决方法

使用 runner 包在这种情况下会有所帮助。使用以下策略

library(tidyverse)
library(runner)

df %>% mutate(b_len = runner::runner(x = ClickedDate,idx = ClickedDate,k = "15 mins",lag = "-14 mins",f = length),b_no = purrr::accumulate(seq_len(length(b_len)-1),.init = b_len[1],~ifelse(.x > .y,.x,.x + b_len[.x +1])),b_no = as.integer(as.factor(b_no))) %>%
  group_by(b_no) %>%
  mutate(b_len = n())

# A tibble: 15 x 3
# Groups:   b_no [12]
   ClickedDate         b_len  b_no
   <dttm>              <int> <int>
 1 2021-01-01 00:16:38     1     1
 2 2021-01-01 01:02:59     1     2
 3 2021-01-01 01:38:08     1     3
 4 2021-01-01 04:44:30     1     4
 5 2021-01-01 05:21:19     1     5
 6 2021-01-01 05:41:07     1     6
 7 2021-01-01 09:24:41     1     7
 8 2021-01-01 11:06:27     1     8
 9 2021-01-01 13:25:06     1     9
10 2021-01-01 14:23:09     2    10
11 2021-01-01 14:32:02     2    10
12 2021-01-01 14:38:14     3    11
13 2021-01-01 14:43:14     3    11
14 2021-01-01 14:46:17     3    11
15 2021-01-01 15:21:14     1    12

注意事项 -

lag

runner 参数允许向后时间窗口（滚动），因此我使用负延迟来使用向前时间窗口。

k

runner 参数是针对给定长度的滚动窗口
b_no 列最初标识滑动/滚动窗口，直到最早的窗口用完并随后采用新的窗口。
dense_rank 也可以使用（参见下面的替代方案）

替代

df %>% mutate(b_len = runner::runner(x = ClickedDate,b_no = dense_rank(b_no)) %>%
  group_by(b_no) %>%
  mutate(b_len = n()) %>%
  ungroup()
# A tibble: 15 x 3
   ClickedDate         b_len  b_no
   <dttm>              <int> <int>
 1 2021-01-01 00:16:38     1     1
 2 2021-01-01 01:02:59     1     2
 3 2021-01-01 01:38:08     1     3
 4 2021-01-01 04:44:30     1     4
 5 2021-01-01 05:21:19     1     5
 6 2021-01-01 05:41:07     1     6
 7 2021-01-01 09:24:41     1     7
 8 2021-01-01 11:06:27     1     8
 9 2021-01-01 13:25:06     1     9
10 2021-01-01 14:23:09     2    10
11 2021-01-01 14:32:02     2    10
12 2021-01-01 14:38:14     3    11
13 2021-01-01 14:43:14     3    11
14 2021-01-01 14:46:17     3    11
15 2021-01-01 15:21:14     1    12

使用的数据

df
> df
# A tibble: 15 x 1
   ClickedDate        
   <dttm>             
 1 2021-01-01 00:16:38
 2 2021-01-01 01:02:59
 3 2021-01-01 01:38:08
 4 2021-01-01 04:44:30
 5 2021-01-01 05:21:19
 6 2021-01-01 05:41:07
 7 2021-01-01 09:24:41
 8 2021-01-01 11:06:27
 9 2021-01-01 13:25:06
10 2021-01-01 14:23:09
11 2021-01-01 14:32:02
12 2021-01-01 14:38:14
13 2021-01-01 14:43:14
14 2021-01-01 14:46:17
15 2021-01-01 15:21:14

batching batching r r rolling-computation runner runner