在 R 中的每次观察之前，在时间窗口内查找所有相同的事务更新方法[更新：增加处理整个 750 万条记录的时间]

问题描述

我有一个工单数据库，其中包含为不同工厂预留的材料。我需要找到在当前观察之前的时间窗口中每种材料被请求的次数。

我尝试了以下方法，但是对于 700 万行的数据框，我需要 300 个月。

result<-data.frame()
for (i in 1:nrow(work.orders)){
  wo.date_f<- work.orders$original_basic_start_date[i] %>% as_date()
  days.back<-30 # Time windows for searching
  mat_f<- work.orders$material[i]
  plant_f<-work.orders$plant[i]
  total_resb_found<-work.orders %>% filter(plant==plant_f,material==mat_f,(original_basic_start_date %>% as_date())<wo.date_f,(original_basic_start_date %>% as_date())-wo.date_f<days.back) %>% nrow()
  result[i,1]<-mat_f
  result[i,2]<-plant_f
  result[i,3]<-total_resb_found
  print(paste0(i," out of ",nrow(work.orders)))
}

所以方法是：

查找第 1 行的日期、材料和工厂。
过滤上一步中找到的材料和植物的主数据框
过滤在步骤 1 中找到的日期之前以及所述日期 - 时间窗口之后的日期的数据框
计算找到的行数
在数据框中记录结果

我知道这是一种蛮力方法，因此有很大的改进空间，但我还没有想到一个好的方法关于更有效的方法有什么想法吗？

谢谢

编辑：添加了具有所需结果的示例数据（作为结果列）

 structure(list(material = c("000000000010339762","000000000010339762","000000000010199498","000000000010339762"),original_basic_start_date = c("20201106","20200702","20200618","20200923","20201205","20201118","20201231","20201022","20200819","20200823","20201106","20200826","20201123","20201111","20200912","20200930","20200916","20200717","20200929","20201016","20200624","20201105","20200620","20200626","20200608","20200712","20200616","20201209","20200904","20210130","20201117","20210311","20200812","20201130","20200816","20200802","20200805","20200705","20200731","20200703","20200926","20200718","20200815","20200609","20200729","20210112","20201121","20210110","20201008","20200902","20200918","20200713","20200828","20200722","20210126","20200906","20210106","20201229","20210117","20210321","20210204","20201005","20210109","20210526","20210203","20201001","20201206","20210221","20210103","20200719","20201019","20200707","20210219","20200710","20210331","20201222","20201027","20210116","20200709","20210120","20210208","20210220","20210205","20201223","20210211","20210303","20210428","20200724","20200831","20210207"),plant = structure(c(16L,34L,16L,21L,20L,10L,15L,25L,37L,17L,32L,39L,19L,27L,36L,33L,47L,20L),.Label = c("B201","B21B","B401","B501","B504","B521","B701","B71A","B991","C602","C603","C690","CS01","CY01","CY02","CY04","CY05","CY06","CY07","CY08","CY09","CY11","CY12","CY13","CY16","CY21","CY30","CY91","CY95","D106","D192","FX01","FX03","FX05","FX06","FX07","FX10","FX1A","FX1C","FX1E","FX21","FX5A","FX5B","FX5C","FX92","FX94","KB01","PA02","PA04","PA05","PA12","PB1A","PB51","PI01","PI03","PI08","PI0A","PI0B","PI0F","PN9A","PN9B","PN9D","PN9E","PP9A","PR90","PR92","PT01","PT02","PT07","PT08","S501","S502","S503","S504","S505","S507","S50I","S516","S517","S593","U201"),class = "factor"),Result = c(23L,1L,18L,0L,2L,4L,5L,3L,6L,22L,24L,7L,9L,11L,8L,14L,12L,30L,28L,29L,31L,13L,4L)),.Names = c("material","original_basic_start_date","plant","Result"),row.names = c(NA,-108L),class = "data.frame")

解决方法

这里是一个刺：


plant.pool <- 1:4
material.pool <- 1:5
date.pool <- seq(as.Date('2010/01/01'),as.Date(now()),by="day")
n.out <- 7e6

## Generate 7 mill records
d <- data.table(
    plant = sample( plant.pool,size=n.out,replace=TRUE ),material = sample( material.pool,original_basic_start_date = sample( date.pool,id = 1:n.out
)[ order(original_basic_start_date) ]

window.size <- 30

## Do the actual rolling calculation,and time it,demonstrating the first 100k only
system.time( the.counts <- frollapply(
    d[ 1:1e5,.(id) ],window.size,function(x){
        d[ id %in% x,sum( plant==last(plant) & material==last(material) ) ]
    }
))

如您所见，我特意只处理前 100k 条记录，并对其进行计时。

在我老旧的工作站（8-9 岁，8 核）上运行几次后，它处理这 100k 条记录的时间大约为 17 秒。这可能会推断为 3.3 小时，大大快于 300 个月。此外，您的计算机可能比我的更快，因此更加愉快。一旦你看到你有我的速度或更快，你可以像我一样对 100k 行进行计时，然后启动它并在应该完成的时候回来检查。

data.table 通常适用于需要加快速度的大事。不过，可能有更适合的函数（您甚至可能需要查看 python 或 c），尽管我还没有找到可以查看这两种材料的解决方案并在其优化的滚动函数中种植列（frollsum w/friends）

编辑：

您在我第一次尝试后添加了数据，但我认为我的方法的原则仍然成立。

编辑 2：

此外，rollapply 不处理前 1:window.size 行，您必须自己处理它们，（现在您有 300 个月减去几个小时，这应该是一件相当简单的事情）

导致时间增加的当前方法的问题是这些代码行

result[i,1]<-mat_f
result[i,2]<-plant_f
result[i,3]<-total_resb_found

向矩阵添加一行，R 中的数据帧效率非常低，并且需要大量额外的工作，当行数很少时，这些工作会很快。一旦行数达到特定阈值，成本乘以每新行。所以你估计 300 个月可能有点乐观:)

这是我测试各种东西的个人经验。虽然我还没有真正量化或对此进行彻底的研究。但是，当您为 100 个元素的数组预先分配内存而不是定义一个空数组并一次添加一个项目时，我确信它与类似于性能问题的内存分配有关。

更新方法

经过一些压力测试和进一步的澄清，这里是我更新的代码以获得更好的性能

library(doParallel)
library(dplyr)
library(lubridate)
library(purrr)

# Parallel on 4 cores
registerDoParallel(4)

# Create a df which is 70000 your sample data (70000 x 108 = 7.5M records)
work.orders <- purrr::map_dfr(seq_len(70000),~df)
# To allow more properly testing the data,a random date in span of one and a half years was generated and assign to original_basic_start_date
random_dates <- as.Date(rdunif(nrow(work.orders),as.integer(as.Date("2020-01-01")),as.integer(as.Date("2021-06-01"))),origin = "1970-01-01")
work.orders$original_basic_start_date <- random_dates
work.orders <- work.orders %>% arrange(original_basic_start_date)

# As the same material & plant on same date will return same results
# Therefore we only need to calculate once for those. Here is filter the 
# original data to only contain no duplicated date records for a pair of 
# material & plant
work.orders_unique_date <- work.orders %>% 
  group_by(material,plant) %>%
  filter(!duplicated(original_basic_start_date)) %>%
  ungroup() %>% 
  arrange(original_basic_start_date)
# Get all the date presented in the data
all_date_in_data <- unique(work.orders_unique_date[["original_basic_start_date"]])
# Time windows for searching
days.back <- 30 
# Time the performance
system.time(
  # Here for each date we filter the original data frame to only have date
  # that in the range of defined windows which allow faster calculation
  all_result <- bind_rows(foreach(i_date = all_date_in_data) %do% {
    message(sprintf("%s Process date: %s",Sys.time(),i_date))
    smaller_df <- work.orders %>% 
      filter(original_basic_start_date < i_date) %>%
      filter(original_basic_start_date - i_date < days.back)
    unique_item_on_date <- work.orders_unique_date %>%
      filter(original_basic_start_date == i_date)
    # for one date,calculate the requirement result for all pair of
    # material & plant on that date
    result <- bind_rows(foreach(i = 1:nrow(unique_item_on_date)) %dopar% {
      mat_f <- unique_item_on_date$material[i]
      plant_f <- unique_item_on_date$plant[i]
      # Here using the smaller df which already filter by date windows
      total_resb_found <- smaller_df %>%
        # Separate condition into multiple filter which should speed it up a bit
        filter(plant == plant_f,material == mat_f)
        nrow()
        tibble(date = i_date,mat_f,plant_f,total_resb_found)
    })
    result
  })
)

上述过程的一些消息输出。每天最多持续 2 秒。如果您的数据在 2 年内有 700 万条记录，那么大约需要 1 小时或更短时间

2021-03-12 08:58:41 Done Process date: 2020-08-15 in 1.83 seconds
2021-03-12 08:58:42 Done Process date: 2020-08-16 in 1.66 seconds
2021-03-12 08:58:44 Done Process date: 2020-08-17 in 1.93 seconds
2021-03-12 08:58:46 Done Process date: 2020-08-18 in 1.72 seconds
2021-03-12 08:58:48 Done Process date: 2020-08-19 in 1.77 seconds
2021-03-12 08:58:50 Done Process date: 2020-08-20 in 1.74 seconds
2021-03-12 08:58:51 Done Process date: 2020-08-21 in 1.78 seconds
2021-03-12 08:58:53 Done Process date: 2020-08-22 in 1.78 seconds
2021-03-12 08:58:55 Done Process date: 2020-08-23 in 1.73 seconds
2021-03-12 08:58:57 Done Process date: 2020-08-24 in 1.94 seconds
2021-03-12 08:58:59 Done Process date: 2020-08-25 in 1.78 seconds
2021-03-12 08:59:00 Done Process date: 2020-08-26 in 1.78 seconds
2021-03-12 08:59:02 Done Process date: 2020-08-27 in 1.81 seconds
2021-03-12 08:59:04 Done Process date: 2020-08-28 in 1.84 seconds
2021-03-12 08:59:06 Done Process date: 2020-08-29 in 1.83 seconds
2021-03-12 08:59:08 Done Process date: 2020-08-30 in 1.76 seconds
2021-03-12 08:59:10 Done Process date: 2020-08-31 in 1.94 seconds
2021-03-12 08:59:11 Done Process date: 2020-09-01 in 1.78 seconds
2021-03-12 08:59:13 Done Process date: 2020-09-02 in 1.86 seconds
2021-03-12 08:59:15 Done Process date: 2020-09-03 in 1.82 seconds
2021-03-12 08:59:17 Done Process date: 2020-09-04 in 1.80 seconds
2021-03-12 08:59:19 Done Process date: 2020-09-05 in 1.88 seconds
2021-03-12 08:59:20 Done Process date: 2020-09-06 in 1.78 seconds
2021-03-12 08:59:23 Done Process date: 2020-09-07 in 2.08 seconds
2021-03-12 08:59:24 Done Process date: 2020-09-08 in 1.89 seconds
2021-03-12 08:59:26 Done Process date: 2020-09-09 in 1.88 seconds
2021-03-12 08:59:28 Done Process date: 2020-09-10 in 1.86 seconds
2021-03-12 08:59:30 Done Process date: 2020-09-11 in 1.87 seconds

前 100 天的时间为 60 秒

   user  system elapsed 
 93.281  73.571  60.310

这是示例结果

> tail(all_result,20)
# A tibble: 20 x 4
   date       mat_f              plant_f total_resb_found
   <date>     <chr>              <fct>              <int>
 1 2020-04-08 000000000010199498 C602               13318
 2 2020-04-09 000000000010339762 FX01               66596
 3 2020-04-09 000000000010339762 CY04              441597
 4 2020-04-09 000000000010199498 CY16              160625
 5 2020-04-09 000000000010199498 FX03               13350
 6 2020-04-09 000000000010199498 FX10               13418
 7 2020-04-09 000000000010339762 CY07              120541
 8 2020-04-09 000000000010339762 CY30               80768
 9 2020-04-09 000000000010339762 FX10              120076
10 2020-04-09 000000000010339762 FX03               13498
11 2020-04-09 000000000010199498 C602               13448
12 2020-04-09 000000000010339762 FX1C               53672
13 2020-04-09 000000000010339762 CY08               80234
14 2020-04-09 000000000010339762 CY05              120682
15 2020-04-09 000000000010339762 CY09               40493
16 2020-04-09 000000000010339762 FX07               13325
17 2020-04-09 000000000010339762 CY02               40204
18 2020-04-09 000000000010339762 FX05               26671
19 2020-04-09 000000000010339762 C602               13576
20 2020-04-09 000000000010339762 KB01               13331

[更新：增加处理整个 750 万条记录的时间]

    user   system  elapsed 
1665.785 1096.865  891.837

dplyr performance processing-efficiency r r

在 R 中的每次观察之前，在时间窗口内查找所有相同的事务 更新方法[更新：增加处理整个 750 万条记录的时间]

问题描述

解决方法

更新方法

[更新：增加处理整个 750 万条记录的时间]

在 R 中的每次观察之前，在时间窗口内查找所有相同的事务更新方法[更新：增加处理整个 750 万条记录的时间]