问题描述

最近我一直在使用 future（以及 future.apply 和 furrr）在 R 中进行一些并行处理，这在大多数情况下都很棒，但我偶然发现了一些我无法解释。这可能是某个地方的错误，但也可能是我的编码草率。如果有人能解释这种行为，我们将不胜感激。

设置

我正在对数据的不同子组进行模拟。对于每个组，我想运行模拟 n 次，然后计算结果的一些汇总统计信息。下面是一些示例代码，用于重现我的基本设置并演示我看到的问题：

library(tidyverse)
library(future)
library(future.apply)

# Helper functions

#' Calls out to `free` to get total system memory used
sys_used <- function() {
  .f <- system2("free","-b",stdout = TRUE)
  as.numeric(unlist(strsplit(.f[2]," +"))[3])
}

#' Write time,and memory usage to log file in CSV format
#' @param .f the file to write to 
#' @param .id identifier for the row to be written
mem_string <- function(.f,.id) {
  .s <- paste(.id,Sys.time(),sys_used(),Sys.getpid(),sep = ",")
  write_lines(.s,.f,append = TRUE)
}

# Inputs
fake_inputs <- 1:16
nsim <- 100
nrows <- 1e6

log_file <- "future_mem_leak_log.csv"
if (fs::file_exists(log_file)) fs::file_delete(log_file)

test_cases <- list(
  list(
    name = "multisession-sequential",plan = list(multisession,sequential)
  ),list(
    name = "sequential-multisession",plan = list(sequential,multisession)
  )
)

# Test code

for (.t in test_cases) {
  plan(.t$plan)
  
  # loop over subsets of the data 
  final_out <- future_lapply(fake_inputs,function(.i) {
    # loop over simulations
    out <- future_lapply(1:nsim,function(.j) {
      # in real life this would be doing simulations,# but here we just create "results" using rnorm()
      res <- data.frame(
        id = rep(.j,nrows),col1 = rnorm(nrows) * .i,col2 = rnorm(nrows) * .i,col3 = rnorm(nrows) * .i,col4 = rnorm(nrows) * .i,col5 = rnorm(nrows) * .i,col6 = rnorm(nrows) * .i
      )

      # write memory usage to file
      mem_string(log_file,.t$name)

      # in real life I would write res to file to read in later,but here we
      # only return head of df so we kNow the returned value isn't filling up memory
      res %>% slice_head(n = 10) 
    })
  })

  # clean up any leftover objects before testing the next plan
  try(rm(final_out))
  try(rm(out))
  try(rm(res))
}

外循环用于测试两种并行化策略：是并行化数据子集还是 100 次模拟。

一些注意事项

我意识到对模拟进行并行化不是理想的设计，而且将数据分块以向每个内核发送 10-20 个模拟会更有效，但这并不是真正的重点.我只是想了解内存中发生了什么。
我也认为 plan(multicore) 在这里可能会更好（尽管我确定它是否会更好）但我更感兴趣的是弄清楚 plan(multisession) 发生了什么

结果

我在 8-vcpu Linux EC2 上运行它（如果人们需要，我可以提供更多规格）并根据结果创建以下图（在底部绘制代码以实现可重复性）：

首先，plan(list(multisession,sequential)) 更快（正如预期的那样，请参阅上面的警告），但我对内存配置文件感到困惑。 plan(list(multisession,sequential)) 的总系统内存使用量保持不变，这是我所期望的，因为我假设 res 对象每次通过循环都会被覆盖。

然而，plan(list(sequential,multisession)) 的内存使用量随着程序运行而稳步增长。似乎每次通过循环创建 res 对象，然后在某个地方的边缘，占用了记忆。在我的真实示例中，它变得足够大，以至于它填满了我的整个 (32GB) 系统内存，并在大约进行到一半时终止了进程。

情节扭曲：只有在嵌套时才会发生

这是真正让我感到困惑的部分！当我将外部 future_lapply 更改为常规 lapply 并设置 plan(multisession) 时，我看不到它！从我对 this "Future: Topologies" vignette 的阅读来看，这应该与 plan(list(sequential,multisession)) 相同，但该图根本没有显示内存增长（实际上，它几乎与上图中的 plan(list(multisession,sequential)) 相同）

注意其他选项

实际上，我最初是用 furrr::future_map_dfr() 发现的，但为了确定这不是 furrr 中的错误，我用 future.apply::future_lapply() 进行了尝试并显示了结果。我尝试仅用 future::future() 对其进行编码并得到非常不同的结果，但很可能是因为我编码的内容实际上并不等效。如果没有 furrr 或 future.apply 提供的抽象层，我在直接使用期货方面没有太多经验。

再次感谢您对此的任何见解。

绘图代码

library(tidyverse)

logDat <- read_csv("future_mem_leak_log.csv",col_names = c("plan","time","sys_used","pid")) %>%
  group_by(plan) %>% 
  mutate(
    start = min(time),time_elapsed = as.numeric(difftime(time,start,units = "secs"))
  )

ggplot(logDat,aes(x = time_elapsed/60,y = sys_used/1e9,group = plan,colour = plan)) + 
  geom_line() + 
  xlab("Time elapsed (in mins)") + ylab("Memory used (in GB)") +
  ggtitle("Memory Usage\n  list(multisession,sequential) vs list(sequential,multisession)")

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

r r r-future

将来不会使用嵌套计划多会话释放内存一些注意事项注意其他选项

问题描述

设置

一些注意事项

结果

情节扭曲：只有在嵌套时才会发生

注意其他选项

绘图代码

解决方法

相关问答

将来不会使用嵌套计划多会话释放内存 一些注意事项注意其他选项

问题描述

设置

一些注意事项

结果

情节扭曲：只有在嵌套时才会发生

注意其他选项

绘图代码

解决方法

相关问答

将来不会使用嵌套计划多会话释放内存一些注意事项注意其他选项