如何加速 R 中的 while 循环可能使用 dopar？

问题描述

我正在尝试处理一个包含数千万行文本的巨大文本文件。该文本文件包含对数百万张图像进行卷积网络分析的结果，如下所示：

 CUDNN_HALF=1 
net.optimized_memory = 0 
mini_batch = 1,batch = 8,time_steps = 1,train = 0 
nms_kind: greedynms (1),beta = 0.600000 
nms_kind: greedynms (1),beta = 0.600000 

 seen 64,trained: 447 K-images (6 Kilo-batches_64) 
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52118,w=256,h=256].png: Predicted in 19.894000 milli-seconds.
tumor: 99%  (left_x:    2   top_y:  160   width:   67   height:   34)
bcell: 98%  (left_x:    6   top_y:   54   width:   32   height:   22)
bcell: 80%  (left_x:   51   top_y:    0   width:   30   height:   16)
bcell: 98%  (left_x:   52   top_y:  198   width:   28   height:   26)
bcell: 98%  (left_x:  150   top_y:  216   width:   35   height:   23)
bcell: 56%  (left_x:  150   top_y:   78   width:   45   height:   30)
bcell: 91%  (left_x:  187   top_y:  132   width:   31   height:   26)
bcell: 96%  (left_x:  219   top_y:  185   width:   20   height:   26)
bcell: 37%  (left_x:  222   top_y:   -0   width:   24   height:    4)
bcell: 98%  (left_x:  241   top_y:  208   width:   15   height:   21)
bcell: 64%  (left_x:  248   top_y:   35   width:    8   height:   35)
 [... a lot of similar lines...] 
Enter Image Path: data/obj1/H001683-19-1-5-OCT2 [x=13390,y=52530,h=256].png: Predicted in 19.195000 milli-seconds.
bcell: 97%  (left_x:   45   top_y:  180   width:   29   height:   24)
bcell: 58%  (left_x:   59   top_y:    1   width:   35   height:   22)
tumor: 98%  (left_x:  105   top_y:  143   width:   99   height:   44)
tumor: 97%  (left_x:  113   top_y:   50   width:   57   height:   40)
bcell: 96%  (left_x:  191   top_y:  194   width:   29   height:   27)
bcell: 99%  (left_x:  201   top_y:  129   width:   34   height:   22)
Enter Image Path:

每个图像都在“输入图像路径”之后由图像文件名提及，后跟已识别的对象列表。我不知道先验，每个图像中有多少个对象（这里是肿瘤一个 bcell）。有时根本没有对象，有时则有数百个。我首先尝试使用

读取整个文件

test11<-readLines("result.txt")
picsna<-grep(test11,pattern="Enter Image") # line numbers with the image file name
lle<-length(picsna) # length for the subsequent script

然后继续我的脚本，但事实证明读取文件需要几个小时，所以我提出了一个想法来逐行读取文件并使用 while -循环：

require(LaF)
n=1 
lle<-0 # number of images (to be used in a subsequent code) 
picsna<-c() # vector with the line numbers of each image entry

# read the result-file initially (first bunch of lines do not contain image entries
test11<-get_lines(file="result.txt",line_numbers=n) 
# as long as the line exists read the next line and do following:
while(is.na(test11)==FALSE){ 
  test11<-get_lines(file="result.txt",line_numbers=n+1)
# I wanted to kNow how far my reading progressed but had a feeling,print slowed down the loop
  #print(n)   
# I found here this solution for printing progress periodically 
  if(n %% 10000==0) { 
     cat(paste0("iteration: ",n,"\n"))
  }
# look for image entry and save the line number (not the iteration number)
  if(grepl(test11,pattern="Enter Image")==TRUE){ 
    picsna<-c(picsna,n+1)
    lle<-lle+1} # increase the number of images
  n<-n+1 
}
# the last line of the file is always incomplete but has to be added to the vector to calculate the number of objects (in a following script not shown here) if the prevIoUs image had any.
if(is.na(test11)==TRUE){ 
  picsna<-c(picsna,n)
  print("The End")
  lle<-lle+1
}

我在一个包含大约 200 行的小结果文件上测量了第一个和第二个脚本的运行时间。第二个脚本甚至有点慢（0.04 vs 0.01），这让我很困惑。我想过在 foreach-%dopar% 循环中重写它，但不知道如何使用 readLines 函数或我的 while 循环来实现它。我的问题是，我事先不知道文件包含多少行。如果有人能帮助我并行化我的脚本，我将不胜感激！

解决方法

谢谢@Bas！我在一台 Linux 机器上测试了你的建议：对于一个大约 2.39 亿行的文件，它用了不到 1 分钟的时间。通过添加 >lines.txt 我可以保存结果。有趣的是，我的第一个 readLines R 脚本“仅”需要 29 分钟，这与我的第一次体验相比快得惊人（所以我的 Windows 计算机可能在工作时遇到了一些与 R 无关的问题）。

doparallel loops parallel-processing r r while-loop