无法在foreach循环doSMP内使用read.table

问题描述

| 我正在尝试使用doSMP / foreach并行化R中的某些代码。我有一个庞大的2D遗传数据矩阵-10,000个观察值（行）和300万个变量（列）。由于内存问题，我不得不将这些数据分成1000个变量的块。我想读取每个文件，进行一些统计，然后将这些结果写到文件中。使用for循环很容易，但是我想使用foreach来加快速度。这是我正在做的事情：

# load doSMP,foreach,iterators,codetools
require(doSMP)

# files i\'m processing
print(filelist <- system(\"ls matrix1k.*.txt\",T))

#initialize processes
w <- startWorkers(2)
registerDoSMP(w)

# for each file,read into memory,do some stuff,write out results.
foreach (i =  1:length(filelist)) %dopar% {
    print(i)
    file <- filelist[i]
    print(file)
    thisfile <- read.table(file,header=T) 
    # here i\'ll do stuff using that file
    # here i\'ll write out results of the stuff I do above
}

#stop processes
stopWorkers(w)

但这会导致错误：Error in { : task 2 Failed - \"cannot open the connection\"。当我将%dopar%更改为%do%时，根本没有问题。

解决方法

我认为并行输入不会加快速度。限制因素是磁盘控制器，因此当您打开2个连接并读取数据时，它无济于事，因为它仍然必须通过磁盘控制器。除非您有一个带有多个磁盘控制器的RAID阵列，否则磁盘IO是一个串行工作（非常糟糕）。并行IO仅在每台机器都有自己的磁盘的群集上运行良好。 ,在foreach循环内，您必须调用要使用的程序包。例：一世）

foreach (i =  1:length(filelist),.packages = \"rgdal\") %dopar% ......

在您的情况下，您应该调用包的向量。范例2： ii）

package.vector <- c(\"package.1\",\"package.2\",etc)

foreach (i =  1:length(filelist),.packages = package.vector) %dopar% ......

我建议您致电您正在使用的所有软件包

foreach read.table 使用使用使用循环循环循环无法