问题描述
我正在努力在Joblib并行访问循环中访问数据。
基本思想是,我想在每次迭代中从h5文件中加载数据,对数据进行某些处理,然后保存输出。数据量太大,无法容纳在内存中。因此,我正在考虑一种迭代方法。
我的基本示例是这样的:
import tables
import numpy as np
from joblib import Parallel,delayed
# open file
h5file = tables.open_file('file.h5','r')
# define function which I want to run in parallel
def function(i):
x = h5file.root.variable[:,i]
# do something with x,e.g.
result = np.sum(np.square(x))
return(result)
# run in parallel
results = Parallel(n_jobs=-1)(delayed(function)(i) for i in range(100))
# close file
h5file.close()
但是,如果我以这种方式实施此操作,则会收到以下错误: “ PicklingError:无法腌制任务以将其发送给工作人员。”
我真的很笨,很乐意提供帮助。
解决方法
我弄错了:我需要在要并行化的函数中打开和关闭文件:
import tables
import numpy as np
from joblib import Parallel,delayed
# define function which I want to run in parallel
def function(i):
# open file
h5file = tables.open_file('file.h5','r')
x = h5file.root.variable[:,i]
# close file
h5file.close()
# do something with x,e.g.
result = np.sum(np.square(x))
return(result)
# run in parallel
results = Parallel(n_jobs=-1)(delayed(function)(i) for i in range(100))