从joblib并行循环中的pytables hdf5文件读取

问题描述

我正在努力在Joblib并行访问循环中访问数据。

基本思想是,我想在每次迭代中从h5文件中加载数据,对数据进行某些处理,然后保存输出。数据量太大,无法容纳在内存中。因此,我正在考虑一种迭代方法

我的基本示例是这样的:

import tables
import numpy as np
from joblib import Parallel,delayed 

# open file
h5file = tables.open_file('file.h5','r')

# define function which I want to run in parallel
def function(i):
    x = h5file.root.variable[:,i]
    
    # do something with x,e.g.
    result = np.sum(np.square(x))

    return(result)

# run in parallel
results = Parallel(n_jobs=-1)(delayed(function)(i) for i in range(100))

# close file
h5file.close()

但是,如果我以这种方式实施此操作,则会收到以下错误: “ PicklingError:无法腌制任务以将其发送给工作人员。”

我真的很笨,很乐意提供帮助。

解决方法

我弄错了:我需要在要并行化的函数中打开和关闭文件:

import tables
import numpy as np
from joblib import Parallel,delayed 


# define function which I want to run in parallel
def function(i):
    # open file
    h5file = tables.open_file('file.h5','r')
    
    x = h5file.root.variable[:,i]
    
    # close file
    h5file.close()
    
    # do something with x,e.g.
    result = np.sum(np.square(x))

    return(result)



# run in parallel
results = Parallel(n_jobs=-1)(delayed(function)(i) for i in range(100))