读取h5文件时释放内存

问题描述

我有一堆h5文件，每个文件的大小约为200 GB。这些文件的结构如下：

file1.h5
├image  [float64: 3341 × 126 × 256 × 256]
├pulse  [uint64: 126]
└train  [uint64: 3341]

我编写了以下代码来读取这些文件：

def read_h5(file_name,pulse_avg=True,train_idx=0,pulse_idx=0):
    """
    Read image data from a h5 file to a xarray
    """
    hf = h5py.File(file_name,'r')
    
    if not pulse_avg:
        coords = {'train': np.array(hf.get(f'train')),'pulse': np.array(hf.get(f'pulse')),}
        dims = ['train','pulse','slow_scan','fast_scan']
        xarr = xr.DataArray(np.array(hf.get(f'image')),dims=dims,coords=coords)
        del hf
        return xarr.isel(train=train_idx).isel(pulse=pulse_idx)
    
    else:
        coords = {'train': np.array(hf.get(f'train'))}
        dims = ['train',coords=coords)
        del hf
        return xarr

请注意，我明确删除了用于读取文件的hf对象。

读取整个文件时，由于对象很大，因此内存使用情况符合预期：

dummy = images_from_disk('file1.h5',pulse_avg=False,train_idx=slice(None),pulse_idx=slice(None))
dummy.nbytes * (2 ** -30)
205.2626953125

读取前使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G

读取后使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G

但是，如果我读取相同的文件但保留较小的版本（只有两个脉冲而不是126个脉冲），则对象的大小显然会较小，但不会释放内存：

dummy_reduced = images_from_disk('file1.h5',pulse_idx=slice(None,2))
dummy_reduced.nbytes * (2 ** -30)
3.2626953125

读取前使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G         41G        697G         18M         15G        710G
Swap:          4.0G        132M        3.9G

读取后使用的内存：

              total        used        free      shared  buff/cache   available
Mem:           754G        247G        491G         18M         15G        504G
Swap:          4.0G        132M        3.9G

如何释放内存以连接三个以上的h5文件？完成这项工作的代码将类似于以下内容：

test = xr.concat([images_from_disk(file,train_idx=slice(None,10),2)) for file in my_files],pd.Index([int(file.stem[-2:]) for file in my_files],name='module'))

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

h5py memory-management python python-xarray