使用PyTables对大型文本文件进行排序

问题描述

我有两个大的输入文件(> 10 GB,Nx4)。任务是根据第二列尽快对这些文件进行排序。现在,我正在分块并将已排序的行保存在文本文件中(下面的代码)。虽然可以,但是我需要更快的速度!

有什么快速的方法吗?后来我不得不分块读取排序后的文件,如何使用PytablesH5Py模块来完成此操作?或其他建议?

filename = ['Input-1.txt','Input-2.txt']
savename = ['Sort-1.txt','Sort-2.txt']

chunksize = 100_000_00 # chunk's size to read

for findex in range(2):
    nrows = sum(1 for line in open(filename[findex])) # no. of lines in each file

    # storing chunk files in /dump
    this_dir = os.path.dirname(__file__)
    path_1 = ["dump/chunk1_{}.tsv","dump/chunk2_{}.tsv"] # chunks in .tsv
    path_2 = ["dump/chunk1_*.tsv","dump/chunk2_*.tsv"]
    path_w = os.path.join(this_dir,path_1[findex])
    path_r = os.path.join(this_dir,path_2[findex])  

    fid = 1
    lines = []

    with open(filename[findex],'r') as f_in:
        # creates chunk file(s)
        f_out = open(path_w.format(fid),'w')
        
        for line_num,line in enumerate(f_in,1):
            # keep appending until you reach chunksize (boundary)
            lines.append(line)
            # enter as line_num reaches chunksize
            if line_num % chunksize == 0:
                # updates list with sorted values
                lines = sorted(lines,key=lambda k: float(k.split(',')[1]))
                f_out.writelines(lines)
                f_out.close()
                lines = []
                fid += 1
                # open next chunk
                f_out = open(path_w.format(fid),'w')

        # last chunk
        if lines:
            lines = sorted(lines,')[1]))
            f_out.writelines(lines)
            f_out.close()
            lines = []

    print(f'==> Writing {savename[findex]}')

    from heapq import merge
    chunks = []

    for filename[findex] in glob.glob(path_r):
        chunks += [open(filename[findex],'r')]

    #print(filename[findex],savename[findex])
    with open(savename[findex],'w') as f_out:
        f_out.writelines(merge(*chunks,')[1])))

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...