处理大文件时，Azure Databricks命令卡住了纯Python 2.5GB +文件大小

问题描述

我正在使用纯Python将txt文件转换为XML格式。我以txt格式列出了从1kb到2.5Gb的文件。转换时，大小增长约5倍。

问题在于，当处理较大的2.5Gb文件时，第一个文件可以工作，但随后的处理挂起并卡住running command..。较小的文件似乎可以正常工作。

我已经编辑了代码，以确保它使用的是生成器，而不在内存中保留大列表。
我正在从dbfs开始处理，因此连接应该不是问题。
进行内存检查表明，它一直仅使用〜200Mb的内存，并且大小没有增长。
大文件大约需要10分钟来处理。
日志中没有GC警告或其他错误
Azure Databricks，纯Python
集群足够大，并且仅使用Python，因此不应该成为问题。
重新启动集群是使事情重新运行的唯一方法。
卡住命令还会导致群集中的其他笔记本无法正常工作。

为简化起见，对代码大纲进行了精简。

# list of files to convert that are in Azure Blob Storage
text_files = ['file1.txt','file2.txt','file3.txt']

# loop over files and convert them to xml
for file in text_files:
    
    xml_filename = file.replace('.txt','.xml')
    # copy files from blob storage to dbfs
    dbutils.fs.cp(f'dbfs:/mnt/storage_account/projects/xml_converter/input/{file}',f'dbfs:/tmp/temporary/{file}')
    
    # open files and convert to xml
    with open(f'/dbfs/tmp/temporary/{file}','r') as infile,open(f'/dbfs/tmp/temporary/{xml_filename}','a',encoding="utf-8") as outfile:

        # list of strings to join at write time
        to_write = []

        for line in infile:
            # convert to xml
            # code redacted for simplicity

            to_write.append(new_xml)

            # batch the write operations to avoid huge lists
            if len(to_write) > 10_000:

                outfile.write(''.join(to_write))
                to_write = [] # reset the batch

        # do a final write of anything that is in the list
        outfile.write(''.join(to_write))
    
    # move completed files from dbfs to blob storage
    dbutils.fs.cp(f'dbfs:/tmp/temporary/{xml_filename}',f"/mnt/storage_account/projects/xml_converter/output/{xml_filename}")

Azure群集信息

我希望这段代码可以正常运行。内存似乎不是问题。数据位于dbfs中，因此不是问题。它正在使用生成器，因此内存中没有太多内存。我很茫然。任何建议，将不胜感激。感谢您的光临！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

azure azure azure azure-databricks databricks python