循环合并 df 和 .txt 并附加匹配的输出

问题描述

我使用的是 python3。我有一个包含 13,527 家公司（和 5 列）的 operation = detect_person(input_uri,output_uri) 数据框。我想将此公司列表合并到变量 nameinternat_country 上的 Contact_info.txt 40+GB 文件（+1.9 亿家公司和 29 列）。

我想要的输出是一个数据框（我在下面的代码中称之为 name_country），其中包含 13,527 家公司列表（来自 mergeall，我的左 df），以及来自匹配的 29 列的合并nameinternat_country 的情况。 Contact_info.txt 将有 13,527 行和 34 列（来自 mergeall 的 5 列 + 来自 nameinternat_country 的 29 列原始）。不匹配的案例将显示缺失值。

问题来自 Contact_info.txt 有 +40GB（由于内存问题，我无法将其作为数据帧加载）。所以我首先需要对它进行分块，然后逐块进行合并。这是我的代码（注意：我将 Contact_info.txt 文件子集为其前 5,000 行只是为了在我的试验中更有效）：

Contact_info.txt

mergeall = pd.DataFrame() #create df to store merges in chunk below ChunkSize = 1000 #num of rows per chunk for chunk in pd.read_csv('ORBIS financial/Contact info.txt',sep="\t",nrows=5000,chunksize=ChunkSize): chunk["name_country"]= chunk["NAME_INTERNAT"]+","+chunk["Country"] #create new (merging) column in txt files mergeall = pd.concat([mergeall,nameinternat_country.merge(chunk,how='left',on='name_country')]) 数据框给了我 67,635 行 x 34 列。是 mergeall 函数中的错误吗？

非常感谢。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）