使用Pandas解析大型txt文件时发生ParserError

问题描述

我正在尝试使用Pandas解析大型.txt文件。该文件的大小为1.6 GB。您可以下载文件here(这是所有国家和地区的GeoNames数据库转储文件)。

关于在Pandas中加载和解析文件,我参考了答案herehere,这就是我在代码中所拥有的:

import pandas as pd

for chunk in pd.read_csv(
    "allCountries.txt",header=None,engine="python",sep=r"\s{1,}",names=[
        "geonameid","name","asciiname","alternatenames","latitude","longitude","feature class","feature code","country code","cc2","admin1 code","admin2 code","admin3 code","admin4 code","population","elevation","dem","timezone","modification date",],chunksize=1000,):
    print(chunk[0])  # just printing out the first row

如果运行上面的代码,则会出现以下错误

ParserError:在第1行中应该有20个字段,看到25个。错误可能是由于使用多字符定界符时引号被忽略了。

我不知道这里出了什么问题。 有人可以告诉我哪里出了什么问题以及如何解决

解决方法

您的分隔符错误,因为您在一列(名称)中有空格:

2986043布兰卡皮卡(Pic de Font Blanca)布兰卡皮卡(Font Blanca)布克皮卡(Pic du Port)42.64991 1.53335 T PK AD 00 0 2860欧洲/安道尔2014-11-05

解析错误。

此代码对我有用:

for chunk in pd.read_csv(
    "allCountries.txt",header=None,engine="python",sep=r"\t+",names=[
        "geonameid","name","asciiname","alternatenames","latitude","longitude","feature class","feature code","country code","cc2","admin1 code","admin2 code","admin3 code","admin4 code","population","elevation","dem","timezone","modification date",],chunksize=1000,):
    print(chunk)
,

使用LibreOffice打开文件的前10行,并使用tab作为分隔符,效果很好

import csv
import pandas as pd

for chunk in pd.read_csv(
    'allCountries.txt',sep="\t",quoting=csv.QUOTE_NONE,chunksize=1000
):
    print(chunk.iloc[0])  # just printing out the first row

该文件还包含字符“和”,默认情况下,大熊猫假定该熊猫用于引用,但会导致错误,但将引用设置为QUOTE_NONE即可解决该问题。