多个数据帧上的多处理池映射产生 TypeError

问题描述

我有要导入 Pandas 数据帧的文件列表，每个文件至少 100 MB。

# current path of working directory for jupyter notebook and CSV files in Google Colab
file_dir = '/content/drive/MyDrive/New York Bike Share'

# getting file names within the directory and sort file name
file_names = glob(path.join(file_dir,'*-citibike-tripdata.csv'))
file_names

['/content/drive/MyDrive/New York Bike Share/201901-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201902-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201903-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201904-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201905-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201906-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201907-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201908-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201909-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201910-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201911-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201912-citibike-tripdata.csv']

我尝试借助 read_csv 方法中的块大小、usecols 等参数来分解数据帧中的每个文件。

cols = [0,1,4,5,6,8,9,10,12,13,14]

col_names = ['duration','time_start','station_name_start','station_latitude_start','station_longitude_start','station_name_end','station_latitude','station_longitude_end','user_type','birth_year','gender']

col_type = {
    'duration': np.int32,'station_latitude_start': np.float32,'station_longitude_start': np.float32,'station_latitude': np.float32,'station_longitude_end': np.float32,'user_type': 'category','birth_year': 'object','gender': 'category'
}

def create_df(file):

    t = pd.read_csv(file,chunksize=100_000,usecols=cols,names=col_names,dtype=col_type,parse_dates=['time_start'],header=0)

    return t

def merge_df(ls):

    f = reduce(lambda a,b: pd.concat([a,b],ignore_index=True),ls)

    return f

所有文件组合将产生 2GB+ 的数据帧，因此我使用 3 个 csv 文件进行了试验。

df_list = []

for f in file_names[0:3]:
    for chunk in create_df(f):
        df_list.append(chunk)

df_list[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries,0 to 99999
Data columns (total 11 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration                 100000 non-null  int32         
 1   time_start               100000 non-null  datetime64[ns]
 2   station_name_start       100000 non-null  object        
 3   station_latitude_start   100000 non-null  float32       
 4   station_longitude_start  100000 non-null  float32       
 5   station_name_end         100000 non-null  object        
 6   station_latitude         100000 non-null  float32       
 7   station_longitude_end    100000 non-null  float32       
 8   user_type                100000 non-null  category      
 9   birth_year               100000 non-null  object        
 10  gender                   100000 non-null  category      
dtypes: category(2),datetime64[ns](1),float32(4),int32(1),object(3)
memory usage: 5.2+ MB

merge_df(df_list).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3238991 entries,0 to 3238990
Data columns (total 11 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   duration                 int32         
 1   time_start               datetime64[ns]
 2   station_name_start       object        
 3   station_latitude_start   float32       
 4   station_longitude_start  float32       
 5   station_name_end         object        
 6   station_latitude         float32       
 7   station_longitude_end    float32       
 8   user_type                category      
 9   birth_year               object        
 10  gender                   category      
dtypes: category(2),object(3)
memory usage: 166.8+ MB

我尝试在“多处理”池的帮助下加快产生类似结果的过程，但遇到了 TypeError。

from multiprocessing import Pool
pool = Pool(8)

pool.map(merge_df,df_list)

TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

感谢您对错误的建议。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

dataframe pandas pandas python-3.8 python-multiprocessing