问题描述
我有要导入 Pandas 数据帧的文件列表,每个文件至少 100 MB。
# current path of working directory for jupyter notebook and CSV files in Google Colab
file_dir = '/content/drive/MyDrive/New York Bike Share'
# getting file names within the directory and sort file name
file_names = glob(path.join(file_dir,'*-citibike-tripdata.csv'))
file_names
['/content/drive/MyDrive/New York Bike Share/201901-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201902-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201903-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201904-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201905-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201906-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201907-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201908-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201909-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201910-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201911-citibike-tripdata.csv','/content/drive/MyDrive/New York Bike Share/201912-citibike-tripdata.csv']
我尝试借助 read_csv 方法中的块大小、usecols 等参数来分解数据帧中的每个文件。
cols = [0,1,4,5,6,8,9,10,12,13,14]
col_names = ['duration','time_start','station_name_start','station_latitude_start','station_longitude_start','station_name_end','station_latitude','station_longitude_end','user_type','birth_year','gender']
col_type = {
'duration': np.int32,'station_latitude_start': np.float32,'station_longitude_start': np.float32,'station_latitude': np.float32,'station_longitude_end': np.float32,'user_type': 'category','birth_year': 'object','gender': 'category'
}
def create_df(file):
t = pd.read_csv(file,chunksize=100_000,usecols=cols,names=col_names,dtype=col_type,parse_dates=['time_start'],header=0)
return t
def merge_df(ls):
f = reduce(lambda a,b: pd.concat([a,b],ignore_index=True),ls)
return f
所有文件组合将产生 2GB+ 的数据帧,因此我使用 3 个 csv 文件进行了试验。
df_list = []
for f in file_names[0:3]:
for chunk in create_df(f):
df_list.append(chunk)
df_list[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries,0 to 99999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 duration 100000 non-null int32
1 time_start 100000 non-null datetime64[ns]
2 station_name_start 100000 non-null object
3 station_latitude_start 100000 non-null float32
4 station_longitude_start 100000 non-null float32
5 station_name_end 100000 non-null object
6 station_latitude 100000 non-null float32
7 station_longitude_end 100000 non-null float32
8 user_type 100000 non-null category
9 birth_year 100000 non-null object
10 gender 100000 non-null category
dtypes: category(2),datetime64[ns](1),float32(4),int32(1),object(3)
memory usage: 5.2+ MB
merge_df(df_list).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3238991 entries,0 to 3238990
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 duration int32
1 time_start datetime64[ns]
2 station_name_start object
3 station_latitude_start float32
4 station_longitude_start float32
5 station_name_end object
6 station_latitude float32
7 station_longitude_end float32
8 user_type category
9 birth_year object
10 gender category
dtypes: category(2),object(3)
memory usage: 166.8+ MB
我尝试在“多处理”池的帮助下加快产生类似结果的过程,但遇到了 TypeError。
from multiprocessing import Pool
pool = Pool(8)
pool.map(merge_df,df_list)
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
感谢您对错误的建议。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)