问题描述
我正在使用庞大的数据集解决https://www.kaggle.com/c/ieee-fraud-detection问题。因此,在进行任何机器学习之前,我想通过将每个属性都设置为正确的类型来减少数据集的大小。所以下面是代码片段:
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum().compute() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min().compute()
c_max = df[col].max().compute()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else: df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum().compute()/ 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
我通过了如下数据集:(test_transaction是一个简单的数据框)
test_transaction = reduce_mem_usage(test_transaction)
test_transaction.to_csv(base + 'test_transaction.csv',single_file = True)
问题是,它是如此之大以至于需要永远。因此,我决定使用dask.Delayed将其并行化。所以我写了下面的代码:
from dask import delayed
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = delayed(df[col].astype)(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = delayed(df[col].astype)(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = delayed(df[col].astype)(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = delayed(df[col].astype)(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = delayed(df[col].astype)(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = delayed(df[col].astype)(np.float32)
else:
df[col] = delayed(df[col].astype)(np.float64)
else: df[col] = delayed(df[col].astype)('category')
Sum = 0
for col in df.columns:
Sum += delayed(df[col].memory_usage)()
print(Sum.compute()/ 1024**2)
#end_mem = df.memory_usage().sum()/ 1024**2
#print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
#print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
我通过了:
#base = 'E:\Study Material\Python_Machine_AI\Machine Learning\Python_ML_programs\IEEE_Fraud_Detection\\'
test = reduce_mem_usage(test_identity.compute())
#test_identity.to_csv(base + 'test_identity.csv',single_file = True)
我在这里没有得到任何结果,它表明:(无变化)
Memory usage of dataframe is 44.39 MB
44.394248962402344
当我执行以下操作时:
test.head()
显示:
TransactionID id-01 id-02 id-03 id-04 id-05 id-06 id-07 id-08 id-09 ... id-31 id-32 id-33 id-34 id-35 id-36 id-37 id-38 DeviceType DeviceInfo
0 Delayed('astype-039b7a66-2d77-43f2-a258-a412ad... Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b... Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25... Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f... Delayed('astype-26f740dc-2480-48b0-b754-f83434... Delayed('astype-d42d8543-f23f-4829-a6c5-66409c... Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9... Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311... Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3... Delayed('astype-9295cd86-9806-45b0-993f-4ed362... ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e... Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371... Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf... Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411... Delayed('astype-d04e8ed3-7690-47b5-987a-50b667... Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7... Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00... Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295... Delayed('astype-dedcc518-d216-491e-a124-bf2560... Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
1 Delayed('astype-039b7a66-2d77-43f2-a258-a412ad... Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b... Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25... Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f... Delayed('astype-26f740dc-2480-48b0-b754-f83434... Delayed('astype-d42d8543-f23f-4829-a6c5-66409c... Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9... Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311... Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3... Delayed('astype-9295cd86-9806-45b0-993f-4ed362... ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e... Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371... Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf... Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411... Delayed('astype-d04e8ed3-7690-47b5-987a-50b667... Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7... Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00... Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295... Delayed('astype-dedcc518-d216-491e-a124-bf2560... Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
2 Delayed('astype-039b7a66-2d77-43f2-a258-a412ad... Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b... Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25... Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f... Delayed('astype-26f740dc-2480-48b0-b754-f83434... Delayed('astype-d42d8543-f23f-4829-a6c5-66409c... Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9... Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311... Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3... Delayed('astype-9295cd86-9806-45b0-993f-4ed362... ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e... Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371... Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf... Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411... Delayed('astype-d04e8ed3-7690-47b5-987a-50b667... Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7... Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00... Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295... Delayed('astype-dedcc518-d216-491e-a124-bf2560... Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
3 Delayed('astype-039b7a66-2d77-43f2-a258-a412ad... Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b... Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25... Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f... Delayed('astype-26f740dc-2480-48b0-b754-f83434... Delayed('astype-d42d8543-f23f-4829-a6c5-66409c... Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9... Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311... Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3... Delayed('astype-9295cd86-9806-45b0-993f-4ed362... ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e... Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371... Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf... Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411... Delayed('astype-d04e8ed3-7690-47b5-987a-50b667... Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7... Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00... Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295... Delayed('astype-dedcc518-d216-491e-a124-bf2560... Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
4 Delayed('astype-039b7a66-2d77-43f2-a258-a412ad... Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b... Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25... Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f... Delayed('astype-26f740dc-2480-48b0-b754-f83434... Delayed('astype-d42d8543-f23f-4829-a6c5-66409c... Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9... Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311... Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3...
这意味着每列都正确延迟了,但是我无法以正确的方式调用.compute()。我希望所有列都正确转换为正确的类型,并且还要显示最终的内存大小。 我应该怎么做??
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)