如何将Dask Delayed应用于以下数据类型转换函数?

问题描述

我正在使用庞大的数据集解决https://www.kaggle.com/c/ieee-fraud-detection问题。因此,在进行任何机器学习之前,我想通过将每个属性都设置为正确的类型来减少数据集的大小。所以下面是代码片段:

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum().compute() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min().compute()
            c_max = df[col].max().compute()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else: df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum().compute()/ 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

我通过了如下数据集:(test_transaction是一个简单的数据框)

test_transaction = reduce_mem_usage(test_transaction)
test_transaction.to_csv(base + 'test_transaction.csv',single_file = True)

问题是,它是如此之大以至于需要永远。因此,我决定使用dask.Delayed将其并行化。所以我写了下面的代码

from dask import delayed

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = delayed(df[col].astype)(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = delayed(df[col].astype)(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = delayed(df[col].astype)(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = delayed(df[col].astype)(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = delayed(df[col].astype)(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = delayed(df[col].astype)(np.float32)
                else:
                    df[col] = delayed(df[col].astype)(np.float64)
        else: df[col] = delayed(df[col].astype)('category')

    Sum = 0
    for col in df.columns:
        Sum += delayed(df[col].memory_usage)()
     
    print(Sum.compute()/ 1024**2)
    
    #end_mem = df.memory_usage().sum()/ 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

我通过了:

#base = 'E:\Study Material\Python_Machine_AI\Machine Learning\Python_ML_programs\IEEE_Fraud_Detection\\'
test = reduce_mem_usage(test_identity.compute())
#test_identity.to_csv(base + 'test_identity.csv',single_file = True)

在这里没有得到任何结果,它表明:(无变化)

Memory usage of dataframe is 44.39 MB
44.394248962402344

当我执行以下操作时:

test.head()

显示

TransactionID   id-01   id-02   id-03   id-04   id-05   id-06   id-07   id-08   id-09   ... id-31   id-32   id-33   id-34   id-35   id-36   id-37   id-38   DeviceType  DeviceInfo
0   Delayed('astype-039b7a66-2d77-43f2-a258-a412ad...   Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b...   Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25...   Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f...   Delayed('astype-26f740dc-2480-48b0-b754-f83434...   Delayed('astype-d42d8543-f23f-4829-a6c5-66409c...   Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9...   Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311...   Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3...   Delayed('astype-9295cd86-9806-45b0-993f-4ed362...   ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e...   Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371...   Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf...   Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411...   Delayed('astype-d04e8ed3-7690-47b5-987a-50b667...   Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7...   Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00...   Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295...   Delayed('astype-dedcc518-d216-491e-a124-bf2560...   Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
1   Delayed('astype-039b7a66-2d77-43f2-a258-a412ad...   Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b...   Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25...   Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f...   Delayed('astype-26f740dc-2480-48b0-b754-f83434...   Delayed('astype-d42d8543-f23f-4829-a6c5-66409c...   Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9...   Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311...   Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3...   Delayed('astype-9295cd86-9806-45b0-993f-4ed362...   ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e...   Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371...   Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf...   Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411...   Delayed('astype-d04e8ed3-7690-47b5-987a-50b667...   Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7...   Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00...   Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295...   Delayed('astype-dedcc518-d216-491e-a124-bf2560...   Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
2   Delayed('astype-039b7a66-2d77-43f2-a258-a412ad...   Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b...   Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25...   Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f...   Delayed('astype-26f740dc-2480-48b0-b754-f83434...   Delayed('astype-d42d8543-f23f-4829-a6c5-66409c...   Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9...   Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311...   Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3...   Delayed('astype-9295cd86-9806-45b0-993f-4ed362...   ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e...   Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371...   Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf...   Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411...   Delayed('astype-d04e8ed3-7690-47b5-987a-50b667...   Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7...   Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00...   Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295...   Delayed('astype-dedcc518-d216-491e-a124-bf2560...   Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
3   Delayed('astype-039b7a66-2d77-43f2-a258-a412ad...   Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b...   Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25...   Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f...   Delayed('astype-26f740dc-2480-48b0-b754-f83434...   Delayed('astype-d42d8543-f23f-4829-a6c5-66409c...   Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9...   Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311...   Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3...   Delayed('astype-9295cd86-9806-45b0-993f-4ed362...   ... Delayed('astype-a1478260-6344-4855-b7c3-bc317e...   Delayed('astype-13aa3bbd-cab7-421b-b7ec-f78371...   Delayed('astype-be95a488-5174-40f0-9bf7-a36bbf...   Delayed('astype-b3f81da4-b137-4095-a1b0-6a8411...   Delayed('astype-d04e8ed3-7690-47b5-987a-50b667...   Delayed('astype-0a70e20d-015e-4905-a14d-0ea9a7...   Delayed('astype-34e55bf4-20e3-4e96-bc87-25de00...   Delayed('astype-26797d19-0532-4f5b-8d2d-d7f295...   Delayed('astype-dedcc518-d216-491e-a124-bf2560...   Delayed('astype-3d28bd90-3aa3-47d8-a066-963c42...
4   Delayed('astype-039b7a66-2d77-43f2-a258-a412ad...   Delayed('astype-0125ce2c-b588-4a6c-ab32-53176b...   Delayed('astype-5979d7a0-d0e8-44b5-bd1a-28fd25...   Delayed('astype-544f031f-ef72-4d62-9db0-57aa7f...   Delayed('astype-26f740dc-2480-48b0-b754-f83434...   Delayed('astype-d42d8543-f23f-4829-a6c5-66409c...   Delayed('astype-878fd007-6e16-4ae8-bf57-1c8eb9...   Delayed('astype-2670a2b2-60f2-4c52-8ae3-306311...   Delayed('astype-3af3977e-860c-4f8e-b657-2e46e3...

这意味着每列都正确延迟了,但是我无法以正确的方式调用.compute()。我希望所有列都正确转换为正确的类型,并且还要显示最终的内存大小。 我应该怎么做??

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...