dask LabelEncoder 奇怪的行为

问题描述

我在使用 dask_ml 标签编码器时遇到了奇怪的行为。这是模拟真实数据的代码(我在这两种行为和一般随机行为中都杀死了工人):

import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
from dask_ml.preprocessing import LabelEncoder

from dask.distributed import Client
client = Client()

# made up data
np.random.seed(1)
n_rows = 16000000
cities = np.random.randint(0,3800,n_rows)
devicebrand = np.random.randint(0,60,n_rows)
lineitem = np.random.randint(0,20,n_rows)
site = np.random.randint(0,180,n_rows)
placement = np.random.randint(0,350,n_rows)
domain = np.random.randint(0,1600,n_rows)
region = np.random.randint(0,85,n_rows)
dayofweek = np.random.randint(0,8,n_rows)
hour = np.random.randint(0,25,n_rows)
target = np.random.randint(0,2,n_rows)

data = np.vstack((cities,devicebrand,lineitem,site,placement,domain,region,dayofweek,hour,target)).T

df = pd.DataFrame(data,columns=['city','devicebrand','lineitem','site','placement','domain','region','dayofweek','hour','target'])

df = dd.from_pandas(df,chunksize=100000)

#dtypes before
print(df.dtypes)

cast_to_object = ['devicebrand','domain']
for c in cast_to_object:
    df[c] = df[c].astype('str')

$dtypes after
print(df.dtypes)
#check shape[0]
print(df.shape[0].compute())

#check number of unique for object columns
for c in [c for c in df.columns if df[c].dtype=='object']:
    print(f'{c} n unuqie: {df[c].nunique().compute()}')

df.head()

此时我们应该有 1600000*10 的数据框和 3 个对象列:devicebrand、domain、region

在这代码很奇怪。通常它会编码前 2 列(设备和域)并停留在区域,只有 85 个要编码的值。它会杀死工人、冻结并可能崩溃。

classes = {}
le = LabelEncoder()
to_encode = [c for c in df.columns if df[c].dtype=='object']

for c in to_encode:
    print(f'encoding: {c})
    df[c] = le.fit_transform(df[c])
    classes[c] = le.classes_
    print(f'done: {c}')

另一方面,如果我手动运行每个编码,那就没问题了。

le = LabelEncoder()
df['devicebrand'] = le.fit_transform(df['devicebrand'])
devicebrand = le.classes_
print(devicebrand.compute())
df.head(3)

le = LabelEncoder()
df['region'] = le.fit_transform(df['region'])
region = le.classes_
print(region.compute())
df.head(3)

le = LabelEncoder()
df['domain'] = le.fit_transform(df['domain'])
domain = le.classes_
print(domain.compute())
df.head(3)

之后我还会检查头部和计算类,以确保一切正常。

有什么建议吗? 提前致谢。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)