问题描述
我在使用 dask_ml 标签编码器时遇到了奇怪的行为。这是模拟真实数据的代码(我在这两种行为和一般随机行为中都杀死了工人):
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
from dask_ml.preprocessing import LabelEncoder
from dask.distributed import Client
client = Client()
# made up data
np.random.seed(1)
n_rows = 16000000
cities = np.random.randint(0,3800,n_rows)
devicebrand = np.random.randint(0,60,n_rows)
lineitem = np.random.randint(0,20,n_rows)
site = np.random.randint(0,180,n_rows)
placement = np.random.randint(0,350,n_rows)
domain = np.random.randint(0,1600,n_rows)
region = np.random.randint(0,85,n_rows)
dayofweek = np.random.randint(0,8,n_rows)
hour = np.random.randint(0,25,n_rows)
target = np.random.randint(0,2,n_rows)
data = np.vstack((cities,devicebrand,lineitem,site,placement,domain,region,dayofweek,hour,target)).T
df = pd.DataFrame(data,columns=['city','devicebrand','lineitem','site','placement','domain','region','dayofweek','hour','target'])
df = dd.from_pandas(df,chunksize=100000)
#dtypes before
print(df.dtypes)
cast_to_object = ['devicebrand','domain']
for c in cast_to_object:
df[c] = df[c].astype('str')
$dtypes after
print(df.dtypes)
#check shape[0]
print(df.shape[0].compute())
#check number of unique for object columns
for c in [c for c in df.columns if df[c].dtype=='object']:
print(f'{c} n unuqie: {df[c].nunique().compute()}')
df.head()
此时我们应该有 1600000*10 的数据框和 3 个对象列:devicebrand、domain、region
现在这段代码很奇怪。通常它会编码前 2 列(设备和域)并停留在区域,只有 85 个要编码的值。它会杀死工人、冻结并可能崩溃。
classes = {}
le = LabelEncoder()
to_encode = [c for c in df.columns if df[c].dtype=='object']
for c in to_encode:
print(f'encoding: {c})
df[c] = le.fit_transform(df[c])
classes[c] = le.classes_
print(f'done: {c}')
另一方面,如果我手动运行每个编码,那就没问题了。
le = LabelEncoder()
df['devicebrand'] = le.fit_transform(df['devicebrand'])
devicebrand = le.classes_
print(devicebrand.compute())
df.head(3)
le = LabelEncoder()
df['region'] = le.fit_transform(df['region'])
region = le.classes_
print(region.compute())
df.head(3)
le = LabelEncoder()
df['domain'] = le.fit_transform(df['domain'])
domain = le.classes_
print(domain.compute())
df.head(3)
之后我还会检查头部和计算类,以确保一切正常。
有什么建议吗? 提前致谢。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)