dask_ml Simple Imputer 失败并出现 AttributeError: 'DataFrame' 对象没有属性 '_data'

问题描述

我正在将 csv 读入 dask Dataframe,然后从 dask_ml 库调用 SimpleImputer。 我面临两种不同的问题。

问题 1) 当实际上我能够读取列时,dask 上的 Simple Imputer 失败并显示 FileNotFound。 代码

 import dask.dataframe as dd
 df = dd.read_csv('outlier.csv')
 X = df.drop('Column_A',axis=1)
 print(X.columns)  # Print statement works. It gives me all the rest of the columns
 p = SimpleImputer().fit_transform(X)

输出

Error
Traceback (most recent call last):
 File "C:\Users\user\Documents\code\blah.py",line 127,in train_blahblah_model
    p = SimpleImputer().fit_transform(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\sklearn\base.py",line 699,in fit_transform
    return self.fit(X,**fit_params).transform(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py",line 53,in fit
    self._fit_frame(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py",line 80,in _fit_frame
    self.statistics_ = pd.Series(dask.compute(avg)[0],index=X.columns)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask\base.py",line 561,in compute
    results = schedule(dsk,keys,**kwargs)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 2681,in get
    results = self.gather(packed,asynchronous=asynchronous,direct=direct)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 1990,in gather
    return self.sync(
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 836,in sync
    return sync(
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py",line 340,in sync
    raise exc.with_traceback(tb)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py",line 324,in f
    result[0] = yield future
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\tornado\gen.py",line 762,in run
    value = future.result()
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 1855,in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/bytes/core.py",line 185,in read_block_from_file
  File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py",line 102,in __enter__
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py",line 930,in open
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py",line 117,in _open
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py",line 199,in __init__
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py",line 204,in _open
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/user/Documents/code/outlier.csv'
  1. 从 Pandas 读取 csv,然后放入 dask
df = pd.read_csv('outlier.csv',index_col='new')
df = dd.from_pandas(df,npartitions=3)
X = df.drop('Column_A',axis=1)
print(X.columns)  # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X) 
            

输出:SimpleImputer().fitTransform(X) 线上出错

AttributeError: 'DataFrame' object has no attribute '_data'

注意:当我使用 IterativeImputer 来拟合变换时,所有这些东西都适用于 Pandas。当我尝试使用 dask 生成模型时会出现问题,因为我最终想使用 dask 工人来生成我的模型

解决方法

此问题已解决。问题在于客户端和工作人员上的熊猫版本不同。工人在 1.0.1 上。我在两台机器上都将它升级到 1.2.3,这个错误消失了。

另请参阅问题 joblib connection to Dask backend: tornado.iostream.StreamClosedError: Stream is closed 以解决其他可能的问题。