问题描述
我正在将 csv 读入 dask Dataframe,然后从 dask_ml 库调用 SimpleImputer。 我面临两种不同的问题。
问题 1) 当实际上我能够读取列时,dask 上的 Simple Imputer 失败并显示 FileNotFound。 代码:
import dask.dataframe as dd
df = dd.read_csv('outlier.csv')
X = df.drop('Column_A',axis=1)
print(X.columns) # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X)
输出:
Error
Traceback (most recent call last):
File "C:\Users\user\Documents\code\blah.py",line 127,in train_blahblah_model
p = SimpleImputer().fit_transform(X)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\sklearn\base.py",line 699,in fit_transform
return self.fit(X,**fit_params).transform(X)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py",line 53,in fit
self._fit_frame(X)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py",line 80,in _fit_frame
self.statistics_ = pd.Series(dask.compute(avg)[0],index=X.columns)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask\base.py",line 561,in compute
results = schedule(dsk,keys,**kwargs)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 2681,in get
results = self.gather(packed,asynchronous=asynchronous,direct=direct)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 1990,in gather
return self.sync(
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 836,in sync
return sync(
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py",line 340,in sync
raise exc.with_traceback(tb)
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py",line 324,in f
result[0] = yield future
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\tornado\gen.py",line 762,in run
value = future.result()
File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py",line 1855,in _gather
raise exception.with_traceback(traceback)
File "/opt/conda/lib/python3.8/site-packages/dask/bytes/core.py",line 185,in read_block_from_file
File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py",line 102,in __enter__
File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py",line 930,in open
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py",line 117,in _open
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py",line 199,in __init__
File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py",line 204,in _open
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/user/Documents/code/outlier.csv'
- 从 Pandas 读取 csv,然后放入 dask
df = pd.read_csv('outlier.csv',index_col='new')
df = dd.from_pandas(df,npartitions=3)
X = df.drop('Column_A',axis=1)
print(X.columns) # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X)
输出:SimpleImputer().fitTransform(X) 线上出错
AttributeError: 'DataFrame' object has no attribute '_data'
注意:当我使用 IterativeImputer 来拟合变换时,所有这些东西都适用于 Pandas。当我尝试使用 dask 生成模型时会出现问题,因为我最终想使用 dask 工人来生成我的模型
解决方法
此问题已解决。问题在于客户端和工作人员上的熊猫版本不同。工人在 1.0.1 上。我在两台机器上都将它升级到 1.2.3,这个错误消失了。
另请参阅问题 joblib connection to Dask backend: tornado.iostream.StreamClosedError: Stream is closed 以解决其他可能的问题。