Dask-ml StandardScaler 内存泄漏和极端内存使用

问题描述

我在使用 dask-ml StandardScaler 内存泄漏时遇到了一个问题，在非常大的（分块）阵列上导致永远无法解决问题。这是我的代码和有关数据的信息。

from dask_ml.preprocessing import StandardScaler
import dask
from dask.distributed import Client,LocalCluster

import rioxarray

client = Client(memory_limit='200GB',n_workers=20,threads_per_worker=2,processes=False)
da = rioxarray.open_Rasterio(r'H:/DEV/GMS/data/raster_stack01.dat')

da_rechunk = da.chunk({"band": 1,'x': 5000,'y': 5000})

在上面的结果中，我有这个：

接下来我尝试使用 StandardScaler：

scaler = StandardScaler()
scaler.fit(da_rechunk)

我收到这样的消息：

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 130.39 GiB -- Worker memory limit: 186.26 GiB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 130.40 GiB -- Worker memory limit: 186.26 GiB
distributed.utils_perf - WARNING - full garbage collections took 98% cpu time recently (threshold: 10%)
distributed.worker - WARNING - gc.collect() took 3.578s. This is usually a sign that some tasks handle too many Python objects at the same time. Rechunking the work into smaller tasks might help.

在客户端仪表板上，我看到它使用了超过 4TB bytes + over 60GB spilled on disk。它在将所有块 xarrays 读入工作程序后挂起处理。重新进入 (1,1000,1000) 无济于事。

StandardScaler 是否在 dask-ml 中实现用于此类用例？是 dask 的错误还是我做错了什么？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

dask dask-distributed dask-ml scikit-learn