具有XGBoost并使用eval_set的DaskML需要预先计算的数据

问题描述

我正在尝试使用dask_ml.xgboost运行eval_set，以便尽早停止，以免过度拟合。

当前，我在下面的示例中显示了一个示例数据集，它表示我正在使用的数据大小

from dask.distributed import Client
from dask_ml.datasets import make_classification_df
from dask_ml.xgboost import XGBClassifier


if __name__ == "__main__":
    n_train_rows = 4_000
    n_val_rows = 1_000

    client = Client()
    print(client)

    # Generate balanced data for binary classification
    X_train,y_train = make_classification_df(
        n_samples=n_train_rows,chunks=100,predictability=0.35,n_features=50,random_state=2,)
    X_val,y_val = make_classification_df(
        n_samples=n_val_rows,)

    clf = XGBClassifier(objective="binary:logistic")

    # train
    clf.fit(
        X_train,y_train,eval_metric="error",eval_set=[
            (X_train.compute(),y_train.compute()),(X_val.compute(),y_val.compute()),],early_stopping_rounds=5,)

    # Make predictions
    y_pred = clf.predict(X_val).compute()
    assert len(y_pred) == len(y_val)

    client.close()

X_train，y_train，X_val和y_val的全部都是DataFrame（行数少，但是有很多功能可以模仿我用例）。

我无法使用eval_set将DataFrame指定为dask eval_set=[(X_train.compute(),y_val.compute())]的嵌套列表。相反，它们必须是大熊猫DataFrame，这就是为什么我需要为每个大熊猫叫.compute()。

但是，当我运行上述代码（使用大熊猫DataFrame s）时，我收到此警告

<Client: 'tcp://127.0.0.1:12345' processes=4 threads=12,memory=16.49 GB>
/home/username/.../distributed/worker.py:3373: UserWarning: Large object of size 2.16 MB detected in task graph:
  {'dmatrix_kwargs': {},'num_boost_round': 100,'ev ... ing_rounds': 5}
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

    future = client.submit(func,big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func,big_future)  # good
  warnings.warn(
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL connected to the tracker
task NULL got new rank 0
task NULL got new rank 1
task NULL got new rank 2
task NULL got new rank 3
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.
[08:52:41] WARNING: ../src/gbm/gbtree.cc:129: Tree method is automatically selected to be 'approx' for distributed training.

此代码一直运行到完成并生成预测。但是，estimator.fit(...)行正在生成此UserWarning。

附加说明

在我的用例中，此处示例中使用的训练和验证拆分中的行数反映了从整体数据中采样后的大小。不幸的是，训练（+超参数调整）dask_ml.xgboost所需的总体数据分割要大几个数量级（基于dask_ml recommendations，基于训练和验证学习曲线的行数，使用{ {3}}（使用from xgboost import XGBClassifier）没有dask_ml的{{1}}（standard XGBoost，1）版本，因此我无法计算它们并带来将它们作为大熊猫XGBoost进行记忆，以进行分布式DataFrame培训。
在此示例中使用的功能数量为50。（在实际用例中）在删除尽可能多的功能之后，我得出了这个数字。
代码在本地计算机上运行。

问题

是否有正确/推荐的方法来运行XGBoost的{{1}}的{{1}}，其中dask_ml由dask xgboost s组成？

编辑

请注意，训练分组也将在eval_set中传递（除了验证分组之外），目的是使用模型训练的输出来生成学习曲线（请参见2）。 / p>

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

dask-distributed dask-ml python