从PySpark ml模型获取训练集中实例的标签概率

问题描述

我正在训练决策树作为二进制分类器，目标是获得所有实例，训练中和测试集中的每个标签的概率（0,1）。我计划使用这些概率将预测变量列中的连续值离散化为here。

可以通过scikit learning中的predict_proba获得训练集和测试集的概率：

# Train set
tree_model.predict_proba(X_train.age.to_frame())
# Test set
tree_model.predict_proba(X_test.age.to_frame())

但是PySpark似乎并非如此：

from pyspark.ml import Pipeline
dt = DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5)
pipeline = Pipeline(stages=[dt])
model = pipeline.fit(train_data)
predictions = model.transform(test_data)

将测试集实例的概率写入预测数据帧：

predictions.show(5,truncate=False)
+--------+-----+-------------+---------------------------------------+----------+
|features|label|rawPrediction|probability                            |prediction|
+--------+-----+-------------+---------------------------------------+----------+
|[0.0]   |1.0  |[132.0,123.0]|[0.5176470588235295,0.4823529411764706]|0.0       |
|[0.0]   |1.0  |[132.0,0.4823529411764706]|0.0       |
|[0.0]   |0.0  |[132.0,0.4823529411764706]|0.0       |
|[32.0]  |0.0  |[3.0,1.0]    |[0.75,0.25]                            |0.0       |
+--------+-----+-------------+---------------------------------------+----------+
only showing top 5 rows

如何获取训练集实例的概率？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

machine-learning pyspark training-data