根据 ID 将决策树规则合并到 DF 并生成聚合摘要

问题描述

我使用 PySpark 的 DecisionTreeRegressor 来拟合决策树。我根据下面的代码块输出树的规则：

x = ['Variable_Final']
df = data_vector(main_df,x)
dt = DecisionTreeRegressor(labelCol=y,featuresCol='features',seed=1234,maxDepth = 3)
modeldt = dt.fit(df)
print(modeldt.toDebugString()

基于上述代码构建的树的规则输出示例如下：

"DecisionTreeRegressionModel (uid=rfc_6c4ceb92ba78) of depth 3 with 15 nodes 
    If (feature 0 <= 0.33333333)
       if (feature 0 <= 0.25)
         if (feature 0 <= 0.22)
          Predict: 0.22
         Else (feature0 > 0.22)
          Predict: 0.3
        else (feature 0 > 0.25)
          if (feature 0 <= 0.3)
            Predict: 0.345
        ...

现在我的问题是：

如何通过“树或节点号（即按 node_number 分组）”运行聚合并获得以下输出：

是否有可能将合并树节点信息添加到我拥有的主数据帧中，以便我可以拥有主数据帧中的所有变量 + 节点/树信息（node0/tree0、node1/tree1、...）我可以进一步对其进行一些汇总。

如果我使用的是 SAS，我会执行以下操作以获得上述结果：

proc hpsplit data = main_data seed=1234 maxdepth = 3;
 model y = x (x here is Variable_Final);
 output out = output_tree;
 ID VariableID Qtr; (Additional Variable which I want to keep in the output_tree dataset)
run;

proc sql;
 create table alpha as 
   select _leaf_ (this is the leaf/node number),count(variableid) as tot,mean(y)
  from output_tree
  group by 1;
quit;

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark-ml apache-spark-mllib decision-tree pyspark pyspark