如何将 2 个 pmml 文件合并为 1 个具有 2 个输出的文件?

问题描述

我开发了两个模型(分类和回归)并通过 https://github.com/jpmml/jpmml-xgboost 将它们导出为 PMML 交换格式。当我在 python 中调用它们时,两个模型都运行良好。但是,我希望将两者合并到一个文件中,该文件返回两个值,即分类模型的类别概率和回归模型的预测值。

我已经尝试了几个小时,但未能完全理解 PMML 规范。

有没有人有这方面的经验,可以给我一个提示如何组合文件并通过文件传输值?两种模型都需要完全相同的输入。

谢谢!

请参阅下面的两个小示例: 回归模型:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="JPMML-XGBoost" version="1.5-SNAPSHOT"/>
        <Timestamp>2021-07-27T11:55:26Z</Timestamp>
    </Header>
    <DataDictionary>
        <datafield name="mpg" optype="continuous" dataType="float">
            <Value value="NaN" property="missing"/>
        </datafield>
        <datafield name="IDVAR_REINIG" optype="continuous" dataType="float">
            <Value value="NaN" property="missing"/>
        </datafield>
    </DataDictionary>
    <MiningModel functionName="regression" algorithmName="XGBoost (GBTree)" x-mathContext="float">
        <MiningSchema>
            <MiningField name="mpg" usageType="target"/>
            <MiningField name="IDVAR_REINIG"/>
        </MiningSchema>
        <Targets>
            <Target field="mpg" rescaleConstant="0.5"/>
        </Targets>
        <Segmentation multipleModelMethod="sum">
            <Segment id="1">
                <True/>
                <TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
                    <MiningSchema>
                        <MiningField name="IDVAR_REINIG"/>
                    </MiningSchema>
                    <Output>
                        <OutputField name="mpg" optype="continuous" dataType="float" isFinalResult="false" rescaleConstant="0.5"/>
                    </Output>
                    <Node score="1.7433707">
                        <True/>
                        <Node score="6.1398296">
                            <SimplePredicate field="IDVAR_REINIG" operator="greaterOrEqual" value="6033.51"/>
                        </Node>
                    </Node>
                </TreeModel>
            </Segment>
        </Segmentation>
    </MiningModel>
</PMML>

分类模型:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
    <Header>
        <Application name="JPMML-XGBoost" version="1.5-SNAPSHOT"/>
        <Timestamp>2021-07-27T11:54:45Z</Timestamp>
    </Header>
    <DataDictionary>
        <datafield name="mpg" optype="categorical" dataType="integer">
            <Value value="0"/>
            <Value value="1"/>
        </datafield>
        <datafield name="IDVAR_REINIG" optype="continuous" dataType="float">
            <Value value="NaN" property="missing"/>
        </datafield>
    </DataDictionary>
    <MiningModel functionName="classification" algorithmName="XGBoost (GBTree)" x-mathContext="float">
        <MiningSchema>
            <MiningField name="mpg" usageType="target"/>
            <MiningField name="IDVAR_REINIG"/>
        </MiningSchema>
        <Segmentation multipleModelMethod="modelChain" missingPredictionTreatment="returnMissing">
            <Segment id="1">
                <True/>
                <MiningModel functionName="regression" x-mathContext="float">
                    <MiningSchema>
                        <MiningField name="IDVAR_REINIG"/>
                    </MiningSchema>
                    <Output>
                        <OutputField name="xgbValue" optype="continuous" dataType="float" isFinalResult="false"/>
                    </Output>
                    <Segmentation multipleModelMethod="sum">
                        <Segment id="1">
                            <True/>
                            <TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
                                <MiningSchema>
                                    <MiningField name="IDVAR_REINIG"/>
                                </MiningSchema>
                                <Node score="0.0070259375">
                                    <True/>
                                    <Node score="-0.030500757">
                                        <SimplePredicate field="IDVAR_REINIG" operator="greaterOrEqual" value="2240.835"/>
                                    </Node>
                                </Node>
                            </TreeModel>
                        </Segment>
                    </Segmentation>
                </MiningModel>
            </Segment>
            <Segment id="2">
                <True/>
                <RegressionModel functionName="classification" normalizationMethod="logit" x-mathContext="float">
                    <MiningSchema>
                        <MiningField name="mpg" usageType="target"/>
                        <MiningField name="xgbValue"/>
                    </MiningSchema>
                    <Output>
                        <OutputField name="probability(0)" optype="continuous" dataType="float" feature="probability" value="0"/>
                        <OutputField name="probability(1)" optype="continuous" dataType="float" feature="probability" value="1"/>
                    </Output>
                    <RegressionTable intercept="0.0" targetCategory="1">
                        <NumericPredictor name="xgbValue" coefficient="1.0"/>
                    </RegressionTable>
                    <RegressionTable intercept="0.0" targetCategory="0"/>
                </RegressionModel>
            </Segment>
        </Segmentation>
    </MiningModel>
</PMML>

解决方法

创建一个包含两个现有子模型元素的父 MiningModel 元素。先插入分类模型,再插入回归模型;将它们作为模型链执行。

默认情况下,此模型链将仅显示最后一个子模型的结果字段。但是,您可以将第一个模型的一个或多个结果字段导出到“局部变量”中,然后在需要的地方反映它们的值。

示例 PMML 标记框架:

<MiningModel>
  <Segmentation multipleModelMethod="modelChain">
    <Segment id="classification">
      <True/>
      <MiningModel>
        <Output>
         <!-- Export the probability value to evaluation context -->
         <OutputField name="probability(event)" feature="probability" value="event"/>
        </Output>
      </MiningModel>
    </Segment>
    <Segment id="regression">
      <True/>
      <MiningModel>
        <MiningSchema>
          <!-- Import the probability value from the evaluation context -->
          <MiningField name="probability(event")/>
        </MiningSchema>
        <Output>
          <!-- Re-export the probability value under a different name -->
          <OutputField name="copy(probability(event))" feature="transformedValue">
            <FieldRef field="probability(event)"/>
          </OutputField>
        </Output>
      </MiningModel>
    </Segment>
  </Segmentation>
</MiningModel>
,

或者,您可以使用“段引用”机制从父输出访问子模型输出。

请参阅 OutputField@segmentId 属性 here 的说明。

示例 PMML 标记框架:

<MiningModel>
  <Segmentation multipleModelMethod="modelChain">
    <Segment id="classification/>
    <Segment id="regresion">
  </Segmentation>
  <Output>
    <!-- Reflect the probability of the "event" class of the classification model -->
    <OutputField name="probability(event)" segmentId="classification" feature="probability"/>
    <!-- Reflect the predicted value of the regression model -->
    <OutputField name="y" segmentId="regression" feature="predictedValue"/>
  </Output>
</MiningModel>