问题描述
我开发了两个模型(分类和回归)并通过 https://github.com/jpmml/jpmml-xgboost 将它们导出为 PMML 交换格式。当我在 python 中调用它们时,两个模型都运行良好。但是,我希望将两者合并到一个文件中,该文件返回两个值,即分类模型的类别概率和回归模型的预测值。
我已经尝试了几个小时,但未能完全理解 PMML 规范。
有没有人有这方面的经验,可以给我一个提示如何组合文件并通过文件传输值?两种模型都需要完全相同的输入。
谢谢!
请参阅下面的两个小示例: 回归模型:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
<Header>
<Application name="JPMML-XGBoost" version="1.5-SNAPSHOT"/>
<Timestamp>2021-07-27T11:55:26Z</Timestamp>
</Header>
<DataDictionary>
<datafield name="mpg" optype="continuous" dataType="float">
<Value value="NaN" property="missing"/>
</datafield>
<datafield name="IDVAR_REINIG" optype="continuous" dataType="float">
<Value value="NaN" property="missing"/>
</datafield>
</DataDictionary>
<MiningModel functionName="regression" algorithmName="XGBoost (GBTree)" x-mathContext="float">
<MiningSchema>
<MiningField name="mpg" usageType="target"/>
<MiningField name="IDVAR_REINIG"/>
</MiningSchema>
<Targets>
<Target field="mpg" rescaleConstant="0.5"/>
</Targets>
<Segmentation multipleModelMethod="sum">
<Segment id="1">
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
<MiningSchema>
<MiningField name="IDVAR_REINIG"/>
</MiningSchema>
<Output>
<OutputField name="mpg" optype="continuous" dataType="float" isFinalResult="false" rescaleConstant="0.5"/>
</Output>
<Node score="1.7433707">
<True/>
<Node score="6.1398296">
<SimplePredicate field="IDVAR_REINIG" operator="greaterOrEqual" value="6033.51"/>
</Node>
</Node>
</TreeModel>
</Segment>
</Segmentation>
</MiningModel>
</PMML>
分类模型:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.4">
<Header>
<Application name="JPMML-XGBoost" version="1.5-SNAPSHOT"/>
<Timestamp>2021-07-27T11:54:45Z</Timestamp>
</Header>
<DataDictionary>
<datafield name="mpg" optype="categorical" dataType="integer">
<Value value="0"/>
<Value value="1"/>
</datafield>
<datafield name="IDVAR_REINIG" optype="continuous" dataType="float">
<Value value="NaN" property="missing"/>
</datafield>
</DataDictionary>
<MiningModel functionName="classification" algorithmName="XGBoost (GBTree)" x-mathContext="float">
<MiningSchema>
<MiningField name="mpg" usageType="target"/>
<MiningField name="IDVAR_REINIG"/>
</MiningSchema>
<Segmentation multipleModelMethod="modelChain" missingPredictionTreatment="returnMissing">
<Segment id="1">
<True/>
<MiningModel functionName="regression" x-mathContext="float">
<MiningSchema>
<MiningField name="IDVAR_REINIG"/>
</MiningSchema>
<Output>
<OutputField name="xgbValue" optype="continuous" dataType="float" isFinalResult="false"/>
</Output>
<Segmentation multipleModelMethod="sum">
<Segment id="1">
<True/>
<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
<MiningSchema>
<MiningField name="IDVAR_REINIG"/>
</MiningSchema>
<Node score="0.0070259375">
<True/>
<Node score="-0.030500757">
<SimplePredicate field="IDVAR_REINIG" operator="greaterOrEqual" value="2240.835"/>
</Node>
</Node>
</TreeModel>
</Segment>
</Segmentation>
</MiningModel>
</Segment>
<Segment id="2">
<True/>
<RegressionModel functionName="classification" normalizationMethod="logit" x-mathContext="float">
<MiningSchema>
<MiningField name="mpg" usageType="target"/>
<MiningField name="xgbValue"/>
</MiningSchema>
<Output>
<OutputField name="probability(0)" optype="continuous" dataType="float" feature="probability" value="0"/>
<OutputField name="probability(1)" optype="continuous" dataType="float" feature="probability" value="1"/>
</Output>
<RegressionTable intercept="0.0" targetCategory="1">
<NumericPredictor name="xgbValue" coefficient="1.0"/>
</RegressionTable>
<RegressionTable intercept="0.0" targetCategory="0"/>
</RegressionModel>
</Segment>
</Segmentation>
</MiningModel>
</PMML>
解决方法
创建一个包含两个现有子模型元素的父 MiningModel
元素。先插入分类模型,再插入回归模型;将它们作为模型链执行。
默认情况下,此模型链将仅显示最后一个子模型的结果字段。但是,您可以将第一个模型的一个或多个结果字段导出到“局部变量”中,然后在需要的地方反映它们的值。
示例 PMML 标记框架:
<MiningModel>
<Segmentation multipleModelMethod="modelChain">
<Segment id="classification">
<True/>
<MiningModel>
<Output>
<!-- Export the probability value to evaluation context -->
<OutputField name="probability(event)" feature="probability" value="event"/>
</Output>
</MiningModel>
</Segment>
<Segment id="regression">
<True/>
<MiningModel>
<MiningSchema>
<!-- Import the probability value from the evaluation context -->
<MiningField name="probability(event")/>
</MiningSchema>
<Output>
<!-- Re-export the probability value under a different name -->
<OutputField name="copy(probability(event))" feature="transformedValue">
<FieldRef field="probability(event)"/>
</OutputField>
</Output>
</MiningModel>
</Segment>
</Segmentation>
</MiningModel>
,
或者,您可以使用“段引用”机制从父输出访问子模型输出。
请参阅 OutputField@segmentId
属性 here 的说明。
示例 PMML 标记框架:
<MiningModel>
<Segmentation multipleModelMethod="modelChain">
<Segment id="classification/>
<Segment id="regresion">
</Segmentation>
<Output>
<!-- Reflect the probability of the "event" class of the classification model -->
<OutputField name="probability(event)" segmentId="classification" feature="probability"/>
<!-- Reflect the predicted value of the regression model -->
<OutputField name="y" segmentId="regression" feature="predictedValue"/>
</Output>
</MiningModel>