na.填充模型管道

问题描述

我试图在管道中用 0 填充空值,然后将管道导出到 pmml 文件

我的第一次尝试是尝试创建一个自定义转换器,但我遇到了一个错误,说“impute_to_zero”对象没有属性“_to_java” 在这里进行了一些研究后,看起来我需要 create my own to_java method,但我很难用我的代码来做到这一点。

这是我的代码

from pyspark.sql import DataFrame
from pyspark.ml import Transformer

class impute_to_zero(Transformer):
    """
    A custom Transformer which converts all dataframe na to 0
    """

    def __init__(self,df: DataFrame) -> DataFrame:
        super(impute_to_zero,self).__init__()

    def _transform(self,df: DataFrame) -> DataFrame:
        df = df.na.fill(0)
        return df

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer,VectorAssembler,sqlTransformer
from pyspark2pmml import PMMLBuilder,toPMMLBytes

# Prepare training documents from a list of (id,text,label) tuples.
training = spark.createDataFrame([
    (0,"abc",3,1.0),(1,"b",None,0.0),(2,"spark",8,(3,"hadoop",4,0.0)
],["id","category","numcol","label"])

fillna = impute_to_zero(training)
indexer = StringIndexer(inputCol="category",outputCol="categoryIndex")
assembler = VectorAssembler(inputCols=["categoryIndex","numcol"],outputCol="features")
rf =RandomForestClassifier(labelCol="label",featuresCol="features",numTrees=5,maxDepth=3)
pipeline = Pipeline(stages=[fillna,indexer,assembler,rf])

model = pipeline.fit(training)

pmmlBuilder = PMMLBuilder(sc,training,model)
pmmlBuilder.buildFile("/dbfs/tmp/test.pmml")

我的第二次尝试是使用 sqltransformer,但 pyspark2pmml 似乎存在问题。我收到一条错误消息,指出 IllegalArgumentException: Name(s) [numcol] 与任何字段都不匹配。

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer,"label"])

fillna = sqlTransformer(statement = 
"""select 
  category,case when numcol is null then 0 else numcol end as numcol,label
FROM __THIS__
""")
indexer = StringIndexer(inputCol="category",rf])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

pmmlBuilder = PMMLBuilder(sc,model)
pmmlBuilder.buildFile("/dbfs/tmp/test.pmml")

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)