使用 Pipeline 从 S3 加载 Pyspark.ml 模型

问题描述

我正在尝试将经过训练的模型保存到 S3 存储,然后尝试通过来自 pyspark.ml 的 Pipeline 包加载和使用此模型进行预测。 这是我如何保存模型的示例。

#stage_1 to stage_4 are some basic trasnformation on data one-hot encoding e.t.c
# define stage 5: logistic regression model                          
 stage_5 = LogisticRegression(featuresCol='features',labelCol='label')

 # SETUP THE PIPELINE
 regression_pipeline = Pipeline(stages= [stage_1,stage_2,stage_3,stage_4,stage_5])

 # fit the pipeline for the trainind data
 model = regression_pipeline.fit(dataFrame1)

 model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model"
 model.save(model_path)

我能够成功保存模型并在上述模型路径中创建了两个文件

  1. 阶段
  2. 元数据。

但是,当我尝试加载模型时,它给了我以下错误

Traceback (most recent call last):
  File "/tmp/PythonScript_85ff2462_e087_4805_9f50_0c75fc4302e2958379757178872310.py",line 75,in <module>
    pipelineModel = Pipeline.load(model_path)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py",line 362,in load
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py",line 207,in load
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/util.py",line 300,in load
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",line 1257,in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",line 79,in deco
pyspark.sql.utils.IllegalArgumentException: 'requirement Failed: Error loading Metadata: Expected class name org.apache.spark.ml.Pipeline but found class name org.apache.spark.ml.PipelineModel'

我正在尝试按如下方式加载模型:

from pyspark.ml import Pipeline

## same path used while #model.save in the above code snippet
model_path ="s3://s3-dummy_path-orch/dummy models/pipeline_testing_1.model" 

pipelineModel = Pipeline.load(model_path)

我该如何解决这个问题?

解决方法

如果您保存了管道模型,则应将其作为管道模型加载,而不是作为管道加载。区别在于管道模型适合数据帧,而管道不是。

from pyspark.ml import PipelineModel

pipelineModel = PipelineModel.load(model_path)