使用Spark 3加载PipelineModel时出现AnalysisException

问题描述

我将Spark版本从2.4.5升级到3.0.1，并且无法再加载使用“ DecisionTreeClassifier”阶段的PipelineModel对象。

在我的代码中，我加载了几个PipelineModel，所有带有阶段[“ CountVectorizer_ [uid]”，“ LinearSVC_ [uid]”]的PipelineModel都可以正常加载，而带有阶段的模型 [“ CountVectorizer_ [uid]”，“ DecisionTreeClassifier_ [uid]”]引发以下异常：

AnalysisException：无法解析给定的输入列“ rawCount”： [增益，id，杂质，杂质统计，leftChild，预测，rightChild，分裂]

这是我正在使用的代码以及完整的堆栈跟踪：

from pyspark.ml.pipeline import PipelineModel
PipelineModel.load("/path/to/model")


AnalysisException                         Traceback (most recent call last)
<command-1278858167154148> in <module>
----> 1 RalentModel = PipelineModel.load(MODELES_ATTRIBUTS + "RalentModel_DT")/databricks/spark/python/pyspark/ml/util.py in load(cls,path)
    368     def load(cls,path):
    369         """Reads an ML instance from the input path,a shortcut of `read().load(path)`."""
--> 370         return cls.read().load(path)
    371 
    372 /databricks/spark/python/pyspark/ml/pipeline.py in load(self,path)
    289         metadata = DefaultParamsReader.loadMetadata(path,self.sc)
    290         if 'language' not in metadata['paramMap'] or metadata['paramMap']['language'] != 'Python':
--> 291             return JavaMLReader(self.cls).load(path)
    292         else:
    293             uid,stages = PipelineSharedReadWrite.load(metadata,self.sc,path)/databricks/spark/python/pyspark/ml/util.py in load(self,path)
    318         if not isinstance(path,basestring):
    319             raise TypeError("path should be a basestring,got type %s" % type(path))
--> 320         java_obj = self._jread.load(path)
    321         if not hasattr(self._clazz,"_from_java"):
    322             raise NotImplementedError("This Java ML type cannot be loaded into Python currently: %r"/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self,*args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer,self.gateway_client,self.target_id,self.name)
   1306 
   1307         for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a,**kw)
    131                 # Hide where the exception came from that shows a non-Pythonic
    132                 # JVM exception message.
--> 133                 raise_from(converted)
    134             else:
    135                 raise/databricks/spark/python/pyspark/sql/utils.py in raise_from(e)
AnalysisException: cannot resolve '`rawCount`' given input columns: [gain,id,impurity,impurityStats,leftChild,prediction,rightChild,split];

这些管道模型是使用Spark 2.4.3保存的，我可以使用Spark 2.4.5很好地加载它们。

我试图进一步研究并分别加载每个阶段。用

加载CountVectorizerModel

from pyspark.ml.feature import CountVectorizerModel
CountVectorizerModel.read().load("/path/to/model/stages/0_CountVectorizer_efce893314a9")

产生一个CountVectorizerModel，所以可以工作，但是尝试加载DecisionTreeClassificationModel时我的代码失败：

DecisionTreeClassificationModel.read().load("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0")
AnalysisException: cannot resolve '`rawCount`' given input columns: [gain,split];

这是我的决策树分类器“数据”的内容：

spark.read.parquet("/path/to/model/stages/1_DecisionTreeClassifier_4d2a76c565b0/data").show()

+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
| id|prediction|            impurity|impurityStats|                gain|leftChild|rightChild|           split|
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+
|  0|       0.0|  0.3926234384295062| [90.0,33.0]| 0.16011830963990054|        1|        16|[190,[0.5],-1]|
|  1|       0.0|  0.2672722508516028| [90.0,17.0]| 0.11434106988303855|        2|        15|[512,-1]|
|  2|       0.0|  0.1652892561983472|  [90.0,9.0]| 0.06959547629404085|        3|        14|[583,-1]|
|  3|       0.0| 0.09972299168975082|  [90.0,5.0]|0.026984966852376356|        4|        11|[480,-1]|
|  4|       0.0|0.043933846736523306|  [87.0,2.0]|0.021717299239076976|        5|        10|[555,[1.5],-1]|
|  5|       0.0|0.022469008264462766|  [87.0,1.0]|0.011105371900826402|        6|         7|[833,-1]|
|  6|       0.0|                 0.0|  [86.0,0.0]|                -1.0|       -1|        -1|    [-1,[],-1]|
|  7|       0.0|                 0.5|   [1.0,1.0]|                 0.5|        8|         9|  [0,-1]|
|  8|       0.0|                 0.0|   [1.0,-1]|
|  9|       1.0|                 0.0|   [0.0,1.0]|                -1.0|       -1|        -1|    [-1,-1]|
| 10|       1.0|                 0.0|   [0.0,-1]|
| 11|       0.0|                 0.5|   [3.0,3.0]|                 0.5|       12|        13| [14,-1]|
| 12|       0.0|                 0.0|   [3.0,-1]|
| 13|       1.0|                 0.0|   [0.0,3.0]|                -1.0|       -1|        -1|    [-1,-1]|
| 14|       1.0|                 0.0|   [0.0,4.0]|                -1.0|       -1|        -1|    [-1,-1]|
| 15|       1.0|                 0.0|   [0.0,8.0]|                -1.0|       -1|        -1|    [-1,-1]|
| 16|       1.0|                 0.0|  [0.0,16.0]|                -1.0|       -1|        -1|    [-1,-1]|
+---+----------+--------------------+-------------+--------------------+---------+----------+----------------+

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark machine-learning python spark3