在Sklearn管道中使用ColumnTransformer时发生ValueError-对GloveVectorizer使用Spacy的自定义类

问题描述

我有一个包含多个文本列和一个目标列的数据集。我试图使用Spacy的Cusom类为我的文本列使用glove嵌入，并且还尝试使用管道来实现。但是我收到了ValueError。以下是我的代码：

data_features = df.copy()[["title","description"]]
train_data,test_data,train_target,test_target = train_test_split(data_features,df['target'],test_size = 0.1)

我创建了这个自定义类以使用手套嵌入。我从this tutorial获得了代码。

class SpacyVectorTransformer(BaseEstimator,TransformerMixin):
    def __init__(self,nlp):
        self.nlp = nlp
        self.dim = 300

    def fit(self,X,y):
        return self

    def transform(self,X):
        return [self.nlp(text).vector for text in X]

加载nlp模型：

nlp = spacy.load("en_core_web_sm")

这是我要在管道中使用的列转换器：

col_preprocessor = ColumnTransformer(
        [
            ('title_glove',SpacyVectorTransformer(nlp),'title'),('description_glove','description'),],remainder='drop',n_jobs=1
        )

这是我的管道：

pipeline_glove = Pipeline([
    ('col_preprocessor',col_preprocessor),('classifier',LogisticRegression())
])

当我运行fit方法时，出现以下错误：

pipeline_glove.fit(train_data,train_target)

错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-219-8543ea744205> in <module>
----> 1 pipeline_glove.fit(train_data,train_target)

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self,y,**fit_params)
    328         """
    329         fit_params_steps = self._check_fit_params(**fit_params)
--> 330         Xt = self._fit(X,**fit_params_steps)
    331         with _print_elapsed_time('Pipeline',332                                  self._log_message(len(self.steps) - 1)):

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self,**fit_params_steps)
    294                 message_clsname='Pipeline',295                 message=self._log_message(step_idx),--> 296                 **fit_params_steps[name])
    297             # Replace the transformer of the step with the fitted
    298             # transformer. This is necessary when loading the transformer

/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self,*args,**kwargs)
    353 
    354     def __call__(self,**kwargs):
--> 355         return self.func(*args,**kwargs)
    356 
    357     def call_and_shelve(self,**kwargs):

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer,weight,message_clsname,message,**fit_params)
    738     with _print_elapsed_time(message_clsname,message):
    739         if hasattr(transformer,'fit_transform'):
--> 740             res = transformer.fit_transform(X,**fit_params)
    741         else:
    742             res = transformer.fit(X,**fit_params).transform(X)

/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self,y)
    549 
    550         self._update_fitted_transformers(transformers)
--> 551         self._validate_output(Xs)
    552 
    553         return self._hstack(list(Xs))

/opt/conda/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _validate_output(self,result)
    410                 raise ValueError(
    411                     "The output of the '{0}' transformer should be 2D (scipy "
--> 412                     "matrix,array,or pandas DataFrame).".format(name))
    413 
    414     def _validate_features(self,n_features,feature_names):

ValueError: The output of the 'title_glove' transformer should be 2D (scipy matrix,or pandas DataFrame).

解决方法

错误消息告诉您，您需要修复什么。

ValueError：“ title_glove”转换器的输出应为2D （科学矩阵，数组或熊猫DataFrame）。

但是您使用电流互感器（SpacyVectorTransformer）返回的是一个列表。您可以通过将列表变成例如这样的pandas DataFrame来解决此问题：

import pandas as pd

class SpacyVectorTransformer(BaseEstimator,TransformerMixin):
    def __init__(self,nlp):
        self.nlp = nlp
        self.dim = 300

    def fit(self,X,y):
        return self

    def transform(self,X):
        return pd.DataFrame([self.nlp(text).vector for text in X])

下次，请提供minimal,reproducible example。在您提供的代码中，没有导入，也没有名为“ df”的DataFrame。

machine-learning pandas python scikit-learn spacy