如何转换自定义向量化器以预测分类？

问题描述

据我搜索，我没有找到类似的问题，或者我用不好的关键字搜索了它。

我想做一个特征提取的变体。

矢量化为简单的词袋
矢量化词袋，结合附加功能

因此，对于第一种方法，我使用此代码拟合转换数据集（这是我的函数的一部分。df 是数据帧，vect 是 TFIDF/countvectorizer）

    self.X = self.vect.fit_transform(df.Tweet)
    self.X_columns=self.vect.get_feature_names()

所以在我建立了分类模型之后，我可以使用这个代码来转换我想要预测的任何文本。（vect 是 TFIDF/countvectorizer，new_df 是数据帧，clf 是使用任何算法训练的构建分类器）

    text_features = vect.transform(new_df.Tweet)  
    predictions = clf.predict(text_features)

已经完成，并且有效。

对于第二种情况：我用一些解决方法做了同样的事情。我在 stackoverflow 中查看了任何有用的代码，我使用这段代码完成了它。（sp 是 scipy lib，df 是数据框）

    self.X = sp.sparse.hstack((vect.fit_transform(df.Tweet),df[['feature_1','feature_2','score','sentiment']].values),format='csr')
    self.X_columns=vect.get_feature_names() + df[['feature_1','sentiment']].columns.tolist()

它有效，附加功能被添加到 csr 矩阵中。

但问题是如何将 new_df 转化为矩阵？我不知道从哪里开始尝试解决方案

解决方法

我的猜测是

    # count/process each additional features ['feature_1','feature_2','score','sentiment']
    ...
    # then use similar method but using transform instead fit_transform
    text_features = sp.sparse.hstack((vect.transform(new_df.Tweet),new_df[['feature_1','sentiment']].values),format='csr')
    predictions = clf.predict(text_features)

如果答案正确，我会更新。如果您找到更好的方法/解决方案，请分享。

classification pandas pandas python scikit-learn scipy scipy