Gridsearch for NLP-如何结合CountVec和其他功能?

问题描述

我正在做一个有关情感分析的基本NLP项目,我想使用gridsearchcv优化我的模型。

下面的代码显示了我正在使用的示例数据框。 “内容”是要传递给CountVectorizer的列,“标签”是要预测的y列,feature_1,feature_2也是我希望包含在模型中的列。

'content': 'Got flat way today Pot hole Another thing tick crap thing happen week list','feature_1': '1','feature_2': '34','label':1},{'content': 'UP today Why doe head hurt badly','feature_1': '5','feature_2': '142',{'content': 'spray tan fail leg foot Ive scrubbing foot look better ','feature_1': '7','feature_2': '123','label':0},])

我正在关注stackoverflow答案:Perform feature selection using pipeline and gridsearch

from sklearn.pipeline import FeatureUnion,Pipeline
from sklearn.base import TransformerMixin,BaseEstimator
class CustomFeatureExtractor(BaseEstimator,TransformerMixin):
    def __init__(self,feature_1=True,feature_2=True):
        self.feature_1=feature_1
        self.feature_2=feature_2
        
    def extractor(self,tweet):
        features = []

        if self.feature_2:
            
            features.append(df['feature_2'])

        if self.feature_1:
            features.append(df['feature_1'])
        
          
        return np.array(features)

    def fit(self,raw_docs,y):
        return self

    def transform(self,raw_docs):
        
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

以下是我尝试将数据框适合于的网格搜索

lr = LogisticRegression()

# Pipeline
pipe = Pipeline([('features',FeatureUnion([("vectorizer",CountVectorizer(df['content'])),("extractor",CustomFeatureExtractor())])),('classifier',lr())
                ])
But yields results: TypeError: 'LogisticRegression' object is not callable

想知道是否还有其他更简便的方法吗?

我已经提到了以下线程,但无济于事: How to combine TFIDF features with other features Perform feature selection using pipeline and gridsearch

解决方法

您无法执行lr()LogisticRegression确实是不可调用的,它具有lr对象的某些方法。

请尝试(lr不带括号)

lr = LogisticRegression()
pipe = Pipeline([('features',FeatureUnion([("vectorizer",CountVectorizer(df['content'])),("extractor",CustomFeatureExtractor())])),('classifier',lr)
                ])

,错误消息应该消失。