使用 Gridsearch 进行超参数搜索提供不起作用的参数值

问题描述

我正在使用 CountVectorizer 和 RandomForestClassifier 通过 scikit-learn 的 GridSearch 运行超参数搜索。超参数搜索网格如下所示:

grid = {
    'vectorizer__ngram_range': [(1,1)],'vectorizer__stop_words': [None,german_stop_words],'vectorizer__max_df': [0.25,0.5,0.75,1],'vectorizer__min_df': [0.01,0.1,1,5,10],'vectorizer__max_features': [None,100,1000,1500],'classifier__class_weight': ['balanced','balanced_subsample',None],'classifier__n_jobs': [-1],'classifier__n_estimators': [100,190,250]
    
    } 

gridsearch 运行到最后并给我一个 best_params 结果。我已经运行了几次,得到了不同的结果。在运行过程中,我有时会遇到这些错误

  warnings.warn("Estimator fit failed. The score on this train-test"
/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:548: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/model_selection/_validation.py",line 531,in _fit_and_score
    estimator.fit(X_train,y_train,**fit_params)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py",line 330,in fit
    Xt = self._fit(X,y,**fit_params_steps)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py",line 292,in _fit
    X,fitted_transformer = fit_transform_one_cached(
  File "/root/complex_semantics/lib/python3.8/site-packages/joblib/memory.py",line 352,in __call__
    return self.func(*args,**kwargs)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/pipeline.py",line 740,in _fit_transform_one
    res = transformer.fit_transform(X,**fit_params)
  File "/root/complex_semantics/lib/python3.8/site-packages/sklearn/feature_extraction/text.py",line 1213,in fit_transform
    raise ValueError(
ValueError: max_df corresponds to < documents than min_df

我认为这是正常的,因为有些值没有很好地混合。但是在获得最佳参数并使用它们运行模型几次后,我收到一个错误,告诉我 max_df 和 min_df 的值不正确,因为使用 max_df 选择的文档数量低于使用 min_df 的数量。>

为什么它在使用相同数据集的超参数搜索期间正确运行而不是正常运行?

有什么想法吗?有没有办法避免这种情况?

这是GridSearch的代码

pipeline = Pipeline([('vectorizer',CountVectorizer()),('classifier',RandomForestClassifier())])

scoring_function = make_scorer(matthews_corrcoef)
grid_search = GridSearchCV(pipeline,param_grid=grid,scoring=scoring_function,n_jobs=-1,cv=5)
grid_search.fit(X=train_text,y=train_labels)
print("-----------")
print(grid_search.best_score_)
print(grid_search.best_params_)

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)