如何使用scikit-learn中的SelectFromModel正确选择特征？

问题描述

我正在使用very simple kaggle dataset来了解具有逻辑回归的SelectFromModel的工作方式。这个想法是用一些基本的数据处理（删除列+缩放）创建一个非常简单的管道，将其传递给特征选择（logreg），然后拟合一个xgboost模型（代码中不包括）。通过阅读documentation，我的理解是，在给定X_train和y_train的情况下，拟合了logreg模型，并选择了系数高于或等于阈值的那些特征。就我而言，我将阈值设置为平均值* 1.25。

我无法理解为什么输出selector.threshold_与我期望获得相同值的selector.estimator_.coef_.mean()*1.25.不同，为什么不是这样？

前进，我想做gridsearchcv来微调我的管道参数。我通常这样做：

from sklearn.model_selection import gridsearchcv

params = {}
params['gradientboostingclassifier__learning_rate'] = [0.05,0.1,0.2]
params['selectfrommodel__estimator__C'] = [0.1,1,10]
params['selectfrommodel__estimator__penalty']= ['l1','l2']
params['selectfrommodel__estimator__threshold']=['median','mean','1.25*mean','0.75*mean']

grid = gridsearchcv(pipe,params,cv=5,scoring='recall')
%time grid.fit(X_train,y_train);

不幸的是，该阈值似乎不在参数列表（pipe.named_steps.selectfrommodel.estimator.get_params().keys()）中，因此，为了使gridsearchcv正常工作，需要对此行进行注释。

params['selectfrommodel__estimator__threshold']=['median','0.75*mean']

是否可以微调阈值？

解决方法

因为重要性基于系数的绝对值的平均值。如果对相对值进行平均，则平均重要性会降低

我建立了一个示例来演示行为：

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.87,-1.34,0.31 ],[-2.79,-0.02,-0.85 ],[-1.34,-0.48,-2.55 ],[ 1.92,1.48,0.65 ]]
y = [0,1,1]
selector = SelectFromModel(estimator=LogisticRegression(),threshold="1.25*mean").fit(X,y)
print(selector.estimator_.coef_)
print(selector.threshold_) # 0.6905659148858644
# note here the absolute transformation before the mean
print(abs(selector.estimator_.coef_).mean()*1.25) # 0.6905659148858644

还请注意，功能重要性是模型训练的结果，而不是您可以定义先验的结果。这是因为您无法达到阈值，只有在训练后才能获得阈值

@Nikaido问题在第一部分中完全正确，而abs()则丢失。这意味着abs(selector.estimator_.coef_).mean()*1.25等于selector.threshold_

对于第二部分，确实是有可能的，并且正确的方法是更改此行：

params['selectfrommodel__estimator__threshold']=['median','mean','1.25*mean','0.75*mean']

到另一行：

params['selectfrommodel__threshold']=['median','0.75*mean']

由于threshold是selectfrommodel的参数，而不是estimator的参数，请参见下面的方法，获取这两种情况的完整列表以进一步调整超参数，请使用以下参数：

pipe.named_steps.selectfrommodel.get_params().keys() 
pipe.named_steps.selectfrommodel.estimator.get_params().keys()

gridsearchcv logistic-regression machine-learning python scikit-learn