问题描述
我正在使用GridSearchCV进行线性回归的特征选择(SelectKBest)。结果显示选择了10个功能(使用.best_params_),但是我不确定如何显示这些功能。
代码粘贴在下面。我正在使用管道,因为下一个模型也将需要选择超参数。由于数据限制,x_train是一个具有12列的数据框,我无法共享。
cv_folds = KFold(n_splits=5,shuffle=False)
steps = [('feature_selection',SelectKBest(mutual_info_regression,k=3)),('regr',LinearRegression())]
pipe = Pipeline(steps)
search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]
clf = GridSearchCV(pipe,search_space,scoring='neg_mean_squared_error',cv=5,verbose=0)
clf = clf.fit(x_train,y_train)
print(clf.best_params_)
解决方法
您可以像这样访问有关feature_selection
步骤的信息:
<GridSearch_model_variable>.best_estimater_.named_steps[<feature_selection_step>]
因此,在您的情况下,将是这样:
print(clf.best_estimator_.named_steps['feature_selection'])
#Output: SelectKBest(k=8,score_func=<function mutual_info_regression at 0x13d37b430>)
接下来,您可以使用get_support函数来获取所选功能的布尔图:
print(clf.best_estimator_.named_steps['feature_selection'].get_support())
# Output: array([ True,False,True,True])
现在在原始列上提供此地图:
data_columns = X.columns # List of columns in your dataset
# This is the original list of columns
print(data_columns)
# Output: ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']
# Now print the select columns
print(data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()])
# Output: ['CRIM','LSTAT']
因此,您可以看到13个特征中只有8个被选中(在我的数据中k = 4是最好的情况)
这是波士顿数据集的完整代码:
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest,mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
boston_dataset = load_boston()
X = pd.DataFrame(boston_dataset.data,columns=boston_dataset.feature_names)
y = boston_dataset.target
cv_folds = KFold(n_splits=5,shuffle=False)
steps = [('feature_selection',SelectKBest(mutual_info_regression,k=3)),('regr',LinearRegression())]
pipe = Pipeline(steps)
search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]
clf = GridSearchCV(pipe,search_space,scoring='neg_mean_squared_error',cv=5,verbose=0)
clf = clf.fit(X,y)
print(clf.best_params_)
data_columns = X.columns
selected_features = data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()]
print(selected_features)
# Output : Index(['CRIM','LSTAT'],dtype='object')
参考: