在Gridsearch之后显示所选功能

问题描述

我正在使用GridSearchCV进行线性回归的特征选择(SelectKBest)。结果显示选择了10个功能(使用.best_params_),但是我不确定如何显示这些功能。

代码粘贴在下面。我正在使用管道,因为下一个模型也将需要选择超参数。由于数据限制,x_train是一个具有12列的数据框,我无法共享。

cv_folds = KFold(n_splits=5,shuffle=False)
steps = [('feature_selection',SelectKBest(mutual_info_regression,k=3)),('regr',LinearRegression())]
pipe = Pipeline(steps)

search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]

clf = GridSearchCV(pipe,search_space,scoring='neg_mean_squared_error',cv=5,verbose=0)
clf = clf.fit(x_train,y_train)

print(clf.best_params_)

解决方法

您可以像这样访问有关feature_selection步骤的信息:

<GridSearch_model_variable>.best_estimater_.named_steps[<feature_selection_step>]

因此,在您的情况下,将是这样:

print(clf.best_estimator_.named_steps['feature_selection'])
#Output: SelectKBest(k=8,score_func=<function mutual_info_regression at 0x13d37b430>)

接下来,您可以使用get_support函数来获取所选功能的布尔图:

print(clf.best_estimator_.named_steps['feature_selection'].get_support())
# Output: array([ True,False,True,True])

现在在原始列上提供此地图:

data_columns = X.columns # List of columns in your dataset

# This is the original list of columns
print(data_columns)
# Output: ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT']

# Now print the select columns
print(data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()])
# Output: ['CRIM','LSTAT']

因此,您可以看到13个特征中只有8个被选中(在我的数据中k = 4是最好的情况)

这是波士顿数据集的完整代码:

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest,mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

boston_dataset = load_boston()
X = pd.DataFrame(boston_dataset.data,columns=boston_dataset.feature_names)
y = boston_dataset.target

cv_folds = KFold(n_splits=5,shuffle=False)
steps = [('feature_selection',SelectKBest(mutual_info_regression,k=3)),('regr',LinearRegression())]

pipe = Pipeline(steps)

search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]

clf = GridSearchCV(pipe,search_space,scoring='neg_mean_squared_error',cv=5,verbose=0)
clf = clf.fit(X,y)

print(clf.best_params_)

data_columns = X.columns
selected_features = data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()]

print(selected_features)
# Output : Index(['CRIM','LSTAT'],dtype='object')

参考

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...