使用Sklearn进行蛮力模型选择

问题描述

我有一个特征矩阵(X_train_balanced)和一个目标向量(y_train_balanced),用于分类任务(3个类)。为了执行模型选择和超参数调整,我打算在要比较的每个模型(LR,SVC,RF和KNN)上使用sklearn的GridsearchCV。

然后我的想法是比较GridsearchCV产生的每个模型的最佳结果,以选择最佳模型。

我想知道这种方法是否有意义,以及我为该任务开发的代码是否正确。

模型搜索空间

models = {
        'LogisticRegression'     : LogisticRegression(),'SVM'                    : SVC(),'RandomForestClassifier' : RandomForestClassifier(),'KNN'                    : KNeighborsClassifier()}

超参数搜索空间

hyper = {
        
        'LogisticRegression':{
                                    'penalty'     : ['l2'],'C'           : np.logspace(0,4,10),'solver'      : ['lbfgs','liblinear','saga'],'class_weight': ['balanced'],'random_state': [0]},'SVM':{
                                    'C'           : [0.01,0.1,1,10,100,1000],'gamma'       : [1,0.01,0.001,0.0001],'kernel'      : ['rbf','linear'],'RandomForestClassifier':{
                                    'max_depth': [2,3,4],'max_features': [2,'auto','sqrt'],'n_estimators': [10,500,'KNN':{
                                    'n_neighbors': [5,15,20],'weights': ['uniform','distance']}
                              
                              
    }

对每个模型进行交叉验证

for model_name in models.keys():

  # Model selection
  clf    = models[model_name]
  params = hyper[model_name]

  # Pipeline (standarization + classifier)
  pipe = Pipeline([ ( 'scaler',StandardScaler() ),( 'clf',clf ) ])
 
  # Gridsearch cross-validation
  grid = GridSearchCV(estimator = clf,param_grid = params,cv = 5,return_train_score = True)
  grid.fit(X_train_balanced,y_train_balanced)

  # Gridsearch cross-validation results
  best_param                  = grid.best_params_
  best_param_test_score_mean  = grid.cv_results_['mean_test_score'][grid.best_index_]
  best_param_test_score_std   = grid.cv_results_['std_test_score'][grid.best_index_]
  best_param_train_score_mean = grid.cv_results_['mean_train_score'][grid.best_index_]
  best_param_train_score_std  = grid.cv_results_['std_train_score'][grid.best_index_]

解决方法

只要您具有必需的导入,您的代码就可以使用:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

我想这也很有意义。

在代码末尾,您可以添加行

print(best_param)
print(grid.best_estimator_)

以获得最佳参数和最佳性能估算器。我使用这些更改运行了代码,例如,我测试过的数据集的输出是:

{'n_neighbors': 15,'weights': 'uniform'}
KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowski',metric_params=None,n_jobs=None,n_neighbors=15,p=2,weights='uniform')

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...