如何在管道内的“随机森林分类器”中调整参数？

问题描述

我试图通过使用管道并调整其中的参数来应用RandomForestClassifier（）。这是使用的数据集：https://www.kaggle.com/gbonesso/enem-2016

from sklearn.ensemble import RandomForestClassifier

imputer = SimpleImputer(strategy="median")
scaler = StandardScaler()
rf = RandomForestClassifier()

features = [
    "NU_IDADE","TP_ESTADO_CIVIL","NU_NOTA_CN","NU_NOTA_CH","NU_NOTA_LC","NU_NOTA_MT","NU_NOTA_COMP1","NU_NOTA_COMP2","NU_NOTA_COMP3","NU_NOTA_COMP4","NU_NOTA_COMP5","NU_NOTA_REDACAO",]

X = enem[features]
y = enem[["IN_TREINEIRO"]]

X_train,X_test,y_train,y_test = train_test_split(
    X,y,train_size=0.8,random_state=42
)

pipeline = make_pipeline(imputer,scaler,rf)

pipe_params = {
    "randomforestregressor__n_estimators": [100,500,1000],"randomforestregressor__max_depth": [1,5,10,25],"randomforestregressor__max_features": [*np.arange(0.1,1.1,0.1)],}

gridsearch = gridsearchcv(
    pipeline,param_grid=pipe_params,cv=3,n_jobs=-1,verbose=1000
)

gridsearch.fit(X_train,y_train)

它似乎适用于一些参数，但随后出现此错误消息：

ValueError: Invalid parameter randomforestregressor for estimator Pipeline(steps=[('simpleimputer',SimpleImputer(strategy='median')),('standardscaler',StandardScaler()),('randomforestclassifier',RandomForestClassifier())]). Check the list of available parameters with `estimator.get_params().keys()`.

另外，还有一个问题是我似乎无法获得简历结果。我尝试运行以下代码：

results = pd.DataFrame(gridsearch.cv_results_)
results.sort_values("rank_test_score").head()
score = pipeline.score(X_test,y_test)
score

但是我得到了这个错误：

AttributeError: 'gridsearchcv' object has no attribute 'cv_results_'

关于如何解决这些错误的任何想法？

解决方法

您的问题可能是这本字典：

pipe_params = {
    "randomforestregressor__n_estimators": [100,500,1000],"randomforestregressor__max_depth": [1,5,10,25],"randomforestregressor__max_features": [*np.arange(0.1,1.1,0.1)],}

错误提示，您的管道没有randomforestregressor参数。由于您使用的是RandomForestClassifier，因此应该为：

pipe_params = {
    "randomforestclassifier__n_estimators": [100,"randomforestclassifier__max_depth": [1,"randomforestclassifier__max_features": [*np.arange(0.1,}

如果您在错误消息中运行建议，您将看到管道的可用选项（pipeline.get_params().keys()）。

尼克的答案绝对正确，并且确实可以解决您的问题。在您的情况下，可以实例化管道，而避免使用make_pipeline类，而使用Pipeline类。我相信这有点可读性和简洁性：

pipe = Pipeline([
    ("scaler",StandardScaler()),("clf",RandomForestClassifier())
])

然后访问带有分类器名称前缀的模型参数：

param_grid = {
    "clf__n_estimators": [100,"clf__max_depth": [1,"clf__max_features": [*np.arange(0.1,}

下面是基于虹膜数据集的完整示例：

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn import datasets
import numpy as np


# Data preparation
iris = datasets.load_iris()
x = iris.data[:,:2]
y = iris.target

x_train,x_test,y_train,y_test = train_test_split(
    x,y,test_size=0.33,random_state=42
)

# Build a pipeline object
pipe = Pipeline([
    ("scaler",RandomForestClassifier())
])

# Declare a hyperparameter grid
param_grid = {
    "clf__n_estimators": [100,}

# Perform grid search,fit it,and print score
gs = GridSearchCV(pipe,param_grid=param_grid,cv=3,n_jobs=-1,verbose=1000)
gs.fit(x_train,y_train)
print(gs.score())

gridsearchcv python random-forest scikit-learn