问题描述
我一直在使用sklearn的Pipeline处理此分类问题,以使用Logistic回归结合预处理步骤(缩放)和交叉验证步骤(gridsearchcv)。
这是简化的代码:
# import dependencies
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,StandardScaler,MinMaxScaler,RobustScaler
# scaler and encoder options
scaler = StandardScaler() # there are 3 options that I want to try
encoder = OneHotEncoder() # only one option,no need to GridSearch it
# use ColumnTransformer to apply different preprocesses to numerical and categorical columns
preprocessor = ColumnTransformer(transformers = [('categorical',encoder,cat_columns),('numerical',scaler,num_columns),])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor',preprocessor),('log_reg',LogisticRegression())])
我要尝试的是尝试不同的缩放方法(例如标准缩放,鲁棒缩放等),然后尝试所有这些缩放方法,然后选择产生最佳度量(即准确性)的缩放方法。但是,我不知道如何使用gridsearchcv做到这一点:
from sklearn.model_selection import gridsearchcv
# set params combination I want to try
scaler_options = {'numerical':[StandardScaler(),RobustScaler(),MinMaxScaler()]}
# initialize gridsearchcv using full_pipeline as final estimator
grid_cv = gridsearchcv(full_pipeline,param_grid = scaler_options,cv = 5)
# fit the data
grid_cv.fit(X_train,y_train)
我知道上面的代码行不通,特别是因为我已将scaler_options设置为param_grid。我意识到gridsearchcv无法处理我设置的scaler_options。为什么?因为它不是管道的超参数(与'log_reg__C'不同,所以LogisticRegression()的超参数比gridsearchcv可以访问)。但是,它却是我嵌套在full_pipeline内的ColumnTransformer的组件。
因此,主要问题是,如何使gridsearchcv自动化以测试所有缩放器选项?由于缩放器是子管道的组件(即先前的ColumnTransformer)。
解决方法
正如您所建议的,您可以创建一个 class
,该 __init()__
接受其 class ScalerSelector(BaseEstimator,TransformerMixin):
def __init__(self,scaler=StandardScaler()):
super().__init__()
self.scaler = scaler
def fit(self,X,y=None):
return self.scaler.fit(X)
def transform(self,y=None):
return self.scaler.transform(X)
个参数,即您要使用的缩放器。
然后你可以在你的网格搜索参数中指定你的类应该用来初始化类的缩放器。
我写过,希望对您有所帮助:
# import dependencies
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler
from sklearn.datasets import load_breast_cancer
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.preprocessing import StandardScaler,RobustScaler
import pandas as pd
class ScalerSelector(BaseEstimator,y=None):
return self.scaler.transform(X)
data = load_breast_cancer()
features = data["data"]
target = data["target"]
data = pd.DataFrame(data['data'],columns=data['feature_names'])
col_names = data.columns.tolist()
# scaler and encoder options
my_scaler = ScalerSelector()
preprocessor = ColumnTransformer(transformers = [('numerical',my_scaler,col_names)
])
# combine the preprocessor with LogisticRegression() using Pipeline
full_pipeline = Pipeline(steps = [('preprocessor',preprocessor),('log_reg',LogisticRegression())
])
# set params combination I want to try
scaler_options = {'preprocessor__numerical__scaler':[StandardScaler(),RobustScaler(),MinMaxScaler()]}
# initialize GridSearchCV using full_pipeline as final estimator
grid_cv = GridSearchCV(full_pipeline,param_grid = scaler_options)
# fit the data
grid_cv.fit(data,target)
# best params :
grid_cv.best_params_
在这里你可以找到一个完整的例子,你可以运行它来测试:
int x = 0;
int* const p = &x;
*p = 42;
,
您可以无需创建自定义转换器即可实现您的意图。您甚至可以将 'passthrough'
参数传递给 param_grid 以试验您根本不想在该步骤中进行任何缩放的场景。
在这个例子中,假设我们想研究模型对数值特征强加一个 Scaler 变换器是否更好,num_features。
cat_features = selector(dtype_exclude='number')(train.drop('target',axis=1))
num_features = selector(dtype_include='number')(train.drop('target',axis=1))
cat_preprocessor = Pipeline(steps=[
('oh',OneHotEncoder(handle_unknown='ignore')),('ss',StandardScaler())
])
num_preprocessor = Pipeline(steps=[
('pt',PowerTransformer(method='yeo-johnson')),StandardScaler()) # Create a place holder for your test here !!!
])
preprocessor = ColumnTransformer(transformers=[
('cat',cat_preprocessor,cat_features),('num',num_preprocessor,num_features)
])
model = Pipeline(steps=[
('prep',('clf',RidgeClassifier())
])
X = train.drop('target',axis=1)
y = train['target']
param_grid = {
'prep__cat__ss': ['passthrough',StandardScaler(with_mean=False)] # 'passthrough',}
gs = GridSearchCV(
estimator=model,param_grid=param_grid,scoring='roc_auc',n_jobs=-1,cv=2
)
gs.fit(X,y)