如何正确使用 RobustScaler 来改进 LinearRegression 模型?

问题描述

为了改进我的线性回归模型,我被建议使用标准化,即 RobustScaler 以获得更好的性能。 我的训练集和验证集的形状:

Train set: (4304,20) (4304,)
Validation set: (1435,20) (1435,)

所以我为训练集和验证集转换了我的 X:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_robust_scaler = scaler.fit_transform(X_train.copy())
X_valid_robust_scaler = scaler.transform(X_valid.copy())

然后我运行模型并使用函数 print_score() 打印分数:

from sklearn import linear_model

regr_vol_2 = linear_model.LinearRegression()
regr_vol_2.fit(X_train_robust_scaler,y_train)

def print_score(m,X_train: pd.DataFrame,X_valid: pd.DataFrame,y_train: pd.Series,y_valid:pd.Series):
'''Function takes a model and calculates and prints its RMSE values and r² 
scores for train and validation set. Also attaches oob_score for Random 
Forest model.
Parameters:
-----------
(1) m --> given model;
(2) X_train --> training set of independent features;
(3) X_valid --> validation set of independent features;
(4) y_train --> training set of dependent features;
(5) y_valid --> validation set of dependent features;
-----------
Returns scoring values in the following order: 
[training rmse,validation rmse,r² for training set,r² for validation set,oob_score_]
'''
res = [rmse(m.predict(X_train),y_train),rmse(m.predict(X_valid),y_valid),m.score(X_train,m.score(X_valid,y_valid)]
if hasattr(m,'oob_score_'): res.append(m.oob_score_)
return print(res)


print_score(regr_vol_2,X_train_robust_scaler,X_valid_robust_scaler,y_train,y_valid)
输出 [训练rmse,验证rmse,训练集r²,验证集r²
之前: [260.86301672800016,271.8005003802866,0.6184501389479591,0.5976532655109332]
之后: [260.8630167262612,271.800437195055,0.6184501389530468,0.5976534525773189]

两个完全相同的结果,我做错了什么?我应该 Robustscaler() 也用于 y_trainy_valid 吗? 如果我这样做:

scaler_y = RobustScaler()
y_train_robust_scaler = scaler_y.fit_transform(y_train[:,None])
y_valid_robust_scaler = scaler_y.transform(y_valid[:,None])

我和没有它一样: | [训练rmse,验证rmse,训练集r²,验证集r² | | -------------- | | [260.8630167262612,0.5976534525773189]|

或者我应该在拆分前一次对整个数据使用 Robustscaler() ?如果“是”,如果在训练/验证中拆分后估算 NaN 值,我该怎么做。

解决方法

缩放不会影响未惩罚的回归。它可以提高求解器的收敛性,但如果模型在原始数据上的收敛性令人满意,则结果将相同。