使用 GridSearchCV 计算调整后的 R2

问题描述

我正在尝试将 gridsearchcv 与多个评分指标结合使用，其中之一是调整后的 R²。就我而言，后者未在 scikit-learn 中实现。我想确认我的方法是否是实现调整后的 R² 的正确方法。

使用 scikit-learn 中实现的分数（在下面的 MAE 和 R² 示例中），我可以执行如下所示的操作（在这个虚拟示例中，我忽略了良好的做法，例如特征缩放和合适的 SVR 迭代次数）：

import numpy as np
from sklearn.svm import SVR
from sklearn.metrics import make_scorer
from sklearn.model_selection import gridsearchcv
from sklearn.metrics import r2_score,mean_absolute_error

#generate input
X = np.random.normal(75,10,(1000,2))
y = np.random.normal(200,20,1000)

#perform grid search
params = {"degree": [2,3],"max_iter": [10]}
grid = gridsearchcv(SVR(),param_grid=params,scoring={"MAE": "neg_mean_absolute_error","R2": "r2"},refit="R2")
grid.fit(X,y)

上面的示例将报告每个交叉验证分区的 MAE 和 R²，并将根据最佳 R² 重新拟合最佳参数。按照这个例子，我尝试使用自定义记分器来做同样的事情：

def adj_r2(true,pred,p=2):
    '''p is the number of independent variables and n is the sample size'''
    n = true.size
    return 1 - ((1 - r2_score(true,pred)) * (n - 1))/(n-p-1)

scorer=make_scorer(adj_r2)
grid = gridsearchcv(SVR(),"adj R2": scorer},refit="adj R2")
grid.fit(X,y)

#print(grid.cv_results_)

上面的代码似乎为“adj R2”得分手生成值。我有两个问题：

上面使用的方法在技术上是否正确编码？
如果方法正确，我如何以动态方式定义 p（自变量数）？如您所见，我在定义函数时必须强制使用默认值，但我希望能够在 gridsearchcv 中定义 p。

解决方法

首先，sklearn目前还没有调整后的R2分数，因为评分函数的API只需要y_true和y_pred。因此，测量 X 的维度是不可能的。

我们可以解决 SearchCV 的问题。

记分员需要有 (estimator,X,y) 的签名。这已在 make_scorer here 中提供。

我在这里提供了一个更简化的版本，用于包装 r2 scorer。

def adj_r2(estimator,y_true):
    n,p = X.shape
    pred = estimator.predict(X)
    return 1 - ((1 - r2_score(y_true,pred)) * (n - 1))/(n-p-1)

grid = GridSearchCV(SVR(),param_grid=params,scoring={"MAE": "neg_mean_absolute_error","adj R2": adj_r2},refit="adj R2") 
grid.fit(X,y)

gridsearchcv python-3.x scikit-learn scoring