为什么确定系数R²的实现会产生不同的结果？

问题描述

当尝试实现用于计算确定系数R²的python函数时，我注意到根据使用的计算顺序，我得到了截然不同的结果。

wikipedia page on R²对于如何计算R²给出了看似非常明确的解释。我对Wiki页上所说的内容的麻木解释如下：

def calcR2_wikipedia(y,yhat):
    # Mean value of the observed data y.
    y_mean = np.mean(y)
    # Total sum of squares.
    SS_tot = np.sum((y - y_mean)**2)
    # Residual sum of squares.
    SS_res = np.sum((y - yhat)**2)
    # Coefficient of determination.
    R2 = 1.0 - (SS_res / SS_tot)
    return R2

当我尝试使用目标向量 y 和建模估计向量 yhat 的此方法时，此函数的R²值为-0.00301。

但是，this stackoverflow post discussing how to calculate R²接受的答案给出了以下定义：

def calcR2_stackOverflow(y,yhat):
    sst = np.sum((y - np.mean(y))**2)
    SSReg = np.sum((yhat - np.mean(y))**2)
    R2 = SSReg/sst
    return R2

使用与以前相同的 y 和 yhat 向量的方法，我现在得到的R²为0.319。

此外，在同一stackoverflow帖子中，很多人似乎都喜欢使用scipy模块计算R²，如下所示：

import scipy
slope,intercept,r_value,p_value,std_err = scipy.stats.linregress(yhat,y)
R2 = r_value**2

在我的案例中，该数字为0.261。

所以我的问题是：为什么R²值从看似广为接受的来源产生的结果彼此根本不同？计算两个向量之间的R²的正确方法是什么？

解决方法

定义

这是一种符号滥用，通常会导致误解。您正在比较两个不同的系数：

Coefficient of determination（通常称为R^2），不仅可用于线性回归（可用于拟合参数，OLS还可用于函数而非函数本身），可用于任何OLS回归；
Pearson Correlation Coefficient（通常记为r或r^2的平方），仅用于线性回归。

如果您仔细阅读了Wikipedia页面上的确定系数介绍，您将看到在那里进行了讨论，它的开始如下：

R2有几种定义，有时只是等效。

MCVE

您可以确认那些分数的经典实现返回了预期结果：

import numpy as np
import scipy
from sklearn import metrics

np.random.seed(12345)
x = np.linspace(-3,3,1001)
yh = np.polynomial.polynomial.polyval(x,[1,2])
e = np.random.randn(x.size)
yn = yh + e

然后您的函数calcR2_wikipedia（0.9265536406736125）返回确定系数，可以确定该系数，因为它返回的结果与sklearn.metrics.r2_score相同：

metrics.r2_score(yn,yh) # 0.9265536406736125

另一方面，scipy.stats.linregress返回相关系数（仅对线性回归有效）：

slope,intercept,r_value,p_value,std_err = scipy.stats.linregress(yh,yn)
r_value # 0.9625821384210018

您可以通过定义交叉确认：

C = np.cov(yh,yn)
C[1,0]/np.sqrt(C[0,0]*C[1,1]) # 0.9625821384210017

coefficient-of-determination numpy python statistics