sklearn 中计算出来的 Robustscaler 好像不对

问题描述

我在sklearn中尝试了Robustscaler，发现结果和公式不一样。

sklearn中Robustscaler的公式为：

Figure 1. The formula to calculate Robustscaler

我有一个如下所示的矩阵：

Figure 2. The test matrix

我测试了特征一（第一行和第一列）中的第一个数据。缩放值应为 (1-3)/(5.5-1.5) = -0.5。但是，sklearn 的结果是 -0.67。有谁知道哪里计算不正确？

使用sklearn的代码如下：

import numpy as np
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1]]
scaler = RobustScaler(quantile_range=(25.0,75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)

解决方法

来自 RobustScaler documentation（强调）：

通过计算训练集中样本的相关统计数据，独立于每个特征发生居中和缩放。

即中值和 IQR 数量是按每列计算的，而不是针对整个阵列计算的。

澄清后，让我们手动计算第一列的缩放值：

import numpy as np

x1 = np.array([1,4,7,2]) # your 1st column here

q75,q25 = np.percentile(x1,[75,25])
iqr = q75 - q25

x1_med = np.median(x1)

x1_scaled = (x1-x1_med)/iqr
x1_scaled
# array([-0.66666667,0.33333333,1.33333333,-0.33333333])

与你自己的x_new的第一列相同，由scikit-learn计算：

# your code verbatim:
from sklearn.preprocessing import RobustScaler
x=[[1,2,3,4],[4,5,6,7],[7,8,9,10],[2,1,1]]
scaler = RobustScaler(quantile_range=(25.0,75.0),with_centering=True)
x_new = scaler.fit_transform(x)
print(x_new)
# result
[[-0.66666667 -0.375      -0.35294118 -0.33333333]
 [ 0.33333333  0.375       0.35294118  0.33333333]
 [ 1.33333333  1.125       1.05882353  1.        ]
 [-0.33333333 -0.625      -0.82352941 -1.        ]]

np.all(x1_scaled == x_new[:,0])
# True

对于其余的列（特征）也类似 - 在缩放它们之前，您需要分别计算每个列的中值和 IQR 值。

更新（评论后）：

正如 quartiles 上的维基百科条目所指出的：

对于离散分布，四分位数的选择没有统一的共识

另见相关参考，Sample quantiles in statistical packages：

统计计算机软件包中有大量用于样本分位数的不同定义

深入研究此处使用的 np.percentile 的文档，您会发现至少有五 (5) 种不同的插值方法，而且并非所有这些方法都会产生相同的结果（另请参阅 4 种不同的方法在上面链接的维基百科条目中展示了）；以下是这些方法及其在上面定义的 x1 数据中的结果的快速演示：

np.percentile(x1,25]) # interpolation='linear' by default
# array([4.75,1.75])

np.percentile(x1,25],interpolation='lower')
# array([4,1])

np.percentile(x1,interpolation='higher')
# array([7,2])

np.percentile(x1,interpolation='midpoint')
# array([5.5,1.5])

np.percentile(x1,interpolation='nearest')
# array([4,2])

除了没有两种方法产生相同结果这一事实外，还应该清楚的是，您在自己的计算中使用的定义对应于 interpolation='midpoint'，而默认的 Numpy 方法是 {{1} }.正如 Ben Reiniger 在下面的评论中正确指出的那样，RobustScaler 的 source code 中实际使用的是 np.nanpercentile（我在这里使用的变体 interpolation='linear' 能够处理 { {1}} 个值）使用默认 np.percentile 设置。

data-preprocessing python scikit-learn

sklearn 中计算出来的 Robustscaler 好像不对

问题描述

解决方法

相关问答