Numpy.random.normal 给出不好的结果

问题描述

我尝试使用 numpy.random.normal 对随机数建模。从这个随机数（mean=0，std=1）

我绘制了多个相似大小的样本（例如，m=100）
我计算每个样本的标准
我取所有标准差的平均值

理论统计数据以及 R 告诉我这必须收敛于所选的 std（即 1）。但不知何故，使用 numpy（和 scipy.stats），它没有。

此代码生成的图形显示了这种奇怪的行为：

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm,tstd

# system setup
m = 100         # number of measurments
sigma = 1       # sensor std

ez = np.arange(1,6,.05)
sample_sizes = [int(10**e) for e in ez]

# testing normal and std - they seem to work fine
sig_est = []
for n in sample_sizes:
    sample = np.random.normal(0,sigma,(n*m))
    sig_est += [np.std(sample)]
plt.plot(ez,sig_est,marker='.',color='b',ls='',label='numpy - no means')

# numpy implementation of problem
sig_est = []
for n in sample_sizes:
    sample = np.random.normal(0,(n,m))
    sigma_est = np.std(sample,axis=1)
    sig_est += [np.mean(sigma_est)]
plt.plot(ez,color='k',label='numpy')

# scipy.stats implementation
sig_est = []
for n in sample_sizes:
    sample = norm.rvs(loc=0,scale=sigma,size=(n,m))
    sigma_est = tstd(sample,color='r',label='scipy.stats')

plt.gca().set(xlabel = 'Number of samples [log10]')
plt.gca().legend()
plt.gca().grid(color='.9')
plt.show()

output

有什么想法吗？

解决方法

这是一个有趣的问题，因为它不是随机数生成器问题而是数学问题:-) 简短的回答是一切都按预期工作。

重点是，在第一个示例中，您正在获取越来越大的 i.i.d 样本。高斯分布并使用 np.std 计算它们的标准偏差。这收敛于 1，如您的图所示。

在第二个图中，您计算的标准偏差始终超过 100 个元素，然后对这些元素求平均值。通过这种方式，您不是在计算许多元素的极限标准差，而是计算标准偏差估计器的偏差。正如您所发现的，其中不为零！这有两个原因：

标准差的默认 numpy 实现是最小化二次风险（即二次误差的 1/n 总和）的方差估计量的平方根。这不是方差的无偏估计量，它从 1/(n-1) 开始。您可以通过将参数 ddof=1 传递给 np.std 来获得后者，请参阅此处的文档：https://numpy.org/doc/stable/reference/generated/numpy.std.html。
...但即使你这样做了，你也不会得到 0 偏差。那是因为您绘制的是标准差，而不是方差；即要得到精确的 1，您应该在计算 np.std 之后和取平均值之前对结果进行平方。你可以看到，如果你更换你的线

sig_est += [np.mean(sigma_est)]  # equivalent to sig_est.append(np.mean(sigma_est))

由

sig_est.append(np.mean(np.std(sample,axis=1,ddof=1)**2))

在代码的第二个块中，您确实会收敛到 1。

至于使用 scipy 的最后一个实现，它似乎使用了另一种规范化：https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tstd.html

他们称之为“无偏”，但显然不是，一方面是因为您的绘图清楚地显示了它，另一方面是因为获得无偏估计量（对于高斯）的确切因素比 n/( n-1)，参见此处：https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

numpy python random random random scipy scipy statistics