如何使用Python重复执行某个命令自举重采样

问题描述

我有一个数据帧（长度为4个数据点），并且想做X次Bootstrap。

数据帧示例：

我想出了这段代码用于Bootstrap重采样

      boot = resample(df,replace=True,n_samples=len(df),random_state=1)
      print('Bootstrap Sample: %s' % boot)

但是现在我想重复X次。我该怎么办？

x = 20的输出。

  Sample Nr.    Index A B
      1         0   1 2
                1   1 2
                2   1 2
                3   1 2 
     ...
      20        0   1 2
                1   1 2
                1   1 2
                2   1 2

谢谢你们。

最佳

解决方法

方法1：并行采样数据

由于在数据帧的示例方法中调用n可能很耗时，因此可以考虑并行应用sample方法。

import multiprocessing
from itertools import repeat

def sample_data(df,replace,random_state):
    '''Generate one sample of size len(df)'''
    return df.sample(replace=replace,n=len(df),random_state=random_state)

def resample_data(df,n_samples,random_state):
    '''Call n_samples time the sample method parallely'''
    
    # Invoke lambda in parallel
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    bootstrap_samples = pool.starmap(sample_data,zip(repeat(df,n_samples),repeat(replace),repeat(random_state)))
    pool.close()
    pool.join()

    return bootstrap_samples

现在，如果我要生成15个样本，resample_data将返回一个包含df中15个样本的列表。

samples = resample_data(df,True,n_samples=15,random_state=1)

请注意，要返回不同的结果，将random_state设置为None将很方便。

方法2：线性采样数据

另一种采样数据的方法是通过列表理解，因为已经定义了函数sample_data，因此可以在列表内部直接调用它。

def resample_data_linearly(df,random_state):
    
    return [sample_data(df,random_state) for _ in range(n_samples)] 

# Generate 10 samples of size len(df)
samples = resample_data_linearly(df,n_samples=10,random_state=1)

dataframe python resampling statistics-bootstrap