对加速度计和陀螺仪的时间序列数据进行下采样

问题描述

我有体育活动的时间序列数据。数据以50hz频率记录。但现在我想在 20hz 下对数据进行下采样，因为我想在 20hz 下训练和预测模型。

python 有没有一种有效的方法可以做到这一点？我听说过 Panda 的重采样功能，但不知道如何有效地使用它来解决我的问题。任何一段代码都会非常有用。

   epoch (ms)              time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
1613977400899   2021-02-22T12:03:20.899            0      -0.336       0.886       0.649
1613977400920   2021-02-22T12:03:20.920        0.021      -0.233       0.799       0.648
1613977400940   2021-02-22T12:03:20.940        0.041      -0.173       0.771       0.629
1613977400961   2021-02-22T12:03:20.961        0.062      -0.132       0.757       0.596
1613977400981   2021-02-22T12:03:20.981        0.082      -0.113       0.724       0.57
1613977401002   2021-02-22T12:03:21.002        0.103      -0.127       0.713       0.538
1613977401021   2021-02-22T12:03:21.021        0.122      -0.175       0.743       0.488
1613977401041   2021-02-22T12:03:21.041        0.142      -0.266       0.775       0.417
1613977401062   2021-02-22T12:03:21.062        0.163      -0.281       0.774       0.402
1613977401082   2021-02-22T12:03:21.082        0.183      -0.212       0.713       0.427
1613977401103   2021-02-22T12:03:21.103        0.204      -0.17        0.649       0.46
1613977401123   2021-02-22T12:03:21.123        0.224      -0.204       0.649       0.524
1613977401144   2021-02-22T12:03:21.144        0.245      -0.313       0.684       0.658
1613977401164   2021-02-22T12:03:21.164        0.265      -0.415       0.727       0.785
1613977401183   2021-02-22T12:03:21.183        0.284      -0.419       0.726       0.82

解决方法

这里的一个主要问题似乎是您的原始频率“大约”为 20 毫秒（或 50 赫兹），不完全是。我们需要分两步重新采样：

上采样到 1 毫秒，我们可以在其中定义要使用的插值
下采样到 50 毫秒（只需每 50 行选择一个，非常简单）

首先让我们建立一个时间索引。此处您有两次信息，因此其中任何一个都可以使用：

>>> df = df.set_index(df['epoch (ms)'].astype('datetime64[ms]'))
>>> df = df.set_index(pd.to_datetime(df['time (10:00)']))
>>> df
                            epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                    
2021-02-22 12:03:20.899  1613977400899  2021-02-22T12:03:20.899        0.000      -0.336       0.886       0.649
2021-02-22 12:03:20.920  1613977400920  2021-02-22T12:03:20.920        0.021      -0.233       0.799       0.648
2021-02-22 12:03:20.940  1613977400940  2021-02-22T12:03:20.940        0.041      -0.173       0.771       0.629
2021-02-22 12:03:20.961  1613977400961  2021-02-22T12:03:20.961        0.062      -0.132       0.757       0.596
2021-02-22 12:03:20.981  1613977400981  2021-02-22T12:03:20.981        0.082      -0.113       0.724       0.570
2021-02-22 12:03:21.002  1613977401002  2021-02-22T12:03:21.002        0.103      -0.127       0.713       0.538
2021-02-22 12:03:21.021  1613977401021  2021-02-22T12:03:21.021        0.122      -0.175       0.743       0.488
2021-02-22 12:03:21.041  1613977401041  2021-02-22T12:03:21.041        0.142      -0.266       0.775       0.417
2021-02-22 12:03:21.062  1613977401062  2021-02-22T12:03:21.062        0.163      -0.281       0.774       0.402
2021-02-22 12:03:21.082  1613977401082  2021-02-22T12:03:21.082        0.183      -0.212       0.713       0.427
2021-02-22 12:03:21.103  1613977401103  2021-02-22T12:03:21.103        0.204      -0.170       0.649       0.460
2021-02-22 12:03:21.123  1613977401123  2021-02-22T12:03:21.123        0.224      -0.204       0.649       0.524
2021-02-22 12:03:21.144  1613977401144  2021-02-22T12:03:21.144        0.245      -0.313       0.684       0.658
2021-02-22 12:03:21.164  1613977401164  2021-02-22T12:03:21.164        0.265      -0.415       0.727       0.785
2021-02-22 12:03:21.183  1613977401183  2021-02-22T12:03:21.183        0.284      -0.419       0.726       0.820

（现在我们真的不再需要 epoch 和 time 列，因为信息在索引中）

现在我们可以进行重采样了：

>>> df.resample('1ms').interpolate().resample('50ms').last()
                           epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                   
2021-02-22 12:03:20.850  1.613977e+12  2021-02-22T12:03:20.899        0.000   -0.336000    0.886000    0.649000
2021-02-22 12:03:20.900  1.613977e+12  2021-02-22T12:03:20.940        0.050   -0.155429    0.765000    0.614857
2021-02-22 12:03:20.950  1.613977e+12  2021-02-22T12:03:20.981        0.100   -0.125000    0.714571    0.542571
2021-02-22 12:03:21.000  1.613977e+12  2021-02-22T12:03:21.041        0.150   -0.271714    0.774619    0.411286
2021-02-22 12:03:21.050  1.613977e+12  2021-02-22T12:03:21.082        0.200   -0.178000    0.661190    0.453714
2021-02-22 12:03:21.100  1.613977e+12  2021-02-22T12:03:21.144        0.250   -0.338500    0.694750    0.689750
2021-02-22 12:03:21.150  1.613977e+12  2021-02-22T12:03:21.183        0.284   -0.419000    0.726000    0.820000

请注意，您可以通过指定传递给 .interpolate() 的参数来执行不同类型的插值。请参阅the doc：

方法：str，默认“线性”
要使用的插值技术。其中之一：

‘linear’：忽略索引并将值视为等距。这是多索引支持的唯一方法。
‘time’：处理每日和更高分辨率的数据，以插入给定的间隔长度。
‘index’、‘values’：使用索引的实际数值。
‘pad’：使用现有值填充 NaN。
‘nearest’,‘zero’,‘slinear’,‘quadratic’,‘cubic’,‘spline’,‘barycentric’,‘polynomial’: 传递给 scipy.interpolate.interp1d。这些方法使用索引的数值。 “多项式”和“样条”都要求您还指定一个阶数 (int)，例如df.interpolate(method='polynomial',order=5).
'krogh'、'piecewise_polynomial'、'spline'、'pchip'、'akima'、'cubicspline'：围绕类似名称的 SciPy 插值方法的包装。请参阅注释。
‘from_derivatives’：指 scipy.interpolate.BPoly.from_derivatives，它取代了 scipy 0.18 中的‘piecewise_polynomial’插值方法。

您可以看到坐标的细微差别，由您来选择适合您的方法：

>>> df.resample('1ms').interpolate('time').resample('50ms').last()
                           epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                   
2021-02-22 12:03:20.850  1.613977e+12  2021-02-22T12:03:20.899        0.000   -0.336000    0.886000    0.649000
2021-02-22 12:03:20.900  1.613977e+12  2021-02-22T12:03:20.940        0.050   -0.155429    0.765000    0.614857
2021-02-22 12:03:20.950  1.613977e+12  2021-02-22T12:03:20.981        0.100   -0.125000    0.714571    0.542571
2021-02-22 12:03:21.000  1.613977e+12  2021-02-22T12:03:21.041        0.150   -0.271714    0.774619    0.411286
2021-02-22 12:03:21.050  1.613977e+12  2021-02-22T12:03:21.082        0.200   -0.178000    0.661190    0.453714
2021-02-22 12:03:21.100  1.613977e+12  2021-02-22T12:03:21.144        0.250   -0.338500    0.694750    0.689750
2021-02-22 12:03:21.150  1.613977e+12  2021-02-22T12:03:21.183        0.284   -0.419000    0.726000    0.820000
>>> df.resample('1ms').interpolate('cubic').resample('50ms').last()
                           epoch (ms)             time (10:00)  elapsed (s)  x-axis (g)  y-axis (g)  z-axis (g)
time (10:00)                                                                                                   
2021-02-22 12:03:20.850  1.613977e+12  2021-02-22T12:03:20.899        0.000   -0.336000    0.886000    0.649000
2021-02-22 12:03:20.900  1.613977e+12  2021-02-22T12:03:20.940        0.050   -0.153162    0.766266    0.615219
2021-02-22 12:03:20.950  1.613977e+12  2021-02-22T12:03:20.981        0.100   -0.122950    0.711454    0.543581
2021-02-22 12:03:21.000  1.613977e+12  2021-02-22T12:03:21.041        0.150   -0.285487    0.781273    0.403123
2021-02-22 12:03:21.050  1.613977e+12  2021-02-22T12:03:21.082        0.200   -0.172478    0.656944    0.452494
2021-02-22 12:03:21.100  1.613977e+12  2021-02-22T12:03:21.144        0.250   -0.342439    0.695493    0.693425
2021-02-22 12:03:21.150  1.613977e+12  2021-02-22T12:03:21.183        0.284   -0.419000    0.726000    0.820000

activity-recognition downsampling pandas pandas python resampling