问题描述
我有体育活动的时间序列数据。数据以50hz频率记录。但现在我想在 20hz 下对数据进行下采样,因为我想在 20hz 下训练和预测模型。
python 有没有一种有效的方法可以做到这一点?我听说过 Panda 的重采样功能,但不知道如何有效地使用它来解决我的问题。任何一段代码都会非常有用。
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
1613977400899 2021-02-22T12:03:20.899 0 -0.336 0.886 0.649
1613977400920 2021-02-22T12:03:20.920 0.021 -0.233 0.799 0.648
1613977400940 2021-02-22T12:03:20.940 0.041 -0.173 0.771 0.629
1613977400961 2021-02-22T12:03:20.961 0.062 -0.132 0.757 0.596
1613977400981 2021-02-22T12:03:20.981 0.082 -0.113 0.724 0.57
1613977401002 2021-02-22T12:03:21.002 0.103 -0.127 0.713 0.538
1613977401021 2021-02-22T12:03:21.021 0.122 -0.175 0.743 0.488
1613977401041 2021-02-22T12:03:21.041 0.142 -0.266 0.775 0.417
1613977401062 2021-02-22T12:03:21.062 0.163 -0.281 0.774 0.402
1613977401082 2021-02-22T12:03:21.082 0.183 -0.212 0.713 0.427
1613977401103 2021-02-22T12:03:21.103 0.204 -0.17 0.649 0.46
1613977401123 2021-02-22T12:03:21.123 0.224 -0.204 0.649 0.524
1613977401144 2021-02-22T12:03:21.144 0.245 -0.313 0.684 0.658
1613977401164 2021-02-22T12:03:21.164 0.265 -0.415 0.727 0.785
1613977401183 2021-02-22T12:03:21.183 0.284 -0.419 0.726 0.82
解决方法
这里的一个主要问题似乎是您的原始频率“大约”为 20 毫秒(或 50 赫兹),不完全是。我们需要分两步重新采样:
- 上采样到 1 毫秒,我们可以在其中定义要使用的插值
- 下采样到 50 毫秒(只需每 50 行选择一个,非常简单)
首先让我们建立一个时间索引。此处您有两次信息,因此其中任何一个都可以使用:
>>> df = df.set_index(df['epoch (ms)'].astype('datetime64[ms]'))
>>> df = df.set_index(pd.to_datetime(df['time (10:00)']))
>>> df
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.899 1613977400899 2021-02-22T12:03:20.899 0.000 -0.336 0.886 0.649
2021-02-22 12:03:20.920 1613977400920 2021-02-22T12:03:20.920 0.021 -0.233 0.799 0.648
2021-02-22 12:03:20.940 1613977400940 2021-02-22T12:03:20.940 0.041 -0.173 0.771 0.629
2021-02-22 12:03:20.961 1613977400961 2021-02-22T12:03:20.961 0.062 -0.132 0.757 0.596
2021-02-22 12:03:20.981 1613977400981 2021-02-22T12:03:20.981 0.082 -0.113 0.724 0.570
2021-02-22 12:03:21.002 1613977401002 2021-02-22T12:03:21.002 0.103 -0.127 0.713 0.538
2021-02-22 12:03:21.021 1613977401021 2021-02-22T12:03:21.021 0.122 -0.175 0.743 0.488
2021-02-22 12:03:21.041 1613977401041 2021-02-22T12:03:21.041 0.142 -0.266 0.775 0.417
2021-02-22 12:03:21.062 1613977401062 2021-02-22T12:03:21.062 0.163 -0.281 0.774 0.402
2021-02-22 12:03:21.082 1613977401082 2021-02-22T12:03:21.082 0.183 -0.212 0.713 0.427
2021-02-22 12:03:21.103 1613977401103 2021-02-22T12:03:21.103 0.204 -0.170 0.649 0.460
2021-02-22 12:03:21.123 1613977401123 2021-02-22T12:03:21.123 0.224 -0.204 0.649 0.524
2021-02-22 12:03:21.144 1613977401144 2021-02-22T12:03:21.144 0.245 -0.313 0.684 0.658
2021-02-22 12:03:21.164 1613977401164 2021-02-22T12:03:21.164 0.265 -0.415 0.727 0.785
2021-02-22 12:03:21.183 1613977401183 2021-02-22T12:03:21.183 0.284 -0.419 0.726 0.820
(现在我们真的不再需要 epoch
和 time
列,因为信息在索引中)
现在我们可以进行重采样了:
>>> df.resample('1ms').interpolate().resample('50ms').last()
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.850 1.613977e+12 2021-02-22T12:03:20.899 0.000 -0.336000 0.886000 0.649000
2021-02-22 12:03:20.900 1.613977e+12 2021-02-22T12:03:20.940 0.050 -0.155429 0.765000 0.614857
2021-02-22 12:03:20.950 1.613977e+12 2021-02-22T12:03:20.981 0.100 -0.125000 0.714571 0.542571
2021-02-22 12:03:21.000 1.613977e+12 2021-02-22T12:03:21.041 0.150 -0.271714 0.774619 0.411286
2021-02-22 12:03:21.050 1.613977e+12 2021-02-22T12:03:21.082 0.200 -0.178000 0.661190 0.453714
2021-02-22 12:03:21.100 1.613977e+12 2021-02-22T12:03:21.144 0.250 -0.338500 0.694750 0.689750
2021-02-22 12:03:21.150 1.613977e+12 2021-02-22T12:03:21.183 0.284 -0.419000 0.726000 0.820000
请注意,您可以通过指定传递给 .interpolate()
的参数来执行不同类型的插值。请参阅the doc:
方法:str,默认“线性”
要使用的插值技术。其中之一:
- ‘linear’:忽略索引并将值视为等距。这是多索引支持的唯一方法。
- ‘time’:处理每日和更高分辨率的数据,以插入给定的间隔长度。
- ‘index’、‘values’:使用索引的实际数值。
- ‘pad’:使用现有值填充 NaN。
- ‘nearest’,‘zero’,‘slinear’,‘quadratic’,‘cubic’,‘spline’,‘barycentric’,‘polynomial’: 传递给 scipy.interpolate.interp1d。这些方法使用索引的数值。 “多项式”和“样条”都要求您还指定一个阶数 (int),例如df.interpolate(method='polynomial',order=5).
- 'krogh'、'piecewise_polynomial'、'spline'、'pchip'、'akima'、'cubicspline':围绕类似名称的 SciPy 插值方法的包装。请参阅注释。
- ‘from_derivatives’:指 scipy.interpolate.BPoly.from_derivatives,它取代了 scipy 0.18 中的‘piecewise_polynomial’插值方法。
您可以看到坐标的细微差别,由您来选择适合您的方法:
>>> df.resample('1ms').interpolate('time').resample('50ms').last()
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.850 1.613977e+12 2021-02-22T12:03:20.899 0.000 -0.336000 0.886000 0.649000
2021-02-22 12:03:20.900 1.613977e+12 2021-02-22T12:03:20.940 0.050 -0.155429 0.765000 0.614857
2021-02-22 12:03:20.950 1.613977e+12 2021-02-22T12:03:20.981 0.100 -0.125000 0.714571 0.542571
2021-02-22 12:03:21.000 1.613977e+12 2021-02-22T12:03:21.041 0.150 -0.271714 0.774619 0.411286
2021-02-22 12:03:21.050 1.613977e+12 2021-02-22T12:03:21.082 0.200 -0.178000 0.661190 0.453714
2021-02-22 12:03:21.100 1.613977e+12 2021-02-22T12:03:21.144 0.250 -0.338500 0.694750 0.689750
2021-02-22 12:03:21.150 1.613977e+12 2021-02-22T12:03:21.183 0.284 -0.419000 0.726000 0.820000
>>> df.resample('1ms').interpolate('cubic').resample('50ms').last()
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.850 1.613977e+12 2021-02-22T12:03:20.899 0.000 -0.336000 0.886000 0.649000
2021-02-22 12:03:20.900 1.613977e+12 2021-02-22T12:03:20.940 0.050 -0.153162 0.766266 0.615219
2021-02-22 12:03:20.950 1.613977e+12 2021-02-22T12:03:20.981 0.100 -0.122950 0.711454 0.543581
2021-02-22 12:03:21.000 1.613977e+12 2021-02-22T12:03:21.041 0.150 -0.285487 0.781273 0.403123
2021-02-22 12:03:21.050 1.613977e+12 2021-02-22T12:03:21.082 0.200 -0.172478 0.656944 0.452494
2021-02-22 12:03:21.100 1.613977e+12 2021-02-22T12:03:21.144 0.250 -0.342439 0.695493 0.693425
2021-02-22 12:03:21.150 1.613977e+12 2021-02-22T12:03:21.183 0.284 -0.419000 0.726000 0.820000