如何包装熊猫重采样方法？问题1用法与标准用法有很大不同：问题2我无法使用以前可以使用的所有功能：

问题描述

我有一个反复出现的pandas问题，我想通过包装.resample方法来解决。我只是不知道怎么办。

背景（不是必需的）

我有可识别时区的时间序列，例如：

s = pd.Series([5,19,-4],pd.date_range('2020-10-01',freq='D',periods=3,tz='Europe/Berlin',name='ts_left'))

s

ts_left
2020-10-01 00:00:00+02:00    5
2020-10-02 00:00:00+02:00   19
2020-10-03 00:00:00+02:00   -4
Freq: D,dtype: int64

我想重采样几个小时。如果我仅使用s.resample('H').sum()，则最后23小时将被丢弃（也已访问in this question）：

s.resample('H').sum()

ts_left
2020-10-01 00:00:00+02:00    5
2020-10-01 01:00:00+02:00    0
...
2020-10-01 23:00:00+02:00    0
2020-10-02 00:00:00+02:00   19
2020-10-02 01:00:00+02:00    0
...
2020-10-02 23:00:00+02:00    0
2020-10-03 00:00:00+02:00   -4
Freq: H,Length: 49,dtype: int64

当前的“解决方案”

我已经编写了一个自定义resample2函数来更正此问题：

def resample2(df,freq,func):
    if type(df.index) != pd.DatetimeIndex:
        return df.resample(freq).apply(func)
    else: 
        #add one row
        idx = [df.index[-1] + df.index.freq]
        if type(df) == pd.DataFrame:
            df = df.append(pd.DataFrame([[None] * len(df.columns)],idx))
        elif type(df) == pd.Series:
            df = df.append(pd.Series([None],idx))
        df = df.resample(freq).apply(func)
        return df.iloc[:-1] #remove one row

这有效：

resample2(s,'H',np.sum)

2020-10-01 00:00:00+02:00    5
2020-10-01 01:00:00+02:00    0
...
2020-10-01 23:00:00+02:00    0
2020-10-02 00:00:00+02:00   19
2020-10-02 01:00:00+02:00    0
...
2020-10-02 23:00:00+02:00    0
2020-10-03 00:00:00+02:00   -4
2020-10-03 01:00:00+02:00    0
...
2020-10-03 23:00:00+02:00    0
Freq: H,Length: 72,dtype: int64

但有2个问题：

用法与标准用法有很大不同（resample2(s,np.sum)与s.resample('H').sum()，并且
我无法使用以前可以使用的所有功能。例如，resample2(s,s.resample.ffill)给出错误。

问题

有没有一种方法可以包装DataFrame.resample和Series.resample方法的功能，以便它们可以像往常一样继续工作，只需在重新采样前添加一行，然后删除resample2函数中显示的重采样后的最后一行？

解决方法

问题1（用法与标准用法有很大不同）：

在本地自定义pandas软件包的时间很短，我认为您正在做的事情接近您可以做的最好的事情。我不知道resample的任何参数允许这样做，而且我不确定如何自定义DataFrame / Series的现有方法。

但是可能有一种方法可以使您的功能更多地成为一个帮助程序，该帮助程序用于相对于重采样对数据进行预处理或后处理。这是您的函数的替代实现：

def allday_resample(df,freq,func):
    df = df.copy()
    begin = df.index.min().floor('D')
    end = df.index.max().ceil('D')
    if end == df.index.max():
        end += pd.offsets.Day(1)

    if begin not in df.index:
        df.loc[begin] = np.nan
    if end not in df.index:
        df.loc[end] = np.nan

    r = df.resample(freq).apply(func)
    return r[(r.index >= begin) &
             (r.index < end)]

这与您的resample2非常相似，但有一些更改（改进之处？）：

使用df = df.copy()，很明显，我们正在返回一个新对象，而不是修改传入的原始数据（可以更改）
它以相同的方式处理Series和DataFrame（因此不需要if-else）
它提供了开始日期和结束日期的完整值-我看到resample2可能会产生差异结果，如果您的开始/结束时间戳记不是在午夜（如果您的数据总是在午夜的话这可能没有意义）。参见以下示例：

# now starting at 10:00
>>> s = pd.Series([5,19,-4],pd.date_range('2020-10-01 10:00',freq='D',periods=3,tz='Europe/Berlin',name='ts_left'))
>>> resample2(s,'H',np.sum)

2020-10-01 10:00:00+02:00     5
2020-10-01 11:00:00+02:00     5
2020-10-01 12:00:00+02:00     5
2020-10-01 13:00:00+02:00     5
2020-10-01 14:00:00+02:00     5
                             ..
2020-10-04 05:00:00+02:00    -4
2020-10-04 06:00:00+02:00    -4
2020-10-04 07:00:00+02:00    -4
2020-10-04 08:00:00+02:00    -4
2020-10-04 09:00:00+02:00    -4
Freq: H,Length: 72,dtype: object

# missing timestamps for Oct 1st,and timestamps carried over into Oct 4th despite no original data on that day

之所以称其为allday_resample，是因为它确保了开始日期，结束日期以及之间的所有日期都填充了输入freq。如果要重新采样到分钟，而只希望将数据填充到小时，则可能会更复杂（您需要选择时间频率偏移量的层次结构）。但是我现在假设您只关心获取每日数据并每小时重新采样。

>>> s = pd.Series([5,pd.date_range('2020-10-01',name='ts_left'))
>>> allday_resample(s,np.sum)
ts_left
2020-10-01 00:00:00+02:00    5.0
2020-10-01 01:00:00+02:00    0.0
2020-10-01 02:00:00+02:00    0.0
2020-10-01 03:00:00+02:00    0.0
2020-10-01 04:00:00+02:00    0.0

2020-10-03 19:00:00+02:00    0.0
2020-10-03 20:00:00+02:00    0.0
2020-10-03 21:00:00+02:00    0.0
2020-10-03 22:00:00+02:00    0.0
2020-10-03 23:00:00+02:00    0.0
Freq: H,dtype: float64

但是我们可以将其步骤移到一个函数中，以便在重采样之前编辑数据，以便在重采样时获得相同的输出：

def preprocess(df):
    begin = df.index.min().floor('D')
    end = df.index.max().ceil('D')
    if end == df.index.max():
        end += pd.offsets.Day(1) - pd.Timedelta('1s')
    if begin not in df.index:
        df.loc[begin] = np.nan
    if end not in df.index:
        df.loc[end] = np.nan

此处，传入的数据被修改到位（该函数不返回任何内容）。还有一个小步骤，可以从结束日期的上限中减去1秒（任意小的增量），这样我们就不会在第二天进行重新采样时包含任何数据。

使用此功能，您可以执行以下操作：

>>> preprocess(s)
>>> s.resample('H').sum()

ts_left
2020-10-01 00:00:00+02:00    5.0
2020-10-01 01:00:00+02:00    0.0
2020-10-01 02:00:00+02:00    0.0
2020-10-01 03:00:00+02:00    0.0
2020-10-01 04:00:00+02:00    0.0

2020-10-03 19:00:00+02:00    0.0
2020-10-03 20:00:00+02:00    0.0
2020-10-03 21:00:00+02:00    0.0
2020-10-03 22:00:00+02:00    0.0
2020-10-03 23:00:00+02:00    0.0
Freq: H,dtype: float64

问题2（我无法使用以前可以使用的所有功能）：

这不太麻烦-您仍然可以通过使用它们的字符串名称而不是其他一些函数（例如，您的示例中为np.sum）来访问这些字符串。因此，对于向前填充，您可以执行以下操作（按原样使用resample2）：

>>> resample2(s,'ffill')
2020-10-01 00:00:00+02:00     5
2020-10-01 01:00:00+02:00     5
2020-10-01 02:00:00+02:00     5
2020-10-01 03:00:00+02:00     5
2020-10-01 04:00:00+02:00     5
                             ..
2020-10-03 19:00:00+02:00    -4
2020-10-03 20:00:00+02:00    -4
2020-10-03 21:00:00+02:00    -4
2020-10-03 22:00:00+02:00    -4
2020-10-03 23:00:00+02:00    -4
Freq: H,dtype: object

通过我的眼睛/简短测试，进行x.resample().sum()和x.resample().apply('sum')是等效的。请参阅有关此here的我的问题和其他人的答案。并查看Resampler.apply().下的文档。以上，当我使用np.sum时，我本可以使用'sum'。

pandas pandas-resample

如何包装熊猫重采样方法？ 问题1用法与标准用法有很大不同：问题2我无法使用以前可以使用的所有功能：