大熊猫将不均匀的每小时数据重新采样到1D或24h箱中

问题描述

我有每周的每小时FX数据，需要从星期一到星期四12:00 pm和星期五的21:00重新采样到“ 1D”或“ 24hr”箱中，每周总计5天：

Date                 rate
2020-01-02 00:00:00 0.673355
2020-01-02 01:00:00 0.67311
2020-01-02 02:00:00 0.672925
2020-01-02 03:00:00 0.67224
2020-01-02 04:00:00 0.67198
2020-01-02 05:00:00 0.67223
2020-01-02 06:00:00 0.671895
2020-01-02 07:00:00 0.672175
2020-01-02 08:00:00 0.672085
2020-01-02 09:00:00 0.67087
2020-01-02 10:00:00 0.6705800000000001
2020-01-02 11:00:00 0.66884
2020-01-02 12:00:00 0.66946
2020-01-02 13:00:00 0.6701600000000001
2020-01-02 14:00:00 0.67056
2020-01-02 15:00:00 0.67124
2020-01-02 16:00:00 0.6691699999999999
2020-01-02 17:00:00 0.66883
2020-01-02 18:00:00 0.66892
2020-01-02 19:00:00 0.669345
2020-01-02 20:00:00 0.66959
2020-01-02 21:00:00 0.670175
2020-01-02 22:00:00 0.6696300000000001
2020-01-02 23:00:00 0.6698350000000001
2020-01-03 00:00:00 0.66957

因此，一周中各天的小时数是不均匀的，即“星期一” =周一00:00:00到周一12:00:00，“星期二”（以及周三，周四）=即13星期一：00：00到星期二12:00:00，星期五= 13:00:00到21:00:00

在尝试找到解决方案时，我发现现在不推荐使用基数，并且偏移量/原点方法无法按预期运行，这可能是由于每天的行数不均匀所致：

df.rate.resample('24h',offset=12).ohlc()

我花了数小时试图找到解决方法

如何简单地将每个12:00:00时间戳之间的所有数据行装到ohlc（）列中？

所需的输出看起来像这样：

Out[69]: 
                                   open      high       low     close
2020-01-02 00:00:00.0000000  0.673355  0.673355  0.673355  0.673355
2020-01-03 00:00:00.0000000  0.673110  0.673110  0.668830  0.669570
2020-01-04 00:00:00.0000000  0.668280  0.668280  0.664950  0.666395
2020-01-05 00:00:00.0000000  0.666425  0.666425  0.666425  0.666425

解决方法

使用原点和偏移量作为参数，这就是您要查找的内容

df.resample('24h',origin='start_day',offset='13h').ohlc()

以您的示例为例：

                    open        high        low     close
datetime                
2020-01-01 13:00:00 0.673355    0.673355    0.66884 0.66946
2020-01-02 13:00:00 0.670160    0.671240    0.66883 0.66957

由于周期长度不相等，因此IMO必须自己制作映射轮。准确地说，星期一的1.5天长度使得freq='D'无法一次正确进行映射。

手工编写的代码还可以防止在定义明确的时间段之外进行记录。

数据

使用稍有不同的时间戳来演示代码的正确性。这些日子是星期一。到星期五。

import pandas as pd
import numpy as np
from datetime import datetime
import io
from pandas import Timestamp,Timedelta

df = pd.read_csv(io.StringIO("""
                         rate
Date                         
2020-01-06 00:00:00  0.673355
2020-01-06 23:00:00  0.673110
2020-01-07 00:00:00  0.672925
2020-01-07 12:00:00  0.672240
2020-01-07 13:00:00  0.671980
2020-01-07 23:00:00  0.672230
2020-01-08 00:00:00  0.671895
2020-01-08 12:00:00  0.672175
2020-01-08 23:00:00  0.672085
2020-01-09 00:00:00  0.670870
2020-01-09 12:00:00  0.670580
2020-01-09 23:00:00  0.668840
2020-01-10 00:00:00  0.669460
2020-01-10 12:00:00  0.670160
2020-01-10 21:00:00  0.670560
2020-01-10 22:00:00  0.671240
2020-01-10 23:00:00  0.669170
"""),sep=r"\s{2,}",engine="python")

df.set_index(pd.to_datetime(df.index),inplace=True)

代码

def find_day(ts: Timestamp):
    """Find the trading day with irregular length"""

    wd = ts.isoweekday()
    if wd == 1:
        return ts.date()
    elif wd in (2,3,4):
        return ts.date() - Timedelta("1D") if ts.hour <= 12 else ts.date()
    elif wd == 5:
        if ts.hour <= 12:
            return ts.date() - Timedelta("1D")
        elif 13 <= ts.hour <= 21:
            return ts.date()

    # out of range or nulls
    return None

# map the timestamps,and set as new index
df.set_index(pd.DatetimeIndex(df.index.map(find_day)),inplace=True)

# drop invalid values and collect ohlc
ans = df["rate"][df.index.notnull()].resample("D").ohlc()

结果

print(ans)

                open      high       low     close
Date                                              
2020-01-06  0.673355  0.673355  0.672240  0.672240
2020-01-07  0.671980  0.672230  0.671895  0.672175
2020-01-08  0.672085  0.672085  0.670580  0.670580
2020-01-09  0.668840  0.670160  0.668840  0.670160
2020-01-10  0.670560  0.670560  0.670560  0.670560

我最终结合使用了grouby和星期几的日期时间来确定我的具体解决方案

# get idxs of time to rebal (12:00:00)-------------------------------------
df['idx'] = range(len(df)) # get row index
days = [] # identify each row by day of week
for i in range(len(df.index)):
    days.append(df.index[i].date().weekday())
df['day'] = days

dtChgIdx = [] # stores "12:00:00" rows
justDates = df.index.date.tolist() # gets just dates
res = [] # removes duplicate dates
[res.append(x) for x in justDates if x not in res]
justDates = res
grouped_dates = df.groupby(df.index.date) # group entire df by dates

for i in range(len(grouped_dates)):
    tempDf = grouped_dates.get_group(justDates[i]) # look at each grouped dates
    if tempDf['day'][0] == 6:
        continue # skip Sundays
    times = [] # gets just the time portion of index
    for y in range(len(tempDf.index)):
        times.append(str(tempDf.index[y])[-8:])
    tempDf['time'] = times # add time column to df
    tempDf['dayCls'] = np.where(tempDf['time'] == '12:00:00',1,0) # idx "12:00:00" row  
    dtChgIdx.append(tempDf.loc[tempDf['dayCls'] == 1,'idx'][0]) # idx value

pandas pandas-resample resampling time-series