问题描述
我有2017年至2019年每分钟的库存数据。 我想每天只保留9:16之后的数据 因此我想将9:00到9:16之间的任何数据转换为9:16的值 即:
值09:16应该是
-
open
:从9:00-9:16开始的第一个数据的值,此处为116.00 -
high
:从9:00-9:16的最高值,此处为117.00 -
low
:从9:00-9:16的最小值,此处为116.00 -
close
:这将是9:16的值,此处为113.00
open high low close
date
2017-01-02 09:08:00 116.00 116.00 116.00 116.00
2017-01-02 09:16:00 116.10 117.80 117.00 113.00
2017-01-02 09:17:00 115.50 116.20 115.50 116.20
2017-01-02 09:18:00 116.05 116.35 116.00 116.00
2017-01-02 09:19:00 116.00 116.00 115.60 115.75
... ... ... ... ...
2029-12-29 15:56:00 259.35 259.35 259.35 259.35
2019-12-29 15:57:00 260.00 260.00 260.00 260.00
2019-12-29 15:58:00 260.00 260.00 259.35 259.35
2019-12-29 15:59:00 260.00 260.00 260.00 260.00
2019-12-29 16:36:00 259.35 259.35 259.35 259.35
这是我尝试过的:
#Get data from/to 9:00 - 9:16 and create only one data item
convertPreTrade = df.between_time("09:00","09:16") #09:00 - 09:16
#combine modified value to original data
df.loc[df.index.strftime("%H:%M") == "09:16",["open","high","low","close"] ] = [convertPreTrade["open"][0],convertPreTrade["high"].max(),convertPreTrade["low"].min(),convertPreTrade['close'][-1] ]
但这不会给我准确的数据
解决方法
d = {'date': 'last','open': 'last','high': 'max','low': 'min','close': 'last'}
# df.index = pd.to_datetime(df.index)
s1 = df.between_time('09:00:00','09:16:00')
s2 = s1.reset_index().groupby(s1.index.date).agg(d).set_index('date')
df1 = pd.concat([df.drop(s1.index),s2]).sort_index()
详细信息:
使用DataFrame.between_time
过滤数据帧df
中介于时间09:00
至09:16
之间的行:
print(s1)
open high low close
date
2017-01-02 09:08:00 116.0 116.0 116.0 116.0
2017-01-02 09:16:00 116.1 117.8 117.0 113.0
使用DataFrame.groupby
对s1
上的此已过滤数据帧date
进行分组,并使用字典d
进行汇总:
print(s2)
open high low close
date
2017-01-02 09:16:00 116.1 117.8 116.0 113.0
使用DataFrame.drop
从原始数据帧df
中删除行09:00-09:16
之间的行,然后使用pd.concat
将其与s2
合并,最后使用DataFrame.sort_index
对索引进行排序:
print(df1)
open high low close
date
2017-01-02 09:16:00 116.10 117.80 116.00 113.00
2017-01-02 09:17:00 115.50 116.20 115.50 116.20
2017-01-02 09:18:00 116.05 116.35 116.00 116.00
2017-01-02 09:19:00 116.00 116.00 115.60 115.75
2019-12-29 15:57:00 260.00 260.00 260.00 260.00
2019-12-29 15:58:00 260.00 260.00 259.35 259.35
2019-12-29 15:59:00 260.00 260.00 260.00 260.00
2019-12-29 16:36:00 259.35 259.35 259.35 259.35
2029-12-29 15:56:00 259.35 259.35 259.35 259.35
,
从9:00提取到9:16。数据帧按年,月和日分组,并根据OHLC值进行计算。逻辑使用您的代码。最后,您在9:16添加日期列。由于我们没有所有数据,因此我们可能省略了一些注意事项,但基本形式保持不变。
import pandas as pd
import numpy as np
import io
data = '''
date open high low close
"2017-01-02 09:08:00" 116.00 116.00 116.00 116.00
"2017-01-02 09:16:00" 116.10 117.80 117.00 113.00
"2017-01-02 09:17:00" 115.50 116.20 115.50 116.20
"2017-01-02 09:18:00" 116.05 116.35 116.00 116.00
"2017-01-02 09:19:00" 116.00 116.00 115.60 115.75
"2017-01-03 09:08:00" 259.35 259.35 259.35 259.35
"2017-01-03 09:09:00" 260.00 260.00 260.00 260.00
"2017-12-03 09:18:00" 260.00 260.00 259.35 259.35
"2017-12-04 09:05:00" 260.00 260.00 260.00 260.00
"2017-12-04 09:22:00" 259.35 259.35 259.35 259.35
'''
df = pd.read_csv(io.StringIO(data),sep='\s+')
df.reset_index(drop=True,inplace=True)
df['date'] = pd.to_datetime(df['date'])
# 9:00-9:16
df_start = df[((df['date'].dt.hour == 9) & (df['date'].dt.minute >= 0)) & ((df['date'].dt.hour == 9) & (df['date'].dt.minute <=16))]
# calculate
df_new = (df_start.groupby([df['date'].dt.year,df['date'].dt.month,df['date'].dt.day])
.agg(open_first=('open',lambda x: x.iloc[0,]),high_max=('high','max'),low_min=('low','min'),close_shift=('close',lambda x: x.iloc[-1,])))
df_new.index.names = ['year','month','day']
df_new.reset_index(inplace=True)
df_new['date'] = df_new['year'].astype(str)+'-'+df_new['month'].astype(str)+'-'+df_new['day'].astype(str)+' 09:16:00'
year month day open_first high_max low_min close_shift date
0 2017 1 2 116.00 117.8 116.00 113.0 2017-1-2 09:16:00
1 2017 1 3 259.35 260.0 259.35 260.0 2017-1-3 09:16:00
2 2017 12 4 260.00 260.0 260.00 260.0 2017-12-4 09:16:00
,
使用@ r-beginners数据并增加了几行:
import pandas as pd
import numpy as np
import io
data = '''
datetime open high low close
"2017-01-02 09:08:00" 116.00 116.00 116.00 116.00
"2017-01-02 09:16:00" 116.10 117.80 117.00 113.00
"2017-01-02 09:17:00" 115.50 116.20 115.50 116.20
"2017-01-02 09:18:00" 116.05 116.35 116.00 116.00
"2017-01-02 09:19:00" 116.00 116.00 115.60 115.75
"2017-01-03 09:08:00" 259.35 259.35 259.35 259.35
"2017-01-03 09:09:00" 260.00 260.00 260.00 260.00
"2017-01-03 09:16:00" 260.00 260.00 260.00 260.00
"2017-01-03 09:17:00" 261.00 261.00 261.00 261.00
"2017-01-03 09:18:00" 262.00 262.00 262.00 262.00
"2017-12-03 09:18:00" 260.00 260.00 259.35 259.35
"2017-12-04 09:05:00" 260.00 260.00 260.00 260.00
"2017-12-04 09:22:00" 259.35 259.35 259.35 259.35
'''
df = pd.read_csv(io.StringIO(data),sep='\s+')
下面的代码开始整个过程。可能不是最好的方法,但是又快又脏:
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df['date'] = df.index.date
dates = np.unique(df.index.date)
first_rows = df.between_time('9:16','00:00').reset_index().groupby('date').first().set_index('datetime')
first_rows['date'] = first_rows.index.date
dffs = []
for d in dates:
df_day = df[df['date'] == d].sort_index()
first_bar_of_the_day = first_rows[first_rows['date'] == d].copy()
bars_until_first = df_day.loc[df_day.index <= first_bar_of_the_day.index.values[0]]
if ~first_bar_of_the_day.empty:
first_bar_of_the_day['open'] = bars_until_first['open'].values[0]
first_bar_of_the_day['high'] = bars_until_first['high'].max()
first_bar_of_the_day['low'] = bars_until_first['low'].min()
first_bar_of_the_day['close'] = bars_until_first['close'].values[-1]
bars_after_first = df_day.loc[df_day.index > first_bar_of_the_day.index.values[0]]
if len(bars_after_first) > 1:
dff = pd.concat([first_bar_of_the_day,bars_after_first])
else:
dff = first_bar_of_the_day.copy()
print(dff)
dffs.append(dff)
combined_df = pd.concat([x for x in dffs])
print(combined_df)
打印结果如下:dff
用于不同日期
open high low close date
datetime
2017-01-02 09:16:00 116.00 117.80 116.0 113.00 2017-01-02
2017-01-02 09:17:00 115.50 116.20 115.5 116.20 2017-01-02
2017-01-02 09:18:00 116.05 116.35 116.0 116.00 2017-01-02
2017-01-02 09:19:00 116.00 116.00 115.6 115.75 2017-01-02
open high low close date
datetime
2017-01-03 09:16:00 259.35 260.0 259.35 260.0 2017-01-03
2017-01-03 09:17:00 261.00 261.0 261.00 261.0 2017-01-03
2017-01-03 09:18:00 262.00 262.0 262.00 262.0 2017-01-03
open high low close date
datetime
2017-12-03 09:18:00 260.0 260.0 259.35 259.35 2017-12-03
open high low close date
datetime
2017-12-04 09:22:00 260.0 260.0 259.35 259.35 2017-12-04
combined_df
open high low close date
datetime
2017-01-02 09:16:00 116.00 117.80 116.00 113.00 2017-01-02
2017-01-02 09:17:00 115.50 116.20 115.50 116.20 2017-01-02
2017-01-02 09:18:00 116.05 116.35 116.00 116.00 2017-01-02
2017-01-02 09:19:00 116.00 116.00 115.60 115.75 2017-01-02
2017-01-03 09:16:00 259.35 260.00 259.35 260.00 2017-01-03
2017-01-03 09:17:00 261.00 261.00 261.00 261.00 2017-01-03
2017-01-03 09:18:00 262.00 262.00 262.00 262.00 2017-01-03
2017-12-03 09:18:00 260.00 260.00 259.35 259.35 2017-12-03
2017-12-04 09:22:00 260.00 260.00 259.35 259.35 2017-12-04
旁注:我不太确定您清除数据的方式是否最好,也许您可以看看是否完全忽略每天上午9:16之前的时间,甚至可以进行分析以检查波动性前15分钟决定。