我有两个数据帧(很长,每个有数百或数千行).其中一个名为df1,包含一个时间序列,间隔为10分钟.例如:
date value 2016-11-24 00:00:00 1759.199951 2016-11-24 00:10:00 992.400024 2016-11-24 00:20:00 1404.800049 2016-11-24 00:30:00 45.799999 2016-11-24 00:40:00 24.299999 2016-11-24 00:50:00 159.899994 2016-11-24 01:00:00 82.499999 2016-11-24 01:10:00 37.400003 2016-11-24 01:20:00 159.899994 ....
而另一个,df2,包含日期时间间隔:
start_date end_date 0 2016-11-23 23:55:32 2016-11-24 00:14:03 1 2016-11-24 01:03:18 2016-11-24 01:07:12 2 2016-11-24 01:11:32 2016-11-24 02:00:00 ...
我需要选择df1中的所有行,这些行“落入”df2中的一个区间.
通过这些示例,结果数据框应为:
date value 2016-11-24 00:00:00 1759.199951 # Fits in row 0 of df2 2016-11-24 00:10:00 992.400024 # Fits in row 0 of df2 2016-11-24 01:00:00 82.499999 # Fits in row 1 of df2 2016-11-24 01:10:00 37.400003 # Fits on row 2 of df2 2016-11-24 01:20:00 159.899994 # Fits in row 2 of df2 ....
解决方法:
这是基于np.searchsorted的变体,似乎比使用intervaltree或merge快一个数量级,假设我的更大的样本数据是正确的.
# Ensure the df2 is sorted (skip if it's already kNown to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
如果df2中的间隔嵌套或重叠,则可能需要修改此值;在那种情况下,我还没有完全考虑过,但它仍然可以工作.
使用间隔树
不完全是纯粹的Pandas解决方案,但您可能需要考虑从df2构建Interval Tree,并根据df1中的间隔查询它以找到重叠的那些.
PyPI上的intervaltree
软件包似乎具有良好的性能和易于使用的语法.
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
出于性能原因,我将日期转换为等价的整数.我怀疑intervaltree包是用pd.Timestamp构建的,所以可能有一些中间转换步骤会让事情变慢.
另请注意,虽然包含起点,但intervaltree包中的间隔不包括终点.这就是为什么我在创建树时有[0,1]的原因;我将终点填充一个纳秒,以确保实际包含真正的终点.这也是为什么我可以在查询树时添加pd.offsets.Minute(10)以获得间隔结束,而不是仅添加9m 59s.
date value
0 2016-11-24 00:00:00 1759.199951
1 2016-11-24 00:10:00 992.400024
6 2016-11-24 01:00:00 82.499999
7 2016-11-24 01:10:00 37.400003
8 2016-11-24 01:20:00 159.899994
计时
使用以下设置生成更大的样本数据:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
这为df1和df2产生以下结果:
df1
date value
0 2016-11-24 00:00:00 0.444939
1 2016-11-24 00:10:00 0.407554
2 2016-11-24 00:20:00 0.460148
3 2016-11-24 00:30:00 0.465239
4 2016-11-24 00:40:00 0.462691
...
54995 2017-12-10 21:50:00 0.754123
54996 2017-12-10 22:00:00 0.401820
54997 2017-12-10 22:10:00 0.146284
54998 2017-12-10 22:20:00 0.394759
54999 2017-12-10 22:30:00 0.907233
df2
start_date end_date
0 2016-11-24 00:00:19 2016-11-24 00:41:24
1 2016-11-24 18:22:44 2016-11-24 18:36:44
2 2016-11-25 12:44:44 2016-11-25 13:03:13
3 2016-11-26 07:07:05 2016-11-26 07:49:29
4 2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53
并使用以下函数进行计时:
def root_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
def root_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.
return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
def ptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
def parfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
我得到以下时间:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop