LEFT JOIN关于熊猫的时间间隔和时间戳

问题描述

我有2个数据框:

标签

import pandas as pd
marker_labels = pd.DataFrame({'cohort_id':[1,1,1],'marker_type':['a','b','a'],'start':['2020-01-2','2020-01-04 05','2020-01-06'],'end':[np.nan,'2020-01-05 16',np.nan]})
marker_labels['start'] = pd.to_datetime(marker_labels['start'])
marker_labels['end'] = pd.to_datetime(marker_labels['end'])
marker_labels.loc[marker_labels['end'].isnull(),'end'] =  marker_labels.start + pd.timedelta(days=1) - pd.timedelta(seconds=1)

和数据:

import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({'hour': {36: Timestamp('2020-01-04 04:00:00'),37: Timestamp('2020-01-04 04:00:00'),38: Timestamp('2020-01-04 04:00:00'),39: Timestamp('2020-01-04 04:00:00'),40: Timestamp('2020-01-04 04:00:00'),41: Timestamp('2020-01-04 04:00:00'),42: Timestamp('2020-01-04 04:00:00'),43: Timestamp('2020-01-04 04:00:00'),44: Timestamp('2020-01-04 04:00:00'),45: Timestamp('2020-01-04 05:00:00'),46: Timestamp('2020-01-04 05:00:00'),47: Timestamp('2020-01-04 05:00:00'),48: Timestamp('2020-01-04 05:00:00'),49: Timestamp('2020-01-04 05:00:00'),50: Timestamp('2020-01-04 05:00:00'),51: Timestamp('2020-01-04 05:00:00'),52: Timestamp('2020-01-04 05:00:00'),53: Timestamp('2020-01-04 05:00:00')},'metrik_0': {36: -0.30098661551885625,37: -0.6402837079024638,38: -2.6953511655638778,39: 0.4036062912674384,40: -0.035627996627399204,41: -0.06510225503176624,42: -1.9745426914329782,43: 1.4112111331287631,44: 0.18641277342651516,45: 0.10780795451690242,46: 0.31822895003286417,47: -1.0804164740649171,48: -1.6676697601556636,49: -1.0354359757914047,50: 1.8570215568670299,51: 0.9055795225472866,52: -0.020539970820695173,53: -0.7975048293123836},'cohort_id': {36: 1,37: 1,38: 1,39: 1,40: 1,41: 1,42: 1,43: 1,44: 1,45: 1,46: 1,47: 1,48: 1,49: 1,50: 1,51: 1,52: 1,53: 1},'device_id': {36: 6,37: 5,38: 11,39: 20,40: 18,42: 14,43: 9,44: 12,45: 9,46: 14,47: 11,48: 20,49: 5,51: 12,52: 6,53: 18}})
df

我想对列cohort_id和时间间隔(小时为BETWEEN(开始,结束))执行LEFT JOIN。

类似的问题是:

到目前为止,我有多种方法,但可以找到最终解决方案:

一个:慢速,在简单的pandas列中没有完全输出/可访问的结果:

def join_on_matching_interval(x):
    result = marker_labels[(marker_labels.cohort_id == x.cohort_id) & (x.hour >= marker_labels.start) & (x.hour <= marker_labels.end)]
    if len(result) == 0:
        result = []
    return result
    
df['marker_labels'] = df.apply(join_on_matching_interval,axis=1)
print(df.shape[0])
#df = df.explode('marker_labels') # this fails to work
df['size'] = df.marker_labels.apply(lambda x: len(x))
df[(df['size'] > 0)].head()

如何使结果可作为列访问?

第二个:正确的列,但无效的行数(快速):

按照我在上面共享的链接

print(len(df))
print(len(marker_labels))
merged_res = df.merge(marker_labels,left_on=['cohort_id'],right_on=['cohort_id'],how='left')
print(len(merged_res)) # the number of rows has increased
merged_res = merged_res[(merged_res.hour.between(merged_res.start,merged_res.end)) | (merged_res.start.isnull())]
print(len(merged_res)) # but Now not enough rows are left over.
  1. 情况1:不匹配(处理正确)
  2. 情况2:完全匹配(正确处理)
  3. 情况3:部分匹配(未处理->记录被删除

特别是对于3表示:

  • 我不想收到任何重复
  • 所有结果都来自左方
  • 以及时间间隔和时间戳重叠时的匹配项

如何在条件中包括第三种情况?

解决方法

您的意思是合并和查询,然后重新加入:

tmp = (df.reset_index()
         .merge(marker_labels,on='cohort_id',how='left')
         .query('start <= hour <= end')
         .set_index('index')
         .reindex(df.index)
      )

out = tmp.combine_first(df)

输出:

      cohort_id    device_id  end                  hour                 marker_type      metrik_0  start
--  -----------  -----------  -------------------  -------------------  -------------  ----------  -------------------
36            1            6  NaT                  2020-01-04 04:00:00  nan            -0.300987   NaT
37            1            5  NaT                  2020-01-04 04:00:00  nan            -0.640284   NaT
38            1           11  NaT                  2020-01-04 04:00:00  nan            -2.69535    NaT
39            1           20  NaT                  2020-01-04 04:00:00  nan             0.403606   NaT
40            1           18  NaT                  2020-01-04 04:00:00  nan            -0.035628   NaT
41            1            1  NaT                  2020-01-04 04:00:00  nan            -0.0651023  NaT
42            1           14  NaT                  2020-01-04 04:00:00  nan            -1.97454    NaT
43            1            9  NaT                  2020-01-04 04:00:00  nan             1.41121    NaT
44            1           12  NaT                  2020-01-04 04:00:00  nan             0.186413   NaT
45            1            9  2020-01-05 16:00:00  2020-01-04 05:00:00  b               0.107808   2020-01-04 05:00:00
46            1           14  2020-01-05 16:00:00  2020-01-04 05:00:00  b               0.318229   2020-01-04 05:00:00
47            1           11  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -1.08042    2020-01-04 05:00:00
48            1           20  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -1.66767    2020-01-04 05:00:00
49            1            5  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -1.03544    2020-01-04 05:00:00
50            1            1  2020-01-05 16:00:00  2020-01-04 05:00:00  b               1.85702    2020-01-04 05:00:00
51            1           12  2020-01-05 16:00:00  2020-01-04 05:00:00  b               0.90558    2020-01-04 05:00:00
52            1            6  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -0.02054    2020-01-04 05:00:00
53            1           18  2020-01-05 16:00:00  2020-01-04 05:00:00  b              -0.797505   2020-01-04 05:00:00