使用Pandas Dataframe识别并记录重叠的时间间隔

问题描述

目前,我非常想着要解决的问题的起点。我有一个包含4列的数据框。我想尝试按天和编号查找重叠时间。例如,我的df如下所示:

+------+--------------+-----------------------+----------------------+
| id   |   date       |  time_start           |  end_time            |
+--------------------------------------------------------------------+
| 123  |   2019-11-10 |  2019-11-10 08:00:00  |  2019-11-10 08:30:00 |
|      |              |                       |                      |
| 123  |   2019-11-10 |  2019-11-10 08:15:00  |  2019-11-10 08:30:00 |
|      |              |                       |                      |
| 123  |   2019-11-10 |  2019-11-10 08:25:00  |  2019-11-10 08:45:00 |
|      |              |                       |                      |
| 123  |   2019-11-11 |  2019-11-11 08:00:00  |  2019-11-11 08:30:00 |
|      |              |                       |                      |
| 123  |   2019-11-11 |  2019-11-11 08:30:00  |  2019-11-11 09:00:00 |
+------+--------------+-----------------------+----------------------+

import pandas as pd 

data = {'id':['123','123','123'],'date':['2019-11-10','2019-11-10','2019-11-11','2019-11-11'],'time_start':['2019-11-10 08:00:00','2019-11-10 08:15:00','2019-11-10 08:25:00','2019-11-11 08:00:00','2019-11-11 08:30:00'],'end_time':['2019-11-10 08:30:00','2019-11-10 08:30:00','2019-11-10 08:45:00','2019-11-11 08:30:00','2019-11-11 09:00:00']}

df = pd.DataFrame(data),id,date,time_start,end_time
0,123,2019-11-10,2019-11-10 08:00:00,2019-11-10 08:30:00
1,2019-11-10 08:15:00,2019-11-10 08:30:00
2,2019-11-10 08:25:00,2019-11-10 08:45:00
3,2019-11-11,2019-11-11 08:00:00,2019-11-11 08:30:00
4,2019-11-11 08:30:00,2019-11-11 09:00:00

我希望看到类似以下的结果:

+----+------------+----------------------+---------------------+---------------+-------------------------+-----------------+
|id  | date       |  time_start          | time_end            | overlap_count |  total_minutes_recorded |   actual_minutes|
+--------------------------------------------------------------------------------------------------------------------------+
|123 | 2019-11-10 |  2019-11-10 08:00:00 | 2019-11-10 08:45:00 | 3             |  65                     |   45            |
|    |            |                      |                     |               |                         |                 |
|123 | 2019-11-11 |  2019-11-11 08:00:00 | 2019-11-11 09:00:00 | 0             |  60                     |   60            |
+----+------------+----------------------+---------------------+---------------+-------------------------+-----------------+

我查看了其他答案,这些答案开始使我对如何解决此问题有深刻的了解,例如:

Pandas: Count time interval intersections over a group by

这些答案中的大多数都给了我一些重叠的时间,并且计算需要很长时间。有什么技巧可以开始解决这个问题

解决方法

使用groupby对日期进行分组,然后定义一个将每个日期作为数据框的函数。我给你get_minutes_recordedget_overlap_counts涉及更多一些-您可以通过保持每个索引为0的向量,遍历所有日期i以及如果行i的end_date[i]在开始和结束之间来解决此问题在n行的末尾,制作vector[n] = 1

def function(sub_df):
    overlap_count = get_overlap_count(sub_df)
    total_minutes_recorded = get_minutes_recorded(sub_df)
    return overlap_count,total_minutes_recorded

def get_overlap_counts(df):
    pass

def get_minutes_recorded(df):
    return (df[end_time] - df[start_time]).dt.seconds.sum()
    

df.groupby('date').apply(function)
,

我不知道您如何在第一行获得overlap_counttotal_minutes_recorded的值,我想这是错误的

df= pd.DataFrame({
    'id':[123,123,123],'date':['2019-11-10','2019-11-10','2019-11-11','2019-11-11'],'time_start':['2019-11-10 08:00:00','2019-11-10 08:15:00','2019-11-10 08:25:00','2019-11-11 08:00:00','2019-11-11 08:30:00'],'end_time':['2019-11-10 08:30:00','2019-11-10 08:30:00','2019-11-10 08:45:00','2019-11-11 08:30:00','2019-11-11 09:00:00']
})
df['date'] = pd.to_datetime(df['date'])
df['time_start'] = pd.to_datetime(df['time_start'])
df['end_time'] = pd.to_datetime(df['end_time'])
df_temp=df
df = pd.merge(df,df_temp,on='id')
df=df[
    ((df.time_start_x - df.time_start_y) == np.timedelta64(1,'D'))
]
df_temp=df[['id','date_x','time_start_x','end_time_x']]
df_temp1 = df[['id','date_y','time_start_y','end_time_y']]
df_temp=df_temp.rename(columns={"date_x": "date","time_start_x": "time_start","end_time_x":"end_time"})
df_temp1=df_temp1.rename(columns={"date_y": "date","time_start_y": "time_start","end_time_y":"end_time"})

df=pd.concat([df_temp,df_temp1])
df=df[['id','date','time_start','end_time']].sort_values(by='date')

df['total_minutes_recorded'] = df['end_time']-df['time_start']

print(df)
     id       date          time_start            end_time total_minutes_recorded
15  123 2019-11-10 2019-11-10 08:00:00 2019-11-10 08:30:00               00:30:00
15  123 2019-11-11 2019-11-11 08:00:00 2019-11-11 08:30:00               00:30:00