问题描述
我有以下数据框。
padel start_time end_time duration
38 Padel 10 08:00:00 09:00:00 60
40 Padel 10 10:00:00 11:30:00 90
42 Padel 10 10:30:00 12:00:00 90
44 Padel 10 11:00:00 12:30:00 90
46 Padel 10 11:30:00 13:00:00 90
49 Padel 10 16:00:00 17:30:00 90
51 Padel 10 16:30:00 18:00:00 90
53 Padel 10 17:00:00 18:30:00 90
55 Padel 10 17:30:00 19:00:00 90
57 Padel 10 18:00:00 19:30:00 90
59 Padel 10 18:30:00 20:00:00 90
61 Padel 10 19:00:00 20:30:00 90
63 Padel 10 19:30:00 21:00:00 90
65 Padel 10 20:00:00 21:30:00 90
67 Padel 10 20:30:00 22:00:00 90
我想选择两者之间最长的时间跨度。我想要的输出应该是这样的
padel start_time end_time duration
38 Padel 10 08:00:00 09:00:00 60
40 Padel 10 10:00:00 13:00:00 180
49 Padel 10 16:00:00 22:00:00 360
我不在乎持续时间。我可以做到。但是我将如何合并重叠的时间跨度。 谢谢
解决方法
- 如果
shift()
是上面行的start_time
greater than
(即重叠),您可以使用end_time
创建组。 - 我们将
fillna
与'24:00:00'
一起使用,以便我们为第一个值返回“True”,因为一天中没有任何东西可以超过 24 小时。这是因为NaN
是带有shift()
的第一行的输出,如果我们不这样做,它将返回False
。 - 这将返回一个
boolean
系列的True
和False
(即分别为1
和0
),因此您只需将累积总和与cumsum
。 - 这会创建一个
grp
对象,我们可以将其包含在groupby
中。
df = df.sort_values(by=['padel','start_time'],ascending=[True,True])
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum()
df = df.groupby([grp,'padel'],as_index=False).agg({'start_time':'first','end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) -
pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
Out[1]:
padel start_time end_time duration
0 Padel 10 08:00:00 09:00:00 60
1 Padel 10 10:00:00 13:00:00 180
2 Padel 10 16:00:00 22:00:00 360
带有输入数据框的完整代码
df = pd.DataFrame(pd.DataFrame({'padel': {38: 'Padel 10',40: 'Padel 10',42: 'Padel 10',44: 'Padel 10',46: 'Padel 10',49: 'Padel 10',51: 'Padel 10',53: 'Padel 10',55: 'Padel 10',57: 'Padel 10',59: 'Padel 10',61: 'Padel 10',63: 'Padel 10',65: 'Padel 10',67: 'Padel 10'},'start_time': {38: '08:00:00',40: '10:00:00',42: '10:30:00',44: '11:00:00',46: '11:30:00',49: '16:00:00',51: '16:30:00',53: '17:00:00',55: '17:30:00',57: '18:00:00',59: '18:30:00',61: '19:00:00',63: '19:30:00',65: '20:00:00',67: '20:30:00'},'end_time': {38: '09:00:00',40: '11:30:00',42: '12:00:00',44: '12:30:00',46: '13:00:00',49: '17:30:00',51: '18:00:00',53: '18:30:00',55: '19:00:00',57: '19:30:00',59: '20:00:00',61: '20:30:00',63: '21:00:00',65: '21:30:00',67: '22:00:00'},'duration': {38: 60,40: 90,42: 90,44: 90,46: 90,49: 90,51: 90,53: 90,55: 90,57: 90,59: 90,61: 90,63: 90,65: 90,67: 90}}))
grp = df['start_time'].gt(df['end_time'].shift().fillna('24:00:00')).cumsum()
df = df.groupby([grp,'end_time':'last'})
df['duration'] = ((pd.to_timedelta(df['end_time']) - \
pd.to_timedelta(df['start_time'])).dt.seconds / 60).astype(int)
df
,
#Coeece the start and end times to datetime
df['start_time']=pd.to_datetime(df['start_time'])
df['end_time']=pd.to_datetime(df['end_time'])
g=df.groupby(df.end_time.sub(df.start_time.shift(1)).ne('2h').cumsum()).tail(1).reset_index()#Find last entry in each set of pedal
g=g.assign(start_time=df.groupby(df.end_time.sub(df.start_time.shift(1)).ne('2h').cumsum()).start_time.head(1).reset_index().loc[:,'start_time'])#Set start_time to the start_time in each set of pedal
g=g.iloc[:,:-1].join(df.groupby(df.end_time.sub(df.start_time.shift(1)).ne('2h').cumsum()).apply(lambda x: (x['end_time'].max()-(x['start_time'].min())).total_seconds()/60).to_frame('duration').reset_index(drop=True))#Calc the duration
padel start_time end_time duration
0 Padel 10 08:00:00 09:00:00 60
1 Padel 10 10:00:00 13:00:00 180
2 Padel 10 16:00:00 22:00:00 360
,
我想不出一个简单的熊猫方法来做到这一点,所以我只需要一个 for 循环。尚未测试此代码,但类似于:
df = df.sort_values(...)
out_df = pd.DataFrame(columns=df.columns)
next_row = None
for row in df.rows:
if next_row is None:
next_row = row
elif row['start_time'] <= next_row['end_time']:
next_row['end_time'] = row['end_time']
else:
out_df = out_df.append(next_row)
next_row = None
out_df = out_df.append(next_row)