问题描述
我有一个包含 100k+ 行的 DataFrame,我需要遍历它并根据它所在的时隙进行计数。一个 DataFrame 示例如下:
Call Sign Entry_Time Exit_Time Sector
0 EA213 2020-10-01 22:24:00 2020-10-01 22:50:55 north
1 NGF23 2020-10-01 22:32:00 2020-10-01 22:53:00 West
2 USR24 2020-10-01 22:44:00 2020-10-01 23:01:53 Central
3 EF36D 2020-10-01 22:50:55 2020-10-01 23:04:07 north
4 NGF23 2020-10-01 22:53:00 2020-10-01 23:03:54 north
5 USR24 2020-10-01 23:01:53 2020-10-01 23:13:44 West
6 EF36D 2020-10-01 23:04:07 2020-10-01 23:26:48 Central
7 USR24 2020-10-01 23:13:44 2020-10-01 23:28:00 Central
8 OSA26 2020-10-02 15:02:00 2020-10-02 15:09:31 West
9 OSA26 2020-10-02 15:09:31 2020-10-02 15:25:47 north
如果进入和退出时间在开始和结束时间段内,我需要计算每一行。为此,我使用以下代码。
startDay = 1
startMonth = 10
startYear = 2020
endDay = 5
endMonth = 10
endYear = 2020
interval = 30
startDate = str(datetime(startYear,startMonth,startDay).date())
endDate = str(datetime(endYear,endMonth,endDay).date())
timeInterval=pd.DataFrame()
sectors = ['West','north','Central']
endDateMinus1 = str(datetime(endYear,endDay)-timedelta(seconds=1))
timeInterval['Start']=pd.date_range(start=startDate+' 00:00:00',end=endDateMinus1,freq=str(interval)+'T')
timeInterval['End']= pd.date_range(start=startDate+' 00:'+str(interval)+':00',end=endDate+' 00:00:00',freq=str(interval)+'T')
for index,row in timeInterval.iterrows():
startMask = (df['Entry_Time'] >= row.Start) | (df['Exit_Time'] >= row.Start)
endMask = (df['Entry_Time'] < row.End) | (df['Exit_Time'] < row.End)
timeInterval.loc[index,'Total Count'] = df[startMask & endMask].count()['Call Sign']
for sector in sectors:
filteredDF = df[startMask & endMask & (df['Sector']==sector)]
filteredDF[sector+' Time']=0
filter1 = (filteredDF['Entry_Time']<row.Start) & (filteredDF['Exit_Time']<=row.End)
filter2 = (filteredDF['Entry_Time']<row.Start) & (filteredDF['Exit_Time']>row.End)
filter3 = (filteredDF['Entry_Time']>=row.Start) & (filteredDF['Exit_Time']<=row.End)
filter4 = (filteredDF['Entry_Time']>=row.Start) & (filteredDF['Exit_Time']>row.End)
filteredDF.loc[filter1,sector+' Time'] = (filteredDF.loc[filter1,'Exit_Time']-row.Start).dt.seconds/60
filteredDF.loc[filter2,sector+' Time'] = interval
filteredDF.loc[filter3,sector+' Time'] = (filteredDF.loc[filter3,'Exit_Time']-filteredDF.loc[filter3,'Entry_Time']).dt.seconds/60
filteredDF.loc[filter4,sector+' Time'] = (row.End-filteredDF.loc[filter4,'Entry_Time']).dt.seconds/60
timeInterval.loc[index,sector+' Total Count'] = filteredDF.count()['Call Sign']
timeInterval.loc[index,sector+' Total Time (min)'] = float("{:.2f}".format(filteredDF[sector+' Time'].sum()))
timeInterval.loc[index,sector+' Average Time (min)'] = 0 if timeInterval.loc[index,sector+' Total Count']==0 else timeInterval.loc[index,sector+' Total Time (min)']/timeInterval.loc[index,sector+' Total Count']
结果将是这样的:
根据共享的数据帧仔细查看某些行不为零的地方。
问题是随着间隔时间的增加或数量或行数的增加,程序需要很长时间才能完成。我需要以不同的方式替换 for 循环,但不太确定该怎么做。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)