Python:在列表中选择最长的连续日期系列

问题描述

我有一系列列表(实际上是np.arrays),其中的元素是日期。

id
0a0fe3ed-d788-4427-8820-8b7b696a6033    [2019-01-30,2019-01-31,2019-02-01,2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1    [2019-01-30,2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8    [2019-01-29,2019-01-30,2019-02-0...
Name: startDate,dtype: object

对于系列中的每个元素(即每个日期列表),我想保留所有日期都是连续的最长的子列表。我正在努力以pythonic(简单/高效)的方式来解决这个问题。我能想到的唯一方法是使用多个循环:遍历系列值(列表),并遍历列表中的每个元素。然后,我将存储第一个日期和连续天数,如果遇到更长的连续天数,则使用临时值覆盖结果。但是,这似乎效率很低。有更好的方法吗?

解决方法

由于您提到使用的是numpy日期数组,因此坚持使用numpy类型而不是转换为内置类型是有意义的。我在这里假设您的数组具有dtype'datetime64 [D]'。在这种情况下,您可以做类似

的操作
import numpy as np

date_list = np.array(['2005-02-01','2005-02-02','2005-02-03','2005-02-05','2005-02-06','2005-02-07','2005-02-08','2005-02-09','2005-02-11','2005-02-12','2005-02-14','2005-02-15','2005-02-16','2005-02-17','2005-02-19','2005-02-20','2005-02-22','2005-02-23','2005-02-24','2005-02-25','2005-02-26','2005-02-27','2005-02-28'],dtype='datetime64[D]')

i0max,i1max = 0,0
i0 = 0
for i1,date in enumerate(date_list):
    if date - date_list[i0] != np.timedelta64(i1-i0,'D'):
        if i1 - i0 > i1max - i0max:
            i0max,i1max = i0,i1
        i0 = i1

print(date_list[i0max:i1max])

# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']

在这里,i0i1指示连续日期的当前子数组的开始和结束索引,而i0maxi1max的开始和结束索引到目前为止找到的最长的子阵列。该解决方案使用以下事实:连续日期列表中的第i和第0个条目之间的差异恰好是i天。

,

您可以将列表转换为在所有连续日期中都在增加的序数。即next_date = previous_date + 1 read more

然后找到最长的连续子数组。

此过程将花费O(n)->single loop的时间,这是最有效的方法。

代码

from datetime import datetime
def get_consecutive(date_list):
  # convert to ordinals
  v = [datetime.strptime(d,"%Y-%m-%d").toordinal()  for d in date_list]
  consecutive = []
  run = []
  dates = []

  # get consecutive ordinal sequence 
  for i in range(1,len(v) + 1):
    run.append(v[i-1])
    dates.append(date_list[i-1])
    if i == len(v) or v[i-1] + 1 != v[i]:
      if len(consecutive) < len(run):
        consecutive = dates
      dates = []
      run = []

  return consecutive

输出:

date_list = ['2019-01-29','2019-01-30','2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088,737089,737090,737095]
OUTPUT:
['2019-01-29','2019-01-31']

现在在get_consecutive中使用df.column.apply(get_consecutive),它将为您列出所有增加的日期。或者,如果您使用其他数据结构,则可以对每个列表都起作用。

,

我将把这个问题减少到在单个列表中查找连续的日子。您可以通过一些技巧使它变得更加Pythonic。以下脚本应按原样运行。我已经记录了它如何内联:

from datetime import timedelta,date

# example input
days = [
    date(2020,1,1),date(2020,2),4),5),6),8),]

# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index =  0
longest_interval_length = current_interval_length = 1

# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1,2020-01-02),(2020-01-02,2020-01-03),...]
# use enumerate to get the index of the current day
for i,(previous_day,current_day) in enumerate(zip(days,days[1:]),start=1):
    if current_day - previous_day == timedelta(days=+1):
        # we've found a consecutive day! increase the interval length
        current_interval_length += 1
    else:
        # nope,not a consecutive day! start from this day and start
        # counting from 1
        current_interval_index = i
        current_interval_length = 1
    if current_interval_length > longest_interval_length:
        # we broke the record! record it as the longest interval
        longest_interval_index = current_interval_index
        longest_interval_length = current_interval_length

print("Longest interval index:",longest_interval_index)
print("Longest interval: ",days[longest_interval_index:longest_interval_index + longest_interval_length])

将其转变为可重用的功能应该足够容易。