比较数据框内的日期并将值分配给另一个变量 详细信息:

问题描述

我有两个数据框(df和df1),如下所示

df = pd.DataFrame({'person_id': [101,101,202,202],'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']

df1 = pd.DataFrame({'person_id': [101,'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM','19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','DEF',np.nan]

我想做的是

a)从df1的'within_id'栏中选择没有NA的每个人,并检查他们的date_1是否在同一个人的(df.start_date - 1) and ( df.end_date + 1)之间df并针对相同的within_idenc_id

ex:对于subject = 101并且within_id = ABC,我们有date_17/7/2013,请检查它们是否在4/7/2013之间({{ 1}})和df.start_date - 111/7/2013)。

由于第一行比较本身就为我们提供了结果,因此我们不必将df.end_date + 1与df中date_1的其余记录进行比较。如果没有,我们需要查找/扫描,直到找到subject 101落入的间隔。

b)如果找到日期间隔,则将date_1中的相应enc_id分配给df中的within_id

c)如果没有,则分配“超出范围”

我尝试了以下

df1

我希望我的输出(也请参见屏幕快照底部的第14行)如下所示。 由于我打算将该解决方案应用于大数据(4/5百万条记录,并且可能有5000-6000个唯一的person_id),因此任何有效而优雅的解决方案都是有帮助的

enter image description here

t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values,'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values,'date_1')
t3= pd.concat([t1,t2],axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]

解决方法

我使用了上面提供的dfdf1

  • 基本方法是遍历df1并提取enc_id的匹配值。
  • 我添加了“规则”列,以显示每个值的填充方式。

不幸的是,我无法重现预期的结果。也许一般的方法会有用。

df1['rule'] = 0
for t in df1.itertuples():
        
    person = (t.person_id == df.person_id)
    b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
    c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
    d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
    e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
    
    if (m := person & b).any():
        df1.at[t.Index,'within_id'] = df.loc[m,'enc_id'].values[0]
        df1.at[t.Index,'rule'] += 1
        
    elif (m := person & c).any():
        df1.at[t.Index,'rule'] += 10
        
    elif (m := person & d).any():
        df1.at[t.Index,'rule'] += 100
        
    elif (m := person & e).any():
        df1.at[t.Index,'within_id'] = 'out of range'
        df1.at[t.Index,'rule'] += 1_000
    else:
        df1.at[t.Index,'within_id'] = 'impossible!'
        df1.at[t.Index,'rule'] += 10_000

df1['within_id'] = df1['within_id'].astype('Int64')

结果是:

print(df1)

    person_id              date_1              date_2    within_id  rule
0          11 1961-12-30 00:00:00 1962-01-01 00:00:00  11345678901     1
1          11 1962-01-30 00:00:00 1962-02-01 00:00:00  11345678902     1
2          12 1962-02-28 00:00:00 1962-03-02 00:00:00  34567892101   100
3          12 1989-07-29 00:00:00 1989-07-31 00:00:00  34567892101     1
4          12 1989-09-03 00:00:00 1989-09-05 00:00:00  34567892101    10
5          12 1989-10-02 00:00:00 1989-10-04 00:00:00  34567892103     1
6          12 1989-10-01 00:00:00 1989-10-03 00:00:00  34567892103     1
7          13 1999-03-29 00:00:00 1999-03-31 00:00:00  56432718901     1
8          13 1999-04-20 00:00:00 1999-04-22 00:00:00  56432718901    10
9          13 1999-06-02 00:00:00 1999-06-04 00:00:00  56432718904     1
10         13 1999-06-03 00:00:00 1999-06-05 00:00:00  56432718904     1
11         13 1999-07-29 00:00:00 1999-07-31 00:00:00  56432718905     1
12         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1
13         14 2002-02-03 10:00:00 2002-02-05 10:00:00  24680135791     1
,

让我们这样做:

d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),on=['person_id','within_id'],how='left',indicator=True)

m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),d['end_date']   + pd.Timedelta(days=1))

d = df1.merge(d[m | d['_merge'].ne('both')],'date_1'],how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]

详细信息:

mergedf1分别放在dfperson_id的数据帧within_id上:

print(d)
    person_id              date_1 within_id          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00       ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
1         101 2013-07-07 11:20:00       ABC 2013-09-08 11:21:00 2013-09-13 11:21:00   ABC2       both
2         101 2013-07-07 11:20:00       ABC 2014-06-06 08:00:00 2014-06-11 08:00:00   ABC3       both
3         101 2013-07-07 11:20:00       ABC 2014-06-06 05:00:00 2014-06-11 10:00:00   DEF1       both
....
47        202 2012-12-18 10:00:00       DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
48        202 2012-12-18 10:00:00       DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
49        202 2013-12-19 11:00:00       NaN                 NaT                 NaT    NaN  left_only

创建布尔掩码m来表示date_1df.start_date - 1 daysdf.end_date + 1 days之间的情况:

print(m)
0     False
1     False
2     False
3     False
...
47    False
48     True
49    False
dtype: bool

再次在列mergedf1上将数据帧m留给person_id,并使用掩码date_1过滤了数据帧:

print(d)

    person_id              date_1 within_id_x within_id_y          start_date            end_date enc_id     _merge
0         101 2013-07-07 11:20:00         ABC         NaN                 NaT                 NaT    NaN        NaN
1         101 2013-05-07 14:30:00         ABC         ABC 2013-05-07 09:27:00 2013-05-12 09:27:00   ABC1       both
2         101 2013-06-07 14:40:00         ABC         NaN                 NaT                 NaT    NaN        NaN
3         101 2014-08-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
4         101 2014-11-06 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
5         101 2013-02-03 12:30:00         ABC         NaN                 NaT                 NaT    NaN        NaN
6         101 2014-06-13 00:00:00         ABC         NaN                 NaT                 NaT    NaN        NaN
7         202 2011-12-11 00:00:00         DEF         DEF 2011-12-11 10:00:00 2011-12-16 10:00:00   DEF1       both
8         202 2012-10-13 07:00:00         DEF         DEF 2012-10-13 00:00:00 2012-10-18 00:00:00   DEF2       both
9         202 2015-12-13 00:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
10        202 2012-12-13 00:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
11        202 2012-12-13 18:30:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
12        202 2011-07-13 10:00:00         DEF         NaN                 NaT                 NaT    NaN        NaN
13        202 2012-12-18 10:00:00         DEF         DEF 2012-12-13 11:45:00 2012-12-18 11:45:00   DEF3       both
14        202 2013-12-19 11:00:00         NaN         NaN                 NaT                 NaT    NaN  left_only

within_id填充enc_id列中的值,并使用Series.fillna填充NaN,将df中与{{ 1}},最后过滤列以获得结果:

out of range