问题描述
我有两个数据框(df和df1),如下所示
df = pd.DataFrame({'person_id': [101,101,202,202],'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = df.start_date + timedelta(days=5)
df['enc_id'] = ['ABC1','ABC2','ABC3','ABC4','DEF1','DEF2','DEF3']
df1 = pd.DataFrame({'person_id': [101,'date_1':['07/07/2013 11:20:00 AM','05/07/2013 02:30:00 PM','06/07/2013 02:40:00 PM','08/06/2014 12:00:00 AM','11/06/2014 12:00:00 AM','02/03/2013 12:30:00 PM','13/06/2014 12:00:00 AM','12/11/2011 12:00:00 AM','13/10/2012 07:00:00 AM','13/12/2015 12:00:00 AM','13/12/2012 12:00:00 AM','13/12/2012 06:30:00 PM','13/07/2011 10:00:00 AM','18/12/2012 10:00:00 AM','19/12/2013 11:00:00 AM']})
df1['date_1'] = pd.to_datetime(df1['date_1'])
df1['within_id'] = ['ABC','ABC','DEF',np.nan]
我想做的是
a)从df1
的'within_id'栏中选择没有NA的每个人,并检查他们的date_1
是否在同一个人的(df.start_date - 1) and (
df.end_date + 1)之间df
并针对相同的within_id
或enc_id
ex:对于subject = 101并且within_id
= ABC
,我们有date_1
是7/7/2013
,请检查它们是否在4/7/2013
之间({{ 1}})和df.start_date - 1
(11/7/2013
)。
由于第一行比较本身就为我们提供了结果,因此我们不必将df.end_date + 1
与df中date_1
的其余记录进行比较。如果没有,我们需要查找/扫描,直到找到subject 101
落入的间隔。
b)如果找到日期间隔,则将date_1
中的相应enc_id
分配给df
中的within_id
c)如果没有,则分配“超出范围”
我尝试了以下
df1
我希望我的输出(也请参见屏幕快照底部的第14行)如下所示。 由于我打算将该解决方案应用于大数据(4/5百万条记录,并且可能有5000-6000个唯一的person_id),因此任何有效而优雅的解决方案都是有帮助的
t1 = df.groupby('person_id').apply(pd.DataFrame.sort_values,'start_date')
t2 = df1.groupby('person_id').apply(pd.DataFrame.sort_values,'date_1')
t3= pd.concat([t1,t2],axis=1)
t3['within_id'] = np.where((t3['date_1'] >= t3['start_date'] && t3['person_id'] == t3['person_id_x'] && t3['date_2'] >= t3['end_date']),enc_id]
解决方法
我使用了上面提供的df
和df1
。
- 基本方法是遍历
df1
并提取enc_id
的匹配值。 - 我添加了“规则”列,以显示每个值的填充方式。
不幸的是,我无法重现预期的结果。也许一般的方法会有用。
df1['rule'] = 0
for t in df1.itertuples():
person = (t.person_id == df.person_id)
b = (t.date_1 >= df.start_date) & (t.date_2 <= df.end_date)
c = (t.date_1 >= df.start_date) & (t.date_2 >= df.end_date)
d = (t.date_1 <= df.start_date) & (t.date_2 <= df.end_date)
e = (t.date_1 <= df.start_date) & (t.date_2 <= df.start_date) # start_date at BOTH ends
if (m := person & b).any():
df1.at[t.Index,'within_id'] = df.loc[m,'enc_id'].values[0]
df1.at[t.Index,'rule'] += 1
elif (m := person & c).any():
df1.at[t.Index,'rule'] += 10
elif (m := person & d).any():
df1.at[t.Index,'rule'] += 100
elif (m := person & e).any():
df1.at[t.Index,'within_id'] = 'out of range'
df1.at[t.Index,'rule'] += 1_000
else:
df1.at[t.Index,'within_id'] = 'impossible!'
df1.at[t.Index,'rule'] += 10_000
df1['within_id'] = df1['within_id'].astype('Int64')
结果是:
print(df1)
person_id date_1 date_2 within_id rule
0 11 1961-12-30 00:00:00 1962-01-01 00:00:00 11345678901 1
1 11 1962-01-30 00:00:00 1962-02-01 00:00:00 11345678902 1
2 12 1962-02-28 00:00:00 1962-03-02 00:00:00 34567892101 100
3 12 1989-07-29 00:00:00 1989-07-31 00:00:00 34567892101 1
4 12 1989-09-03 00:00:00 1989-09-05 00:00:00 34567892101 10
5 12 1989-10-02 00:00:00 1989-10-04 00:00:00 34567892103 1
6 12 1989-10-01 00:00:00 1989-10-03 00:00:00 34567892103 1
7 13 1999-03-29 00:00:00 1999-03-31 00:00:00 56432718901 1
8 13 1999-04-20 00:00:00 1999-04-22 00:00:00 56432718901 10
9 13 1999-06-02 00:00:00 1999-06-04 00:00:00 56432718904 1
10 13 1999-06-03 00:00:00 1999-06-05 00:00:00 56432718904 1
11 13 1999-07-29 00:00:00 1999-07-31 00:00:00 56432718905 1
12 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
13 14 2002-02-03 10:00:00 2002-02-05 10:00:00 24680135791 1
,
让我们这样做:
d = df1.merge(df.assign(within_id=df['enc_id'].str[:3]),on=['person_id','within_id'],how='left',indicator=True)
m = d['date_1'].between(d['start_date'] - pd.Timedelta(days=1),d['end_date'] + pd.Timedelta(days=1))
d = df1.merge(d[m | d['_merge'].ne('both')],'date_1'],how='left')
d['within_id'] = d['enc_id'].fillna('out of range').mask(d['_merge'].eq('left_only'))
d = d[df1.columns]
详细信息:
将merge
和df1
分别放在df
和person_id
的数据帧within_id
上:
print(d)
person_id date_1 within_id start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
1 101 2013-07-07 11:20:00 ABC 2013-09-08 11:21:00 2013-09-13 11:21:00 ABC2 both
2 101 2013-07-07 11:20:00 ABC 2014-06-06 08:00:00 2014-06-11 08:00:00 ABC3 both
3 101 2013-07-07 11:20:00 ABC 2014-06-06 05:00:00 2014-06-11 10:00:00 DEF1 both
....
47 202 2012-12-18 10:00:00 DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
48 202 2012-12-18 10:00:00 DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
49 202 2013-12-19 11:00:00 NaN NaT NaT NaN left_only
创建布尔掩码m
来表示date_1
在df.start_date - 1 days
和df.end_date + 1 days
之间的情况:
print(m)
0 False
1 False
2 False
3 False
...
47 False
48 True
49 False
dtype: bool
再次在列merge
和df1
上将数据帧m
留给person_id
,并使用掩码date_1
过滤了数据帧:
print(d)
person_id date_1 within_id_x within_id_y start_date end_date enc_id _merge
0 101 2013-07-07 11:20:00 ABC NaN NaT NaT NaN NaN
1 101 2013-05-07 14:30:00 ABC ABC 2013-05-07 09:27:00 2013-05-12 09:27:00 ABC1 both
2 101 2013-06-07 14:40:00 ABC NaN NaT NaT NaN NaN
3 101 2014-08-06 00:00:00 ABC NaN NaT NaT NaN NaN
4 101 2014-11-06 00:00:00 ABC NaN NaT NaT NaN NaN
5 101 2013-02-03 12:30:00 ABC NaN NaT NaT NaN NaN
6 101 2014-06-13 00:00:00 ABC NaN NaT NaT NaN NaN
7 202 2011-12-11 00:00:00 DEF DEF 2011-12-11 10:00:00 2011-12-16 10:00:00 DEF1 both
8 202 2012-10-13 07:00:00 DEF DEF 2012-10-13 00:00:00 2012-10-18 00:00:00 DEF2 both
9 202 2015-12-13 00:00:00 DEF NaN NaT NaT NaN NaN
10 202 2012-12-13 00:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
11 202 2012-12-13 18:30:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
12 202 2011-07-13 10:00:00 DEF NaN NaT NaT NaN NaN
13 202 2012-12-18 10:00:00 DEF DEF 2012-12-13 11:45:00 2012-12-18 11:45:00 DEF3 both
14 202 2013-12-19 11:00:00 NaN NaN NaT NaT NaN left_only
从within_id
填充enc_id
列中的值,并使用Series.fillna
填充NaN
,将df
中与{{ 1}},最后过滤列以获得结果:
out of range