问题描述
我有两个数据框。一个是在网上商店中用户的搜索查询(102377行),另一个是用户在搜索中的点击次数(8004行)。
queries:
index term timestamp
...
10 tight 2018-09-27 20:09:23
11 differential pressure 2018-09-27 20:09:30
12 soot pump 2018-09-27 20:09:32
13 gas pressure 2018-09-27 20:09:46
14 case 2018-09-27 20:11:29
15 backpack 2018-09-27 20:18:35
...
clicks
index term timestamp artnr
...
245 soot pump 2018-09-27 20:09:25 9150.0
246 dungarees 2018-09-27 20:10:38 7228.0
247 db23 2018-09-27 20:10:40 7966.0
248 db23 2018-09-27 20:10:55 7971.0
249 sealing blister 2018-09-27 20:12:05 7971.0
250 backpack 2018-09-27 20:18:40 8739.0
...
我想做的是将点击添加到查询中。如果querys.term等于clicks.term,并且clicks.timestamp与querys.timestamp之间的差在10秒以下且在0秒以上,则应将clicks数据框的条件替换为clicks数据框的artnr,使其看起来像:
queries:
index term timestamp
...
10 tight 2018-09-27 20:09:23
11 differential pressure 2018-09-27 20:09:30
12 9150.0 2018-09-27 20:09:32
13 gas pressure 2018-09-27 20:09:46
14 case 2018-09-27 20:11:29
15 8739.0 2018-09-27 20:18:35
...
我的第一种方法是:
df_Q['term'] = np.where(((((df_CS.timestamp-df_Q.timestamp).dt.total_seconds() <= 10.0) &
(df_CS.timestamp-df_Q.timestamp).dt.total_seconds() >= 0) &
(df_CS.term.str == df_Q.term.str)),df_CS['artnr'],df_CS['term'])
但这只会产生以下错误:
ValueError:操作数不能与形状一起广播 (102377,)(8004,)(8004,)
解决方法
queries = pd.DataFrame({'term': ['tight','differential pressure','soot pump','gas pressure','case','backpack'],'timestamp': ['2018-09-27 20:09:23','2018-09-27 20:09:30','2018-09-27 20:09:32','2018-09-27 20:09:46','2018-09-27 20:11:29','2018-09-27 20:18:35']})
print(queries)
term timestamp
0 tight 2018-09-27 20:09:23
1 differential pressure 2018-09-27 20:09:30
2 soot pump 2018-09-27 20:09:32
3 gas pressure 2018-09-27 20:09:46
4 case 2018-09-27 20:11:29
5 backpack 2018-09-27 20:18:35
clicks = pd.DataFrame({'term': ['soot pump','dungarees','db23','sealing blister','timestamp': ['2018-09-27 20:09:25','2018-09-27 20:10:38','2018-09-27 20:10:40','2018-09-27 20:10:55','2018-09-27 20:12:05','2018-09-27 20:18:40'],'artnr':[9150.0,7228.0,7966.0,7971.0,8739.0]})
print(clicks)
term timestamp artnr
0 soot pump 2018-09-27 20:09:25 9150.0
1 dungarees 2018-09-27 20:10:38 7228.0
2 db23 2018-09-27 20:10:40 7966.0
3 db23 2018-09-27 20:10:55 7971.0
4 sealing blister 2018-09-27 20:12:05 7971.0
5 backpack 2018-09-27 20:18:40 8739.0
首先,对时间戳上的两个数据帧进行排序
queries['timestamp'] = pd.to_datetime(queries['timestamp'])
clicks['timestamp'] = pd.to_datetime(clicks['timestamp'])
queries.sort_values('timestamp',ascending=True,inplace=True)
clicks.sort_values('timestamp',inplace=True)
然后仅当“时间戳记”的时差在10秒以内时,才使用pd.merge_asof()加入“期限”列。
df = pd.merge_asof(
queries,# left data
clicks,# right data
on="timestamp",# column to check time differnece
by="term",# column to join on
tolerance=pd.Timedelta("10s"),# time difference
direction='forward',# join only if timestamp in right data after timestamp in left data
)
如果未找到匹配项,则“ artnr”列将为NA。因此,请使用'artnr'的非NA值替换'term'
df['term'][df['artnr'].notna()] = df['artnr']
print(df)
term timestamp artnr
0 tight 2018-09-27 20:09:23 NaN
1 differential pressure 2018-09-27 20:09:30 NaN
2 soot pump 2018-09-27 20:09:32 NaN
3 gas pressure 2018-09-27 20:09:46 NaN
4 case 2018-09-27 20:11:29 NaN
5 8739 2018-09-27 20:18:35 8739.0