问题描述
我想计算按城市分组的 3 天窗口内的唯一客户
输入:
df = pd.DataFrame([['1A','Cairo','2020-12-01'],["2A",['1A','2020-12-02'],'2020-12-03'],['3A','Alex',['4A','Giza',['5A',['6A','2020-12-01']],columns=
['customer_id','city','day'])
预期输出:
output = pd.DataFrame([['Alex','2020-12-01',1],['Alex','2020-12-02','2020-12-03',['Cairo',2],['Giza',3]],columns=
['city','day','unique_customers_last3Days'])
我试过了:
df['day'] = pd.to_datetime(df['day'])
df.set_index('day',inplace=True)
df.sort_index(inplace=True)
df.groupby('city').rolling("3D").agg({'customer_id':'nun'})
但它给了我错误
AttributeError: 'nunique' is not a valid function for 'RollingGroupby' object
解决方法
将数据框的索引设置为 day
然后 sort
索引值,现在 factorize
customer_id
列以便为每个客户 ID 分配唯一代码,然后 group
city
上的数据框和 apply
一个 rolling
nunique
操作,窗口大小为 3 days
。可选的 drop
day
中每个 city
df = df.set_index('day').sort_index()
df['codes'] = df['customer_id'].factorize()[0]
df.groupby('city')\
.rolling('3D')['codes'].apply(pd.Series.nunique)\
.reset_index(name='unique').drop_duplicates(['city','day'],keep='last')
city day unique
0 Alex 2020-12-01 1.0
1 Alex 2020-12-02 1.0
2 Alex 2020-12-03 1.0
4 Cairo 2020-12-01 2.0
5 Cairo 2020-12-02 2.0
6 Cairo 2020-12-03 2.0
7 Giza 2020-12-01 1.0
9 Giza 2020-12-02 2.0
10 Giza 2020-12-03 3.0