计算滚动 3 天熊猫的不同计数?

问题描述

我想计算按城市分组的 3 天窗口内的唯一客户

输入:

    df = pd.DataFrame([['1A','Cairo','2020-12-01'],["2A",['1A','2020-12-02'],'2020-12-03'],['3A','Alex',['4A','Giza',['5A',['6A','2020-12-01']],columns=
    ['customer_id','city','day'])

预期输出

    output = pd.DataFrame([['Alex','2020-12-01',1],['Alex','2020-12-02','2020-12-03',['Cairo',2],['Giza',3]],columns=
    ['city','day','unique_customers_last3Days'])

我试过了:

df['day'] = pd.to_datetime(df['day'])
df.set_index('day',inplace=True)
df.sort_index(inplace=True)
df.groupby('city').rolling("3D").agg({'customer_id':'nun'})

但它给了我错误

AttributeError: 'nunique' is not a valid function for 'RollingGroupby' object

解决方法

将数据框的索引设置为 day 然后 sort 索引值,现在 factorize customer_id 列以便为每个客户 ID 分配唯一代码,然后 group city 上的数据框和 apply 一个 rolling nunique 操作,窗口大小为 3 days。可选的 drop day 中每个 city

的重复值
df = df.set_index('day').sort_index()
df['codes'] = df['customer_id'].factorize()[0]

df.groupby('city')\
  .rolling('3D')['codes'].apply(pd.Series.nunique)\
  .reset_index(name='unique').drop_duplicates(['city','day'],keep='last')

     city        day  unique
0    Alex 2020-12-01     1.0
1    Alex 2020-12-02     1.0
2    Alex 2020-12-03     1.0
4   Cairo 2020-12-01     2.0
5   Cairo 2020-12-02     2.0
6   Cairo 2020-12-03     2.0
7    Giza 2020-12-01     1.0
9    Giza 2020-12-02     2.0
10   Giza 2020-12-03     3.0