使用条件选择分组依据中的最后2个值

问题描述

我需要为每个 user_id date 选择最后一个值的行,但是当 metric 列中的最后一个值是“离开”选择最后两行(如果存在)。 我的数据:

df = pd.DataFrame({
     "user_id": [1,1,2,2],'subscription': [1,3,4,5],"metric": ['enter','stay','leave','enter','enter'],'date': ['2020-01-01','2020-01-01','2020-03-01','2020-01-02']
})
#result
    user_id subscription    metric  date
0   1       1               enter   2020-01-01
1   1       1               stay    2020-01-01
2   1       2               leave   2020-03-01
3   2       3               enter   2020-01-01
4   2       4               leave   2020-01-01
5   2       5               enter   2020-01-02

预期输出

    user_id subscription    metric  date
1   1       1               stay    2020-01-01
2   1       2               leave   2020-03-01
3   2       3               enter   2020-01-01 # stay because last metric='leave' inside group[user_id,date]
4   2       4               leave   2020-01-01
5   2       5               enter   2020-01-02

我尝试过的操作:drop_duplicatesgroupby都给出相同的结果,只是最后一个

df.drop_duplicates(['user_id','date'],keep='last')
#or
df.groupby(['user_id','date']).tail(1)

解决方法

您可以使用布尔掩码,并使用变量TrueFalsea返回三个不同的条件,分别是bc。然后,使用or运算符True筛选数据a,b或c返回|的时间:

a = df.groupby(['user_id','date',df.groupby(['user_id','date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id','date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a,b,c)
df

#a groupby the two required groups plus a group that finds the cumulative count,which is necessary in order to return True for the last "metric" within the the group.
0    False
1    False
2     True
3    False
4     True
5    False
Name: metric,dtype: bool

#b if something has a count of one,then you want to keep it.
0    False
1    False
2    True
3    False
4    False
5    True
Name: metric,dtype: bool

#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0    False
1    True
2    False
3    True
4    False
5    False
Name: metric,dtype: bool

Out[18]: 
   user_id  subscription metric        date
1        1             1   stay  2020-01-01
2        1             2  leave  2020-03-01
3        2             3  enter  2020-01-01
4        2             4  leave  2020-01-01
5        2             5  enter  2020-01-02
,

这是一种方法,但我认为这很慢,因为我们正在遍历分组:

df["date"] = pd.to_datetime(df["date"])

df = df.assign(metric_is_leave=df.metric.eq("leave"))

pd.concat(
    [
        value.iloc[-2:,:-1] if value.metric_is_leave.any() else value.iloc[-1:,:-1]
        for key,value in df.groupby(["user_id","date"])
    ]
)




  user_id   subscription    metric  date
1      1        1           stay    2020-01-01
2      1        2          leave    2020-03-01
3      2        3          enter    2020-01-01
4      2        4          leave    2020-01-01
5      2        5          enter    2020-01-02