问题描述
我需要为每个 user_id 和 date 选择最后一个值的行,但是当 metric 列中的最后一个值是“离开”选择最后两行(如果存在)。 我的数据:
df = pd.DataFrame({
"user_id": [1,1,2,2],'subscription': [1,3,4,5],"metric": ['enter','stay','leave','enter','enter'],'date': ['2020-01-01','2020-01-01','2020-03-01','2020-01-02']
})
#result
user_id subscription metric date
0 1 1 enter 2020-01-01
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
预期输出:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01 # stay because last metric='leave' inside group[user_id,date]
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
我尝试过的操作:drop_duplicates
和groupby
都给出相同的结果,只是最后一个值
df.drop_duplicates(['user_id','date'],keep='last')
#or
df.groupby(['user_id','date']).tail(1)
解决方法
您可以使用布尔掩码,并使用变量True
,False
或a
返回三个不同的条件,分别是b
或c
。然后,使用or运算符True
筛选数据a,b或c返回|
的时间:
a = df.groupby(['user_id','date',df.groupby(['user_id','date']).cumcount()])['metric'].transform('last') == 'leave'
b = df.groupby(['user_id','date'])['metric'].transform('count') == 1
c = a.shift(-1) & (b == False)
df = df[a | b | c]
print(a,b,c)
df
#a groupby the two required groups plus a group that finds the cumulative count,which is necessary in order to return True for the last "metric" within the the group.
0 False
1 False
2 True
3 False
4 True
5 False
Name: metric,dtype: bool
#b if something has a count of one,then you want to keep it.
0 False
1 False
2 True
3 False
4 False
5 True
Name: metric,dtype: bool
#c simply use .shift(-1) to find the row before the row. For the condition to be satisfied the count for that group must be > 1
0 False
1 True
2 False
3 True
4 False
5 False
Name: metric,dtype: bool
Out[18]:
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02
,
这是一种方法,但我认为这很慢,因为我们正在遍历分组:
df["date"] = pd.to_datetime(df["date"])
df = df.assign(metric_is_leave=df.metric.eq("leave"))
pd.concat(
[
value.iloc[-2:,:-1] if value.metric_is_leave.any() else value.iloc[-1:,:-1]
for key,value in df.groupby(["user_id","date"])
]
)
user_id subscription metric date
1 1 1 stay 2020-01-01
2 1 2 leave 2020-03-01
3 2 3 enter 2020-01-01
4 2 4 leave 2020-01-01
5 2 5 enter 2020-01-02