问题描述
在尝试从数据框中删除正确的重复项时遇到一些困难。
我有以下示例:
@H_502_5@import numpy as np import pandas as pd test = {'date': ['2012-10-12 10:10:10','2012-10-12 10:10:10','2012-10-19 10:55:10','2012-11-02 16:08:07','2012-12-12 23:45:21','2012-12-12 23:45:21'],'value' : [123,'',324,321],} df = pd.DataFrame(data=test)
输出如下:
@H_502_5@ date value 0 2012-10-12 10:10:10 123 1 2012-10-12 10:10:10 2 2012-10-19 10:55:10 324 3 2012-11-02 16:08:07 4 2012-11-02 16:08:07 5 2012-12-12 23:45:21 6 2012-12-12 23:45:21 321
@H_502_5@ date value 0 2012-10-12 10:10:10 123 2 2012-10-19 10:55:10 324 3 2012-11-02 16:08:07 6 2012-12-12 23:45:21 321
然而,我迄今为止的尝试都没有成功,如下所示:
尝试 1:-
@H_502_5@df = df.drop_duplicates(subset='date') date value 0 2012-10-12 10:10:10 123 2 2012-10-19 10:55:10 324 3 2012-11-02 16:08:07 5 2012-12-12 23:45:21
尝试 2:-
@H_502_5@df = df.drop_duplicates(subset='date',keep='last') date value 1 2012-10-12 10:10:10 2 2012-10-19 10:55:10 324 4 2012-11-02 16:08:07 6 2012-12-12 23:45:21 321
请帮助我达到所需的输出。非常感谢提前
解决方法
一种方法是屏蔽 value
列中的空字符串,然后在 date
上进行分组并使用 first
进行聚合:
df['value'].mask(df['value'].eq('')).groupby(df['date']).first().fillna('').reset_index()
或者,您可以屏蔽 value
列中的空字符串并将其分配给临时列 key
,然后对列 date
和 key
上的数据框进行排序,然后是drop_duplicates
:
df['key'] = df['value'].mask(df['value'].eq(''))
df.sort_values(['date','key']).drop_duplicates('date').drop('key',1)
结果:
date value
0 2012-10-12 10:10:10 123
1 2012-10-19 10:55:10 324
2 2012-11-02 16:08:07
3 2012-12-12 23:45:21 321
,
import numpy as np
import pandas as pd
test = {'date': ['2012-10-12 10:10:10','2012-10-12 10:10:10','2012-10-19 10:55:10','2012-11-02 16:08:07','2012-12-12 23:45:21','2012-12-12 23:45:21'],'value' : [123,np.nan,324,321],}
这应该可行!
df = pd.DataFrame(data=test)
df.sort_values(by = "value",inplace = True)
df = df.drop_duplicates(subset='date')
df = df.replace(np.nan,'',regex=True)
df.sort_index()
输出如下:
date value
0 2012-10-12 10:10:10 123
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
6 2012-12-12 23:45:21 321
,
import pandas as pd
test = {'date': ['2012-10-12 10:10:10',}
df = pd.DataFrame(data=test)
df["value_not_empty"] = df['value'].map(bool)
df = df.sort_values("value_not_empty")
df = df.drop(columns=["value_not_empty"])
df = df.drop_duplicates('date',keep='last')
df