我有不规则列数的一系列数据,我需要使用pandas从跨多列的拆分部分中确定最常见的值.我的意思的一个例子是,如果我知道同事每天午餐吃的奶酪是什么:
Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN
Name Cheese Pct
Evan Gouda .66
John Havarti .4
我也不知道每次运行脚本时都需要包含多少列,只是它们都是“ Cheese”索引格式.如果约翰第二天出现了四种奶酪,我将需要添加第四列,并且分析脚本需要能够处理.
解决方法:
import io
import pandas as pd
data = io.StringIO("""\
Idx Name Cheese1 Cheese2 Cheese3
0 Evan Gouda NaN NaN
1 John Cheddar Havarti Blue
2 Evan Cheddar Gouda NaN
3 John Havarti Swiss NaN
4 Rick NaN NaN NaN
""")
df = pd.read_csv(data, delim_whitespace=True)
def top_cheese(g):
cheese_cols = [col for col in g.columns if col.startswith('Cheese')]
try:
out = (g[cheese_cols].stack().value_counts(normalize=True)
.reset_index().iloc[0])
out.index = ['Cheese', 'Pct']
return out
except IndexError:
return pd.Series({'Cheese': 'None', 'Pct': 0})
output = df.groupby('Name').apply(top_cheese)
print(output)
输出:
Cheese Pct
Name
Evan Gouda 0.666667
John Havarti 0.400000
Rick None 0.000000