问题描述
我有如下 df
目标输出
我尝试了下面的代码,但它会得到一列的输出,我不得不添加 for 循环才能得到整个结果
data = {'item':["y1","y2","y3","y4","y5","y6","y7","y8","y9","y10"],'X1': [1,1,7,5,4],'X2': [8,9,10,8,9],'X3': [11,12,13,11,2],}
df = pd.DataFrame(data,columns = ['item','X1','X2','X3'])
# get count of unique values
df['X1'].nunique()
# get max Value
df['X1'].value_counts().idxmax()
# get percentage of max value
df['X1'].value_counts().max()/df['X1'].size
# get Second value of Max Value
(df.nlargest(2,['X1'])['X1']).value_counts().idxmax()
# Get Second Value of %
df['X1'][df['X1']==(df.nlargest(2,['X1'])['X1']).value_counts().idxmax()].size/df['X1'].size
解决方法
您可以为每个测试列以及最大和第二大使用索引创建字典,因为 Series.value_counts
默认排序:
L = []
cols = ['X1','X2','X3']
for c in cols:
u = df[c].nunique()
a = df[c].value_counts()
d = {'No of unique': u,'Highest rep': a.index[0],'% of Highest rep': a.iat[0] / len(df),'Second Highest rep': a.index[1],'Second % of Highest rep': a.iat[1] / len(df)}
L.append(d)
df = pd.DataFrame(L,index=cols)
print (df)
No of unique Highest rep % of Highest rep Second Highest rep \
X1 4 1 0.5 7
X2 3 10 0.4 8
X3 5 11 0.6 13
Second % of Highest rep
X1 0.3
X2 0.4
X3 0.1
更通用的解决方案测试是否存在性最大值:
L = []
cols = ['X1','X3']
for c in cols:
u = df[c].nunique()
a = df[c].value_counts()
if len(a) > 1:
secondmax = a.index[1]
secondperc = a.iat[1] / len(df)
else:
secondmax = np.nan
secondsecondperc = np.nan
d = {'No of unique': u,'Second Highest rep': secondmax,'Second % of Highest rep': secondperc}
L.append(d)
df = pd.DataFrame(L,index=cols)