使用 Python Pandas 搜索最大值和百分比最大值时的优化

问题描述

我有如下 df

我尝试了下面的代码，但它会得到一列的输出，我不得不添加 for 循环才能得到整个结果

那么有没有什么快速的解决方案，因为我有大数据

data = {'item':["y1","y2","y3","y4","y5","y6","y7","y8","y9","y10"],'X1':  [1,1,7,5,4],'X2':  [8,9,10,8,9],'X3':  [11,12,13,11,2],}
df = pd.DataFrame(data,columns = ['item','X1','X2','X3'])
# get count of unique values 
df['X1'].nunique()
# get max Value
df['X1'].value_counts().idxmax()
# get percentage of max value 
df['X1'].value_counts().max()/df['X1'].size
# get Second value of Max Value
(df.nlargest(2,['X1'])['X1']).value_counts().idxmax()
# Get Second Value of % 
df['X1'][df['X1']==(df.nlargest(2,['X1'])['X1']).value_counts().idxmax()].size/df['X1'].size

解决方法

您可以为每个测试列以及最大和第二大使用索引创建字典，因为 Series.value_counts 默认排序：

L = []
cols = ['X1','X2','X3'] 

for c in cols:
    u = df[c].nunique()
    a = df[c].value_counts()
    d = {'No of unique': u,'Highest rep': a.index[0],'% of Highest rep': a.iat[0] / len(df),'Second Highest rep': a.index[1],'Second % of Highest rep': a.iat[1] / len(df)}
    L.append(d)


df = pd.DataFrame(L,index=cols)    
print (df)
    No of unique  Highest rep  % of Highest rep  Second Highest rep  \
X1             4            1               0.5                   7   
X2             3           10               0.4                   8   
X3             5           11               0.6                  13   

    Second % of Highest rep  
X1                      0.3  
X2                      0.4  
X3                      0.1

更通用的解决方案测试是否存在性最大值：

L = []
cols = ['X1','X3'] 

for c in cols:
    u = df[c].nunique()
    a = df[c].value_counts()
    
    if len(a) > 1:
        secondmax = a.index[1]
        secondperc = a.iat[1] / len(df)
    else:
        secondmax = np.nan
        secondsecondperc = np.nan
        
    d = {'No of unique': u,'Second Highest rep': secondmax,'Second % of Highest rep': secondperc}

         
    L.append(d)

df = pd.DataFrame(L,index=cols)

bigdata dataframe pandas pandas python