根据条件对许多功能进行二值化

问题描述

我有Pandas数据框，具有数百个分类特征（以数字表示）。我只想在列中保留最高值。我已经知道，每列中只有3或4个最频繁的值，但是我想自动选择它。我需要两种方法：

1）仅保留3个最常用的值。注释：没有具有1、2或3个唯一值的列（每列中约20个唯一值），因此，请勿考虑。例如，如果您有几个第三名，请将它们全部保留。例如：

＃在使用value_counts（）列1之后
1``35
2 23 23
3 10 10
4 9 9
8 8 8
6 8

＃在第2列上使用value_counts（）后
0 23
2 15 15
1，15＃2位第二名
4 9 9
5 3 3
6。2

#result在第1列上使用value_counts（）后
1``35
2 23 23
3 10 10
其他25＃9 + 8 + 8

#result在第2列上使用value_counts（）后
0 23
2 15 15
1 1 15
4 9 9
其他5＃3 + 2

2）根据需要在每列中保留尽可能多的值，以使剩余值的数量小于您决定保留的最后一个值的数量。例如：

＃在使用value_counts（）列1之后
1``35
2 23 23
3 10 10
4 3 3
8 2 2
6。1

＃在第2列上使用value_counts（）后
0 23
2 15 15
1 9 9
4 8
5 3 3
6。2

#result在第1列上使用value_counts（）后
1``35
2 23
3 10 10
其他6＃3 + 2 + 1

#result在第2列上使用value_counts（）后
0 23
2 15 15
1 9 9
4 8
其他5＃3 + 2

请两者都做。谢谢。

解决方法

让我们按照您的逻辑尝试udf：

def my_count(s):
    x = s.value_counts()
    if len(x) > 3:
        ret = x.iloc[:3].copy()
        ret.loc['other'] = x.iloc[3:].sum()
    else:
        ret = x
    return ret

df[['col1']].apply(my_count)

输出：

       col1
1        35
2        23
3        10
other     6

我将展示在处理两列数据时要使用的自我。限制：在此解决方案中，第二，第三和第四名的并发关系不会收集到同一单元格中。您可能需要根据自己的目的进一步自定义此行为。

样本数据

有2列，每列26个类。一列为分类列，另一列为数字列。特意选择样本数据以展示联系的效果。

import pandas as pd
import numpy as np

np.random.seed(2)  # reproducibility
df = pd.DataFrame(np.random.randint(65,91,(1000,2)),columns=["str","num"])
df["str"] = list(map(chr,df["str"].values))

print(df)
    str  num
0     I   80
1     N   73
2     W   76
3     S   76
4     I   72
..   ..  ...
995   M   80
996   Q   70
997   P   66
998   I   87
999   F   83
[1000 rows x 2 columns]

所需功能

def count_top_n(df,n_top):

    # name of output columns
    def gen_cols(ls_str):
        for s in ls_str:
            yield s
            yield f"{s}_counts"

    df_count = pd.DataFrame(np.zeros((n_top+1,df.shape[1]*2),dtype=object),index=range(1,n_top+2),columns=list(gen_cols(df.columns.values)))  # df.shape[1] = #cols
    # process each column
    for i,col in enumerate(df):
        # count
        tmp = df[col].value_counts()
        assert len(tmp) > n_top,f"ValueError: too few classes {len(tmp)} <= {n_top} = n_top)"

        # case 1: no ties at the 3rd place
        if tmp.iat[n_top - 1] != tmp.iat[n_top]:
            # fill in classes
            df_count.iloc[:n_top,2*i] = tmp[:n_top].index.values
            df_count.iloc[n_top,2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top,2*i+1] = tmp[:n_top]
            df_count.iloc[n_top,2*i+1] = tmp[n_top:].sum()
        
        # case 2: ties
        else:
            # new termination location
            n_top_new = (tmp >= tmp.iat[n_top]).sum()
            # fill in classes
            df_count.iloc[:n_top-1,2*i] = tmp.iloc[:n_top-1].index.values
            df_count.iloc[n_top-1,2*i] = list(tmp.iloc[n_top-1:n_top_new].index.values)
            df_count.iloc[n_top,2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top-1,2*i+1] = tmp.iloc[:n_top-1].values
            df_count.iloc[n_top-1,2*i+1] = list(tmp.iloc[n_top-1:n_top_new].values)
            df_count.iloc[n_top,2*i+1] = tmp.iloc[n_top_new:].values.sum()

    return df_count

输出：

生成人类可读的表。请注意，str列在第二，第三和第四位有联系。

print(count_top_n(df,3))
      str str_count       num num_count
1       V        52        71        51
2       Q        46        86        47
3  [B,K]  [46,46]  [90,67]  [46,46]
4  (rest)       810    (rest)       810

使用以下功能：

def myFilter(col,maxOther = 0):
    unq = col.value_counts()
    if maxOther == 0:    # Return 3 MFV
        thr = unq.unique()[:3][-1]
        otherCnt = unq[unq < thr].sum()
        rv = col[col.isin(unq[unq >= thr].index)]
    else:    # Drop last LFV,no more than maxOther
        otherCnt = 0
        for i in unq[::-1]:
            if otherCnt + i >= maxOther: break
            otherCnt += i
        thrInd = unq.size - i + 1
        rv = col[col.isin(unq[:thrInd].index)]
    rv = rv.reset_index(drop=True)
    # print(f'  Trace {col.name}\nunq:\n{unq}\notherCnt: {otherCnt}')
    return rv

我的假设是两种变体之间的区别：

返回3个最频繁的值（MFV）
删除最近一次（其他）的值

由 maxOther 参数控制。其默认值 0 表示“第一个变量”。

因此要测试这两种变体，请调用它：

df.apply(myFilter)（第一个变体）
df.apply(myFilter,maxOther=10)（第二个变体）。

要查看跟踪打印输出，请取消注释 print 指令在功能中。

categorical-data dataframe frequency pandas python