如何列出包含数据的列的最频繁组合

问题描述

你好，

我正在处理以混乱和不同而闻名的地质数据集。我想要做的是：输出一个列组合列表，其中包含一定数量的列的最多无 NaN 行数。

例如

A B C D E F
2 6 3 7 7 3 
4 5 6 7 5 4 
3 4 x x x x 
4 5 x x x x 
6 7 x x x x
x x x 5 6 7 
x x x 7 5 8

如果我输入 2，那么它会返回一个包含 ['A','B'] 和 5 的列表，因为 A 和 B 列有 5 个完整的行。如果我输入 3，则返回 ['D','E','F'] 和 4，因为它们有 4 个完整的行。如果我输入 5，那么我会得到 ['A','B','C','D','F'] 和 2，因为它们有 2 个完整的行。

提前致谢！

解决方法

它认为这就是你想要的。这不是返回列列表，而是返回一个或多个列列表，以说明“最佳”非 NA 行数存在并列的情况。

import pandas as pd
from itertools import combinations
from math import nan

def best_combinations(df,n_cols):
    best_cols = []
    best_length = 0
    for cols in combinations(df.columns,n_cols):
        subdf = df.loc[:,list(cols)].dropna()
        if len(subdf) > best_length:
            best_length = len(subdf)
            best_cols = [cols]
        elif (len(subdf) == best_length) and (best_length > 0):
            best_cols.append(cols)
    return best_cols,best_length

在您的数据框中：

df = pd.DataFrame({
 'A': {0: '2',1: '4',2: '3',3: '4',4: '6',5: nan,6: nan},'B': {0: '6',1: '5',2: '4',3: '5',4: '7','C': {0: '3',1: '6',2: nan,3: nan,4: nan,'D': {0: '7',1: '7',5: '5',6: '7'},'E': {0: '7',5: '6',6: '5'},'F': {0: '3',5: '7',6: '8'}}
)

best_combinations(df,2)
# returns:
[('A','B')],5

best_combinations(df,3)
[('D','E','F')],4

itertools pandas pandas python

如何列出包含数据的列的最频繁组合

问题描述

解决方法

相关问答