在忽略 NaN 的同时识别列中的相等性

问题描述

我怎样才能忽略与熊猫相等的空/NaN 列。

所以它返回 TRUE 是 col 2 与 1 相同，并且当 col 2 包含 NaN

df['col1'].equals(df['col2'])

解决方法

使用与 col2 相同的中间列（系列），但将 NaN 值设置为 col1 中的值。

import pandas as pd
df = pd.DataFrame({'col1': [1.,2,3,4,5,6],'col2': [1,None,None]})
df['col1'].equals(df['col2'])
s = df['col2'].fillna(df['col1'])
df['col1'].equals(s)

您可以使用布尔过滤（两列中的 NA/nan 对称）来执行此操作：

mask = df['col1'].notna() & df['col2'].notna()
df.loc[mask,'col1'].equals(df.loc[mask,'col2'])

我不得不深入研究一下，因为如果您不知道在哪一列中遇到缺失值，某些答案将不起作用。也让我们看看哪个答案最快。

那么让我们创建一些测试数据：

import pandas as pd
import numpy as np

ser1 = pd.Series(np.random.rand(10_000)) # Generate random column
ser2 = ser1.copy(deep=True)              # Exact copy of values
ser3 = ser1.copy(deep=True)              # Exact copy of values
ser4 = pd.Series(np.random.rand(10_000)) # Different data

# Create independent nans
# ser1 without nans
ser2[np.random.rand(10_000) > 0.8] = np.nan 
ser3[np.random.rand(10_000) > 0.8] = np.nan
ser4[np.random.rand(10_000) > 0.8] = np.nan

当前答案作为采用 pd.Series（列类型）的函数：

# As a sanity check `Series.equals` that is not true for any_value == np.nan
def equality_pandas(a,b):
    return a.equals(b)

def equality_filling_one_sided(a,b):
    return a.equals(b.fillna(a))

def equality_filling_dummy_data(a,b):
    return a.fillna(-9999).equals(b.fillna(a).fillna(-9999))

def equality_boolean_mask(a,b):
    mask = a.notna() & b.notna()
    return a[mask].equals(b[mask])

def equality_pure_boolean(a,b):
    # using binary or operator to make it True if isna
    return ((a == b) | a.isna() | b.isna()).all()

让我们定义一些我们期望从忽略 NaN 并且不关心您是否在左侧或右侧列中有这些 NaN 的通用比较函数的测试

def tests(equal):
    assert equal(ser1,ser1),"Identity has to be true without nan"
    assert equal(ser2,ser2),"Identity with nans at the same position"
    assert equal(ser1,"Same data,NaNs only on the right"
    assert equal(ser2,NaNs only on the left"
    assert equal(ser2,ser3),"Same data but different NaNs"
    assert not equal(ser1,ser4),"Different data has to be not equal (NaNs only right)"
    assert not equal(ser2,"Different data has to be not equal (NaNs in both)"
    print("PASS")

运行这些测试表明，仅在一个方向上填充不会使其具有可交换性（即使使用通常可能是不好的做法的虚拟值）。请注意，Series.equals 与 np.nan == np.nan 为 False 的规则相反，如果 NaN 位于完全相同的位置，则返回 True！

>>> tests(equality_pandas)

          2     assert equal(ser1,"Identity has to be true without nan"
          3     assert equal(ser2,"Identity with nans at the same position"
    ----> 4     assert equal(ser1,NaNs only on the right"
          5     assert equal(ser2,NaNs only on the left"
          6     assert equal(ser2,"Same data but different NaNs"


    AssertionError: Same data,NaNs only on the right


>>> tests(equality_filling_one_sided)

          3     assert equal(ser2,"Identity with nans at the same position"
          4     assert equal(ser1,NaNs only on the right"
    ----> 5     assert equal(ser2,"Same data but different NaNs"
          7     assert not equal(ser1,"Different data has to be not equal"


    AssertionError: Same data,NaNs only on the left


>>> tests(equality_filling_dummy_data)


          3     assert equal(ser2,NaNs only on the left


>>> tests(equality_boolean_mask)
PASS

>>> tests(equality_pure_boolean)
PASS

性能

现在让我们快速看看哪个方法返回答案最快

%%timeit
equality_filling_one_sided(ser1,ser2)

    910 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)


%%timeit
equality_filling_dummy_data(ser1,ser2)

    1.27 ms ± 67.7 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)


%%timeit
equality_boolean_mask(ser1,ser2)

    2.15 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs,100 loops each)


%%timeit
equality_pure_boolean(ser1,ser2)

    1.34 ms ± 32.2 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

如您所见，布尔解决方案需要单独计算更多的中间结果，因此速度较慢，尽管更接近于将其编写为优化的 C 代码时所编写的内容。

如果您知道右列/系列中只有 NaN，则可以使用 equality_filling_one_sided 解决方案以获得最佳性能；

理想的交换解

所以如果我们想要交换比较忽略左右两边的 NaN，最快的方法是使用：

def equality_filling_two_sided(a,b):
    f_a = a.fillna(b)
    f_b = b.fillna(a)
    return f_a.equals(f_b)

>>> tests(equality_filling_two_sided)
PASS


%%timeit
equality_filling_two_sided(ser1,ser2)

    962 µs ± 35.1 µs per loop (mean ± std. dev. of 7 runs,1000 loops each)

这比单方面解决方案慢一点，但满足所有要求

equals pandas pandas python similarity

在忽略 NaN 的同时识别列中的相等性

问题描述

解决方法

性能

理想的交换解

相关问答