Pandas:处理缺失数据时的真值条件是什么 示例输出

问题描述

我有一个创建比率的函数。定义为

def create_ratio(data,num,den):
    if data[num].isnull():
        ratio = -9997
    if data[den].isnull():
        ratio = -9998
    if data[num].isnull() & data[den].isnull():
        ratio = -9999
    else:
        ratio = data[num]/data[den]
    return ratio

我有 Pandas 数据框 (df_credit),其中包括信用卡余额 (cc_bal) 和限制 (cc_limit),我想计算信用卡利用率,即余额超过限制

df_credit['cc_util'] = create_ratio(df_credit,'cc_bal','cc_limit')

我收到以下错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-d53809a7690d> in <module>
----> 1 data['ratio_cc_util'] = create_ratio(data,'open_credit_card_credit_limit_nomiss','open_credit_card_credit_limit_nomiss')
      2 data['ratio_cc_util'].hist()

<ipython-input-65-99bc55b184ed> in create_ratio(data,den)
      1 def create_ratio(data,den):
----> 2     if data[num].isnull():
      3         ratio = -9997
      4     if data[den].isnull():
      5         ratio = -9998

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1441     def __nonzero__(self):
   1442         raise ValueError(
-> 1443             f"The truth value of a {type(self).__name__} is ambiguous. "
   1444             "Use a.empty,a.bool(),a.item(),a.any() or a.all()."
   1445         )

ValueError: The truth value of a Series is ambiguous. Use a.empty,a.any() or a.all().

这个错误解决方法是什么?谢谢。

解决方法

  • 您在标量和系列之间混合使用,您的函数需要根据其调用上下文返回系列或数组
  • 实现这种条件逻辑的最简单方法是 np.select()
  • 拥有模拟数据,包括满足您的用例的缺失值
df = pd.DataFrame({
        "cc_bal": np.random.uniform(200,1000,200),"cc_limit": np.random.uniform(800,1200,})

df.loc[np.unique(np.random.choice(range(len(df)),30)),"cc_bal"] = None
df.loc[np.unique(np.random.choice(range(len(df)),"cc_limit"] = None


def create_ratio(df,num,den):
    return np.select(
        [
            df[num].isnull() & df[den].isnull(),df[num].isnull(),df[den].isnull(),],[-9999,-9997,-9998],df[num] / df[den],)


df["ratio"] = create_ratio(df,"cc_bal","cc_limit")
df

示例输出

cc_bal cc_limit 比例
0 372.633 981.996 0.379465
1 845.541 1133.69 0.745831
2 449.406 975.903 0.460503
3 209.827 922.829 0.227374
4 237.347 936.654 0.253398
5 351.154 nan -9998
6 nan 873.671 -9997
7 803.396 861.791 0.93224
8 591.136 807.176 0.732352
9 675.397 847.059 0.797344