将 2 df but column 映射到值而不是每个 ID 的值到值

问题描述

我有一个表,其中包含前 3 个原因(表 1)和另一个表,其中包含每个变量所属的类别(表 2)。 我正在尝试将类别箱与表 3 中的原因表相匹配。

Table 1: top 3 reasons,basically which 3 variable was the most important for that Register number

      RegNo    Reason1  Reason2 Reason3
        1111         v3       v2     v6
        2222         v2       v3     v5
        3333         v3       v2     v6
        4444         v3       v6     v2
        5555         v3       v2     v5


Table 2: The category bin for each variable 

    RegNo           v2              v3        v4          v5                 v6
    1111    (0.44,0.64]    (0.0,60.0]     missing    (-inf,3.0]   (102.0,175.0]
    2222    (-inf,0.33]    (0.0,60.0]     welldone   missing       missing
    3333    (0.44,60.0]     rare       missing       missing
    4444    (0.44,60.0]     missing    missing       (20.0,102.0]
    5555     (0.64,inf]    (0.0,60.0]     missing    missing       (-inf,20.0]


Table 3: For each ID,find category bin label and replace reason1,2 & 3 with the labels

       RegNo            Reason1         Reason2         Reason3
        1111         (0.0,60.0]    (0.44,0.64]     (102.0,175.0]
        2222         (-inf,60.0]       missing
        3333         (0.0,0.64]       missing
        4444         (0.0,60.0]    (20.0,102.0]    (0.44,0.64]
        5555         (0.0,60.0]     (0.64,inf]        missing

解决方法

方法

  • 以与 join() 一起使用的方式索引两个数据框
  • 然后它是来自 join 输出的每个原因的 pd.concat()
df1 = pd.read_csv(io.StringIO("""      RegNo    Reason1  Reason2 Reason3
        1111         v3       v2     v6
        2222         v2       v3     v5
        3333         v3       v2     v6
        4444         v3       v6     v2
        5555         v3       v2     v5
"""),sep="\s+")

df2 = pd.read_csv(io.StringIO("""    RegNo           v2              v3        v4          v5                 v6
    1111    (0.44,0.64]    (0.0,60.0]     missing    (-inf,3.0]   (102.0,175.0]
    2222    (-inf,0.33]    (0.0,60.0]     welldone   missing       missing
    3333    (0.44,60.0]     rare       missing       missing
    4444    (0.44,60.0]     missing    missing       (20.0,102.0]
    5555     (0.64,inf]    (0.0,60.0]     missing    missing       (-inf,20.0]
"""),sep="\s\s+",engine="python")

# index main df
df1 = df1.set_index("RegNo")
# reshape and index reasons
dfm = df2.set_index("RegNo").stack()
df3 = pd.concat([
    # add apprpriate column into index,don't want othert columns
    (df1.set_index(c,append="True").loc[:,[]]
     # now it's a straight forward join
     .join(dfm.rename_axis(["RegNo",c]).to_frame())
     # cleanup index and rename columns of joined DF
     .droplevel(1).rename(columns={0:c}))
    for c in ["Reason1","Reason2","Reason3"]],axis=1)
RegNo 原因1 原因2 原因3
1111 (0.0,60.0] (0.44,0.64] (102.0,175.0]
2222 (-inf,0.33] (0.0,60.0] 缺少
3333 (0.0,0.64] 缺少
4444 (0.0,60.0] (20.0,102.0] (0.44,0.64]
5555 (0.0,60.0] (0.64,inf] 缺少