将带有列表值的熊猫系列转换为布尔数据框

问题描述

我有一个值作为不同元素列表的系列。值计数显示如下。

category                                                                            count
[Radiometric]                                                                       76
[Ozone]                                                                             59
[Aerosol]                                                                           53
[Cryosphere]                                                                        31
[Atmospheric State,Cloud Properties]                                               29
[Atmospheric State,Radiometric,Surface Properties]                                 8
[POPs]                                                                               8
[Atmospheric State,Cloud Properties,Radiometric]                                   7

我想为每个类别创建列，并为每一行标记 True/False。

例如

index                Aerosol    Cloud Properities     Radiometric  ......
1                     TRUE       FALSE                  TRUE
2                     FALSE       TRUE                   TRUE
3
4

我设法从所有项目中获得了这些类别的唯一列表。我也可以使用 solution here 中给出的方法将其分成单独的列。

但在我的情况下，数据不完整/多变，因此给了我一个像下面这样的 DF

    1                   2                   3                 4                 5
25  Reactive Gas        Surface Properties  None               None             None
28  Aerosol             Ozone               Atmospheric State Cloud Properties  None
59  Surface Properties  Cryosphere          None               None             None
68  Atmospheric State   Cloud Properties    None               None             None
73  Atmospheric State   Radiometric         None               None             None

有没有办法使用 pandas 或其他 python 工具将其转换为所需的输出。我现在正在使用带有来自 this solution 的提示的 pandas.pivot_table。我使用第 1 列作为列（假设它具有所有类别），但为每列提供了一个多索引 DF。

pvt = tmp.pivot_table(index=tmp.index,columns="1",aggfunc="count")

需要关于如何获得上述布尔矩阵/df 的帮助。

解决方法

我认为您需要 Series.str.join 和 Series.str.get_dummies 并转换为布尔值：

df1 = df.category.str.join('|').str.get_dummies().astype(bool)

或者使用MultiLabelBinarizer：

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df.category),columns=mlb.classes_).astype(bool)

dataframe pandas pandas series series