展开列中包含的列表,以便列表中的每个元素都对应于其自己的列,并表示为二进制变量

问题描述

我有一个看起来像这样的数据框:

skill_list                 name               profile                 561 904 468 875 737 402 882...
[561,564,632,859]       Aaron Weidele      wordpress developer      0    0   0   0   0   0   0   
[737,399,882,1086,5...]Abdelrady Tantawy  full stack developer     0    0   0   0   0   0   0   
[904,468,783,1120,8...]Abhijeet A Mulgund machine learning dev...  0    0   0   0   0   0   0   [468]                      Abhijeet Tiwari    salesforce programmi...  0    0   0   0   0   0   0
[518,466,875,445,402..]Abhimanyu Veer A...machine learning devel...0    0   0   0   0   0   0   

skill_list列包含与开发人员相对应的已编码技能的列表。我想扩展skill_list列中包含的每个列表,以便将每个编码的技能在其自己的列中表示为二进制变量(1表示打开,0表示关闭)。预期输出为:

skill_list                 name               profile                 561 904 468 875 737 402 882...
[561,859]       Aaron Weidele      wordpress developer      1    0   0   0   0   0   0   
[737,5...]Abdelrady Tantawy  full stack developer     0    0   0   0   1   0   1   
[904,8...]Abhijeet A Mulgund machine learning dev...  0    1   1   0   0   0   0   [468]                      Abhijeet Tiwari    salesforce programmi...  0    0   1   0   0   0   0
[518,402..]Abhimanyu Veer A...machine learning devel...0    0   0   0   0   1   0   

我尝试过:

for index,row in df_vector_matrix["skill_list"].items():
    for item in row:
        for col in df_vector_matrix.columns:
            if item == col:
                df_vector_matrix.loc[item,col] = "1"
        else:
            0

我非常感谢您的帮助!

解决方法

您可以从sklearn试用MultiLabelBinarizer。 下面的示例可能会有所帮助。

from sklearn.preprocessing import MultiLabelBinarizer

lb = MultiLabelBinarizer()
lb_res = lb.fit_transform(df_vector_matrix['skill_list'])

# convert result into dataframe
res = pd.DataFrame(lb_res,columns=lb.classes_)

# concatenate data result and original dataframe
df_vector_matrix = pd.concat([df_vector_matrix,res],axis=1)

下面是示例数据框,其中col列具有列表值。

>>> import pandas as pd
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> d ={'col':[[1,2,3],[2,3,4,5],[2]],'name':['abc','vdf','rt']}
>>> df = pd.DataFrame(d)
>>> df
            col name
0     [1,3]  abc
1  [2,5]  vdf
2           [2]   rt
>>> lb = MultiLabelBinarizer()
>>> lb_res = lb.fit_transform(df['col'])
>>> res = pd.DataFrame(lb_res,columns=lb.classes_)
>>> pd.concat([df,axis=1)
            col name  1  2  3  4  5
0     [1,3]  abc  1  1  1  0  0
1  [2,5]  vdf  0  1  1  1  1
2           [2]   rt  0  1  0  0  0
>>>