问题描述
skill_list name profile 561 904 468 875 737 402 882...
[561,564,632,859] Aaron Weidele wordpress developer 0 0 0 0 0 0 0
[737,399,882,1086,5...]Abdelrady Tantawy full stack developer 0 0 0 0 0 0 0
[904,468,783,1120,8...]Abhijeet A Mulgund machine learning dev... 0 0 0 0 0 0 0 [468] Abhijeet Tiwari salesforce programmi... 0 0 0 0 0 0 0
[518,466,875,445,402..]Abhimanyu Veer A...machine learning devel...0 0 0 0 0 0 0
skill_list列包含与开发人员相对应的已编码技能的列表。我想扩展skill_list列中包含的每个列表,以便将每个编码的技能在其自己的列中表示为二进制变量(1表示打开,0表示关闭)。预期输出为:
skill_list name profile 561 904 468 875 737 402 882...
[561,859] Aaron Weidele wordpress developer 1 0 0 0 0 0 0
[737,5...]Abdelrady Tantawy full stack developer 0 0 0 0 1 0 1
[904,8...]Abhijeet A Mulgund machine learning dev... 0 1 1 0 0 0 0 [468] Abhijeet Tiwari salesforce programmi... 0 0 1 0 0 0 0
[518,402..]Abhimanyu Veer A...machine learning devel...0 0 0 0 0 1 0
我尝试过:
for index,row in df_vector_matrix["skill_list"].items():
for item in row:
for col in df_vector_matrix.columns:
if item == col:
df_vector_matrix.loc[item,col] = "1"
else:
0
我非常感谢您的帮助!
解决方法
您可以从sklearn试用MultiLabelBinarizer
。
下面的示例可能会有所帮助。
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
lb_res = lb.fit_transform(df_vector_matrix['skill_list'])
# convert result into dataframe
res = pd.DataFrame(lb_res,columns=lb.classes_)
# concatenate data result and original dataframe
df_vector_matrix = pd.concat([df_vector_matrix,res],axis=1)
下面是示例数据框,其中col
列具有列表值。
>>> import pandas as pd
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> d ={'col':[[1,2,3],[2,3,4,5],[2]],'name':['abc','vdf','rt']}
>>> df = pd.DataFrame(d)
>>> df
col name
0 [1,3] abc
1 [2,5] vdf
2 [2] rt
>>> lb = MultiLabelBinarizer()
>>> lb_res = lb.fit_transform(df['col'])
>>> res = pd.DataFrame(lb_res,columns=lb.classes_)
>>> pd.concat([df,axis=1)
col name 1 2 3 4 5
0 [1,3] abc 1 1 1 0 0
1 [2,5] vdf 0 1 1 1 1
2 [2] rt 0 1 0 0 0
>>>