问题描述
我有下面的代码来遍历一列的句子,在句子中标记关键字,并创建由 1 和 0 组成的这些标记的新列。如果一个关键字存在,它会被自动标记并在以该标记命名的新列中被赋予 1。如果它不存在但存在另一个关键字,则为 0。如果该句子没有任何关键字,则整行将被删除。
下面的代码有点好,但它仍然缺少关键字,它在部分单词和空白单元格(没有句子的行)上标记和输出 1 和 0。我不确定缺少什么?如何保证不漏关键词,不标记部分单词和空白句子?
pattern = '|'.join(dict_list)
tags_id = (df['description_summary']
.str.extractall(f'({pattern})')[0]
.map(keyword_dict)
.reset_index(name='col')
.assign(value=1)
.pivot_table(index=[df['issue.id'],df['description_summary']],columns='col',values='value',fill_value=0))
这里基本上是我在 excel 文件中处理的数据:
issue.id description_summary
0 753 Long sentence with keywords ball and hot
1 937 Long sentence with keywords cold,stick,and glove
2
3 598 Long sentence with NO keywords
4 574 Long sentence with keywords very cold and cold
issue.id description_summary Toy Temperature
0 753 Long sentence with keywords ball and hot 1 1
1 937 Long sentence with keywords cold,and glove 1 1
2 1 0
3 598 Long sentence with NO keywords but outputs 1s and 0s 0 1
4 574 Long sentence with keywords very cold and cold 1 1
这是我想要的输出:
issue.id description_summary Toy Temperature
0 753 Long sentence with keywords ball and hot 1 1
1 937 Long sentence with keywords cold,and glove 1 1
4 574 Long sentence with keywords very cold and cold 0 1
这里是关键字和标签的字典('keywords': 'tags'):
dict_list = {'Hot': 'Temperature','Cold': 'Temperature','Very cold': 'Temperature','Ball': 'Toy','glove': 'Toy','Stick': 'Toy'
}
如何确保不会遗漏关键词,不会标记部分单词和空白句子?
解决方法
我认为您的第一个问题是 map
。如果我粗略地重新构建你正在做的事情,直到那里:
>>> pattern = '|'.join(dict_list.keys())
>>> matches = df['description_summary'].str.extractall(f"({pattern})",flags=re.IGNORECASE)[0]
>>> matches
match
0 0 ball
1 hot
1 0 cold
1 stick
2 glove
4 0 very cold
1 cold
Name: 0,dtype: object
>>> matches.map(dict_list)
match
0 0 NaN
1 NaN
1 0 NaN
1 NaN
2 NaN
4 0 NaN
1 NaN
Name: 0,dtype: object
然而,如果强制不区分大小写,我们会得到更好的结果:
>>> matches.str.lower().map({kw.lower():tag for kw,tag in dict_list.items()})
match
0 0 Toy
1 Temperature
1 0 Temperature
1 Toy
2 Toy
4 0 Temperature
1 Temperature
Name: 0,dtype: object
第二个问题似乎是 pivot_table
将错误的行分配给匹配项,因为 df
和 matches
的形状不同。我们可以改为使用第一级索引进行透视,然后使用它与 df
连接:
>>> tags = matches.str.lower().map({kw.lower():tag for kw,tag in dict_list.items()})
>>> tags = tags.rename_axis(['line','match']).reset_index(name='tag').assign(value=1)
>>> tags.pivot_table(index='line',columns='tag',values='value',fill_value=0).join(df[['issue.id','description_summary']])
Temperature Toy issue.id description_summary
line
0 1 1 753.0 Long sentence with keywords ball and hot
1 1 1 937.0 Long sentence with keywords cold,stick,and g...
4 1 0 574.0 Long sentence with keywords very cold and cold
,
输入数据:
>>> df
issue.id description_summary
0 753 Long sentence with keywords ball and hot
1 937 Long sentence with keywords cold,and g...
2 <NA> <NA>
3 598 Long sentence with NO keywords
4 574 Long sentence with keywords very cold and cold
>>> mapping
{'Hot': 'Temperature','Cold': 'Temperature','Very cold': 'Temperature','Ball': 'Toy','Glove': 'Toy','Stick': 'Toy'}
>>> words # words = fr"({'|'.join(mapping.keys())})".lower()
'(hot|cold|very cold|ball|glove|stick)'
稍后我会写一些解释(但你可以逐行测试)
out = df['description_summary'].str.lower().str.findall(words) \
.explode().str.capitalize() \
.replace(dict_list) \
.pipe(lambda x: x.loc[x.notna()]) \
.str.get_dummies() \
.groupby(level=0) \
.any().astype(int)
输出结果
>>> df.merge(out,left_index=True,right_index=True)
issue.id description_summary Temperature Toy
0 753 Long sentence with keywords ball and hot 1 1
1 937 Long sentence with keywords cold,and g... 1 1
4 574 Long sentence with keywords very cold and cold 1 0