问题描述
我正在尝试使用以下代码提取句子中的标签,但它返回关键字。我错过了什么?如何输出由逗号分隔的所有标签(而不是关键字)的新列?
s = set(dict_list)
f = lambda x: ','.join(set([y for y in x.split() if y in s]))
# df['tags'] = df['description_summary'].apply(f)
df['tags'] = df['description_summary'].apply(lambda x: ','.join(set(x.split()).intersection(s)))
df
这里基本上是我在 excel 文件中处理的数据:
description_summary
0 Long sentence with keywords ball and hot
1 Long sentence with keywords stick,glove,and cold
description_summary keywords instead of tags
0 Long sentence with keywords ball and hot ball,hot
1 Long sentence with keywords cold,stick,and glove cold,glove
这是我想要的输出:
description_summary tags
0 Long sentence with keywords ball and hot toy,temperature
1 Long sentence with keywords cold,and glove temperature,toy
这里是关键字和标签的字典('keywords': 'tags'):
dict_list = {'Hot': 'Temperature','Cold': 'Temperature','Very cold': 'Temperature','Ball': 'Toy','glove': 'Toy','Stick': 'Toy'
}
解决方法
您可以使用普通的字典索引来返回关联的值,而不是键本身。
请注意,我已从您的问题中编辑了字典列表,以便更轻松地验证它是否有效,并且您还需要考虑区分大小写。
df = pd.DataFrame({'description_summary':['Long sentence with keywords ball and hot','Long sentence with keywords cold,stick,and glove']})
dict_list = {'Hot': 'Temperature (hot)','Cold': 'Temperature (cold)','Very cold': 'Temperature (very cold)','Ball': 'Toy (ball)','Glove': 'Toy (glove)','Stick': 'Toy (stick)'}
d_lower = {key.lower():value.lower() for key,value in dict_list.items()}
df['tags'] = df['description_summary'].apply(lambda x: ','.join(
set([d_lower[y] for y in d_lower.keys() if y in x])
))
产量 'tags'
0 temperature (hot),toy (ball)
1 temperature (cold),toy (glove),toy (stick)
Name: tags,dtype: object