如何在 Python 中标记关键字并添加到新列

问题描述

我正在尝试使用以下代码提取句子中的标签,但它返回关键字。我错过了什么?如何输出由逗号分隔的所有标签(而不是关键字)的新列?

s = set(dict_list)
f = lambda x: ','.join(set([y for y in x.split() if y in s]))
# df['tags'] = df['description_summary'].apply(f)

df['tags'] = df['description_summary'].apply(lambda x: ','.join(set(x.split()).intersection(s)))
df

这里基本上是我在 excel 文件中处理的数据:

    description_summary

0   Long sentence with keywords ball and hot
1   Long sentence with keywords stick,glove,and cold

这是当前(错误的)输出

     description_summary                                     keywords instead of tags

0    Long sentence with keywords ball and hot                ball,hot
1    Long sentence with keywords cold,stick,and glove      cold,glove

这是我想要的输出

     description_summary                                     tags

0    Long sentence with keywords ball and hot                toy,temperature
1    Long sentence with keywords cold,and glove      temperature,toy 

这里是关键字和标签的字典('keywords': 'tags'):

dict_list = {'Hot': 'Temperature','Cold': 'Temperature','Very cold': 'Temperature','Ball': 'Toy','glove': 'Toy','Stick': 'Toy'
 }

如何在同一文件的新列中仅输出标签(以逗号分隔)?

解决方法

您可以使用普通的字典索引来返回关联的值,而不是键本身。

请注意,我已从您的问题中编辑了字典列表,以便更轻松地验证它是否有效,并且您还需要考虑区分大小写。

df = pd.DataFrame({'description_summary':['Long sentence with keywords ball and hot','Long sentence with keywords cold,stick,and glove']})

dict_list = {'Hot': 'Temperature (hot)','Cold': 'Temperature (cold)','Very cold': 'Temperature (very cold)','Ball': 'Toy (ball)','Glove': 'Toy (glove)','Stick': 'Toy (stick)'}

d_lower = {key.lower():value.lower() for key,value in dict_list.items()}

df['tags'] = df['description_summary'].apply(lambda x: ','.join(
      set([d_lower[y] for y in d_lower.keys() if y in x])
    ))

产量 'tags'

0                   temperature (hot),toy (ball)
1    temperature (cold),toy (glove),toy (stick)
Name: tags,dtype: object