Python:标记关键字并创建带有 1 和 0 的新标签列

问题描述

我有下面的代码来遍历一列的句子,在句子中标记关键字,并创建由 1 和 0 组成的这些标记的新列。如果一个关键字存在,它会被自动标记并在以该标记命名的新列中被赋予 1。如果它不存在但存在另一个关键字,则为 0。如果该句子没有任何关键字,则整行将被删除

下面的代码有点好,但它仍然缺少关键字,它在部分单词和空白单元格(没有句子的行)上标记输出 1 和 0。我不确定缺少什么?如何保证不漏关键词,不标记部分单词和空白句子?

pattern = '|'.join(dict_list)
tags_id = (df['description_summary']
   .str.extractall(f'({pattern})')[0]
   .map(keyword_dict)
   .reset_index(name='col')
   .assign(value=1)
   .pivot_table(index=[df['issue.id'],df['description_summary']],columns='col',values='value',fill_value=0))

这里基本上是我在 excel 文件中处理的数据:

    issue.id  description_summary

0   753       Long sentence with keywords ball and hot
1   937       Long sentence with keywords cold,stick,and glove
2   
3   598       Long sentence with NO keywords
4   574       Long sentence with keywords very cold and cold 

这是当前(错误的)输出

    issue.id  description_summary                                     Toy     Temperature 

0    753       Long sentence with keywords ball and hot                1       1
1    937       Long sentence with keywords cold,and glove      1       1
2                                                                      1       0
3    598       Long sentence with NO keywords but outputs 1s and 0s    0       1
4    574       Long sentence with keywords very cold and cold          1       1

这是我想要的输出

    issue.id  description_summary                                     Toy     Temperature    

0    753       Long sentence with keywords ball and hot                1       1
1    937       Long sentence with keywords cold,and glove      1       1
4    574       Long sentence with keywords very cold and cold          0       1

这里是关键字和标签的字典('keywords': 'tags'):

dict_list = {'Hot': 'Temperature','Cold': 'Temperature','Very cold': 'Temperature','Ball': 'Toy','glove': 'Toy','Stick': 'Toy'
 }

如何确保不会遗漏关键词,不会标记部分单词和空白句子?

解决方法

我认为您的第一个问题是 map。如果我粗略地重新构建你正在做的事情,直到那里:

>>> pattern = '|'.join(dict_list.keys())
>>> matches = df['description_summary'].str.extractall(f"({pattern})",flags=re.IGNORECASE)[0]
>>> matches
   match
0  0             ball
   1              hot
1  0             cold
   1            stick
   2            glove
4  0        very cold
   1             cold
Name: 0,dtype: object
>>> matches.map(dict_list)
   match
0  0        NaN
   1        NaN
1  0        NaN
   1        NaN
   2        NaN
4  0        NaN
   1        NaN
Name: 0,dtype: object

然而,如果强制不区分大小写,我们会得到更好的结果:

>>> matches.str.lower().map({kw.lower():tag for kw,tag in dict_list.items()})
   match
0  0                Toy
   1        Temperature
1  0        Temperature
   1                Toy
   2                Toy
4  0        Temperature
   1        Temperature
Name: 0,dtype: object

第二个问题似乎是 pivot_table 将错误的行分配给匹配项,因为 dfmatches 的形状不同。我们可以改为使用第一级索引进行透视,然后使用它与 df 连接:

>>> tags = matches.str.lower().map({kw.lower():tag for kw,tag in dict_list.items()})
>>> tags = tags.rename_axis(['line','match']).reset_index(name='tag').assign(value=1)
>>> tags.pivot_table(index='line',columns='tag',values='value',fill_value=0).join(df[['issue.id','description_summary']])
      Temperature  Toy  issue.id                                description_summary
line                                                                               
0               1    1     753.0           Long sentence with keywords ball and hot
1               1    1     937.0  Long sentence with keywords cold,stick,and g...
4               1    0     574.0     Long sentence with keywords very cold and cold
,

输入数据:

>>> df
  issue.id                                description_summary
0      753           Long sentence with keywords ball and hot
1      937  Long sentence with keywords cold,and g...
2     <NA>                                               <NA>
3      598                     Long sentence with NO keywords
4      574     Long sentence with keywords very cold and cold

>>> mapping
{'Hot': 'Temperature','Cold': 'Temperature','Very cold': 'Temperature','Ball': 'Toy','Glove': 'Toy','Stick': 'Toy'}

>>> words  # words = fr"({'|'.join(mapping.keys())})".lower()
'(hot|cold|very cold|ball|glove|stick)'

稍后我会写一些解释(但你可以逐行测试)

out = df['description_summary'].str.lower().str.findall(words) \
                               .explode().str.capitalize() \
                               .replace(dict_list) \
                               .pipe(lambda x: x.loc[x.notna()]) \
                               .str.get_dummies() \
                               .groupby(level=0) \
                               .any().astype(int)

输出结果

>>> df.merge(out,left_index=True,right_index=True)
  issue.id                                description_summary  Temperature  Toy
0      753           Long sentence with keywords ball and hot            1    1
1      937  Long sentence with keywords cold,and g...            1    1
4      574     Long sentence with keywords very cold and cold            1    0