如何根据 R 中字典中的单词标记正面或负面文本?

问题描述

假设我有一个包含评论的向量(数据框)(每一行是不同的评论):

comment
'well done!'
'terrible work'
'quit your job'
'hi'

我有以下数据框,其中包含 positivenegative 词(即字典)

positive negative
well     terrible
done     quit

在 R 中有没有办法使用这个字典来标记一个数据框中的注释 positivenegativeneutral,具体取决于它们是否包含更多或更少的正面/负面评论

即我希望输出一个如下所示的数据框:

comment          label
'well done!'     positive
'terrible work'  negative
'quit your job'  negative
'hi'             neutral

有谁知道如何在 R 中做到这一点?

解决方法

这行得通吗:

library(dplyr)
library(stringr)
comm %>% mutate(label = case_when(str_detect(comments,str_c(dict$positive,collapse = '|')) ~ 'positive',str_detect(comments,str_c(dict$negative,collapse = '|')) ~ 'negative',TRUE ~ 'neutral'))
       comments    label
1    well done! positive
2 terrible work negative
3 quit your job negative
4            hi  neutral

基于 OP 的要求:

comm %>% mutate(p_count = str_count(comments,collapse = '|')),n_count = str_count(comments,collapse = '|'))) %>% 
           mutate(label = case_when(p_count > n_count ~ 'positive',p_count < n_count ~ 'negative',TRUE ~ 'neutral')) %>% select(comments,label)
                 comments    label
1              well done! positive
2      terrible well work  neutral
3 quit your job well well positive
4                      hi  neutral
5      terrible quit well negative

使用的新数据:

comm
                 comments
1              well done!
2      terrible well work
3 quit your job well well
4                      hi
5      terrible quit well

dict
# A tibble: 2 x 2
  positive negative
  <chr>    <chr>   
1 well     terrible
2 done     quit