问题描述
假设我有一个包含评论的向量(数据框)(每一行是不同的评论):
comment
'well done!'
'terrible work'
'quit your job'
'hi'
我有以下数据框,其中包含 positive
和 negative
词(即字典)
positive negative
well terrible
done quit
在 R 中有没有办法使用这个字典来标记第一个数据框中的注释 positive
、negative
或 neutral
,具体取决于它们是否包含更多或更少的正面/负面评论?
comment label
'well done!' positive
'terrible work' negative
'quit your job' negative
'hi' neutral
有谁知道如何在 R 中做到这一点?
解决方法
这行得通吗:
library(dplyr)
library(stringr)
comm %>% mutate(label = case_when(str_detect(comments,str_c(dict$positive,collapse = '|')) ~ 'positive',str_detect(comments,str_c(dict$negative,collapse = '|')) ~ 'negative',TRUE ~ 'neutral'))
comments label
1 well done! positive
2 terrible work negative
3 quit your job negative
4 hi neutral
基于 OP 的要求:
comm %>% mutate(p_count = str_count(comments,collapse = '|')),n_count = str_count(comments,collapse = '|'))) %>%
mutate(label = case_when(p_count > n_count ~ 'positive',p_count < n_count ~ 'negative',TRUE ~ 'neutral')) %>% select(comments,label)
comments label
1 well done! positive
2 terrible well work neutral
3 quit your job well well positive
4 hi neutral
5 terrible quit well negative
使用的新数据:
comm
comments
1 well done!
2 terrible well work
3 quit your job well well
4 hi
5 terrible quit well
dict
# A tibble: 2 x 2
positive negative
<chr> <chr>
1 well terrible
2 done quit