R从文本中删除停用词,而无需标记化并将数据转换为列表

问题描述

我需要从文本中删除停用词,而无需标记化对象或将对象更改为列表。使用rm_stopwords函数时出现错误。有人可以帮我吗?

test<- data.frame(words = c("hello there,everyone","the most amazing planet"),id = 1:2)
test$words <- rm_stopwords(test$words,tm::stopwords("english"),separate = F,unlist = T)
#Error in `$<-.data.frame`(`*tmP*`,words,value = c("hello","everyone",: 
  #replacement has 4 rows,data has 2

#I want something like this,where the stopwords are removed but the rest of the formatting remains intact (e.g. punctuation) 

#                words     id
#1    hello,everyone     1
#2    amazing planet        2

解决方法

尝试这种方法,将产生与您想要的输出类似的输出。您可以使用tidytext函数根据停用词来创建过滤器,然后将过滤后的值融合到与期望值接近的数据框中。这里的代码:

library(tidytext)
library(tidyverse)
#Data
test<- data.frame(words = c("hello there,everyone","the most amazing planet"),id = 1:2,stringsAsFactors = F)
#Unnest
l1 <- test %>% unnest_tokens(word,words,strip_punct = FALSE)
#Vector for stop words
vec<-tm::stopwords("english")
#Filter
l1<-l1[!(l1$word %in% vec),]
#Re aggregate by id
l2 <- l1 %>% group_by(id) %>% summarise(text=paste0(word,collapse = ' '))

输出:

# A tibble: 2 x 2
     id text            
  <int> <chr>           
1     1 hello,everyone
2     2 amazing planet  
,

您可以为所有带有词边界的停用词创建一个正则表达式模式,并使用gsub将其替换为空格。

test$words <- gsub(paste0('\\b',tm::stopwords("english"),'\\b',collapse = '|'),'',test$words)
test
#             words id
#1 hello,everyone  1
#2   amazing planet  2