问题描述
我需要从文本中删除停用词,而无需标记化对象或将对象更改为列表。使用rm_stopwords函数时出现错误。有人可以帮我吗?
test<- data.frame(words = c("hello there,everyone","the most amazing planet"),id = 1:2)
test$words <- rm_stopwords(test$words,tm::stopwords("english"),separate = F,unlist = T)
#Error in `$<-.data.frame`(`*tmP*`,words,value = c("hello","everyone",:
#replacement has 4 rows,data has 2
#I want something like this,where the stopwords are removed but the rest of the formatting remains intact (e.g. punctuation)
# words id
#1 hello,everyone 1
#2 amazing planet 2
解决方法
尝试这种方法,将产生与您想要的输出类似的输出。您可以使用tidytext
函数根据停用词来创建过滤器,然后将过滤后的值融合到与期望值接近的数据框中。这里的代码:
library(tidytext)
library(tidyverse)
#Data
test<- data.frame(words = c("hello there,everyone","the most amazing planet"),id = 1:2,stringsAsFactors = F)
#Unnest
l1 <- test %>% unnest_tokens(word,words,strip_punct = FALSE)
#Vector for stop words
vec<-tm::stopwords("english")
#Filter
l1<-l1[!(l1$word %in% vec),]
#Re aggregate by id
l2 <- l1 %>% group_by(id) %>% summarise(text=paste0(word,collapse = ' '))
输出:
# A tibble: 2 x 2
id text
<int> <chr>
1 1 hello,everyone
2 2 amazing planet
,
您可以为所有带有词边界的停用词创建一个正则表达式模式,并使用gsub
将其替换为空格。
test$words <- gsub(paste0('\\b',tm::stopwords("english"),'\\b',collapse = '|'),'',test$words)
test
# words id
#1 hello,everyone 1
#2 amazing planet 2