是否有使用自定义字典进行清理的 R 函数优化

问题描述

我有一个自定义字典（作为字符列表加载），我想用它来清理数据集（包含 100,000 多个元素的 VCorpus）中的内容。比如我想用字典把[1]变成[1]：*

   #[1] "never give up uouo cbbuk jeez"  
   #[1*] "never give up"

因为单词“never”、“give”和“up”都在自定义词典中。 我目前可以使用以下代码在数据框上使用字典：

#Creating a dataframe
   DF<-tibble(Text="never give up uouo cbbuk jeez")
#Creating the custom dictionary
   custom.dictionary <- c("never","give","up")
#Reading the custom dictionary as a function
    english.words  <- function(x) x %in% custom.dictionary
#Filtering based on words in the dictionary
    DF$Text1 <- sapply(strsplit(DF$Text,'\\s+'),function(x) 
                        paste0(Filter(english.words,x),collapse = ' '))

但是，该操作需要很长时间，我正在考虑在 VCorpus 上执行相同的操作是否会使其更快（或为此目的的任何替代方案）。我尝试使用 tm_map 函数，但它只返回一个包含一个字符的语料库。有什么建议吗？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

corpus data-cleaning dictionary r r