问题描述
在这样的 dfm 中包含单词 图书馆(“quanteda”)
软件包版本:2.1.2
dfmat <- dfm(c("hello_text","text_hello","test1_test2","test2_test1","test2_test2_test2","test2_other","other"))
例如标记“hello_text”和“text_hello”在不同的地方是相同的。怎么可能只保留其中一个选项?
示例输出
dfmat <- dfm(c("hello_text","other"))
解决方法
在下划线处拆分字符串并按字母顺序排序,然后使用此列表识别重复项并将其应用于原始列表:
words <- c("hello_text","text_hello","test1_test2","test2_test1","test2_test2_test2","test2_other","other")
words_sorted <- sapply(sapply(words,strsplit,"_"),sort)
words[!duplicated(words_sorted)]
返回:
[1] "hello_text" "test1_test2" "test2_test2_test2" "test2_other"
[5] "other"