使用 Quanteda 软件包 R 删除 2 个停用词列表

问题描述

我在语料库数据框上使用 quanteda 包，这是我使用的基本代码：


"PK\u0003\u0004\u0014\u0000\u0008\u0000\u0008\u0000nO�R\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u000c\u0000\u0000\u000012_59_28.png�yX�i�6�f9֨e���m�kY�JM�jꔥ�R���B�
�唕����(͔1�K����Ai����䊂����Z�=�s����x�{��C�����<����_mOʯݶ\u0016\u0004\u0002ɟ>��y\u0010h�\u0003\u0008�
�4��\u0011׬$�\u000f/�N�r�B8f�&���xi ...

但是，我有另一个停用词列表作为数据框，称为 stpw，我想将其考虑在内。

我试过了：

List<BigInteger> sample2 = new ArrayList<>(sample1);
sample2.add(BIGINTEGER4);

停用词错误（“spanish”，“stpw”）：未使用的参数（“stpw”）

然后我创建了一个列表，其中包含“spanish”的停用词 + stpw 的停用词：

library(quanteda)

fmsi_des <- dfm(corpus_des,remove=stopwords("spanish"),verbose=TRUE,remove_punct=TRUE,remove_numbers=TRUE)

停用词错误（“all_stops”）：没有可用于“all_stops”的停用词

我还用我的停用词创建了一个 txt 文件，以便尝试：

fmsi_des <- dfm(corpus_des,remove=stopwords("spanish","stpw"),remove_numbers=TRUE)

警告信息：在 readLines("stp.txt") 中：在 'stpw.txt' 中找到不完整的最后一行

解决方法

在这种情况下，知道 R 中返回对象的值是获得想要的结果的关键。具体来说，您需要知道 stopwords() 返回什么，以及它的第一个参数是什么。

stopwords(language = "sp") 使用默认的 source = "snowball" 列表返回西班牙语停用词的字符向量。（有关详细信息，请参阅 ?stopwords。）

因此，如果您想删除默认的西班牙语列表加上您自己的话，您可以将返回的字符向量与其他元素连接起来。这就是您在创建 all_stops 时所做的。

因此，要删除 all_stops -- 在这里，使用 quanteda v3 建议的用法 -- 您只需执行以下操作：

fmsi_des <- corpus_des %>%
    tokens(remove_punct = TRUE,remove_numbers = TRUE) %>%
    tokens_remove(pattern = all_stops) %>%
    dfm()

corpus quanteda r r stop-words text-mining

使用 Quanteda 软件包 R 删除 2 个停用词列表

问题描述

解决方法

相关问答