Quanteda 按多个变量对文档进行分组

问题描述

我希望能够通过两个变量对我的 dfm 中的文档进行分组——speaker 和 week_start。我以前能够使用 dfm(corpus,groups=c("speaker","week_start")。这工作正常，并按演讲者周对文档进行分组。

但是，随着最近对 quanteda 软件包的更新，我似乎遇到了一些问题。所以我现在首先创建 dfm 然后我尝试分组。下面是代码

dfm <- dfm(corpus)
dfm <- dfm_group(dfm,groups = c(speaker,week_start))

但是，当我这样做时，我收到错误：

错误：组的长度必须为 ndoc(x)

我也尝试将 docvars 放在引号中，但这会产生相同的错误。

解决方法

我们更改了 v3 中 groups 参数的用法，使其更加标准。

来自news(Version >= "3.0",package = "quanteda")：

我们为 by 和 groups 参数添加了非标准评估访问对象文档变量：

*_sample() 函数的参数 by 和 groups 函数中的 *_group() 现在采用不带引号的文档变量 (docvar) 直接命名，类似于 subset 参数在 *_subset() 函数。
引用的 docvar 名称不再有效，因为这些名称将按字面计算。
以前从 by = "document" 中采样的 docid(x)，但现在删除了此功能。相反，使用 by = docid(x) 来复制此功能。
对于 groups，默认值现在是 docid(x)，现在记录更完整。请参阅 ?groups 和 ?docid。

因此，要获得以前的行为，您需要使用：

groups = interaction(speaker,week_start)

这是一个例子：

library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(c(
  "a b c","a c d","c d d","d d e"
),docvars = data.frame(
  var1 = c("a","a","b","b"),var2 = c(1,2,1,1)
)
)
corp %>%
  tokens() %>%
  dfm() %>%
  dfm_group(groups = interaction(var1,var2))
## Document-feature matrix of: 3 documents,5 features (40.00% sparse) and 2 docvars.
##      features
## docs  a b c d e
##   a.1 1 1 1 0 0
##   b.1 0 0 1 4 1
##   a.2 1 0 1 1 0

nlp quanteda r r

Quanteda 按多个变量对文档进行分组

问题描述

解决方法

相关问答