如何创建与 quanteda 的交互?

问题描述

考虑下面的例子

library(quanteda)
library(tidyverse)

tibble(text = c('the dog is growing tall','the grass is growing as well')) %>% 
  corpus() %>% dfm()
Document-feature matrix of: 2 documents,8 features (31.2% sparse).
       features
docs    the dog is growing tall grass as well
  text1   1   1  1       1    1     0  0    0
  text2   1   0  1       1    0     1  1    1

我想在每个句子中创建 dog 和其他标记间的交互。也就是说,创建特征 the-dogis-doggrowing-dogtall-dog 并将它们添加dfm(在我们已有的之上)。

也就是说,例如,如果句子中同时存在 the-dogthe,则 dog 将等于 1(否则为零)。因此,the-dog 将是第一个句子的 1,第二个句子的 0。

请注意,我仅在句子中包含 dog 时创建交互项,因此此处不需要 dog-grass

如何在 quanteda 中有效地做到这一点?

解决方法

library("quanteda")
## Package version: 2.1.2

toks <- tokens(c(
  "the dog is growing tall","the grass is growing as well"
))

# now keep just tokens co-occurring with "dog"
toks_dog <- tokens_select(toks,"dog",window = 1e5)

# create the dfm and label other terms as interactions with dog
dfmat_dog <- dfm(toks_dog) %>%
  dfm_remove("dog")
colnames(dfmat_dog) <- paste(featnames(dfmat_dog),sep = "-")
dfmat_dog
## Document-feature matrix of: 2 documents,4 features (50.00% sparse) and 0 docvars.
##        features
## docs    the-dog is-dog growing-dog tall-dog
##   text1       1      1           1        1
##   text2       0      0           0        0

# combine with other features
print(cbind(dfm(toks),dfmat_dog),max_nfeat = -1)
## Document-feature matrix of: 2 documents,12 features (37.50% sparse) and 0 docvars.
##        features
## docs    the dog is growing tall grass as well the-dog is-dog growing-dog
##   text1   1   1  1       1    1     0  0    0       1      1           1
##   text2   1   0  1       1    0     1  1    1       0      0           0
##        features
## docs    tall-dog
##   text1        1
##   text2        0

reprex package (v1.0.0) 于 2021 年 3 月 18 日创建

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...