为什么 Quanteda 频率的列/行结果不同共现矩阵?

问题描述

我正在尝试使用 Quanteda 来计算一个季度内不同术语与特定术语(例如越南或“越南”)同时出现的次数

但是当我从频率共现矩阵中选择一列或一行时,计数是不同的。

谁能告诉我这是为什么或我做错了什么?我担心我基于这些结果的分析不正确。

##Producing the FCM

> corp <- corpus(data_SCS14q4)
> toks <- tokens(corp,remove_punct = TRUE) %>%  tokens_remove(ch_stop) %>% tokens_compound(phrase("东 盟"),concatenator = "") 
> fcm_14q4 <- fcm(toks,context = "window")

##taking the row for Vietnam or "越南":

mt <- fcm_14q4["越南",]
> head(mt)

Feature co-occurrence matrix of: 1 by 6 features.
        features
features 印 司令 中国 2050 收复 台湾
    越南  0    0    0    0    0    0

##Taking the column for Vietnam or "越南":

> mt2 <- fcm_14q4[,"越南"]
> head(mt2)

Feature co-occurrence matrix of: 6 by 1 feature.
        features
features 越南
    印      0
    司令    0
    中国   68
    2050    0
    收复    8
    台湾    4

解决方法

这是因为默认情况下,fcm() 只返回对称共生矩阵的上三角(ordered = FALSE 时对称)。要使两个索引切片等效,您需要指定 tri = FALSE

library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

toks <- tokens(c("a a a b b c","a a c e","a c e f g"))

# default is only upper triangle
fcm(toks,context = "window",window = 2,tri = TRUE)
## Feature co-occurrence matrix of: 6 by 6 features.
##         features
## features a b c e f g
##        a 8 3 3 2 0 0
##        b 0 2 2 0 0 0
##        c 0 0 0 2 1 0
##        e 0 0 0 0 1 1
##        f 0 0 0 0 0 1
##        g 0 0 0 0 0 0

这可以使它对称,在这种情况下索引切片是相同的:

fcmat2 <- fcm(toks,tri = FALSE)
fcmat2
## Feature co-occurrence matrix of: 6 by 6 features.
##         features
## features a b c e f g
##        a 8 3 3 2 0 0
##        b 3 2 2 0 0 0
##        c 3 2 0 2 1 0
##        e 2 0 2 0 1 1
##        f 0 0 1 1 0 1
##        g 0 0 0 1 1 0

fcmat2[,"a"]
## Feature co-occurrence matrix of: 6 by 1 features.
##         features
## features a
##        a 8
##        b 3
##        c 3
##        e 2
##        f 0
##        g 0
t(fcmat2["a",])
## Feature co-occurrence matrix of: 6 by 1 features.
##         features
## features a
##        a 8
##        b 3
##        c 3
##        e 2
##        f 0
##        g 0

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...