R. Quanteda程序包如何过滤dfm_tfidf中存在的值?

问题描述

所以我有一个dfm_tfidf,我想过滤掉低于某个阈值的值。

代码

dfmat2 <-
  matrix(c(1,1,2,3),byrow = TRUE,nrow = 2,dimnames = list(docs = c("document1","document2"),features = c("this","is","a","sample","another","example"))) %>%
  as.dfm()


#it works
dfmat2 %>% dfm_trim(min_termfreq = 3)

#it does not work
dfm_tfidf(dfmat2) %>% dfm_trim( min_termfreq = 1)
# "Warning message: In dfm_trim.dfm(.,min_termfreq = 1) : dfm has been prevIoUsly weighted"

问题:如何过滤出dfm_tfidf中存在的值?

谢谢

解决方法

这是一个基于绝对最小值在稀疏矩阵空间中执行此操作的函数。但是要注意,因为tf-idf绝对值在不同的dfm对象中意义不大。

library("quanteda")
## Package version: 2.1.1

dfmat2 <-
  matrix(c(1,1,2,3),byrow = TRUE,nrow = 2,dimnames = list(
      docs = c("document1","document2"),features = c(
        "this","is","a","sample","another","example"
      )
    )
  ) %>%
  as.dfm()

# function to trim features based on absolute minimum threshold
# operating directly on sparse matrix
dfm_trimabs <- function(x,min) {
  maxvals <- sapply(
    split(dfmat3@x,featnames(dfmat3)[as(x,"dgTMatrix")@j + 1]),max
  )
  dfm_keep(x,names(maxvals)[maxvals >= min])
}

现在将其应用于上面和之前的示例:

# before trimming
dfm_tfidf(dfmat2)
## Document-feature matrix of: 2 documents,6 features (33.3% sparse).
##            features
## docs        this is       a  sample another example
##   document1    0  0 0.60206 0.30103 0       0      
##   document2    0  0 0       0       0.60206 0.90309

# after trimming
dfm_tfidf(dfmat2) %>%
  dfm_trimabs(min = 0.5)
## Document-feature matrix of: 2 documents,3 features (50.0% sparse).
##            features
## docs              a another example
##   document1 0.60206 0       0      
##   document2 0       0.60206 0.90309

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...