问题描述
我有一个文档术语矩阵“mydtm”,它是我在 R 中使用“tm”包创建的。我试图描述 dtm/corpus 中包含的 557 个文档中的每一个之间的相似之处。我一直在尝试使用余弦相似度矩阵: mydtm_cosine Output Matrix
解决方法
可能您的文档之间出现的单词很少。您可能希望减少术语文档矩阵中的单词。
text <- c("term-document matrix is a mathematical matrix","we now have a tidy three-column","cast into a Term-Document Matrix","where the rows represent the text responses,or documents")
corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,control = list(wordLengths = c(1,Inf)))
occurrence <- apply(X = tdm,MARGIN = 1,FUN = function(x) sum(x > 0) / ncol(tdm))
occurrence
# a cast documents have
# 0.75 0.25 0.25 0.25
# into is mathematical matrix
# 0.25 0.25 0.25 0.50
# now or represent responses,# 0.25 0.25 0.25 0.25
# rows term-document text the
# 0.25 0.50 0.25 0.25
# three-column tidy we where
# 0.25 0.25 0.25 0.25
quantile(occurrence,probs = c(0.5,0.9,0.99))
# 50% 90% 99%
# 0.2500 0.5000 0.7025
tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5],])
tdm_mat
# Docs
# Terms 1 2 3 4
# a 1 1 1 0
# matrix 2 0 1 0
# term-document 1 0 1 0
然后您可以计算余弦相似度。
library(proxy)
dist(tdm_mat,method = "cosine",upper = TRUE)
# a matrix term-document
# a 0.2254033 0.1835034
# matrix 0.2254033 0.0513167
# term-document 0.1835034 0.0513167