如何在pairwise_similarity函数中计算相似度？

问题描述

使用 tidytext 包和 widyr 包计算文档相似度。像这样..

library(janeaustenr)
library(dplyr)
library(tidytext)

# Comparing Jane Austen novels
austen_words <- austen_books() %>%
  unnest_tokens(word,text) %>%
  anti_join(stop_words,by = "word") %>%
  count(book,word) %>%
  ungroup()

# closest books to each other
closest <- austen_words %>%
  pairwise_similarity(book,word,n) %>%
  arrange(desc(similarity))

closest

closest %>%
  filter(item1 == "emma")

pairwise_similarity 函数中的相似度是如何计算的？

有些词可能不会在两个文档中共同出现。这些字算不算？

还是忽略这些词而只计算两个文档共有的词？

如果一个词在两个文档中的 tf-idf 分数相似，是否认为它相似？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

cosine-similarity r r tf-idf