保持一种类型的文档的单词频率和倒数

问题描述

保持项和反频率的代码示例:

library(dplyr)
library(janeaustenr)
library(tidytext)

book_words <- austen_books() %>%
    unnest_tokens(word,text) %>%
    count(book,word,sort = TRUE)

total_words <- book_words %>% 
    group_by(book) %>% 
    summarize(total = sum(n))

book_words <- left_join(book_words,total_words)

book_words <- book_words %>%
    bind_tf_idf(word,book,n)

book_words %>%
    select(-total) %>%
    arrange(desc(tf_idf))

我的问题是这个示例使用了多本书。

我有不同的数据结构:

dataset1 <- data.frame( anumber = c(1,2,3),text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s,when an unkNown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries,but also the leap into electronic typesetting,remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages,and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum","Contrary to popular belief,Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC,making it over 2000 years old. Richard Mcclintock,a Latin professor at Hampden-Sydney College in Virginia,looked up one of the more obscure Latin words,consectetur,from a Lorem Ipsum passage,and going through the cites of the word in classical literature,discovered the undoubtable source."))

在我的数据集1的情况下,每一行都是一个唯一的文档。我想得到term和inverse term frq相同的结果,但是我不知道如何使用我的选项。我该如何开始?

替代选项。根据这样的词频计算:

library(quanteda)
myDfm <- dataset1$text %>%
    corpus() %>%                    
    tokens(remove_punct = TRUE,remove_numbers = TRUE,remove_symbols = TRUE) %>%
    tokens_ngrams(n = 1:2) %>%
    dfm()

如何使用quanteda软件包获得与tidytext相同的结果,我的意思是每个单词的tf idf得分都高?

我尝试过的

number_of_docs <- nrow(myDfm)
term_in_docs <- colSums(myDfm > 0)
idf <- log2(number_of_docs / term_in_docs)

# Compute TF

tf <- as.vector(myDfm)

# Compute TF-IDF
tf_idf <- tf * idf
names(tf_idf) <- colnames(myDfm)
sort(tf_idf,decreasing = T)[1:5]

使用每个单词频率的Quanteda接收tf_idf是否正确?

接收单词,词频,tf_idf值作为输出

解决方法

如果我正确理解了这个问题,那么您希望在三个不同的文档中每个单词都得到一个tf-idf,换句话说,就是一个按单词唯一的输出data.frame。

问题是您无法使用tf-idf来执行此操作,因为“ idf”部分将术语频率乘以反向文档频率的对数。当您合并三个文档时,每个术语都出现在单个合并文档中,这意味着它的文档频率为1,等于文档数。因此,合并文档中每个单词的tf-idf为零。我在下面显示了这个。

tf-idf对于文档中的相同单词是不同的。因此, tidytext 示例按书显示每个单词,而不是整个语料库显示一次。

quanteda 中按文档说明的方法如下:

library("quanteda",warn.conflicts = FALSE)
## Package version: 2.1.1

myDfm <- dataset1 %>%
  corpus(docid_field = "anumber") %>%
  tokens(remove_punct = TRUE,remove_numbers = TRUE,remove_symbols = TRUE) %>%
  tokens_ngrams(n = 1:2) %>%
  dfm()

myDfm %>%
  dfm_tfidf() %>%
  convert(to = "data.frame") %>%
  dplyr::group_by(doc_id) %>%
  tidyr::gather(key = "word",value = "tf_idf",-doc_id) %>%
  tibble::tibble()
## # A tibble: 744 x 3
##    doc_id word   tf_idf
##    <chr>  <chr>   <dbl>
##  1 1      lorem   0    
##  2 2      lorem   0    
##  3 3      lorem   0    
##  4 1      ipsum   0    
##  5 2      ipsum   0    
##  6 3      ipsum   0    
##  7 1      is      0.176
##  8 2      is      0    
##  9 3      is      0.176
## 10 1      simply  0.176
## # … with 734 more rows

如果使用dfm_group()合并所有文档,则可以看到所有单词的tf-idf为零。

myDfm %>%
  dfm_group(groups = rep(1,ndoc(myDfm))) %>%
  dfm_tfidf() %>%
  convert(to = "data.frame") %>%
  dplyr::select(-doc_id) %>%
  tidyr::gather(key = "word",value = "tf_idf") %>%
  tibble::tibble()
## # A tibble: 247 x 2
##    word     tf_idf
##    <chr>     <dbl>
##  1 lorem         0
##  2 ipsum         0
##  3 is            0
##  4 simply        0
##  5 dummy         0
##  6 text          0
##  7 of            0
##  8 the           0
##  9 printing      0
## 10 and           0
## # … with 237 more rows

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...