bind_tf_idf() 错误:在 tapply(n, documents, sum) 中:参数必须具有相同的长度

问题描述

我正在尝试为以下 df 执行 bind_tf_idf()。我的 df 有两个文档/类:Y 或 N。

> test_2
# A tibble: 3,295 x 2
   Class word    
   <fct> <chr>   
 1 Y     nature
 2 Y     great
 3 Y     are     
 4 Y     present 
 5 N     in      
 6 N     weather   
 7 Y     moisture   
 8 N     humidity     
 9 Y     and     
10 Y     pollen
# … with 3,285 more rows
Warning message:
`...` is not empty.

We detected these problematic arguments:
* `needs_dots`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?
@H_404_4@

这是我正在使用的:

test_2_tf_idf <- test_2 %>%
  bind_tf_idf(word,Class,sum)
@H_404_4@

但我收到错误消息:

> test_2_tf_idf <- test_2 %>%
+   bind_tf_idf(word,sum)

'Error in tapply(n,documents,sum) : arguments must have same length'
@H_404_4@

我最终想要的是一个与此类似的计算表(忽略“总计”列):

#> # A tibble: 40,379 x 7
#>    book              word      n  total     tf   idf tf_idf
#>    <fct>             <chr> <int>  <int>  <dbl> <dbl>  <dbl>
#>  1 Mansfield Park    the    6206 160460 0.0387     0      0
#>  2 Mansfield Park    to     5475 160460 0.0341     0      0
#>  3 Mansfield Park    and    5438 160460 0.0339     0      0
#>  4 emma              to     5239 160996 0.0325     0      0
#>  5 emma              the    5201 160996 0.0323     0      0
#>  6 emma              and    4896 160996 0.0304     0      0
#>  7 Mansfield Park    of     4778 160460 0.0298     0      0
#>  8 Pride & Prejudice the    4331 122204 0.0354     0      0
#>  9 emma              of     4291 160996 0.0267     0      0
#> 10 Pride & Prejudice to     4162 122204 0.0341     0      0
#> # … with 40,369 more rows
@H_404_4@

除了在我的情况下,“book”列类似于每个单词的“Y”或“N”类。

我该怎么做才能修复这个点击错误

解决方法

tidytext::bind_tf_idf 的第四个参数不是函数而是一个

包含文档项的列计为字符串或符号 (?tidytext::bind_tf_idf)

因此,您首先必须使用 Classword 聚合您的数据,例如dplyr::count

test_2 <- structure(list(Class = c(
  "Y","Y","N","Y"
),word = c(
  "vesicles","exosomes","are","present","in","blood","urine","and","proteins"
)),class = "data.frame",row.names = c(
  "1","2","3","4","5","6","7","8","9","10"
))

library(tidytext)
library(dplyr)

test_2_tf_idf <- test_2 %>%
  count(word,Class) %>%
  bind_tf_idf(word,Class,n)

test_2_tf_idf
#>        word Class n        tf       idf     tf_idf
#> 1       and     N 1 0.3333333 0.0000000 0.00000000
#> 2       and     Y 1 0.1428571 0.0000000 0.00000000
#> 3       are     Y 1 0.1428571 0.6931472 0.09902103
#> 4     blood     N 1 0.3333333 0.6931472 0.23104906
#> 5  exosomes     Y 1 0.1428571 0.6931472 0.09902103
#> 6        in     N 1 0.3333333 0.6931472 0.23104906
#> 7   present     Y 1 0.1428571 0.6931472 0.09902103
#> 8  proteins     Y 1 0.1428571 0.6931472 0.09902103
#> 9     urine     Y 1 0.1428571 0.6931472 0.09902103
#> 10 vesicles     Y 1 0.1428571 0.6931472 0.09902103