如何计算R中的R1词汇丰富度指数?

问题描述

嗨,我需要编写一个函数来计算 R1,其定义如下:

R1 = 1 - ( F(h) - h*h/2N) )

其中 N 是标记数量,h 是 Hirsch 点,F(h) 是到该点的累积相对频率。使用 quanteda 包我设法计算了赫希点

 a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.","The United States is committed to advancing prosperity,security,and freedom for both Israelis and Palestinians in tangible ways in the immediate term,which is important in its own right,but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.","We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash,with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction."," The US president accused Palestinians of lacking “appreciation or respect.","To create my data I had to chunk each text in an increasing manner.","Therefore,the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic,development,and humanitarian assistance for the Palestinian people,” the secretary of state,Antony Blinken,said in a statement.","The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare,schooling,and sanitation,including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership,President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis,and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)

    
library(quanteda)
DFMs <- lapply(txt,dfm)
txt_freq <- function(x) textstat_frequency(x,groups = docnames(x),ties_method = "first")
Fs <- lapply(DFMs,txt_freq)

get_h_point <- function(DATA) {
  fn_interp <- approxfun(DATA$rank,DATA$frequency)
  fn_root <- function(x) fn_interp(x) - x
  uniroot(fn_root,range(DATA$rank))$root
}

s_p <- function(x){split(x,x$group)}  
tstat_by <- lapply(Fs,s_p)
h_values <-lapply(tstat_by,vapply,get_h_point,double(1))

要计算 F(h)——直到 h_point 的累积相对频率——放入 R1,我需要两个值;其中一个必须来自Fs$rank,另一个必须来自h_values。考虑第一个原始文本(tstat_by[[1]]tstat_by[[2]]tstat_by[[3]])及其各自的 h_values(h_values[[1]]h_values[[2]]h_values[[3]]):

fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])),n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])),n=1)
...

tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])),n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])),n=1)
[1] 0.75

如您所见,分组是相同的——原始字符向量的每个块的文档名都是相同的(text1、text2、text3 等)。我的问题是如何为 fh_txt(s) 编写函数,以便使用 lapply 可以作为计算 R1 的 F(h) 的一个选项。

请注意,我们的目标是写一个函数来计算R1,我在这里放的是这方面已经做了什么。

解决方法

我在下面简化了您的输入,并在 groups 中使用了 textstat_frequency() 参数,而不是您创建 dfm 对象列表的方法。

a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.")
a2 <- c("His statement comes amid an ongoing investigation into the crash,with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.")

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfmat <- c(a,a1,a2) %>%
  tokens() %>%
  dfm()

tstat <- quanteda.textstats::textstat_frequency(dfmat,groups = docnames(dfmat),ties = "first")
tstat_by <- split(tstat,tstat$group)

get_h_point <- function(DATA) {
  fn_interp <- approxfun(DATA$rank,DATA$frequency)
  fn_root <- function(x) fn_interp(x) - x
  uniroot(fn_root,range(DATA$rank))$root
}
h_values <- vapply(tstat_by,get_h_point,double(1))
h_values
##    text1    text2    text3 
## 2.000014 1.500000 2.000024

tstat_by <- lapply(
  names(tstat_by),function(x) subset(tstat_by[[x]],cumsum(rank) <= h_values[[x]])
)

do.call(rbind,tstat_by)
##    feature frequency rank docfreq group
## 1      the         2    1       1 text1
## 29     the         2    1       1 text2
## 48     the         3    1       1 text3

您没有指定您想要的输出,但是有了这个结果,您应该能够使用 lapply() 在列表上或使用例如 dplyr

reprex package (v1.0.0) 于 2021 年 4 月 5 日创建

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...