R:如何从两个列表中提取具有相同索引的元素来编写函数

问题描述

我正在尝试编写一个函数来计算 R1 词汇丰富度度量。公式如下:

R1 = 1 - ( F(h) - h*h/2N) )

其中 N 是标记数量,h 是 Hirsch 点,F(h) 是到该点的累积相对频率。使用 quanteda 包,我设法计算了赫希点。

为了创建我的数据,我必须以递增的方式对每个文本进行分块。因此,输入是另一个列表中的分块文本列表。为了避免嵌套列表,我将分块列表更改为字符。 (意思是在每个字符向量中,有不同的单独文本)。

a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.","The United States is committed to advancing prosperity,security,and freedom for both Israelis and Palestinians in tangible ways in the immediate term,which is important in its own right,but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.","We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash,with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction."," The US president accused Palestinians of lacking “appreciation or respect.","To create my data I had to chunk each text in an increasing manner.","Therefore,the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic,development,and humanitarian assistance for the Palestinian people,” the secretary of state,Antony Blinken,said in a statement.","The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare,schooling,and sanitation,including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership,President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis,and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)

    
library(quanteda)
DFMs <- lapply(txt,dfm)
txt_freq <- function(x) textstat_frequency(x,groups = docnames(x),ties_method = "first")
Fs <- lapply(DFMs,txt_freq)

get_h_point <- function(DATA) {
  fn_interp <- approxfun(DATA$rank,DATA$frequency)
  fn_root <- function(x) fn_interp(x) - x
  uniroot(fn_root,range(DATA$rank))$root
}

s_p <- function(x){split(x,x$group)}  
tstat_by <- lapply(Fs,s_p)
h_values <-lapply(tstat_by,vapply,get_h_point,double(1))

要计算到 h_pointF(h) 的累积相对频率——小于或等于 h 点值(相同频率)的频率总和除以相加的频率总数——要放入 R1,我需要两个值;其中一个必须是 $frequency 中的 tstat_by,另一个必须是 h_values 中对应的 h 点。

fh <- function(X,Y) {subset(tstat_by[[X]][[Y]],rank <= h_values[[X]][[Y]])}

函数提取频率并排名到 h 点。考虑以下几点:

fh31 <- subset(tstat_by[[3]][["text1"]],rank <= h_values[[3]][["text1"]])   #produces a list within which there are frequencies up to h point.
F1_1 <-sum(fh31$frequency) / length(fh31$frequency)    #the cumulative relative frequency up to h_point
R1_1 <-1 - ( F1_1 - h_values[[3]][["text1"]] * h_values[[3]][["text1"]] / 2 * sum(tstat_by[[3]][["text1"]]$frequency)    #produces the lexical richness value (R1)

fh32 <- subset(tstat_by[[3]][["text2"]],rank <= h_values[[3]][["text2"]])
F1_2 <-sum(fh32$frequency) / length(fh32$frequency)

fh33 <- subset(tstat_by[[3]][["text3"]],rank <= h_values[[3]][["text3"]])
F1_3 <-sum(fh33$frequency) / length(fh33$frequency)

fh34 <- subset(tstat_by[[3]][["text4"]],rank <= h_values[[3]][["text4"]])
F1_4 <-sum(fh34 $frequency) / length(fh34 $frequency)

我需要的帮助是上面函数XY 参数。我如何定义它们以将 lapply 用于 tstat_by?请注意,目标是写一个函数来计算R1,我这里放的是这方面已经做了什么。

解决方法

最好在 list() 函数调用中构建列表项的名称。这样你就不会在你的工作区中得到很多“松散”的名字。这些通常是令人困惑的错误消息的来源。

txt <- list( 
  a = c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.","The United States is committed to advancing prosperity,security,and freedom for both Israelis and Palestinians in tangible ways in the immediate term,which is important in its own right,but also as a means to advance towards a negotiated two-state solution."),a1 = c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.","We believe that this UN agency for so-called refugees should not exist in its current format."),a2 = c("His statement comes amid an ongoing investigation into the crash,with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction."," The US president accused Palestinians of lacking “appreciation or respect.","To create my data I had to chunk each text in an increasing manner.","Therefore,the input is a list of chunked texts within another list."),a3 = c("We plan to restart US economic,development,and humanitarian assistance for the Palestinian people,” the secretary of state,Antony Blinken,said in a statement.","The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare,schooling,and sanitation,including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership,President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis,and restoring the aid is part of that.")
           )

这没有成功,因为字符索引似乎比数字索引分配更灵活,如下所示:

F <- list() # need to put a name in the workspace before assignments by indexing
for( Ls in seq_along(tstat_by) ){ 
  for (items in seq_along(tstat_by[[ls]])){
    F[[Ls]][[items]] <- #doesn't work (throws error)
      R[[Ls]][[items]] <-    

在 LHS 上努力使用数字索引后,我放弃并立即成功使用字符索引。

如果您仅对少数对象使用字符索引,则可以避免创建数量未知的松散对象,而是将它们全部保存在一个(或在本例中为两个)结构中。我们将它们称为 FR

str(tstat_by)
str(h_value)
F <- list()
R <- list() # need list names in the workspace before indexed assignments
for( Ls in names(tstat_by) ){  
  for (item in names(h_values[[Ls]]) ){  
        #produces a list within which there are frequencies up to h point.
        temp <-  subset(tstat_by[[Ls]][[item]],rank <= h_values[[Ls]][[item]])  
        #the calc cumulative relative frequency up to h_point
      F[[Ls]][item] <- sum(temp$frequency) / length(temp$frequency)  
        #produces the lexical richness value (R1)
      R[[Ls]][[item]] <- 1 - ( F[[Ls]][[item]] - 
                               h_values[[Ls]][[item]] * h_values[[Ls]][[item]] / 
                                      2 * sum(tstat_by[Ls][[item]]$frequency) )
                      }}

这不是我的领域知识领域,所以我不知道这些结果的大小或符号是否正确。 (写好问题的基础之一是指定正确答案的样子。)

> R
$a
$a$text1
[1] -1

$a$text2
[1] -2.5


$a1
$a1$text1
[1] -1

$a1$text2
[1] 0


$a2
$a2$text1
[1] -1.5

$a2$text2
[1] 0

$a2$text3
[1] -1

$a2$text4
[1] -1


$a3
$a3$text1
[1] -2.5

$a3$text2
[1] -2

$a3$text3
[1] -1.5

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...