在 pdf 上标记以进行定量分析

问题描述

我在 data_frame 上使用 unnest_tokens 函数时遇到问题。我正在处理要比较的 pdf 文件。

text_path <- "c:/.../text1.pdf"
text_raw <- pdf_text("c:/.../text1.pdf")
text1df<- data_frame(Zeile = 1:25,text_raw)

到目前为止一切顺利。但我的问题来了：

  unnest_tokens(output = token,input = content) -> text1_long

错误：必须提取具有单个有效下标的列。 x 下标 var 的类型错误 function。 i 必须是数字或字符。

我想标记我的 pdf 文件，以便我可以分析词频，并可能比较 wordclouds 上的多个 pdf 文件。

解决方法

这是一段简单的代码。我保留了您的德语单词，以便您可以复制粘贴所有内容。

library(pdftools)
library(dplyr)
library(stringr)
library(tidytext)

file_location <- "d:/.../my_doc.pdf"
text_raw <- pdf_text(file_location)
# Zeile 12 because I only have 12 pages
text1df <- data_frame(Zeile = 1:12,text_raw) 

text1df_long <- unnest_tokens(text1df,output = wort,input = text_raw ) %>% 
  filter(str_detect(wort,"[a-z]"))

text1df_long
# A tibble: 4,134 x 2
   Zeile wort       
   <int> <chr>      
 1     1 training   
 2     1 and        
 3     1 development
 4     1 policy     
 5     1 contents   
 6     1 policy     
 7     1 statement  
 8     1 scope      
 9     1 induction  
10     1 training   
# ... with 4,124 more rows

nlp quanteda r r text-mining

在 pdf 上标记以进行定量分析

问题描述

解决方法

相关问答