从 R 中的网页打开 PDF

问题描述

我正在尝试使用美联储 FOMC 会议纪要来练习文本分析。

我能够从下面的链接中获得指向相应 pdf 文件的所有链接。 https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm

我试过 download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf")。

下载成功；但是，当我单击下载的文件时，它输出“打开此文档时出错。文件已损坏，无法修复。” 有什么方法可以解决这个问题？这是在美联储方面防止网络抓取的一种方式吗？

我有 44 个链接（pdf 文件）可以在 R 中下载和阅读。有没有办法在不实际下载文件的情况下做到这一点？

解决方法

library(stringr)
library(rvest)
library(pdftools)

# Scrape the website with rvest for all href links
p <- 
  rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")

# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs,"fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/",pdfs)

# Scrape minutes as list of text files
pdf_data <- lapply(paths,pdftools::pdf_text)

data-mining nlp r r