Rrvest无法获取html_node

问题描述

我有一些使用rvest包从网上抓取我需要的数据的经验,但是在此页面上遇到了问题:

https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html

如果向下滚动一点,您会看到所有学校所在的部分。

enter image description here

我想要学校,案例和位置数据。我应该注意,有人在NYT GitHub上将其发布为csv,他们recommended that the data is all in the page and can just be pulled from there.因此,我认为可以从此页面抓取。

但是我无法正常工作。假设我只是想从第一所学校的简单选择器开始。我使用检查器找到xpath。

enter image description here

我没有结果:

library(rvest)

URL <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"
pg <- read_html(URL)

# xpath copied from inspector
xpath_first_school <- '//*[@id="school100663"]'

node_first_school <- html_node(pg,xpath = xpath_first_school)

> node_first_school
{xml_missing}
<NA>

我得到{xml_missing}

我显然要做很多工作来概括这一点并收集所有数据 学校,但通过网络抓取,我通常尝试从简单和具体开始,然后扩大范围。但是,即使是我的简单测试也无法正常工作。有什么想法吗?

解决方法

设置Rselenium可能需要一些时间。首先,您必须下载chromedriver(https://chromedriver.chromium.org/),选择当前chrome也最接近的版本。然后将其解压缩到您的R工作目录中。

我尝试使用一个名为“斩首”的程序包,该程序包可以在其中抓取javascript呈现的网站,但是由于该网站包含“显示更多”,在显示所有数据之前需要先对其进行物理单击,因此我必须使用Rselenium来“单击”该程序获取页面源,然后使用rvest进行解析

代码:

library(rvest)
library(tidyverse)
library(RSelenium)

url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"

driver <- rsDriver(browser = c("chrome"),chromever = "85.0.4183.87",port = 560L)
remote_driver <- driver[["client"]] 
remote_driver$navigate(url)

showmore <- remote_driver$findElement(using = "xpath",value = "//*[@id=\"showall\"]/p")
showmore$clickElement()

test <- remote_driver$getPageSource()

school <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id,\"school\")]/div[2]/h2") %>%
  html_text() %>%
  as.tibble()

case <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id,\"school\")]/div[3]/p") %>%
  html_text() %>%
  as.tibble() 

location <- read_html(test[[1]]) %>%
  html_nodes(xpath = "//*[contains(@id,\"school\")]/div[4]/p") %>%
  html_text() %>%
  as.tibble() 

combined_table <- bind_cols(school,case = case[2:nrow(case),],location = location[2:nrow(location),]) 
names(combined_table) <- c("school","case","location")

combined_table %>% view()

输出:

# A tibble: 913 x 3
   school                                      case  location              
   <chr>                                       <chr> <chr>                 
 1 University of Alabama at Birmingham*        972   Birmingham,Ala.      
 2 University of North Carolina at Chapel Hill 835   Chapel Hill,N.C.     
 3 University of Central Florida               727   Orlando,Fla.         
 4 University of Alabama                       568   Tuscaloosa,Ala.      
 5 Auburn University                           557   Auburn,Ala.          
 6 North Carolina State University             509   Raleigh,N.C.         
 7 University of Georgia                       504   Athens,Ga.           
 8 Texas A&M University                        500   College Station,Texas
 9 University of Texas at Austin               483   Austin,Texas         
10 University of Notre Dame                    473   Notre Dame,Ind.      
# ... with 903 more rows

希望这对您有用!

,

因此,我将在此处提供一个违反a very important rule described here的答案,通常这是一个丑陋的解决方案。但这是 a 解决方案,它使我们不必使用Selenium。

要对此使用html_nodes,我们需要启动需要Selenium的JS操作。 @KWN的解决方案似乎可以在他们的机器上运行,但是我无法让chromedriver在我的机器上运行。我可以在Firefox或Chrome上使用Docker到达几乎,但无法获得结果。因此,我将首先检查该解决方案。如果失败,请尝试一下。差不多,这个站点有我需要作为JSON公开的数据。因此,我提取了使用正则表达式隔离JSON,然后jsonlite进行解析的站点文本。

library(jsonlite)
library(rvest)
library(tidyverse)

url <- "https://www.nytimes.com/interactive/2020/us/covid-college-cases-tracker.html"

html_res <- read_html(url)

# get text
text_res <- html_res %>% 
  html_text(trim = TRUE)

# find the area of interest
# find the area of interest
data1 <- str_extract_all(text_res,"(?<=var NYTG_schools = ).*(?=;)")[[1]]

# get json into data frame
json_res <- fromJSON(data1)

# did it work?
glimpse(json_res)

Rows: 1,515
Columns: 16
$ ipeds_id    <chr> "100663","199120","132903","100751"...
$ nytname     <chr> "University of Alabama at Birmingham",...
$ shortname   <chr> "U.A.B.","North Carolina","Central F...
$ city        <chr> "Birmingham","Chapel Hill","Orlando"...
$ state       <chr> "Ala.","N.C.","Fla.","Ala.","Ala."...
$ county      <chr> "Jefferson","Orange","Tusc...
$ fips        <chr> "01073","37135","12095","01125","0...
$ lat         <dbl> 33.50199,35.90491,28.60258,33.21402...
$ long        <dbl> -86.80644,-79.04691,-81.20223,-87.5...
$ logo        <chr> "https://static01.nyt.com/newsgraphics...
$ infected    <int> 972,835,727,568,557,509,504,500...
$ death       <int> 0,1,...
$ dateline    <chr> "n","n","n"...
$ ranking     <int> 1,2,3,4,5,6,7,8,9,10,11,12,...
$ medicalnote <chr> "y",NA,N...
$ coord       <list> [<847052.5,-406444.3>,<1508445.93,...