如何在R中抓取Web时摆脱错误？

问题描述

我正在抓取this网站，并收到一条错误消息，即小标题栏必须具有兼容的大小。
在这种情况下我该怎么办？

library(rvest)
library(tidyverse)

url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
map_dfr(
  .x = url,.f = function(x) {
    tibble(
      url = x,place = read_html(x) %>%
        html_nodes("a.result-title.hover_Feedback.zred.bold.ln24.fontsize0") %>%
        html_attr("title"),price = read_html(x) %>%
        html_nodes("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
        html_text()
    )
  }
) -> df_zomato

谢谢。

解决方法

问题是由于每个餐厅都没有完整的记录。在此示例中，列表中的第13个项目不包含价格，因此价格向量有14个项目，而位置向量有15个项目。

解决此问题的一种方法是找到公共父节点，然后使用html_node()函数解析这些节点。 html_node()将始终返回一个值，即使该值为NA。

library(rvest)
library(dplyr)
library(tibble)


url <- "https://www.zomato.com/tr/toronto/drinks-and-nightlife?page=5"
readpage <- function(url){
   #read the page once
   page <-read_html(url)

   #parse out the parent nodes
   results <- page %>% html_nodes("article.search-result")

   #retrieve the place and price from each parent
   place <- results %>% html_node("a.result-title.hover_feedback.zred.bold.ln24.fontsize0") %>%
      html_attr("title")
   price <- results %>% html_node("div.res-cost.clearfix span.col-s-11.col-m-12.pl0") %>%
      html_text()

   #return a tibble/data,frame
   tibble(url,place,price)
}

readpage(url)

还要注意，在上面的代码示例中，您多次读取同一页。这很慢，并给服务器增加了额外的负载。这可以被视为“拒绝服务”攻击。
最好将页面读入内存一次，然后使用该副本。

更新
回答有关多个页面的问题。将上述函数包装在lapply函数中，然后绑定返回的数据帧（或小标题）列表

dfs <- lapply(listofurls,function(url){ readpage(url)})
finalanswer <- bind_rows(dfs)

r r rvest web-scraping