使用read_html在R中读取时处理404和其他错误URL

问题描述

摘要:使用trycatch和R的read_html函数处理错误和不良页面

我们正在使用Rs read_html功能连接到某些NCAA体育网站,并需要确定网页何时出现问题。以下是指向错误页面的一些示例URL:

 - www.newburynighthawks.com (does not exist)
 - http://www.clarkepride.com/sports/womens-basketball/roster/2020-21 (404 not found)
 - https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19 (not found)
 - www.lambuth.edu/athletics/index.html (does not exist)
 - https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19 (page not found)

使用read_html时,每个URL都有其自身的问题。为了解决这些问题,我编写了一个函数,使用trycatch在以下情况下检查这些页面的有效性:

check_url_validity <- function(this_url) {
  good_url = FALSE

  # go to url to check for a rosters page
  bad_page_titles = c('Page Not Found','Page not found','404')
  result = tryCatch({
    team_page <- this_url %>% GET(.,timeout(2)) %>% read_html
    team_page_title <- team_page %>% html_nodes('title') %>% html_text
    team_page_body <- team_page %>% html_nodes('body') %>% html_text
    good_page <- !grepl('Page not found',team_page_title) &&
      !grepl('Page Not Found',team_page_title) &&
      !grepl('404',team_page_title) &&
      team_page_title != "" &&
      !grepl('Error 404',team_page_body)
    
    if(good_page) { good_url = TRUE }
  },error = function(e) { NA })
  
  return(good_url)
}

在上面链接的网址上测试此功能可以提供以下内容

these_urls = c(
'www.newburynighthawks.com','http://www.clarkepride.com/sports/womens-basketball/roster/2020-21','https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19','www.lambuth.edu/athletics/index.html','https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19'
)

for (this_url in these_urls) {
  print(check_rosters_url(this_url))
}

在这http://www.newburynighthawks.com/中,其中某些页面trycatch)很容易识别为不良页面,因为没有页面。其他(http://www.clarkepride.com/sports/womens-basketball/roster/2020-21)则依靠正文中的字符串匹配来识别该页面错误的。总体而言,这是一个很棘手的解决方案,我们在这里处理约1000个不同的URL,并且继续在代码行中添加条件,以确定good_page是TRUE还是FALSE。目前,我们最多有5个条件,其中大多数条件使用grepl来对标题和正文中的404Not Found之类的短语进行字符串匹配。

要知道这些页面不是好的页面,是否有比主体中的404Not Found字符串匹配更好的解决方案?

解决方法

下面的代码不会尝试读取页面内容,而是使用软件包httr发出HEAD请求。这样更快,并返回所有必要的信息。

library(httr)

check_url_validity <- function(this_url){
  r <- tryCatch(HEAD(this_url),error = function(e) e
  )
  if(inherits(r,"error")){
    "does not exist"
  } else {
    http_status(r)$reason
  }
}

lapply(urls_vec,check_url_validity)
#[[1]]
#[1] "does not exist"
#
#[[2]]
#[1] "Not Found"
#
#[[3]]
#[1] "Not Found"
#
#[[4]]
#[1] "does not exist"
#
#[[5]]
#[1] "OK"

要返回NA/FALSE/TRUE,下面的函数将遵循相同的行。

check_url_validity2 <- function(this_url){
  r <- tryCatch(HEAD(this_url),"error")){
    NA
  }else{
    status_code(r) < 300
  }
}

lapply(urls_vec,check_url_validity2)
#[[1]]
#[1] NA
#
#[[2]]
#[1] FALSE
#
#[[3]]
#[1] FALSE
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] TRUE

数据

urls_vec <- c(
  "www.newburynighthawks.com","http://www.clarkepride.com/sports/womens-basketball/roster/2020-21","https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19","www.lambuth.edu/athletics/index.html","https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19"
)