html – R跨越多个页面的网页抓取

我正在开展网络抓取计划,以搜索特定的葡萄酒,并返回该品种的当地葡萄酒清单.我遇到的问题是多页结果.下面的代码是我正在使用的基本示例

url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
htmlpage2 <- read_html(url2)
names2 <- html_nodes(htmlpage2,".review-listing .title")
Wines2 <- html_text(names2)

对于此特定搜索,有39页的结果.我知道url更改为http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2,但是有一种简单的方法可以让代码循环遍历所有返回的页面并将所有39个页面的结果编译成单个列表吗？我知道我可以手动完成所有网址,但这看起来有些过分.

解决方法

如果您希望将所有信息作为data.frame,您可以使用purrr :: map_df()执行类似的操作：

library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39,function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base,i))

  data.frame(wine=html_text(html_nodes(pg,".review-listing .title")),excerpt=html_text(html_nodes(pg,"div.excerpt")),rating=gsub(" Points","",html_text(html_nodes(pg,"span.rating"))),appellation=html_text(html_nodes(pg,"span.appellation")),price=gsub("\\$","span.price"))),stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $excerpt     (chr) "Green olive,green stem and fresh herb aromas are at the ...
## $rating      (chr) "96","95","94","93","93"...
## $appellation (chr) "Columbia Valley","Columbia Valley","...
## $price       (chr) "140","70","20","40","135","50","60","3...

html – R跨越多个页面的网页抓取

解决方法

相关文章