html – R跨越多个页面的网页抓取

我正在开展网络抓取计划,以搜索特定的葡萄酒,并返回该品种的当地葡萄酒清单.我遇到的问题是多页结果.下面的代码是我正在使用的基本示例
url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
htmlpage2 <- read_html(url2)
names2 <- html_nodes(htmlpage2,".review-listing .title")
Wines2 <- html_text(names2)

对于此特定搜索,有39页的结果.我知道url更改为http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2,但是有一种简单的方法可以让代码循环遍历所有返回的页面并将所有39个页面的结果编译成单个列表吗?我知道我可以手动完成所有网址,但这看起来有些过分.

解决方法

如果您希望将所有信息作为data.frame,您可以使用purrr :: map_df()执行类似的操作:
library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39,function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base,i))

  data.frame(wine=html_text(html_nodes(pg,".review-listing .title")),excerpt=html_text(html_nodes(pg,"div.excerpt")),rating=gsub(" Points","",html_text(html_nodes(pg,"span.rating"))),appellation=html_text(html_nodes(pg,"span.appellation")),price=gsub("\\$","span.price"))),stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $excerpt     (chr) "Green olive,green stem and fresh herb aromas are at the ...
## $rating      (chr) "96","95","94","93","93"...
## $appellation (chr) "Columbia Valley","Columbia Valley","...
## $price       (chr) "140","70","20","40","135","50","60","3...

相关文章

vue阻止冒泡事件 阻止点击事件的执行 &lt;div @click=&a...
尝试过使用网友说的API接口获取 找到的都是失效了 暂时就使用...
后台我拿的数据是这样的格式: [ {id:1 , parentId: 0, name:...
JAVA下载文件防重复点击,防止多次下载请求,Cookie方式快速简...
Mip是什么意思以及作用有哪些