使用RVest从meta和button标记中抓取信息

问题描述

我正在尝试从葡萄酒销售商的页面上抓取用户的平均评分(满分5星)和评分数量。我们的5星的平均星级似乎在button标签中,而评分数在Meta标签中。
这是HTML:

<div class="bv_avgrating_component_container notranslate">
    <button
      type="button"
      class="bv_avgrating"
      aria-expanded="false"
      aria-label="average rating value is 4.5 of 5."
      id="avg-rating-button"
      role="link"
      itemprop="ratingValue"
    >
      4.5
    </button>
  </div>
  <div class="bv_numReviews_component_container">
    <Meta itemprop="reviewCount" content="95" />
   &nbsp;
   <button
      type="button"
      class="bv_numReviews_text"
      aria-label="Read 95 Reviews"
      aria-expanded="false"
      id="num-reviews-button"
      role="link"
    >
      (95)
    </button>
  </div>

我尝试过的事情:

library(tidyverse)
library(rvest)

x <- "/wine/red-wine/cabernet-sauvignon/amici-cabernet-sauvignon-napa/p/20095750?s=918&igrules=true"
ratings <- read_html(paste0("https://www.totalwine.com",x)) %>% 
  html_nodes(xpath = '//Meta[@itemprop="reviewCount"]') %>% 
  html_attr('content') #returns character(empty)

ratings <- read_html(paste0("https://www.totalwine.com",x)) %>% 
  html_nodes("Meta") %>% 
  html_attr("content") #returns chr [1:33]

ratings <- read_html(paste0("https://www.totalwine.com",x)) %>% 
  html_nodes("div Meta") %>% 
  html_attr("content") #returns chr [1:21]

ratings <- read_html(paste0("https://www.totalwine.com",x)) %>% 
  html_nodes("Meta[itemprop=reviewCount]") %>% 
  html_attr("content") #returns character(empty)

最终,我要提取的两点是4.5content="95"

解决方法

打开“开发工具”的“网络”标签,然后重新加载页面,您将看到此页面从https://www.totalwine.com/product/api/product/product-detail/v1/getProduct/20095750-1?shoppingMethod=INSTORE_PICKUP&state=US-CA&storeId=918(这是一个JSON文件)加载数据: enter image description here 通过此获取所需的评分和评论数:

data <- jsonlite::fromJSON("https://www.totalwine.com/product/api/product/product-detail/v1/getProduct/20095750-1?shoppingMethod=INSTORE_PICKUP&state=US-CA&storeId=918")
rating <- data$customerAverageRating
reviews_count <- data$customerReviewsCount

更新:如果您不熟悉网络抓取领域,您可能想知道为什么我根本不使用rvest。事实是,此页面使用JS生成内容,rvest无法处理JS,它仅在加载JS之前读取HTML。