RSelenium抓取具有空空间的动态表

问题描述

我正在尝试抓取以下网站上的表中的数据:https://www.iea.org/data-and-statistics/data-tables?country=WORLD

我正在使用RSelenium,并且正在获取所需的信息。问题在于网站上的表包含空元素,最终我在文本输出中没有这样指出。结果,我无法在R中复制原始表。

您能想到我可以通过什么方式刮擦表格并将其复制到R中吗?

感谢您的支持。下面提供了原始代码。

library(RSelenium)
library(tidyverse)

driver <- RSelenium::rsDriver(browser = "chrome",chromever =
                                system2(command = "wmic",args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value',stdout = TRUE,stderr = TRUE) %>%
                                stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>%
                                magrittr::extract(!is.na(.)) %>%
                                stringr::str_replace_all(pattern = "\\.",replacement = "\\\\.") %>%
                                paste0("^",.) %>%
                                stringr::str_subset(string =
                                                      binman::list_versions(appname = "chromedriver") %>%
                                                      dplyr::last()) %>%
                                as.numeric_version() %>%
                                max() %>%
                                as.character())

remote_driver <- driver[["client"]] 
remote_driver$navigate("https://www.iea.org/data-and-statistics/data-tables?country=WORLD")

out <- remote_driver$findElement(using = "class",value="m-data-table")

data <- out$getElementText() 
data <- gsub ("\n",";",data)
data <- strsplit(data,";")
data <- gsub ("ktoe","Ktoe",data[[1]])
data <- gsub (pattern="\\s+([a-z])",replacement="\\_\\U\\1",perl=TRUE," ")
data

解决方法

您可能要抓取内部HTML而不是内部文本:

function setApproxInterval(callback,interval) {
  let running = true
  const startTime = Date.now()

  const loop = (nthRun) => {
    const targetTime = nthRun * interval + startTime
    const timeout = targetTime - Date.now()
    setTimeout(() => {
      if (running) {
        callback()
        loop(nthRun + 1)
      }
    },timeout)
  }

  loop(1)
  return () => running = false
}

function clearApproxInterval(stopInterval) {
  stopInterval()
}

// Example usage
const testStart = Date.now()
const interval = setApproxInterval(() => console.log(`${Date.now() - testStart}ms`),1000)
setTimeout(() => clearApproxInterval(interval),10000)

结果:

dtab <- out$getElementAttribute("innerHTML")

result <- dtab[[1]] %>%
  
  # convert from html table to data frame
  xml2::read_html() %>%
  rvest::html_table() %>%
  as.data.frame() %>%
  
  # remove the "ktoe" row
  filter(row_number() != 1) %>% 
  
  # convert to ASCII encoding
  mutate(across(everything(),~iconv(.x,"utf-8","ASCII",sub = ""))) %>%
  
  # convert all except first column to integers
  mutate(across(-one_of("Var.1"),as.integer))

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...