问题描述
我正在尝试抓取以下网站上的表中的数据:https://www.iea.org/data-and-statistics/data-tables?country=WORLD
我正在使用RSelenium,并且正在获取所需的信息。问题在于网站上的表包含空元素,最终我在文本输出中没有这样指出。结果,我无法在R中复制原始表。
您能想到我可以通过什么方式刮擦表格并将其复制到R中吗?
感谢您的支持。下面提供了原始代码。
library(RSelenium)
library(tidyverse)
driver <- RSelenium::rsDriver(browser = "chrome",chromever =
system2(command = "wmic",args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value',stdout = TRUE,stderr = TRUE) %>%
stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>%
magrittr::extract(!is.na(.)) %>%
stringr::str_replace_all(pattern = "\\.",replacement = "\\\\.") %>%
paste0("^",.) %>%
stringr::str_subset(string =
binman::list_versions(appname = "chromedriver") %>%
dplyr::last()) %>%
as.numeric_version() %>%
max() %>%
as.character())
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.iea.org/data-and-statistics/data-tables?country=WORLD")
out <- remote_driver$findElement(using = "class",value="m-data-table")
data <- out$getElementText()
data <- gsub ("\n",";",data)
data <- strsplit(data,";")
data <- gsub ("ktoe","Ktoe",data[[1]])
data <- gsub (pattern="\\s+([a-z])",replacement="\\_\\U\\1",perl=TRUE," ")
data
解决方法
您可能要抓取内部HTML而不是内部文本:
function setApproxInterval(callback,interval) {
let running = true
const startTime = Date.now()
const loop = (nthRun) => {
const targetTime = nthRun * interval + startTime
const timeout = targetTime - Date.now()
setTimeout(() => {
if (running) {
callback()
loop(nthRun + 1)
}
},timeout)
}
loop(1)
return () => running = false
}
function clearApproxInterval(stopInterval) {
stopInterval()
}
// Example usage
const testStart = Date.now()
const interval = setApproxInterval(() => console.log(`${Date.now() - testStart}ms`),1000)
setTimeout(() => clearApproxInterval(interval),10000)
结果:
dtab <- out$getElementAttribute("innerHTML")
result <- dtab[[1]] %>%
# convert from html table to data frame
xml2::read_html() %>%
rvest::html_table() %>%
as.data.frame() %>%
# remove the "ktoe" row
filter(row_number() != 1) %>%
# convert to ASCII encoding
mutate(across(everything(),~iconv(.x,"utf-8","ASCII",sub = ""))) %>%
# convert all except first column to integers
mutate(across(-one_of("Var.1"),as.integer))