问题描述
我正在尝试抓取以下网页:https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020 我对最后的表格感兴趣,位于“ ...的斯德哥尔摩天气历史记录”下方
使用提交的代码,我可以在当月的第一天获得信息,但是在接下来的日子里,我不知道如何获取信息。如果我在下拉列表中更改日期,则网址不会更改。 我该如何在一个月的所有天中刮擦这张桌子?
library(tidyverse)
library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
rD <- rsDriver(browser="chrome",port=4234L,chromever ="85.0.4183.83")
remDr <- rD[["client"]]
remDr$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
webElems <- remDr$findElements(using="class name",value="sticky-wr")
s<-webElems[[1]]$getElementText()
s<-as.character(s)
print(s)
解决方法
看起来您可以使用rvest
本身提取表,而在这里不需要Rselenium
。不过,桌子可能需要清洗。
library(rvest)
url <- 'https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020'
url %>%
read_html() %>%
html_table() %>%
.[[3]] %>%
setNames(.[1,]) -> tmp
tmp[-c(1,nrow(tmp)),]
# Time Temp Weather Wind Humidity Barometer Visibility
#2 0:20.Aha 01 Mac 2 °C Light rain. Mostly cloudy. 20 km/h ↑ 93% 988 mbar 5 km
#3 0:50. 2 °C Drizzle. Low clouds. 13 km/h ↑ 93% 988 mbar N/A
#4 1:20. 2 °C Drizzle. Low clouds. 15 km/h ↑ 100% 987 mbar 9 km
#5 1:50. 2 °C Drizzle. Low clouds. 15 km/h ↑ 100% 987 mbar 8 km
#6 2:20. 2 °C Light rain. Low clouds. 19 km/h ↑ 100% 986 mbar 6 km
#7 2:50. 2 °C Light rain. Low clouds. 19 km/h ↑ 100% 985 mbar 4 km
#...