如何使用rselenium R

问题描述

我正在尝试抓取本网站提供的所有会议记录和议程的链接：https://www.charleston-sc.gov/AgendaCenter/

我设法抓取了与每个类别（以及每个类别的年份）关联的部分ID，以遍历每个类别年份内的内容（请参见下文）。但是我不知道如何抓取内容中包含的href。尤其是由于指向议程的链接位于“下载”下的下拉菜单中，因此我似乎需要额外点击才能刮除hrefs。

如何为我选择的每个表格抓取会议记录和议程（在下载下拉列表中）？理想情况下，我想要一张带有日期，议程标题，分钟链接和议程链接的表。

我正在为此使用RSelenium。请查看下面到目前为止的代码，该代码使我可以单击每个类别和年份，但没有太多其他选择。请帮忙！

rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)

t  <- readLines('https://www.charleston-sc.gov/AgendaCenter/',encoding = 'UTF-8')
co <- str_match(t,'aria-label="(.*?)"[ ]href="java')[,2] 
yr <- str_match(t,'id="(.*?)" aria-label')[,2]

df <- data.frame(cbind(co,yr)) %>%
  mutate_all(as.character) %>%
  filter_all(any_vars(!is.na(.))) %>%
  mutate(id = ifelse(grepl('^a0',yr),gsub('a0','',NA)) %>%
  tidyr::fill(c(co,id),.direction='down')%>% drop_na(co)

remDr <- remoteDriver(port=4445L,browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)

for (j in unique(df$id)){
  remDr$findElement(using = 'xpath',value = paste0('//*[@id="cat',j,'"]/h2'))$clickElement()
  
  for (k in unique(df[which(df$id==j),'yr'])){
    remDr$findElement(using = 'xpath',value = paste0('//*[@id="',k,'"]'))$clickElement()
    # NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
  }
}

解决方法

也许您真的不需要点击所有元素？您可以使用以下事实：所有可下载的链接在其ViewFile中都有href：

t  <- readLines('https://www.charleston-sc.gov/AgendaCenter/',encoding = 'UTF-8')

viewfile <- str_extract_all(t,'.*ViewFile.*',simplify = T)
viewfile <- viewfile[viewfile!='']

library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)

# list the elements and patterns we will be looking for:
searchfor <- list(
  Title='name=[^ ]+ title=\"(.+)\" href',Date='<strong>(.+)</strong>',href='href=\"([^\"]+)\"',label= 'aria-label=\"([^\"]+)\"'
)

for (this.i in names(searchfor)){
  this.full <- paste0('.*',searchfor[[this.i]],'.*');
  dt.viewfile[grepl(this.full,origStr),(this.i):=gsub(this.full,'\\1',origStr)]
}

# Clean records:
dt.viewfile[,`:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records

所得到的结果是一个带有所有可下载文件链接的表。现在，您可以使用任何喜欢的工具下载它们，例如，使用download.file()或GET()：

dt.viewfile[,full.url:=paste0('https://www.charleston-sc.gov',href)]
dt.viewfile[,filename:=fs::path_sanitize(paste0(Title,' - ',Date),replacement = '_')]

for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
  url <- dt.viewfile[i,full.url]
  destfile <- dt.viewfile[i,filename]
  
  cat('\nDownloading',url,' to ',destfile)
  
  
  fil <- GET(url,write_disk(destfile))
  
  # our destination file doesn't have extension,we need to get it from the server:
  serverFilename <- gsub("inline;filename=(.*)",headers(fil)$`content-disposition`)
  serverExtension <- tools::file_ext(serverFilename)
  
  # Adding the extension to the file we just saved
  file.rename(destfile,paste0(destfile,'.',serverExtension))
  
}

现在我们唯一的问题是原始网页仅显示过去3年的记录。但是，我们无需单击View More到RSelenium，而是可以简单地用较早的日期加载页面，如下所示：

t  <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017',encoding = 'UTF-8')

然后根据需要重复其余代码。

dropdown href r rselenium web-scraping

如何使用rselenium R

问题描述

解决方法

相关问答