如何使用 R 中的 purrr 映射函数将 xml-nodesets使用 rvest 创建放入小标题中？

问题描述

我想抓取大量网站。为此，我首先阅读网站的 html 脚本并将它们存储为 xml_nodesets。由于我只需要网站的内容，我最后从 xml_nodesets 中提取每个网站的内容。为此，我编写了以下代码：

# required packages
library(purrr)
library(dplyr)
library(xml2)
library(rvest)
    
# urls of the example sources
test_files <- c("https://en.wikipedia.org/wiki/Web_scraping","https://en.wikipedia.org/wiki/Data_scraping")
        
# reading in the html sources,storing them as xml_nodesets
test <- test_files %>% 
map(.,~ xml2::read_html(.x,encoding = "UTF-8"))
        
# extracting selected nodes (contents)
test_tbl <- test %>%
     map(.,~tibble(
     # scrape contents
     test_html = rvest::html_nodes(.x,xpath = '//*[(@id = "toc")]')  
            ))

不幸的是，这会产生以下错误：

Error: All columns in a tibble must be vectors.
x Column `test_html` is a `xml_nodeset` object.

我想我明白这个错误的实质，但我找不到解决方法。这也有点奇怪，因为我在一月份能够顺利运行这段代码，突然它不再起作用了。我怀疑包更新是原因，但安装旧版本的 xml2、rvest 或 tibble 也没有帮助。此外，仅抓取一个网站也不会产生任何错误：

test <- read_html("https://en.wikipedia.org/wiki/Web_scraping",encoding = "UTF-8") %>%
  rvest::html_nodes(xpath = '//*[(@id = "toc")]')

您对如何解决这个问题有什么建议吗？非常感谢！

编辑：我从 ...

中删除了 %>% html_text

test_tbl <- test %>%
     map(.,xpath = '//*[(@id = "toc")]')  
            ))

... 因为这不会产生此错误。不过，编辑后的代码确实如此。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

purrr r r rvest web-scraping xml2 xml2