R-具有未关闭xml节点的RVest网络抓取,此处:html_nodes“ br”存在问题

问题描述

我使用rvest使用以下代码提取了网页的一部分(编辑this webpage

library('rvest')
webpage <- read_html(url("https://www.tandfonline.com/action/journalInformation?show=editorialBoard&journalCode=ceas20"))
people <- webpage %>%
  html_nodes(xpath='//*[@id="8af55cbd-03a5-4deb-9086-061d8da288d1"]/div/div/div') %>%
  html_nodes(xpath='//p')

结果存储在名为people的xml_nodeset中:

> people
{xml_nodeset (11)}
 [1] <p> <b>Editors:</b> <br> Dr Xyz Anceschi - <i>University of Glasgow <a href="http://www.gla.ac.uk/schools/soci ...
 [2] <p> <b>Editorial Board:</b> <br> Dr Xyz Aliyev - <i>University of Glasgow</i> <br> Professor Richard Berry < ...
 [3] <p> <b>Board of Management:</b> <br> Professor Xyz Berry (Chair) <i>- University of Glasgow</i> <br> Profes ...
 [4] <p> <b>National Advisory Board:</b> <br> Dr Xyz Badcock <i>- University of Nottingham</i> <br> Professor Cath ...

people中,每个元素包含跟随<br>的各个人物的姓名(但是未公开:没有</br>)。

我试图使用此代码解析每个人,但是它不起作用:

sapply(people,function(x)
    {
        x %>%
        html_nodes("br") %>%
        html_text()
    }
)

它只给我一个空结果列表:

[[1]]
 [1] "" ""

[[2]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

[[3]]
 [1] "" "" "" "" ""

[[4]]
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

我认为该错误是基于<br>是xml_nodeset中未封闭的节点这一事实。可能是这样吗?

如果是这样,我还能做些什么来从people中提取每个人吗?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)