问题描述
我正在刮 this website。我对提取在最后一个脚本节点 script node snippet 中找到的内容特别感兴趣。到目前为止,我已经尝试了以下内容:
url <- "https://insolvencyinsider.ca/filing/"
ii <- read__html(url)
fwp <- ii %>%
htl_nodes("body") %>%
xml_find_first(xpath = "/script[15]") %>%
html_text() # Not text so I wouldn't expect this to work.
#> character (empty)
fwp <- ii %>%
htl_nodes("body") %>%
xml_find_first(xpath = "/script[15]") %>%
html_attr("window.FWP_JSON") # Don't think this makes sense since its not an attribute?
#> chr NA
解决方法
您可以使用以下模式对其进行正则表达式,然后使用 jsonlite 进行解析
id : someLongID
createdDateTime : 2021-06-02T02:54:47Z
lastModifiedDateTime : 2021-06-02T02:55:11Z
changeKey : changeKEY
categories : {}
parentFolderId : parentfolderID
birthday :
fileAs : Bob,Billy
displayName : Billy Bob
givenName : Billy
initials : B.B.
middleName :
nickName :
surname : Bob
title :
yomiGivenName :
yomiSurname :
yomiCompanyName :
generation :
imAddresses : {}
jobTitle :
companyName :
department :
officeLocation :
profession :
businessHomePage :
assistantName :
manager :
homePhones : {}
mobilePhone :
businessPhones : {}
spouseName :
personalNotes :
children : {}
emailAddresses : {@{name=Billy Bob (bbob@domain.com); address=bbob@domain.com}}
homeAddress :
businessAddress :
otherAddress :
正则表达式: