问题描述
url <- "https://www.sec.gov/Archives/edgar/data/1061165/0001567619-21-010580.txt"
应该与此链接上的信息相同 https://www.sec.gov/Archives/edgar/data/1061165/000156761921010580/xslForm13F_X01/form13fInfoTable.xml
谢谢
解决方法
该文件似乎是两个嵌套的 XML 文件。我们可以使用以下代码将每个组件提取到 list
中:
txt <- readLines("https://www.sec.gov/Archives/edgar/data/1061165/0001567619-21-010580.txt")
grep("</?XML>",txt)
# [1] 46 101 109 719
txt[grep("</?XML>",txt)]
# [1] "<XML>" "</XML>" "<XML>" "</XML>"
对该文件的简要检查通知 grep
,表明一个 XML 文件启动和停止,然后另一个启动/停止。如果我们保持在范围内,我们可以使用
library(xml2)
first <- as_list(read_xml(paste(txt[47:100],collapse = "")))
str(first)
# List of 1
# $ edgarSubmission:List of 2
# ..$ headerData:List of 2
# .. ..$ submissionType:List of 1
# .. .. ..$ : chr "13F-HR"
# .. ..$ filerInfo :List of 4
# .. .. ..$ liveTestFlag :List of 1
# .. .. .. ..$ : chr "LIVE"
# .. .. ..$ flags :List of 3
# .. .. .. ..$ confirmingCopyFlag :List of 1
# .. .. .. .. ..$ : chr "false"
# .. .. .. ..$ returnCopyFlag :List of 1
# .. .. .. .. ..$ : chr "true"
# .. .. .. ..$ overrideInternetFlag:List of 1
# .. .. .. .. ..$ : chr "false"
# .. .. ..$ filer :List of 1
# .. .. .. ..$ credentials:List of 2
# .. .. .. .. ..$ cik:List of 1
# .. .. .. .. .. ..$ : chr "0001061165"
# .. .. .. .. ..$ ccc:List of 1
# .. .. .. .. .. ..$ : chr "XXXXXXXX"
# .. .. ..$ periodOfReport:List of 1
# .. .. .. ..$ : chr "03-31-2021"
# ..$ formData :List of 3
第二批:
second <- as_list(read_xml(paste(txt[110:718],collapse = "")))
str(second)
# List of 1
# $ informationTable:List of 38
# ..$ infoTable:List of 7
# .. ..$ nameOfIssuer :List of 1
# .. .. ..$ : chr "ADOBE SYSTEMS INCORPORATED"
# .. ..$ titleOfClass :List of 1
# .. .. ..$ : chr "COM"
# .. ..$ cusip :List of 1
# .. .. ..$ : chr "00724F101"
# .. ..$ value :List of 1
# .. .. ..$ : chr "1246613"
# .. ..$ shrsOrPrnAmt :List of 2
# .. .. ..$ sshPrnamt :List of 1
# .. .. .. ..$ : chr "2622406"
# .. .. ..$ sshPrnamtType:List of 1
# .. .. .. ..$ : chr "SH"
# .. ..$ investmentDiscretion:List of 1
# .. .. ..$ : chr "SOLE"
# .. ..$ votingAuthority :List of 3
# .. .. ..$ Sole :List of 1
# .. .. .. ..$ : chr "2622406"
# .. .. ..$ Shared:List of 1
# .. .. .. ..$ : chr "0"
# .. .. ..$ None :List of 1
# .. .. .. ..$ : chr "0"
# ..$ infoTable:List of 7
我不确定如何提取正面内容,我希望这是一个足够好的开始。