R xml2 去除标签前缀

问题描述

假设我想解析 Microsoft 10-Q SEC XBRL 文件

library('xml2')
url <- "https://www.sec.gov/Archives/edgar/data/789019/000156459021002316/msft-10q_20201231_htm.xml"
xml <- read_xml(url)
xml_find_all(xml,"./us-gaap:EarningsPerShareBasic")

# {xml_nodeset (10)}
#  [1] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000099" unitRef="U_iso4217USD_x ...
#  [2] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20191001_20191231" decimals="2" id="F_000100" unitRef="U_iso4217USD_x ...
#  [3] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20200701_20201231" decimals="2" id="F_000101" unitRef="U_iso4217USD_x ...
#  [4] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20190701_20191231" decimals="2" id="F_000102" unitRef="U_iso4217USD_x ...
#  [5] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_us-gaapChangeInAccountingEstimateByTypeAxis_us-gaapServiceLifeMember_ ...
#  [6] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_us-gaapChangeInAccountingEstimateByTypeAxis_us-gaapServiceLifeMember_ ...
#  [7] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000517" unitRef="U_iso4217USD_x ...
#  [8] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20191001_20191231" decimals="2" id="F_000518" unitRef="U_iso4217USD_x ...
#  [9] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20200701_20201231" decimals="2" id="F_000519" unitRef="U_iso4217USD_x ...
# [10] <us-gaap:EarningsPerShareBasic contextRef="C_0000789019_20190701_20191231" decimals="2" id="F_000520" unitRef="U_iso4217USD_x ...

如上所述,大多数美国 XBRL 标签都有命名空间前缀;这里 us-gaap: 表示会计准则。但是,某些 xml2 函数,例如:

 xml_name(xml_find_all(xml,"./us-gaap:EarningsPerShareBasic"))
 # [1] "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic"
 # [6] "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic" "EarningsPerShareBasic"

 xml_find_first(xml,"./us-gaap:EarningsPerShareBasic")
 # {xml_node}
 # <EarningsPerShareBasic contextRef="C_0000789019_20201001_20201231" decimals="2" id="F_000099" unitRef="U_iso4217USD_xbrlishares">

去掉前缀。
想象一下我想收集所有标签搜索它们名称的情况:

nodes <- xml_find_all(xml,"./*")
tags <- xml_name(nodes)
grep("earnings",tags,ignore.case = TRUE,value=TRUE)

因为 xml_name(nodes) 去掉了前缀,所以我没有从 grep 中得到实际的标签

有什么办法可以得到一个节点的确切标签名吗?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)