问题描述
这是我的第一个堆栈问题,因此,如果我做错了什么,请告诉我。
我正在尝试使用xml2包以及可能的pandas包解析数据。在下面,您可以找到数据的匿名快照。
<?xml version="1.0" encoding="utf-8"?>
<a xmlns:xsd="http://www.y.org/y1/y2" xmlns:xsi="http://www.y.org/y1/y3" xmlns="http://x.nl/">
<b1>1</b1>
<b2>2019-07-01T10:01:35.312+02:00</b2>
<b3>xxx</b3>
<b4>xxx</b4>
<b5>
<c>
<d1>
</d1>
<d2>xxxx</d2>
<d3>
<e1>
</e1>
<e2>
<ID>1</ID>
<f2>XXXXXXXXXXX</f2>
<event>
<eventType>start</eventType>
<eventValue>true</eventValue>
<timestamp>2019-10-07T13:45:00.00+02.00</timestamp>
</event>
<event>
<eventType>next</eventType>
<eventValue>itm1</eventValue>
<timestamp>2019-10-07T13:46:00.00+02.00</timestamp>
</event>
<event>
<eventType>next</eventType>
<eventValue>itm2</eventValue>
<timestamp>2019-10-07T13:47:00.00+02.00</timestamp>
</event>
<event>
<eventType>next</eventType>
<eventValue>itm3</eventValue>
<timestamp>2019-10-07T13:48:00.00+02.00</timestamp>
</event>
我想创建类似下表的内容。
+-----------+------------+------------------------------+
| EventType | EventValue | timestamp |
+-----------+------------+------------------------------+
| start | true | 2019-10-07T13:45:00.00+02.00 |
| next | itm1 | 2019-10-07T13:46:00.00+02.00 |
| next | itm2 | 2019-10-07T13:47:00.00+02.00 |
| next | itm3 | 2019-10-07T13:48:00.00+02.00 |
+-----------+------------+------------------------------+
我尝试了xml_find_all函数来查找所有事件,但是我总是得到{xml_nodeset(0))}。
x <- xml_find_all(data,"//event",xml_ns(data))
有人能以正确的方向向我发送邮件,也可能给我提示以创建类似上述的数据框吗?太神奇了
解决方法
此XML文件包含一些名称空间:
> xml_ns(data)
d1 <-> http://x.nl/
xsd <-> http://www.y.org/y1/y2
xsi <-> http://www.y.org/y1/y3
要从中读取节点,有两种方法。简单的方法是删除所有名称空间:
xml_ns_strip(data)
events <- xml_find_all(data,"//event")
df_event <-
data.frame(
EventType = events %>% xml_find_first("./eventType") %>% xml_text(),EventValue = events %>% xml_find_first("./eventValue") %>% xml_text(),timestamp = events %>% xml_find_first("./timestamp") %>% xml_text()
)
或者您可以在XPath中添加前缀以获得节点:
events <- xml_find_all(data,"//d1:event") # d1 is the default namespace
df_event <-
data.frame(
EventType = events %>% xml_find_first("./d1:eventType") %>% xml_text(),EventValue = events %>% xml_find_first("./d1:eventValue") %>% xml_text(),timestamp = events %>% xml_find_first("./d1:timestamp") %>% xml_text()
)
输出:
> df_event
EventType EventValue timestamp
1 start true 2019-10-07T13:45:00.00+02.00
2 next itm1 2019-10-07T13:46:00.00+02.00
3 next itm2 2019-10-07T13:47:00.00+02.00
4 next itm3 2019-10-07T13:48:00.00+02.00