R中的xml,删除段落但保留xml类

问题描述

我试图从 R 中的 XML 文档中删除一些段落,但我想保留 XML 结构/类。这是一些示例文本和我失败的尝试:

for key in data.keys():
    data[key] = [x for x in data[key] if x['Status'] == 'ACTIVE']

# in case of empty data,remove the key
data = {k: v for k,v in data.items() if v != []}

这是我想要的结尾(仅删除标题中的段落):

library(xml2)
text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")
xml_find_all(text,'.//caption//p') %>% xml_remove() # deletes text
xml_find_all(text,'.//caption//p') %>% xml_text() # removes paragraphs but also XML structure

解决方法

看起来这需要多个步骤。找到节点,复制文本,删除节点的内容,然后更新。

library(xml2)
library(magrittr)

text = read_xml("<paper> <caption><p>The main title</p> <p>A sub title</p></caption> <p>The opening paragraph.</p> </paper>")

# find the caption
caption <- xml_find_all(text,'.//caption')

#store existing text
replacemement<- caption %>% xml_find_all( './/p') %>% xml_text() %>% paste(collapse = " ")

#remove the desired text
caption %>% xml_find_all( './/p') %>% xml_remove()

#replace the caption
xml_text(caption) <- replacemement
text  #test
    
{xml_document}
<paper>
   [1] <caption>The main title A sub title</caption>
   [2] <p>The opening paragraph.</p>

您很可能需要获取字幕节点的向量/列表,然后使用循环逐一遍历它们。