蟒蛇获取没有空白的内部XML

问题描述

<?xml version="1.0" encoding="UTF-8"?>
<data>
    <head>
        <version>1.0</version>
        <project>hello,world</project>
        <date>2020-08-15</date>
    </head>
    <file name="helloworld.py"/>
    <file name="helloworld.ps1"/>
    <file name="helloworld.bat"/>
</data>

我需要获取head元素中的数据，并且元素之间没有空格，如下所示：

<version>1.0</version><project>hello,world</project><date>2020-08-15</date>

然后将其哈希。现在，我必须进行一些字符串操作才能将其放入一行：

root = ET.parse('myfile.xml').getroot()
header = ET.tostring(root[0]).decode('utf-8')
import re
header = re.sub('\n','',header)
header = re.sub('>\s+<','><',header)
header = header.replace('<head>','')
header = header.replace('</head>','')
header = header.strip()

有没有更简单的方法可以做到这一点？ Powershell XML对象具有一个简单的InnerXML属性，该属性为您提供元素内的XML，且字符串中没有空格。 Python是否有一种方法可以使此操作更容易？

解决方法

以下（不使用任何外部库-只是核心python）

import xml.etree.ElementTree as ET

root = ET.parse('input.xml')
head = root.find('.//head')
combined = ''.join(['<{}>{}</{}>'.format(e.tag,e.text,e.tag) for e in list(head)])
print(combined)

input.xml

<?xml version="1.0" encoding="UTF-8"?>
<data>
    <head>
        <version>1.0</version>
        <project>hello,world</project>
        <date>2020-08-15</date>
    </head>
    <file name="helloworld.py"/>
    <file name="helloworld.ps1"/>
    <file name="helloworld.bat"/>
</data>

输出

<version>1.0</version><project>hello,world</project><date>2020-08-15</date>

如果您可以使用外部库，那么BeautifulSoup非常有用。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

以下是您的文档示例。

from bs4 import BeautifulSoup as bs

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
 <data>
 <head>
     <version>1.0</version>
     <project>hello,world</project>
     <date>2020-08-15</date>
 </head>
 <file name="helloworld.py"/>
 <file name="helloworld.ps1"/>
 <file name="helloworld.bat"/>
</data>"""

page_soup = bs(xml_doc)

page_soup.head.getText()

page_soup.head.getText().strip().replace('\n','').replace(' ','')

这将返回head标记子级的内容，并去除换行符和空格。

每种方法都可能有问题。某些方法还会删除有用的空间。节点具有属性时，某些方法会变得很麻烦。所以我给你第三种方式。这也可能是一种不完善的方法：）

from simplified_scrapy import SimplifiedDoc,utils
# xml_doc = utils.getFileContent('myfile.xml')
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
 <data>
 <head>
     <version>1.0</version>
     <project>hello,world</project>
     <date>2020-08-15</date>
 </head>
 <file name="helloworld.py"/>
 <file name="helloworld.ps1"/>
 <file name="helloworld.bat"/>
</data>"""

doc = SimplifiedDoc(xml_doc)
headXml = doc.head.html.strip() # Get internal data of head
print (doc.replaceReg(headXml,'>[\s]+<','><')) # Replace newlines and spaces with regex

结果：

<version>1.0</version><project>hello,world</project><date>2020-08-15</date>

python removing-whitespace xml xml xml xml xml xml xml

蟒蛇获取没有空白的内部XML

问题描述

解决方法

相关问答