循环遍历 XML 节点并在 Python 中比较其元素的文本

问题描述

我正在处理一个 XML 文件，其中的一部分如下所示：

<tblSampleParts>
    <AnID>1</AnID>
    <jriD>11</jriD>
</tblSampleParts>
<tblSampleParts>
    <AnID>2</AnID>
    <jriD>16</jriD>
</tblSampleParts>
<tblSampleParts>
    <AnID>2</AnID>
    <jriD>28</jriD>
</tblSampleParts>
<tblSampleParts>
    <AnID>2</AnID>
    <jriD>29</jriD>
<tblSampleParts>
    <AnID>3</AnID>
    <jriD>5</jriD>
</tblSampleParts>
<tblSampleParts>
    <AnID>4</AnID>
    <jriD>22</jriD>
</tblSampleParts>
<tblSampleParts>
    <AnID>5</AnID>
    <jriD>12</jriD>
</tblSampleParts>
<tblSampleParts>
    <AnID>5</AnID>
    <jriD>18</jriD>
<tblSampleParts>
    <AnID>6</AnID>
    <jriD>6</jriD>
</tblSampleParts>

我想要做的是遍历节点并比较“AnID”元素的值。如果“AnID”的值显示不止一次，那么我想打印 AnID 和相应 jriD 的文本。所以我希望在查看附加代码时打印的是：

    <AnID>2</AnID>
    <jriD>16</jriD>

    <AnID>2</AnID>
    <jriD>28</jriD>

    <AnID>2</AnID>
    <jriD>29</jriD>

    <AnID>5</AnID>
    <jriD>12</jriD>

    <AnID>5</AnID>
    <jriD>18</jriD>

我自己尝试过并使用 int() 函数将文本转换为整数并尝试循环遍历所有节点，但我出现了“字符串索引必须为整数”之类的错误。

目前我正在使用以下代码来收集和打印 AnID 和 jriD 的值：

import pandas as pd
from lxml import objectify
path='0458510148.xml'
parsed=objectify.parse(open(path))
root=parsed.getroot()


data=[]
skip_fields=['tblProjects','tblMeasurementPoints']

for elt in root.tblSampleParts:
    el_data={}
    for child in elt.getchildren():
        el_data[child.tag]=child.pyval
    data.append(el_data)


perf=pd.DataFrame(data)
print(perf)

结果如下：

    AnID  jriD
0      1    11
1      2    16
2      2    28
3      3     5
4      4    22
5      5    12
6      6     6
7      7     1
8      8    17
9      9    18
10    10    10
11    10    13
12    10    24
13    11     2
14    11     8
15    11    14
16    11    25
17    12    10
18    13    13
19    14    24

但我不知道如何只打印编号出现多次的 AnID（及其对应的 jriD）。

解决方法

我认为没有太多理由将其转换为整数，您也可以比较字符串值。

您可以尝试以下操作：

创建字典
遍历每个 <tblSampleParts>
- 将<AnID>中的字符串用作k，将<JrID>中的字符串用作v
- 如果键不在字典中，则将键k和列表[v]作为值添加到字典中
- 如果键在字典中，则将 v 附加到列表中
遍历字典中的每个键值对
- 如果值中的列表只包含一个元素，则跳过它
- 如果列表包含更多元素，这就是您要查找的情况之一。

我敢肯定，有更好、更高效、更 Pythonic 的方法来做到这一点。但这至少应该有效。

无论如何，对于此解决方案，您可以使用字符串 "5" 以及整数 5 作为键。
但是，如果您坚持将字符串转换为整数，并且不断出现错误，您可能需要查看导致这些错误的字符串是什么。

有点复杂，但可以使用 xpath 来完成：

from lxml import etree

ids = """<root>[your xml above[</root>""" #note: the xml in the question is not well formed; it needs to be wrapped in a root element

uniq_anids = {id for id in doc.xpath('//AnID/text()')}
targets = [u_a for u_a in uniq_anids if doc.xpath(f'count(//AnID[text()="{u_a}"])')>1]
for target in targets:
    for tsp in doc.xpath(f'//tblSampleParts[./AnID[text()="{target}"]]/*'):
        print(etree.tostring(tsp).decode())

输出应该是你的问题中指出的那个。

好了，我给它一个镜头：

import lxml
from bs4 import BeautifulSoup

sample_data = """
<xml>
    <tblSampleParts>
        <AnID>1</AnID>
        <JrID>11</JrID>
    </tblSampleParts>
    <tblSampleParts>
        <AnID>2</AnID>
        <JrID>16</JrID>
    </tblSampleParts>
    <tblSampleParts>
        <AnID>2</AnID>
        <JrID>28</JrID>
    </tblSampleParts>
    <tblSampleParts>
        <AnID>2</AnID>
        <JrID>29</JrID>
    <tblSampleParts>
        <AnID>3</AnID>
        <JrID>5</JrID>
    </tblSampleParts>
    <tblSampleParts>
        <AnID>4</AnID>
        <JrID>22</JrID>
    </tblSampleParts>
    <tblSampleParts>
        <AnID>5</AnID>
        <JrID>12</JrID>
    </tblSampleParts>
    <tblSampleParts>
        <AnID>5</AnID>
        <JrID>18</JrID>
    <tblSampleParts>
        <AnID>6</AnID>
        <JrID>6</JrID>
    </tblSampleParts>
</xml>
"""

soup = BeautifulSoup(sample_data,'xml')

parts = soup.find_all('tblSampleParts')

AnIDs = []
JrIDs = []
for p in parts:
    an = p.AnID.text
    AnIDs.append(an)
    jr = p.JrID.text
    JrIDs.append(jr)

for i,a in enumerate(AnIDs):
    if AnIDs.count(a) > 1:
        print(f'<AnID>{a}</AnID>\n<JrID>{JrIDs[i]}</JrID>')

此打印

<AnID>2</AnID>
<JrID>16</JrID>
<AnID>2</AnID>
<JrID>28</JrID>
<AnID>2</AnID>
<JrID>29</JrID>
<AnID>5</AnID>
<JrID>12</JrID>
<AnID>5</AnID>
<JrID>18</JrID>

我猜应该是你想要的东西，对吧？

更新：

BeautifulSoup不提供功能直接读取的文件/网页。

如果你有本地数据，作为一个文件data.xml，则可以执行下列操作（假定该文件是在同一文件夹中的脚本 - 使用相对路径）：

with open('data.xml','r') as f:
    contents = f.read()
    soup = BeautifulSoup(contents,'xml')

如果你想使用在线数据，请执行以下操作：

import requests

url = "http://some.url.com/data.xml"
req = requests.get(url)
soup = BeautifulSoup(req.content,'xml')

（应大致的工作，还没有尝试过的，所以你可能要在这里调整它有）

element element loops nodes python xml xml xml xml xml xml