使用iterparse或findall

问题描述

我有一个XML文档（用UTF-8编码），其结构为：

<Group id= "123">
    <rule id= "abc" level= "low">
    <identity>some text</identity>
    <element1>text</element1>
</Group>

每个文档都具有多个Group元素，目的是将它们解析为电子表格，其中每个组都是一行，其中包含组ID，级别以及标识和element1元素中的文本。

我有一个使用findall（）的脚本，当我尝试一次解析一个文档时，该脚本可以工作，但是当我尝试一次解析多个文档时，它往往会失败并显示以下错误：

 File "c:/Documents/Python Projects/Bulkparse.py",line 86,in parseall
    writer.writerow(data)
  File "C:\Program Files (x86)\Python\lib\encodings\cp1252.py",line 19,in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x9d' in position 1137: character maps to <undefined>

我已经查看了'\ x9d'字符代码，它似乎是某种十字图标，它没有出现在我的任何文档中。所以我确定它发生的地点或原因。

Findall（）脚本示例：

for child in root.findall('Group'):
  data.append(child.attrib['id'])
  num = child.attrib['id']
  for child in root.findall('Group[@id = "%s"]/Rule'% num ):
    data.append(child.attrib['level'])
    # followed by a for loop for each element needed ending with
    writer.writerow(data)

上面的方法有效，除非我要做大量工作，否则会出现上述错误。

是否仅仅是findall（）效率太低？我试图用iterparse（）编写一些东西，但是找不到一种方法来遍历每个子元素。例如：

for  event,elem in context:
    if elem.tag ==f"Group" and event == 'end':
        data.append(elem.attrib['id'])
        num = elem.attrib['id']
        for event,elem in context :
            if elem.tag ==f"Rule" and event == 'end':
                data.append(elem.attrib['level'])
                print(data)

返回组ID，然后返回每个组的级别等级，例如[123，low，high，low，low，low，high ..]等。

使用iterparse更好吗？如果是这样，是否有办法像在findall（）中一样，将目标元素标签嵌套在group元素中？还是有办法让findall（）脚本停止抛出该错误？有没有一种方法可以清除每个文档末尾的内存？（假设会有所帮助）非常感谢您的帮助。

解决方法

通过拆分文档集并搜索继续引起错误的一半，找出导致该问题的文档。尽管您说“ \ x9d”不在您的文档集中，但它必须以不同的编码出现在其中。

您还没有说过XML文档的字符编码-也许将XML的字符编码更改为UTF？

如果看不到编码问题，则可以将导出过程切换到XSL转换，该转换执行XML到csv的转换。无论如何这可能会更好。

读取带有不寻常字符的文件时，经常会出现此问题。解决该问题的一种方法是在打开xml文件时执行以下操作：

with open('myfile.xml',encoding='utf-8') as myfile:
   root = etree.XML(myfile)  #or however you import lxml and your file
   for child in root.findall('Group'):.....

这将解决大多数这些问题。但是我遇到了这么多错误，有时我不得不求助于在处理文件之前从文件中实际编辑更麻烦的字符。像这样：

[string representation of your file].replace('\x9d','+') 
#or whatever other charcter you want to use to represent a cross.

lxml python python-3.x xml-parsing