问题描述
嗨,我一直在使用Python搜索有关此主题的很多内容,但都没有成功。我有一个文件,可以从网上下载,文件名为import.xml,它的开头是这样:
<?xml version="1.0" encoding="ISO-8859-1"?>
<articulos><item>
我想将其转换为UTF-8,任何我应该从哪里开始的想法?
解决方法
xml.etree.ElementTree
— The ElementTree XML API与StackOverflow聚在一起
- deceze's answer到使用utf-8以外的编码在Python中解析XML 和
- Tomalak's answer到 xml.etree.ElementTree.Element'对象没有属性“写”
脚本:
myFileIn = 'ISO_8859_1.xml'
myFileOu = 'utf_8.xml'
from xml.etree import ElementTree
# open in binary mode ↓
with open( myFileIn,'rb') as f:
root = ElementTree.fromstring( f.read())
tree = ElementTree.ElementTree( root)
tree.write( myFileOu,encoding="utf-8",xml_declaration=True)
在以下myFileIn
文件(data from Wikipedia)上进行了测试:
<?xml version="1.0" encoding="ISO-8859-1"?>
<articulos>
<item>
<table>"upper"</table>
<A_> ~A0 ¡~A1 ¢~A2 £~A3 ¤~A4 ¥~A5 ¦~A6 §~A7 ¨~A8 ©~A9 ª~AA «~AB ¬~AC ~AD ®~AE ¯~AF</A_>
<B_>°~B0 ±~B1 ²~B2 ³~B3 ´~B4 µ~B5 ¶~B6 ·~B7 ¸~B8 ¹~B9 º~BA »~BB ¼~BC ½~BD ¾~BE ¿~BF</B_>
<C_>À~C0 Á~C1 Â~C2 Ã~C3 Ä~C4 Å~C5 Æ~C6 Ç~C7 È~C8 É~C9 Ê~CA Ë~CB Ì~CC Í~CD Î~CE Ï~CF</C_>
<D_>Ð~D0 Ñ~D1 Ò~D2 Ó~D3 Ô~D4 Õ~D5 Ö~D6 ×~D7 Ø~D8 Ù~D9 Ú~DA Û~DB Ü~DC Ý~DD Þ~DE ß~DF</D_>
<E_>à~E0 á~E1 â~E2 ã~E3 ä~E4 å~E5 æ~E6 ç~E7 è~E8 é~E9 ê~EA ë~EB ì~EC í~ED î~EE ï~EF</E_>
<F_>ð~F0 ñ~F1 ò~F2 ó~F3 ô~F4 õ~F5 ö~F6 ÷~F7 ø~F8 ù~F9 ú~FA û~FB ü~FC ý~FD þ~FE ÿ~FF</F_>
</item>
</articulos>
,
我将另外添加解码/编码以保留特殊字符:
from lxml import etree
source = "input.xml"
with open(source,'rb') as source:
parser = etree.XMLParser(encoding = "iso-8859-1")
root = etree.parse(source,parser)
string = etree.tostring(root,xml_declaration = True,encoding="UTF-8",pretty_print=True).decode('utf8').encode('iso-8859-1')
with open('output.xml','wb') as target:
target.write(string)