如何使用PYTHON 3.7.7将XML ISO-8859-1转换为UTF-8

问题描述

嗨,我一直在使用Python搜索有关此主题的很多内容,但都没有成功。我有一个文件,可以从网上下载,文件名为import.xml,它的开头是这样:

<?xml version="1.0" encoding="ISO-8859-1"?>
<articulos><item>

我想将其转换为UTF-8,任何我应该从哪里开始的想法?

解决方法

xml.etree.ElementTreeThe ElementTree XML API与StackOverflow聚在一起

脚本:

myFileIn = 'ISO_8859_1.xml'
myFileOu = 'utf_8.xml'

from xml.etree import ElementTree

#  open in binary mode ↓
with open( myFileIn,'rb') as f:
    root = ElementTree.fromstring( f.read())

tree = ElementTree.ElementTree( root)
tree.write( myFileOu,encoding="utf-8",xml_declaration=True)

在以下myFileIn文件(data from Wikipedia)上进行了测试:

<?xml version="1.0" encoding="ISO-8859-1"?>
<articulos>
  <item>
    <table>"upper"</table>
    <A_> ~A0 ¡~A1 ¢~A2 £~A3 ¤~A4 ¥~A5 ¦~A6 §~A7 ¨~A8 ©~A9 ª~AA «~AB ¬~AC  ~AD ®~AE ¯~AF</A_>
    <B_>°~B0 ±~B1 ²~B2 ³~B3 ´~B4 µ~B5 ¶~B6 ·~B7 ¸~B8 ¹~B9 º~BA »~BB ¼~BC ½~BD ¾~BE ¿~BF</B_>
    <C_>À~C0 Á~C1 Â~C2 Ã~C3 Ä~C4 Å~C5 Æ~C6 Ç~C7 È~C8 É~C9 Ê~CA Ë~CB Ì~CC Í~CD Î~CE Ï~CF</C_>
    <D_>Ð~D0 Ñ~D1 Ò~D2 Ó~D3 Ô~D4 Õ~D5 Ö~D6 ×~D7 Ø~D8 Ù~D9 Ú~DA Û~DB Ü~DC Ý~DD Þ~DE ß~DF</D_>
    <E_>à~E0 á~E1 â~E2 ã~E3 ä~E4 å~E5 æ~E6 ç~E7 è~E8 é~E9 ê~EA ë~EB ì~EC í~ED î~EE ï~EF</E_>
    <F_>ð~F0 ñ~F1 ò~F2 ó~F3 ô~F4 õ~F5 ö~F6 ÷~F7 ø~F8 ù~F9 ú~FA û~FB ü~FC ý~FD þ~FE ÿ~FF</F_>
  </item>
</articulos>
,

我将另外添加解码/编码以保留特殊字符:

from lxml import etree

source = "input.xml"

with open(source,'rb') as source:
    parser = etree.XMLParser(encoding = "iso-8859-1")
    root = etree.parse(source,parser)
      
string = etree.tostring(root,xml_declaration = True,encoding="UTF-8",pretty_print=True).decode('utf8').encode('iso-8859-1')

with open('output.xml','wb') as target:
    target.write(string)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...