我想解码为“UTF-8”

问题描述

The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94Now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n

我需要将所有这些解码为“UTF-8”,除了“\n”。 所以我想要这个输出

Original :The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94Now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n
Decoded : The restoration and rejuvenation of the Willamette Army Base—Now the Willamette Reservist Training Center—is complete.  \n \n

解决方法

您的输入字符串必须是字节字符串才能进行解码。假设使用 bytes.decode():

>>> s = b'The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n'
>>> type(s)
<class 'bytes'>
>>> s2 = s.decode('utf8')
>>> type(s2)
<class 'str'>
>>> s2
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.  \n \n'

以上显示了将字节字符串(类 bytes)解码为 un​​icode 字符串(类 str)。

rstrip() 去掉尾随的新行:

>>> s2.rstrip()
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.'

如果您的数据来自文件或其他流,您可以通过在打开文件/流时指定编码来在读取时进行解码:

with open('file.txt',encoding='utf8') as f:
    for line in f:
        print(line)

这将解码来自 UTF8 的传入数据,您的代码仅处理字符串。不是字节字符串。有关详细信息,请参阅 open()

,

您可以按如下方式解决特定的 mojibake 情况:

s = 'The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n'
s.encode('latin1').decode('utf-8')
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.  \n \n'