我想解码为“UTF-8”

问题描述

The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94Now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n

我需要将所有这些解码为“UTF-8”，除了“\n”。所以我想要这个输出

Original :The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94Now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n
Decoded : The restoration and rejuvenation of the Willamette Army Base—Now the Willamette Reservist Training Center—is complete.  \n \n

解决方法

您的输入字符串必须是字节字符串才能进行解码。假设使用 bytes.decode():

>>> s = b'The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n'
>>> type(s)
<class 'bytes'>
>>> s2 = s.decode('utf8')
>>> type(s2)
<class 'str'>
>>> s2
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.  \n \n'

以上显示了将字节字符串（类 bytes）解码为 unicode 字符串（类 str）。

用 rstrip() 去掉尾随的新行：

>>> s2.rstrip()
'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.'

如果您的数据来自文件或其他流，您可以通过在打开文件/流时指定编码来在读取时进行解码：

with open('file.txt',encoding='utf8') as f:
    for line in f:
        print(line)

这将解码来自 UTF8 的传入数据，您的代码仅处理字符串。不是字节字符串。有关详细信息，请参阅 open()。

您可以按如下方式解决特定的 mojibake 情况：

s = 'The restoration and rejuvenation of the Willamette Army Base\xe2\x80\x94now the Willamette Reservist Training Center\xe2\x80\x94is complete.  \n \n'
s.encode('latin1').decode('utf-8')

'The restoration and rejuvenation of the Willamette Army Base—now the Willamette Reservist Training Center—is complete.  \n \n'

decode decode python python-3.x string string