在 Python 3 中将 Unicode 转换为 ASCII

问题描述

我尝试了多种解决方案，也阅读了许多网站，但似乎无法解决此问题。我有一个包含消息对象的文件。每条消息都有一个 4 字节的值作为消息类型，一个 4 字节的值作为长度，然后是 Unicode 中的 ASCII 消息数据。当我打印到屏幕上时，它看起来像 ASCII。当我将输出定向到一个文件时，我得到了 Unicode，所以我试图解码所有这些的方式有些不对劲。这是python脚本：

import sys
import codecs
import encodings.idna
import unicodedata

def getHeader(fileObj):
    mstype_array = bytearray(4)
    mslen_array = bytearray(4)
    mstype = 0
    mslen = 0
    fileObj.seek(-1,1)
    mstype_array = fileObj.read(4)
    mslen_array = fileObj.read(4)
    mstype = int.from_bytes(mstype_array,byteorder=sys.byteorder)
    mslen = int.from_bytes(mslen_array,byteorder=sys.byteorder)
    return mstype,mslen

def getMessage(fileObj,count):
    str = fileObj.read(count)#.decode("utf-8","strict")
    return str

def getFields(msg):
    msg = codecs.decode(msg,'utf-8')
    fields = msg.split(';')
    return fields

mstype = 0
mslen = 0
with open('../putty.log','rb') as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        if byte == b'\x1D':
            mstype,mslen = getHeader(f)
            print (f"Msg Type: {mstype} Msg Len: {mslen}")
            msg = getMessage(f,mslen)
            print(f"Message: {codecs.decode(msg,'utf-8')}")
            #print(type(msg))
            fields = getFields(msg)
            print("Fields:")
            for field in fields:
                print(field)
        else:
            print (f"Char read: {byte}  {hex(ord(byte))}")

使用可以使用此 link 来获取文件进行解码。

解决方法

似乎 function Button(props) { return ( <div className={props.className}> <Link to={props.buttonLink} component="button" className={`${styles.buttonDefault} ${ props.display ? '' : styles.hideButton } ${props.buttonClass}`} onClick={props.onClick}> {props.buttonText} </Link> </div> ); } export default Button; 在写入控制台和写入文件时表现不同。手册 (https://docs.python.org/3/library/sys.html#sys.stdout) 说这是预期的，但只提供了 Windows 的详细信息。
在任何情况下，您都是将 unicode 写入 stdout（通过 sys.stdout），这就是为什么您在文件中获得 unicode 的原因。您可以通过不在 print() 中解码消息来避免这种情况（因此您可以将 getFields 替换为 fields = getFields(msg) 并使用 fields = msg.split(b';') 写入标准输出。
显然有一些问题将 sys.stdout.buffer.write(field+b'\n') 和 print() 混合在一起，所以 Python 3: write binary to stdout respecting buffering 可能值得一读。

tl;dr - 尝试写入字节而不解码为 unicode。

简而言之，定义一个自定义函数并在您调用 print 的任何地方使用它。

import sys

def ascii_print(txt):
    sys.stdout.buffer.write(txt.encode('ascii',errors='backslashreplace'))

ASCII 是 utf-8 的子集。 ACSII 字符与相同的 utf-8 编码字符无法区分。在内部，所有 Python 字符串都是原始 Unicode。但是，无法读入或写出原始 Unicode。它们必须首先被编码为某种编码。默认情况下，在大多数系统上，默认编码是 utf-8，这是最常见的 Unicode 编码标准。

如果要使用不同的编码写出，则必须指定该编码。我假设您出于某种原因需要 ascii 编码。

请注意 print 的文档说明：

由于打印的参数被转换为文本字符串，print() 不能用于二进制模式文件对象。对于这些，请改用 file.write(...)。

现在，如果您要重定向 stdout，您可以直接在 sys.stdout 中调用 write()。但是，正如文档在那里解释的那样：

要从/向标准流写入或读取二进制数据，请使用底层二进制 buffer 对象。例如，要将字节写入 stdout，请使用 sys.stdout.buffer.write(b'abc')。

因此，您可以这样做：

print(f"Message: {codecs.decode(msg,'utf-8')}")

请注意，我专门在字符串上调用了 str.encode，并显式设置了 ascii_msg = f"Message: {codecs.decode(msg,'utf-8')}".encode('ascii') sys.stdout.buffer.write(ascii_msg) 编码。另请注意，我对整个字符串（包括 ascii）进行了编码，而不仅仅是传入的变量（仍需要解码）。然后，您需要将该 ASCII 编码的字节字符串直接写入 Message: ，如第二行所示。

这样做的一个问题是输入可能包含一些非 ASCII 字符。照原样，会发生 sys.stdout.buffer 并且程序会崩溃。为避免这种情况，Unicodeerror 支持几种不同的错误处理选项：

其他可能的值是 str.encode、'ignore'、'replace'、'xmlcharrefreplace' 和通过 'backslashreplace' 注册的任何其他名称。

由于目标输出是纯文本，codecs.register_error() 可能是保持无损输出的最佳方式。但是，如果您不关心保留非 ASCII 字符，'backslashreplace' 也可以使用。

'ignore'

是的，您需要为发送到 ascii_msg = f"Message: {codecs.decode(msg,'utf-8')}".encode('ascii',errors='backslashreplace') sys.stdout.buffer.write(ascii_msg) 的每个字符串执行此操作。定义一个自定义的打印函数可以让代码更具可读性：

print

然后在你的代码中你可以直接调用它而不是 def ascii_print(txt): sys.stdout.buffer.write(txt.encode('ascii',errors='backslashreplace')):

print

python python-3.x python-unicode