问题描述
我们知道文件包含字节,b'\x96'
因为错误消息中已提到该字节:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
现在,我们可以编写一个小脚本来找出b'\x96'
解码到的 编码是否存在ñ
:
import pkgutil
import encodings
import os
def all_encodings():
modnames = set([modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
text = b'\x96'
for enc in all_encodings():
try:
msg = text.decode(enc)
except Exception:
continue
if msg == 'ñ':
print('Decoding {t} with {enc} is {m}'.format(t=text, enc=enc, m=msg))
产生
Decoding b'\x96' with mac_roman is ñ
Decoding b'\x96' with mac_farsi is ñ
Decoding b'\x96' with mac_croatian is ñ
Decoding b'\x96' with mac_arabic is ñ
Decoding b'\x96' with mac_romanian is ñ
Decoding b'\x96' with mac_iceland is ñ
Decoding b'\x96' with mac_turkish is ñ
因此,请尝试更改
with open('my_file.csv', 'r', newline='') as csvfile:
这些编码之一,例如:
with open('my_file.csv', 'r', encoding='mac_roman', newline='') as csvfile:
解决方法
我在Python 3中有以下代码,该代码用于打印csv文件中的每一行。
import csv
with open('my_file.csv','r',newline='') as csvfile:
lines = csv.reader(csvfile,delimiter = ',',quotechar = '|')
for line in lines:
print(' '.join(line))
但是当我运行它时,它给了我这个错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 7386: invalid start byte
我查看了csv文件,结果发现,如果我取出单个ñ(小N,顶部有波浪号),则每一行都可以正常打印。
我的问题是,我已经针对类似的问题浏览了许多不同的解决方案,但我仍然不知道如何解决此问题,解码/编码内容等。仅选择数据中的ñ字符是不可行的。