问题描述
我正在尝试读取各种文件的内容。其中一些文件也可以是docx扩展名或pdf或xlsx扩展名。
我尝试使用此代码
for path in paths:
print(open(path,"r",encoding="utf8").read())
但这给了我以下错误
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-22-db6ea654fe14> in <module>
1 for path in paths:
----> 2 print(open(path,encoding="utf8").read())
~\AppData\Local\Programs\Python\Python38\lib\codecs.py in decode(self,input,final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result,consumed) = self._buffer_decode(data,self.errors,final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte
解决方法
没有任何一种类型的文件可以读取和公开任何类型的文件扩展名的功能。您将需要相应地处理每个扩展名
有些库可以帮助您读取某些文件格式,所以我建议您使用它们。
import PyPDF2
for path in paths:
if path.endswith(".pdf"):
with open(path,'rb') as pdf_file:
pdf_read_obj = PyPDF2.PdfFileReader(pdf_file)
print(pdf_read_obj.read()) # This is pseudo code
elif path.endswith(".docx"):
# handle doc case
elif path.endsith("xlsx"):
# handle excel case
else: # Default to this case
try:
print(open(path,"r",encoding="utf8").read())
except:
print(f"Could not read file {path}")