使用python读取任何扩展名的文件

问题描述

我正在尝试读取各种文件的内容。其中一些文件也可以是docx扩展名或pdf或xlsx扩展名。

我尝试使用此代码

for path in paths:
    print(open(path,"r",encoding="utf8").read())

但这给了我以下错误

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-22-db6ea654fe14> in <module>
      1 for path in paths:
----> 2     print(open(path,encoding="utf8").read())

~\AppData\Local\Programs\Python\Python38\lib\codecs.py in decode(self,input,final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result,consumed) = self._buffer_decode(data,self.errors,final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte

解决方法

没有任何一种类型的文件可以读取和公开任何类型的文件扩展名的功能。您将需要相应地处理每个扩展名

有些库可以帮助您读取某些文件格式，所以我建议您使用它们。

import PyPDF2
 
for path in paths:
    if path.endswith(".pdf"):
        with open(path,'rb') as pdf_file:
            pdf_read_obj = PyPDF2.PdfFileReader(pdf_file)
            print(pdf_read_obj.read()) # This is pseudo code

    elif path.endswith(".docx"):
        # handle doc case
    elif path.endsith("xlsx"):
        # handle excel case
    else: # Default to this case
        try:
            print(open(path,"r",encoding="utf8").read())
        except:
            print(f"Could not read file {path}")

encoding file file python

使用python读取任何扩展名的文件

问题描述

解决方法

相关问答