为什么我在Python PDFMiner中收到此错误:TypeError:只能将str而不是“ bytes”连接到str 1您可以将密码设置为字节,从而以 2您可以摆脱该论点

问题描述

我是python的新手,正在尝试使用PDFminer将pdf转换为txt文件,每次TypeError: can only concatenate str (not "bytes") to str*-都会出现此错误

我很困惑,因为错误消息似乎表明该错误是由于pdfminer软件包中的文件引起的?我知道这里还有关于此错误消息的其他问题,但是我无法根据它们来解决我的问题-可能主要是因为我不知道他们的代码在做什么,而且我是新手,但也许是因为它看起来像我的问题是由于与PDFminer专门相关的文件。

我正在运行以下代码:

from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from io import StringIO
from pdfminer.pdfpage import PDFPage

def get_pdf_file_content(path_to_pdf):
    resource_manager = PDFResourceManager(caching=True)
    out_text = StringIO
    laParams = LAParams()
    text_converter = TextConverter(resource_manager,out_text,laparams= laParams)
    fp = open(path_to_pdf,'rb')
    interpreter = PDFPageInterpreter(resource_manager,text_converter)
    for page in PDFPage.get_pages(fp,pagenos=set(),maxpages=0,password="",caching= True,check_extractable= True):
        interpreter.process_page(page)

    text = out_text.getvalue()

    fp.close()
    text_converter.close()
    out_text.close()

    return text

path_to_pdf = "C:\\files\\raw\\AZO - CALLSTREET REPORT  AutoZone,Inc.(AZO),Q1 2002 Earnings Call,5-December-2001 10 00 AM ET - 05-Dec-01.pdf"
print(get_pdf_file_content(path_to_pdf))

我收到此错误消息:

  File "<stdin>",line 1,in <module>
  File "<stdin>",line 8,in get_pdf_file_content
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfpage.py",line 122,in get_pages
    doc = PDFDocument(parser,password=password,caching=caching)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 575,in __init__
    self._initialize_password(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 599,in _initialize_password
    handler = factory(docid,param,password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 300,in __init__
    self.init()
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 307,in init
    self.init_key()
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 320,in init_key
    self.key = self.authenticate(self.password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 368,in authenticate
    key = self.authenticate_user_password(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 374,in authenticate_user_password
    key = self.compute_encryption_key(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 351,in compute_encryption_key
    password = (password + self.PASSWORD_PADDING)[:32]  # 1
TypeError: can only concatenate str (not "bytes") to str```

解决方法

您在这里有两个选择:

1)您可以将密码设置为字节,从而以

结尾
for page in PDFPage.get_pages(fp,pagenos=set(),maxpages=0,password=b"",caching= True,check_extractable= True):
        interpreter.process_page(page)

(请注意引号前面的b定义您的密码)

2)您可以摆脱该论点

password参数不是必需的(它具有默认值),因此如果您不需要它,可以删除它。您将最终得到:

for page in PDFPage.get_pages(fp,check_extractable= True):
        interpreter.process_page(page)
,

我之前遇到过这个问题。我将密码设置为字节,将传递给解析器的数据设置为字节,它可以为我将多个 PDF 转换为多个 txt 文件。这是我的代码:

    def main():

        for path in Path(PDFS_FOLDER).glob("*.pdf"):
            with path.open("rb") as file:
                 parser = PDFParser(file)
                 document = PDFDocument(parser,b"")
                 if not document.is_extractable:
                    continue

                 manager = PDFResourceManager()
                 params = LAParams()

                 device = PDFPageAggregator(manager,laparams=params)
                 interpreter = PDFPageInterpreter(manager,device)
        
                 password =b""
                 text = ""

                 for page in PDFPage.create_pages(document):
                       interpreter.process_page(page)
                       for obj in device.get_result():
                           if isinstance(obj,LTTextBox) or isinstance(obj,LTTextLine):
                    text += obj.get_text()
             with open(TEXTS_FOLDER + "{}.txt".format(path.stem),"w") as file:
                 file.write(text)
         return 0


     if __name__ == "__main__":
         import sys
         sys.exit(main())

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...