为什么我在Python PDFMiner中收到此错误：TypeError：只能将str而不是“ bytes”连接到str 1您可以将密码设置为字节，从而以 2您可以摆脱该论点

问题描述

我是python的新手，正在尝试使用PDFminer将pdf转换为txt文件，每次TypeError: can only concatenate str (not "bytes") to str*-都会出现此错误

我很困惑，因为错误消息似乎表明该错误是由于pdfminer软件包中的文件引起的？我知道这里还有关于此错误消息的其他问题，但是我无法根据它们来解决我的问题-可能主要是因为我不知道他们的代码在做什么，而且我是新手，但也许是因为它看起来像我的问题是由于与PDFminer专门相关的文件。

我正在运行以下代码：

from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from io import StringIO
from pdfminer.pdfpage import pdfpage

def get_pdf_file_content(path_to_pdf):
    resource_manager = PDFResourceManager(caching=True)
    out_text = StringIO
    laParams = LAParams()
    text_converter = TextConverter(resource_manager,out_text,laparams= laParams)
    fp = open(path_to_pdf,'rb')
    interpreter = pdfpageInterpreter(resource_manager,text_converter)
    for page in pdfpage.get_pages(fp,pagenos=set(),maxpages=0,password="",caching= True,check_extractable= True):
        interpreter.process_page(page)

    text = out_text.getvalue()

    fp.close()
    text_converter.close()
    out_text.close()

    return text

path_to_pdf = "C:\\files\\raw\\AZO - CALLSTREET REPORT  AutoZone,Inc.(AZO),Q1 2002 Earnings Call,5-December-2001 10 00 AM ET - 05-Dec-01.pdf"
print(get_pdf_file_content(path_to_pdf))

我收到此错误消息：

  File "<stdin>",line 1,in <module>
  File "<stdin>",line 8,in get_pdf_file_content
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfpage.py",line 122,in get_pages
    doc = PDFDocument(parser,password=password,caching=caching)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 575,in __init__
    self._initialize_password(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 599,in _initialize_password
    handler = factory(docid,param,password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 300,in __init__
    self.init()
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 307,in init
    self.init_key()
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 320,in init_key
    self.key = self.authenticate(self.password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 368,in authenticate
    key = self.authenticate_user_password(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 374,in authenticate_user_password
    key = self.compute_encryption_key(password)
  File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 351,in compute_encryption_key
    password = (password + self.PASSWORD_PADDING)[:32]  # 1
TypeError: can only concatenate str (not "bytes") to str```

解决方法

您在这里有两个选择：

1）您可以将密码设置为字节，从而以

结尾

for page in PDFPage.get_pages(fp,pagenos=set(),maxpages=0,password=b"",caching= True,check_extractable= True):
        interpreter.process_page(page)

（请注意引号前面的b定义您的密码）

2）您可以摆脱该论点

password参数不是必需的（它具有默认值），因此如果您不需要它，可以删除它。您将最终得到：

for page in PDFPage.get_pages(fp,check_extractable= True):
        interpreter.process_page(page)

我之前遇到过这个问题。我将密码设置为字节，将传递给解析器的数据设置为字节，它可以为我将多个 PDF 转换为多个 txt 文件。这是我的代码：

    def main():

        for path in Path(PDFS_FOLDER).glob("*.pdf"):
            with path.open("rb") as file:
                 parser = PDFParser(file)
                 document = PDFDocument(parser,b"")
                 if not document.is_extractable:
                    continue

                 manager = PDFResourceManager()
                 params = LAParams()

                 device = PDFPageAggregator(manager,laparams=params)
                 interpreter = PDFPageInterpreter(manager,device)
        
                 password =b""
                 text = ""

                 for page in PDFPage.create_pages(document):
                       interpreter.process_page(page)
                       for obj in device.get_result():
                           if isinstance(obj,LTTextBox) or isinstance(obj,LTTextLine):
                    text += obj.get_text()
             with open(TEXTS_FOLDER + "{}.txt".format(path.stem),"w") as file:
                 file.write(text)
         return 0


     if __name__ == "__main__":
         import sys
         sys.exit(main())

pdf pdfminer python python-3.x