问题描述
我是python的新手,正在尝试使用PDFminer将pdf转换为txt文件,每次TypeError: can only concatenate str (not "bytes") to str*-
都会出现此错误
我很困惑,因为错误消息似乎表明该错误是由于pdfminer
软件包中的文件引起的?我知道这里还有关于此错误消息的其他问题,但是我无法根据它们来解决我的问题-可能主要是因为我不知道他们的代码在做什么,而且我是新手,但也许是因为它看起来像我的问题是由于与PDFminer
专门相关的文件。
我正在运行以下代码:
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from io import StringIO
from pdfminer.pdfpage import PDFPage
def get_pdf_file_content(path_to_pdf):
resource_manager = PDFResourceManager(caching=True)
out_text = StringIO
laParams = LAParams()
text_converter = TextConverter(resource_manager,out_text,laparams= laParams)
fp = open(path_to_pdf,'rb')
interpreter = PDFPageInterpreter(resource_manager,text_converter)
for page in PDFPage.get_pages(fp,pagenos=set(),maxpages=0,password="",caching= True,check_extractable= True):
interpreter.process_page(page)
text = out_text.getvalue()
fp.close()
text_converter.close()
out_text.close()
return text
path_to_pdf = "C:\\files\\raw\\AZO - CALLSTREET REPORT AutoZone,Inc.(AZO),Q1 2002 Earnings Call,5-December-2001 10 00 AM ET - 05-Dec-01.pdf"
print(get_pdf_file_content(path_to_pdf))
我收到此错误消息:
File "<stdin>",line 1,in <module>
File "<stdin>",line 8,in get_pdf_file_content
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfpage.py",line 122,in get_pages
doc = PDFDocument(parser,password=password,caching=caching)
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 575,in __init__
self._initialize_password(password)
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 599,in _initialize_password
handler = factory(docid,param,password)
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 300,in __init__
self.init()
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 307,in init
self.init_key()
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 320,in init_key
self.key = self.authenticate(self.password)
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 368,in authenticate
key = self.authenticate_user_password(password)
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 374,in authenticate_user_password
key = self.compute_encryption_key(password)
File "C:\text_analysis\project\lib\site-packages\pdfminer\pdfdocument.py",line 351,in compute_encryption_key
password = (password + self.PASSWORD_PADDING)[:32] # 1
TypeError: can only concatenate str (not "bytes") to str```
解决方法
您在这里有两个选择:
1)您可以将密码设置为字节,从而以
结尾for page in PDFPage.get_pages(fp,pagenos=set(),maxpages=0,password=b"",caching= True,check_extractable= True):
interpreter.process_page(page)
(请注意引号前面的b定义您的密码)
2)您可以摆脱该论点
password参数不是必需的(它具有默认值),因此如果您不需要它,可以删除它。您将最终得到:
for page in PDFPage.get_pages(fp,check_extractable= True):
interpreter.process_page(page)
,
我之前遇到过这个问题。我将密码设置为字节,将传递给解析器的数据设置为字节,它可以为我将多个 PDF 转换为多个 txt 文件。这是我的代码:
def main():
for path in Path(PDFS_FOLDER).glob("*.pdf"):
with path.open("rb") as file:
parser = PDFParser(file)
document = PDFDocument(parser,b"")
if not document.is_extractable:
continue
manager = PDFResourceManager()
params = LAParams()
device = PDFPageAggregator(manager,laparams=params)
interpreter = PDFPageInterpreter(manager,device)
password =b""
text = ""
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
for obj in device.get_result():
if isinstance(obj,LTTextBox) or isinstance(obj,LTTextLine):
text += obj.get_text()
with open(TEXTS_FOLDER + "{}.txt".format(path.stem),"w") as file:
file.write(text)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())