问题描述
由于iam试图从合并的pdf文件中提取文本,并使用PDFminer将其转换为txt文件,因此iam面临PDFInterpreter错误:未知运算符“ QQ”,这是代码
from pdfminer.pdfinterp import PDFResourceManager,pdfpageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter
from io import StringIO
from pdfminer.pdfpage import pdfpage
def get_pdf_file_content(path_to_pdf):
resource_manager = PDFResourceManager(caching=True)
out_text = StringIO()
codec = 'utf-8'
laParams = LAParams()
text_converter = TextConverter(resource_manager,out_text,laparams=laParams)
fp = open(path_to_pdf,'rb')
interpreter = pdfpageInterpreter(resource_manager,text_converter)
for page in pdfpage.get_pages(fp,pagenos=set(),maxpages=0,password="",caching=True,check_extractable=True):
interpreter.process_page(page)
text = out_text.getvalue()
fp.close()
text_converter.close()
out_text.close()
return text
path_to_pdf = 'merged.pdf'
print(get_pdf_file_content(path_to_pdf))
解决方法
由于我是Windows用户,所以我不了解PDFMiner,我不习惯使用shell,但是您可以尝试以下在线转换器:https://pdftotext.com/对我来说,它工作正常。