如何从合并的PDF文件中提取文本并转换为txt文件?

问题描述

由于iam试图从合并的pdf文件提取文本,并使用PDFminer将其转换为txt文件,因此iam面临PDFInterpreter错误:未知运算符“ QQ”,这是代码

    from pdfminer.pdfinterp import PDFResourceManager,pdfpageInterpreter
    from pdfminer.layout import LAParams
    from pdfminer.converter import TextConverter
    from io import StringIO
    from pdfminer.pdfpage import pdfpage
    def get_pdf_file_content(path_to_pdf):
        resource_manager = PDFResourceManager(caching=True)
        out_text = StringIO()
        codec = 'utf-8'
        laParams = LAParams()
        text_converter = TextConverter(resource_manager,out_text,laparams=laParams)
        fp = open(path_to_pdf,'rb')
        interpreter = pdfpageInterpreter(resource_manager,text_converter)
        for page in pdfpage.get_pages(fp,pagenos=set(),maxpages=0,password="",caching=True,check_extractable=True):
        interpreter.process_page(page)
        text = out_text.getvalue()
        fp.close()
        text_converter.close()
        out_text.close()
        return text
    path_to_pdf = 'merged.pdf'
    print(get_pdf_file_content(path_to_pdf))

解决方法

由于我是Windows用户,所以我不了解PDFMiner,我不习惯使用shell,但是您可以尝试以下在线转换器:https://pdftotext.com/对我来说,它工作正常。