Python Pdf 到 txt 包产生错误的结果

问题描述

我正在尝试将文件夹中的多个 pdf 文件转换为 txt 文件。

我使用了 pdfminer3 和 pdfplumber，然后将结果与 pdf to txt website 进行了比较。

这是使用pdfminer3的代码

directory = 'mydirectory'
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        pathname = os.path.join(directory,filename)
        
        resource_manager = PDFResourceManager()
        fake_file_handle = io.StringIO()
        converter = TextConverter(resource_manager,fake_file_handle)
        page_interpreter = pdfpageInterpreter(resource_manager,converter)
        
        with open(pathname,'rb') as fh:
            for page in pdfpage.get_pages(fh,caching=True,check_extractable=True):
                page_interpreter.process_page(page)
                
            text = fake_file_handle.getvalue()

        # close open handles
        converter.close()
        fake_file_handle.close()
        
        txtname = pathname.replace('.pdf','.txt')
        print(text,file=open(txtname,"a"))
        
        continue

这是我使用 pdfplumber 的代码

import os
import pdfplumber

directory = "directory"

for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        all_text = ''
        pathname = os.path.join(directory,filename)
        
        with pdfplumber.open(pathname) as pdf:
            for pdf_page in pdf.pages:
               single_page_text = pdf_page.extract_text()
               all_text = all_text + '\n' + single_page_text
        with open(txtname,"w") as text_file:
            text_file.write(all_text)        
        continue

将结果与 pdf 文件和从 pdf 到 txt 网站的结果进行比较。

原始 PDF

从 PDF 到 TXT 网站的结果

如您所见，网站上的这个转换程序从左上角到右下角扫描pdf文档。因此将文件写为关于此报告 Ceo 消息...等等。这是我期待的结果。

pdfminer3 的结果

然后这是我使用 pdfminer3 得到的结果。我不知道这个包是如何扫描 pdf 文档的。我假设它从右上角到左下角扫描文档，因此首先写入 Paragraph CEO 消息 的正文。但是，标题CEO 消息出现在段落末尾之后，我不知道发生了什么。

pdfplumber 的结果

嗯，结果比 pdfminer3 好，但仍然很少有错误。如果你看一下第 5 句话，它写的是About this CEO Message。应该是关于这份报告，然后是CEO 致辞。我假设这个包不会按句子扫描和阅读，而是从上到下扫描并返回顶部的字母。

是否有任何包可以用来生成与从 pdf 到 txt 网站的结果接近的结果？感谢您花时间阅读并回答这个问题。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

pdf pdfminer pdftotext python