pdfminer：pdf 到文本 python 3：在每个页面中迭代一个 for 循环

问题描述

在将 pdf 文件转换为文本之前，我试图从我的 pdf 文件的每一页中删除页眉和页脚。我对 Python 相当陌生，正在寻求帮助。我发现 print(outputresult.splitlines()[3:-3]) 部分工作，但它从完整的文本输出中删除了前 3 个和后三个（在这种情况下，'outputresult' 是输出的名称）。我想我应该先遍历每个页面以获得所需的结果。但我不明白在哪里添加代码。

示例：

header_line_1
header_line_2

text line 1 
text line 2
.....
.....
.....
footer_line_1
footer_line_2

想要的输出：

text line 1 
text line 2
.....
.....

注意：我希望 pdf 中的每一页都发生这种情况。

下面是我正在使用的代码

def readPDF(pdfFile):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = pdfminer.layout.LAParams(char_margin = 200,word_margin=10,# default 0.2
                    line_margin=10,# default 0.3
                    line_overlap=0.5       # default 0.5
)
setattr(laparams,'all_texts',True)

device = TextConverter(rsrcmgr,retstr,codec=codec,laparams=laparams)

interpreter = pdfpageInterpreter(rsrcmgr,device)
password = ""
maxpages = 0
caching = True
pagenos=set()

for page in pdfpage.get_pages(pdfFile,pagenos,maxpages=maxpages,password=password,caching=caching,check_extractable=True):
    interpreter.process_page(page)

device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr

if __name__ == "__main__":

scrape = open("location/file1.pdf",'rb') # for local files
pdfFile = BytesIO(scrape.read())
outputresult= readPDF(pdfFile)
print(outputresult)
pdfFile.close()

谢谢

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

pdfminer pdftotext python-3.x