如何在pdfplumber中迭代.extracttext

问题描述

我正在尝试构建一个工具来从PDF文件的每一页提取文本。到目前为止，只有pdfplumber返回可读取的文本。 pdfplumber的示例（例如https://github.com/jsvine/pdfplumber）显示了 per 页中提取的文本。因此，我已完成以下操作以捕获多个页面：

import pdfplumber

with pdfplumber.open(file) as pdf:

        p1 = pdf.pages[0]
        p2 = pdf.pages[1]
        p3 = pdf.pages[2]
    
        p1_text = p1.extract_text()
        p2_text = p2.extract_text()
        p3_text = p3.extract_text()
    
        print(p1_text,p2_text,p3_text)

我的pdf有17页。我想知道是否有可能遍历一个列表（即0-16）以便生成p1，p2，p3 ... p17（with语句下的第一个块）。

我使用以下方法生成了必要的列表：

file = '/Users/Guy/Coding/Crossref/sample.pdf'

from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open(file,'rb'))
total_pages = pdf.getNumPages()

total_pages_range = list(range(1,total_pages))

但似乎无法将两者结合在一起。

任何帮助将不胜感激-只是从Python开始。谢谢。

解决方法

pdfplumber.PDF 类有一个 .pages 属性，它是一个列表，每个加载的页面包含一个 pdfplumber.Page 实例。因此，如果您的 PDF 有 n 页，您可以像

一样遍历所有页面

import pdfplumber

with pdfplumber.open(file) as pdf:
    for page in pdf.pages:
        print(page.extract_text())

iteration list pdf text-extraction