如何使用 PDFPlumber 准确提取表格？

问题描述

我是自学成才的，目前正在从事一个个人项目。我要抓取的 pdf 是 here。

我试图提取的表格附在上面。我尝试使用extract_table() 提取代码，但提取的信息不是我所期望的。

from PyPDF2 import PdfFileReader
from pathlib import Path
import pdfplumber
import requests
URL_NTU = 'https://www3.ntu.edu.sg/oad2/website_files/IGP/NTU_IGP.pdf'
filename = Path('NTU_IGP.pdf')
response = requests.get(URL_NTU)
filename.write_bytes(response.content)

pdf_path='NTU_IGP.pdf'
pdf = PdfFileReader(str(pdf_path))

with pdfplumber.open('NTU_IGP.pdf') as pdf:
    second_page = pdf.pages[1]
    third_page = pdf.pages[2]
    ntu_course_list = []
    print (second_page.extract_tables())

我收到的输出是

[[['NTU Programmes',None,'','Representative Grade',''],[None,'Profile',None],'3H2/1H1','10th','90th','percentile',['','Lee Kong Chian School of Medicine','']],[['','College of Engineering','College of Science','']]]

但我期望像 [['Medicine*','AAA/A','AAA/A'],['Renaissance Engineering*,'AAA/A']...]

任何帮助或建议将不胜感激。谢谢。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

pdf pdf pdf pdf-scraping python-3.x