问题描述
我想从其位置提取注释文本(例如,超链接的突出显示的文本)。为此,我可以使用PDFminer抓取位置和网址,如下面的代码所示。是否可以将此位置传递给布局对象并获取文本?
这是我用于此目的的代码块。
第一部分包括一个名为parse_annotation的函数,用于解析每个页面中的注释。
def parse_annotations(page):
positions = []
urls = []
for annot in pdftypes.resolve1(page.annots):
if isinstance(annot,pdftypes.PDFObjRef):
annotationDict = annot.resolve()
# Skip over any annotations that are not links
if str(annotationDict["Subtype"]) != "/'Link'":
continue
destID = 0
position = annotationDict["Rect"]
uriDict = "None"
if any(k in annotationDict for k in {"Dest","D"}):
destID = (annotationDict["Dest"][0]).objid
url = "Cross reference"
elif "A" in annotationDict:
# Key A contains PDFObjRef,then resolve it again
if isinstance(annotationDict["A"],pdftypes.PDFObjRef):
uriDict = pdftypes.resolve1(annotationDict["A"])
if any(k in uriDict for k in {"Dest","D"}):
destID = (uriDict["D"][0]).objid
else:
uriDict = annotationDict["A"]
# Check if the key exists within resolved uriDict
if str(uriDict["S"]) == "/'GoTo'":
url = "Cross reference"
elif str(uriDict["S"]) == "/'URI'":
url = str(uriDict["URI"])
url = url.lstrip("b")
url = url.replace("'","")
else:
# Skip if key S in uriDict does not contain value URI,GoTo
continue
else:
sys.stderr.write("Warning: unknown key in annotationDict : ",annotationDict)
#print(annot,'\n',annotationDict,destID,position,uriDict,url,'\n')
print(position,'\n')
positions.append(position)
urls.append(url)
else:
sys.stderr.write("Warning: unknown annotation: %s\n" % annot)
return positions,urls
示例PDF文件可从下面的链接中找到。
https://www2.ed.gov/about/offices/list/ocr/docs/20200512-qa-psi-covid-19.pdf
现在,通过使用PDFMiner创建一个文档对象,并开始循环浏览PDF中找到的页面。
manager = PDFResourceManager()
output = StringIO()
codec = 'utf-8'
laparams = LAParams()
converter = TextConverter(manager,output,codec=codec,laparams=laparams)
device = PDFPageAggregator(manager,laparams=laparams)
interpreter = PDFPageInterpreter(manager,device)
page_interpreter = PDFPageInterpreter(manager,converter)
filename = '20200512-qa-psi-covid-19.pdf'
fp = open(filename,'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
page_no = 0
for pageNumber,page in enumerate(PDFPage.create_pages(document)):
print("\n================ PageNumber ",pageNumber+1,"===================\n")
if pageNumber == page_no:
page_interpreter.process_page(page)
raw_text = output.getvalue()
output.truncate(0)
output.seek(0)
interpreter.process_page(page)
layout = device.get_result()
if page.annots:
positions,urls = parse_annotations(page)
for obj in layout:
print('Object name and position %s \t %s \n' % (obj.__class__.__name__,obj.bbox))
page_no += 1
fp.close()
converter.close()
output.close()
device.close()
预先感谢, 答:
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)