使用Tika模块的Python PDF提取句子？

问题描述

这是我的问题的逻辑：

逐页阅读（以稍后检索页码）
添加句号以识别为“句子”
检查完页面中是否存在关键字后
如果确实如此，则仅将找到的第一句话提取出来，保存页码
转到列表中的下一个项目并重复操作
如果尚未在整个文档中找到该单词，则打印（'N / A'）

问题：输出似乎仅在整个文档中找到列表中第一个单词的句子，而不是一旦找到就停止。它也不会移至关键字列表中的第二项。请有人帮忙吗？

当前错误输出的示例：

systems are so varied currently. 3
systems have this in common. 4
systems are the best  7

所需的输出

systems are so varied currently. 3
biometric data is there 9
technology is the best 10
silver jewellery is present 15
puppies are cute 29

代码：

keywords= ['systems','biometric','technology','silver','puppies']

import tika
from tika import parser

my_file="mypdf.pdf"

count=0
lst=[]

raw_xml = parser.from_file(my_file,xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>","").replace("</p>","").replace("<div>","").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]

text=str(text_pages)

for line in text.split('\\n'):
    if 4 <= len(line) <= 50:
        line=line+'.'
        line= line.strip('\\n')
        line=str(line)
        
    for j in keywords: 
        for i in line.split('.'): 
            if j in i:
                lst.append((i.split('.')[0]))
                print(j,i.split('.')[0],count)
                count= count+1
                break
    else:  
        lst.append('N/A')
        continue

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-tika for-loop pdf pdf python

使用Tika模块的Python PDF提取句子？

问题描述

解决方法

相关问答