从S3存储桶python中的pdf文件中提取文本

问题描述

我的 AWS s3 存储桶中有多种格式文件，例如 pdf、doc、rtf、odt、png，我需要从中提取文本。我已经设法获得了带有路径的内容列表。现在取决于文件类型，我将使用不同的库从文件中提取文本。由于文件可能有数千个，我需要直接从 s3 中提取文本而不是下载。

filespath=['https://abc.s3.ap-south-1.amazonaws.com/DocumentOnPATest','https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf','https://abc.s3.ap-south-1.amazonaws.com/receipt.png','https://abc.s3.ap-south-1.amazonaws.com/sample.rtf','https://abc.s3.ap-south-1.amazonaws.com/sample1.odt']

bucketname =abc

我尝试了一些东西，但它给了我错误

for path in filespath:
    ext=pathlib.Path(path).suffix
    if ext=='.pdf':
       pdf_file=PyPDF2.PdfFileReader(path)
       print(pdf_file.extractText())

但我收到一个错误

  File "F:\Projects\FileExtractor\fileextracts3.py",line 28,in <module>
    pdf_file=PyPDF2.PdfFileReader(path)

  File "C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py",line 1081,in __init__
    fileobj = open(stream,'rb')

OSError: [Errno 22] Invalid argument: 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf

请帮我带头。谢谢

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

amazon-s3 python python-pdfreader