如何阅读波斯语pdf并抓取其内容？

问题描述

我正在尝试阅读这个波斯语 pdf，但结果没有很好地解码。我也试过 utf-16 或 utf-32，但没有产生可读的结果。我想在表中获取波斯日期。尝试了其他库，但没有提取出好的文本。

 urlpdf="https://www.codal.ir/Reports/DownloadFile.aspx?id=LG5QhAhMbfl2DrQQQaQQQ%2bkR9nMQ%3d%3d"
    response = requests.get(urlpdf,verify=False,timeout=5)
with io.BytesIO(response.content) as f:
    #print(response.content)
    pdf = PdfFileReader(f)
    #print(pdf)
    @R_9_4045@ion = pdf.getDocumentInfo()
    number_of_pages = pdf.getNumPages()
    txt = f"""
    Author: {@R_9_404[email protected]}
    Creator: {@R_9_404[email protected]}
    Producer: {@R_9_404[email protected]}
    Subject: {@R_9_404[email protected]}
    Title: {@R_9_404[email protected]}
    Number of pages: {number_of_pages}
    """
    # Here the Metadata of your pdf
    print(txt)
    # numpage for the number page
    numpage=0
    page = pdf.getPage(numpage)
    page_content = page.extractText()+"\n"
    # print the content in the page 20 
    g=open("extract.txt",'w',encoding='UTF-8',)
    g.write(page_content)
    g.close
    print(page_content)

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

pdf-scraping python python-3.x