我们如何使用Python解码PDF内部的二进制流?

问题描述

我有一个包含二进制流的pdf文件

5 0 obj
<< /Length 4760 /Filter [ /ASCII85Decode /FlateDecode ]
 >>
stream
<<LOTS OF BINARY DATA IS HERE!!>>
endstream

如何解码二进制流?

在这里问了类似的问题:

How do we decompress FlateDecode Objects in PDF in Python?

但是,我有两个流过滤器/ASCII85Decode/FlateDecode,而不仅仅是/FlateDecode

此外,我不断收到以下错误消息:

Error -3 while decompressing data: incorrect header check

以下是我对pdf进行解码的尝试之一:

import re
import zlib
import io

def sani_path(ugly_path):
    """ 
        sanitize an ugly file-path 
        converts windows-style file paths to Unix-style paths and vis versa.
        changes slashes to back-slashes or back-slashes to slashes
    """
    # IMPLEMENTATION NOTES:
    #     VARIABLE NAMES: 
    #         `wip`.... `work in progress`
    #
    #     OTHER NOTES:
    #          `repr()` converts new-line char in the middle of the path into
    #          a backslash character and a letter "n"
    import pathlib
    wip = str(ugly_path)
    wip = wip.strip()
    wip = repr(wip)[1:-1]
    wip = pathlib.Path(wip)        
    pretty_path = wip      
    return pretty_path

def decompress_pdf_from_path(xpath):
    ###########################################
    ipath = sani_path(xpath)
    del xpath
    ############################################
    pdf = open(str(ipath),'rb').read()
    stream = re.compile(b'stream(.*?)endstream',re.S)
    results = re.findall(stream,pdf)
    for s in results:
        s = s.strip(b'\r\n')
        IoUt = None
        try:
            IoUt = zlib.decompress(s).decode('UTF-8')
        except BaseException as exc:
            out_stream = io.StringIO()
            print(
                "INPUT STRING: ",s,type(exc),str(exc),sep="\n",file=out_stream
            )
            IoUt = out_stream.getvalue()
        finally:
            xout = str(IoUt)
            return xout


read_file_path_str = 'C:/Users/username/Desktop/test.pdf'
decompressed_pdf = decompress_pdf_from_path(read_file_path_str)
print(decompressed_pdf)
print('length ',len(decompressed_pdf))

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)