如何在 Google Cloud Function 上使用 Python pdf2image 模块即 poppler？

问题描述

我尝试在 Google Cloud Functions 上将 PDF 转换为 JPEG。我使用了 Python 模块 pdf2image。但是我不知道如何解决云函数上的错误 No such file or directory: 'pdfinfo' 和 "Unable to get page count. Is poppler installed and in PATH?。

错误代码与this question非常相似。 pdf2image 是 poppler 的“pdftoppm”和“pdftocairo”的包装器。但是如何在谷歌云函数上安装 poppler 包，并将其添加到 PATH 中？我找不到相关的参考资料。甚至有可能吗？如果没有，可以做什么？

还有this question，但没有用。

代码如下所示。入口点是 process_image。

import requests
from pdf2image import convert_from_path

def process_image(event,context):
    # Download sample pdf file
    url = 'https://www.a@R_404[email protected]/support/products/enterprise/kNowledgecenter/media/c4611_sample_explain.pdf'
    r = requests.get(url,allow_redirects=True)
    open('/tmp/sample.pdf','wb').write(r.content)

    # Error occur on this line
    pages = convert_from_path('/tmp/sample.pdf')

    # Save pages to /tmp
    for idx,page in enumerate(pages):
        output_file_path = f"/tmp/{str(idx)}.jpg"
        page.save(output_file_path,'JPEG')
        # To be saved to cloud storage

需求.txt：

requests==2.25.1
pdf2image==1.14.0

这是我得到的错误代码：

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py",line 441,in pdfinfo_from_path
    proc = Popen(command,env=env,stdout=PIPE,stderr=PIPE)
  File "/opt/python3.8/lib/python3.8/subprocess.py",line 858,in __init__
    self._execute_child(args,executable,preexec_fn,close_fds,File "/opt/python3.8/lib/python3.8/subprocess.py",line 1706,in _execute_child
    raise child_exception_type(errno_num,err_msg,err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

在处理上述异常的过程中，又发生了一个异常：

Traceback (most recent call last):
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py",line 2447,in wsgi_app
    response = self.full_dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py",line 1952,in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py",line 1821,in handle_user_exception
    reraise(exc_type,exc_value,tb)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/_compat.py",line 39,in reraise
    raise value
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py",line 1950,in full_dispatch_request
    rv = self.dispatch_request()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py",line 1936,in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/functions_framework/__init__.py",line 149,in view_func
    function(data,context)
  File "/workspace/main.py",line 11,in process_image
    pages = convert_from_path('/tmp/sample.pdf')
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py",line 97,in convert_from_path
    page_count = pdfinfo_from_path(pdf_path,userpw,poppler_path=poppler_path)["Pages"]
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py",line 467,in pdfinfo_from_path
    raise PDFInfonotinstalledError(
pdf2image.exceptions.PDFInfonotinstalledError: Unable to get page count. Is poppler installed and in PATH?

在此先感谢您的帮助。

解决方法

出现此错误是因为 poppler 包在 Cloud Functions 中不起作用，因为它需要将某些文件写入系统。遗憾的是，您无法在 Cloud Functions 等无服务器产品中写入文件系统。

您可能想尝试其他线程中描述的方法，或考虑使用可以访问整个系统的 GCP Compute Engine。

Cloud Functions 不支持安装自定义系统级包（即使它支持相关编程语言的第三方库，并带有 npm、pip 等包管理器）。如https://cloud.google.com/functions/docs/reference/system-packages所示，没有包“poppler”。

但是，您仍然可以使用其他预安装的软件包。 ghostscript 可用于将 pdf 转换为图像。

首先，您应该将 pdf 文件保存在云功能中（例如从云存储中）。您只有对 /tmp 的磁盘写访问权限 (https://cloud.google.com/functions/docs/concepts/exec#file_system)。

将 pdf 转换为 jpeg 的终端命令示例如下

gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile=output/file/path input/file/path

在python环境中使用命令的示例代码：

# download the file from google cloud storage
gcs = storage.Client(project=os.environ['GCP_PROJECT'])
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(input_file_path)

# run ghostscript
cmd = f'gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile="{output_file_path}" {input_file_path}'.split(' ')
p = subprocess.Popen(cmd,stderr=subprocess.PIPE,stdout=subprocess.PIPE)
stdout,stderr = p.communicate()
error = stderr.decode('utf8')
if error:
    logging.error(error)
    return

注意：您可能想改用 imagemagick 包，它本身使用 ghostscript。但是，如 Can't load PDF with Wand/ImageMagick in Google Cloud Function 中所述，由于 Ghostscript 在撰写本文时 (2021-07-12) 存在安全漏洞，ImageMagick 的 PDF 读取已被禁用。提供的解决方案本质上是另一种运行 ghostscript 的方法。

参考： https://www.the-swamp.info/blog/google-cloud-functions-system-packages/

google-cloud-functions image image pdf poppler python