pdf 如何批量列出带有注释的 pdf？ qpdf？ pdf信息？

问题描述

当我打印一个用 Okular 注释的 pdf 时，我感到很惊讶，尽管它确实显示在屏幕上，但打印时没有注释。我必须将注释文件保存为打印的 pdf，然后打印。

问题：如何列出至少在一页上有至少一个注释的所有 pdf？

显然，当有注释时，pdfinfo 返回 Acroform

            find -type f -iname "*.pdf" -exec pdfinfo {} \;

但不显示文件名。

我不熟悉 qpdf，但它似乎没有提供此信息

谢谢

解决方法

使用 poppler-utils 中的 pdfinfo 你可以说，

find . -type f -iname '*.pdf' | while read -r pn
do  pdfinfo "$pn" |
    grep -q '^Form: *AcroForm' && printf '%s\n' "$pn"
done

列出 pdfinfo 报告的 PDF 文件的名称：

Form:           AcroForm

但是，在我的测试中，它遗漏了几个带有文本注释的 PDF 并列出了几个没有所以我会避免在这份工作中使用它。下面是2 替代方案：qpdf 支持所有注释子类型， python3-poppler-qt5 只是一个子集，但可以更快。

（对于非 POSIX shell，请调整本文中的命令。）

编辑：编辑了 find 结构以避免不安全和依赖 GNU 的 {}。

qpdf 版本自 8.3.0 起支持 json 表示非内容 PDF 数据，并且如果您使用的系统带有 jq JSON 处理器，您可以将唯一的 PDF 注释类型列为制表符分隔值（在这种情况下，丢弃输出并使用仅退出代码）：

find . -type f -iname '*.pdf' | while read -r pn
do  qpdf --json --no-warn -- "$pn" |
    jq -e -r --arg typls '*' -f annots.jq > /dev/null && 
    printf '%s\n' "$pn"
done

哪里

--arg typls '*' 指定所需的注释子类型，例如* 为所有（默认），或 Text,FreeText,Link 为选择
-e 如果没有输出（没有找到注释），则设置退出代码 4
-r 生成原始（非 JSON）输出
jq 脚本文件 annots.jq 包含以下内容

#! /usr/bin/env jq-1.6
def annots:
    ( if ($typls | length) > 0 and $typls != "*"
      then $typls
      else
        # annotation types,per Adobe`s PDF Reference 1.7 (table 8.20)
        "Text,Link,Line,Square,Circle,Polygon"
        + ",PolyLine,Highlight,Underline,Squiggly,StrikeOut"
        + ",Stamp,Caret,Ink,Popup,FileAttachment,Sound,Movie"
        + ",Widget,Screen,PrinterMark,TrapNet,Watermark,3D"
      end | split(",")
    ) as $whitelist
    | .objects
    | .[]
    | objects
    | select( ."/Type" == "/Annot" )
    | select( ."/Subtype" | .[1:] | IN($whitelist[]) )
    | ."/Subtype" | .[1:]
    ;
[ annots ] | unique as $out
| if ($out | length) > 0 then ($out | @tsv) else empty end

出于许多目的，将 python-3.x 与 python3-poppler-qt5 一次性处理整个文件列表，

find . -type f -iname '*.pdf' -exec python3 path/to/script -t 1,7 {} '+'

其中 -t 选项列出了所需的注释子类型，每个 poppler documentation; 1 是 AText，7 是 ALink。没有 -t 所有已知的子类型选择了 poppler（0 到 14），即不是所有现有的子类型支持。

#! /usr/bin/env python3.8
import popplerqt5

def gotAnnot(pdfPathname,subtypls):
    pdoc = popplerqt5.Poppler.Document.load(pdfPathname)
    for pgindex in range(pdoc.numPages()):
        annls = pdoc.page(pgindex).annotations()
        if annls is not None and len(annls) > 0:
            for a in annls:
                if a.subType() in subtypls:
                    return True
    return False

if __name__ == "__main__":
    import sys,getopt
    typls = range(14+1)         ## default: all subtypes
    opts,args = getopt.getopt(sys.argv[1:],"t:")
    for o,a in opts:
        if o == "-t" and a != "*":
            typls = [int(c) for c in a.split(",")]
    for pathnm in args:
        if gotAnnot(pathnm,typls):
            print(pathnm)

annotations annotations pdf pdf pdf qpdf