空解析器和Tika Server模式

问题描述

我无法理解如何将解析器加载到Tika中。从他们的文档看来,Tika-app预先打包了解析器(https://tika.apache.org/1.17/gettingstarted.html)。当我运行此命令来启动服务器时

    ./.java-buildpack/open_jdk_jre/bin/java -jar ./lib/tika-app-1.24.1.jar -s --port ${PORT}

    2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR Nov 02,2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR WARNING: J2KImageReader not loaded. jpeg2000 files will not be processed.
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR See https://pdfBox.apache.org/2.0/dependencies.html#jai-image-io
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR for optional dependencies.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Nov 02,2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR WARNING: org.xerial's sqlite-jdbc is not loaded.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Please provide the jar on your classpath to parse sqlite files.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR See tika-parsers/pom.xml for the correct version.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] OUT Successfully started tika-app's server on port: 8080
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR WARNING: The server option in tika-app is deprecated and will be removed
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR by Tika 2.0 if not shortly after Tika 1.14.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR Please migrate to the JAX-RS tika-server package.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR See https://wiki.apache.org/tika/TikaJAXRS for usage.
   2020-11-02T13:31:25.66-0600 [HEALTH/0] ERR Failed to make HTTP request to '/version' on port 8080: timed out after 1.00 seconds
   2020-11-02T13:31:25.66-0600 [CELL/0] ERR Timed out after 1m0s: health check never passed.

我拥有最新的tika版本1.24.1。他们的文档提到了下载tika-server并在运行时传递类路径以指向tika-parsers.jar(https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-ParsersMissing),但我在任何地方都找不到parsers.jar文件。我正在使用openjdk-jre-1.8.0来运行它。

解决方法

默认情况下,解析器应捆绑在一起。服务器模式(-s)中的Tika App是基于套接字的服务器。您可以通过使用netcat并查看是否收到响应来确认它是否正常工作:

nc localhost 8080 -q2 < test.pdf

要在Python中使用此代码,您需要编写自定义代码,打开一个套接字并发送输入,发送SHUT_WR,然后将输出读回。

如果您正在使用tika-python库,则期望使用Tika Server,它位于tika-server JAR中,而不是tika-app JAR中。它具有一些帮助程序设置,因此您可以指向JAR,也可以托管自己的实例(自运行或docker)并为其指定URL。