October 22, 2024
Chicago 12, Melborne City, USA
java

Apache Tika Server v2 Not Exposing Async or Pipes Endpoints


My goal is to use use Tika server to intake a S3 source/destination url to asynchronously parse various file types. Using this guide as a starting point I got Tika server (2.9.2) running locally using docker, but I don’t see any /async or /pipes endpoints. I don’t expect them to work locally without a bucket which is fine, but I’d expect the endpoint to at least show up. This is going off of their documentation on tika-pipes.

These are the only logs I get on startup and the /async and /pipes endpoints both return 404s. The main home page looks fine but also doesn’t show the routes I’m looking for.
enter image description here.

I’m assuming either I need to explicitly expose those endpoints, or it’s not picking up on the jars I brought it and therefore not autoloading them. Or maybe something else I don’t understand with the config file.

Any pointers are appreciated!

My tika-config.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    </parser>
  </parsers>
  <server>
    <params>
      <enableUnsecureFeatures>true</enableUnsecureFeatures>
    </params>
  </server>
  <pipes>
    <params>
      <tikaConfig>./config/tika-config.xml</tikaConfig>
    </params>
  </pipes>
  <async>
    <params>
      <timeoutMillis>1000000</timeoutMillis>
    </params>
  </async>
  <fetchers>
    <fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
      <params>
        <name>s3f</name>
        <region>us-east-1</region>
        <bucket>tika-bucket</bucket>
        <credentialsProvider>instance</credentialsProvider>
        <spoolToTemp>false</spoolToTemp>
        <extractUserMetadata>false</extractUserMetadata>
        <maxConnections>100</maxConnections>
      </params>
    </fetcher>
  </fetchers>
  <emitters>
    <emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
      <params>
        <param name="name" type="string">s3e</param>
        <param name="region" type="string">us-east-1</param>
        <param name="credentialsProvider" type="string">instance</param>
        <param name="bucket" type="string">tika-bucket</param>
        <param name="fileExtension" type="string">json</param>
        <param name="spoolToTemp" type="bool">true</param>
      </params>
    </emitter>
  </emitters>
</properties>

dockerfile (I removed a bunch of lines from this part that were just downloaded dependencies, they are in the tutorial linked above):

FROM ubuntu:focal as base
RUN apt-get update

ENV TIKA_VERSION 2.9.2
ENV TIKA_SERVER_JAR tika-server-standard

FROM base as dependencies

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install gdal-bin tesseract-ocr \
        tesseract-ocr-eng curl gnupg

# Set this environment variable if you need to run OCR
ENV OMP_THREAD_LIMIT=1

RUN apt-get -y install openjdk-17-jdk

FROM dependencies as fetch_tika

# download all the tika dependencies (removed those lines of code for this question)

ENV TIKA_VERSION=$TIKA_VERSION
RUN mkdir /tika-bin
COPY --from=fetch_tika /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar /tika-bin/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar
# The extra dependencies need to be added into tika-bin together with the tika-server jar
COPY --from=fetch_tika /tika-fetcher-s3-${TIKA_VERSION}.jar /tika-bin/tika-fetcher-s3-${TIKA_VERSION}.jar
COPY --from=fetch_tika /tika-emitter-s3-${TIKA_VERSION}.jar /tika-bin/tika-emitter-s3-${TIKA_VERSION}.jar
RUN mkdir /config
COPY tika-config.xml /config

EXPOSE 9998
ENTRYPOINT [ "/bin/sh", "-c", "exec java -cp \"/tika-bin/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@"]

Then build + run:

docker build --tag 'tika_server_local' .

docker run -d \                         
    --name tika_container \
    -v tika_dir:/config \
    -p 9998:9998 tika_server_local:latest \
    -c ./config/tika-config.xml



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video