OiO.lk Blog java Apache Tika Server v2 Not Exposing Async or Pipes Endpoints
java

Apache Tika Server v2 Not Exposing Async or Pipes Endpoints


My goal is to use use Tika server to intake a S3 source/destination url to asynchronously parse various file types. Using this guide as a starting point I got Tika server (2.9.2) running locally using docker, but I don’t see any /async or /pipes endpoints. I don’t expect them to work locally without a bucket which is fine, but I’d expect the endpoint to at least show up. This is going off of their documentation on tika-pipes.

These are the only logs I get on startup and the /async and /pipes endpoints both return 404s. The main home page looks fine but also doesn’t show the routes I’m looking for.
.

I’m assuming either I need to explicitly expose those endpoints, or it’s not picking up on the jars I brought it and therefore not autoloading them. Or maybe something else I don’t understand with the config file.

Any pointers are appreciated!

My tika-config.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
    </parser>
  </parsers>
  <server>
    <params>
      <enableUnsecureFeatures>true</enableUnsecureFeatures>
    </params>
  </server>
  <pipes>
    <params>
      <tikaConfig>./config/tika-config.xml</tikaConfig>
    </params>
  </pipes>
  <async>
    <params>
      <timeoutMillis>1000000</timeoutMillis>
    </params>
  </async>
  <fetchers>
    <fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
      <params>
        <name>s3f</name>
        <region>us-east-1</region>
        <bucket>tika-bucket</bucket>
        <credentialsProvider>instance</credentialsProvider>
        <spoolToTemp>false</spoolToTemp>
        <extractUserMetadata>false</extractUserMetadata>
        <maxConnections>100</maxConnections>
      </params>
    </fetcher>
  </fetchers>
  <emitters>
    <emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
      <params>
        <param name="name" type="string">s3e</param>
        <param name="region" type="string">us-east-1</param>
        <param name="credentialsProvider" type="string">instance</param>
        <param name="bucket" type="string">tika-bucket</param>
        <param name="fileExtension" type="string">json</param>
        <param name="spoolToTemp" type="bool">true</param>
      </params>
    </emitter>
  </emitters>
</properties>

dockerfile (I removed a bunch of lines from this part that were just downloaded dependencies, they are in the tutorial linked above):

FROM ubuntu:focal as base
RUN apt-get update

ENV TIKA_VERSION 2.9.2
ENV TIKA_SERVER_JAR tika-server-standard

FROM base as dependencies

RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install gdal-bin tesseract-ocr \
        tesseract-ocr-eng curl gnupg

# Set this environment variable if you need to run OCR
ENV OMP_THREAD_LIMIT=1

RUN apt-get -y install openjdk-17-jdk

FROM dependencies as fetch_tika

# download all the tika dependencies (removed those lines of code for this question)

ENV TIKA_VERSION=$TIKA_VERSION
RUN mkdir /tika-bin
COPY --from=fetch_tika /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar /tika-bin/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar
# The extra dependencies need to be added into tika-bin together with the tika-server jar
COPY --from=fetch_tika /tika-fetcher-s3-${TIKA_VERSION}.jar /tika-bin/tika-fetcher-s3-${TIKA_VERSION}.jar
COPY --from=fetch_tika /tika-emitter-s3-${TIKA_VERSION}.jar /tika-bin/tika-emitter-s3-${TIKA_VERSION}.jar
RUN mkdir /config
COPY tika-config.xml /config

EXPOSE 9998
ENTRYPOINT [ "/bin/sh", "-c", "exec java -cp \"/tika-bin/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@"]

Then build + run:

docker build --tag 'tika_server_local' .

docker run -d \                         
    --name tika_container \
    -v tika_dir:/config \
    -p 9998:9998 tika_server_local:latest \
    -c ./config/tika-config.xml



You need to sign in to view this answers

Exit mobile version