My goal is to use use Tika server to intake a S3 source/destination url to asynchronously parse various file types. Using this guide as a starting point I got Tika server (2.9.2) running locally using docker, but I don’t see any /async or /pipes endpoints. I don’t expect them to work locally without a bucket which is fine, but I’d expect the endpoint to at least show up. This is going off of their documentation on tika-pipes.
These are the only logs I get on startup and the /async and /pipes endpoints both return 404s. The main home page looks fine but also doesn’t show the routes I’m looking for.
.
I’m assuming either I need to explicitly expose those endpoints, or it’s not picking up on the jars I brought it and therefore not autoloading them. Or maybe something else I don’t understand with the config file.
Any pointers are appreciated!
My tika-config.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
</parsers>
<server>
<params>
<enableUnsecureFeatures>true</enableUnsecureFeatures>
</params>
</server>
<pipes>
<params>
<tikaConfig>./config/tika-config.xml</tikaConfig>
</params>
</pipes>
<async>
<params>
<timeoutMillis>1000000</timeoutMillis>
</params>
</async>
<fetchers>
<fetcher class="org.apache.tika.pipes.fetcher.s3.S3Fetcher">
<params>
<name>s3f</name>
<region>us-east-1</region>
<bucket>tika-bucket</bucket>
<credentialsProvider>instance</credentialsProvider>
<spoolToTemp>false</spoolToTemp>
<extractUserMetadata>false</extractUserMetadata>
<maxConnections>100</maxConnections>
</params>
</fetcher>
</fetchers>
<emitters>
<emitter class="org.apache.tika.pipes.emitter.s3.S3Emitter">
<params>
<param name="name" type="string">s3e</param>
<param name="region" type="string">us-east-1</param>
<param name="credentialsProvider" type="string">instance</param>
<param name="bucket" type="string">tika-bucket</param>
<param name="fileExtension" type="string">json</param>
<param name="spoolToTemp" type="bool">true</param>
</params>
</emitter>
</emitters>
</properties>
dockerfile (I removed a bunch of lines from this part that were just downloaded dependencies, they are in the tutorial linked above):
FROM ubuntu:focal as base
RUN apt-get update
ENV TIKA_VERSION 2.9.2
ENV TIKA_SERVER_JAR tika-server-standard
FROM base as dependencies
RUN DEBIAN_FRONTEND=noninteractive apt-get update && apt-get -y install gdal-bin tesseract-ocr \
tesseract-ocr-eng curl gnupg
# Set this environment variable if you need to run OCR
ENV OMP_THREAD_LIMIT=1
RUN apt-get -y install openjdk-17-jdk
FROM dependencies as fetch_tika
# download all the tika dependencies (removed those lines of code for this question)
ENV TIKA_VERSION=$TIKA_VERSION
RUN mkdir /tika-bin
COPY --from=fetch_tika /${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar /tika-bin/${TIKA_SERVER_JAR}-${TIKA_VERSION}.jar
# The extra dependencies need to be added into tika-bin together with the tika-server jar
COPY --from=fetch_tika /tika-fetcher-s3-${TIKA_VERSION}.jar /tika-bin/tika-fetcher-s3-${TIKA_VERSION}.jar
COPY --from=fetch_tika /tika-emitter-s3-${TIKA_VERSION}.jar /tika-bin/tika-emitter-s3-${TIKA_VERSION}.jar
RUN mkdir /config
COPY tika-config.xml /config
EXPOSE 9998
ENTRYPOINT [ "/bin/sh", "-c", "exec java -cp \"/tika-bin/*\" org.apache.tika.server.core.TikaServerCli -h 0.0.0.0 $0 $@"]
Then build + run:
docker build --tag 'tika_server_local' .
docker run -d \
--name tika_container \
-v tika_dir:/config \
-p 9998:9998 tika_server_local:latest \
-c ./config/tika-config.xml
You need to sign in to view this answers
Leave feedback about this