OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

tesseract-ocr slow inside docker compared to local

  • Thread starter Thread starter qoob
  • Start date Start date
Q

qoob

Guest
I've successfully installed tesseract 5.3.1 inside a docker container built on the base image FROM:python:3.10 have run a script that takes file_path and runs it through PyMuPDF with OCR enabled (using tesseract on the backend).

On my local machine (Windows), I have the same version of tesseract and PyMuPDF installed. Running the exact same code with the exact same file and version of libraries, I get wildly different results - running anything inside Docker on the same machine is much slower than on local.

I tried three configurations:

Running locally on my machine

Code:
100%|██████████| 24/24 [00:49<00:00,  2.06s/it]
It took 0:00:49 to process 24 pages.

Running with the docker install

Code:
100%|██████████| 24/24 [03:14<00:00,  8.12s/it]
It took 0:03:14 to process 24 pages.

Running within docker, through bind mount

Code:
100%|██████████| 24/24 [02:06<00:00,  5.29s/it]
It took 0:02:06 to process 24 pages.

In docker compose, I have the following + I've set the env path for TESSDATA_PREFIX to /code/ocr/tessdata so it refers to the bind mount.

Code:
- type: bind
  source: "C:/Program Files/Tesseract-OCR/tessdata"
  target: /code/ocr/tessdata

The script is essentially this:

Code:
import fitz # pip install pymupdf
content_dict = {}
pages = fitz.open(pdf_file_path)
for page_num, page in enumerate(pages, start=1):
    list_of_text_in_block = []
    page_data = page.get_textpage_ocr(dpi=400, full=True)
    text_blocks = page_data.extractBLOCKS()
    for text_block in text_blocks:
        block_text = text_block[4]
        list_of_text_in_block.append(block_text)
    content_dict[page_num] = ' '.join(list_of_text_in_block)

The command tesseract --version yields this inside docker:

Code:
tesseract 5.3.1
 leptonica-1.79.0
  libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0     
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3

and this on my local (note: 5.3.1 is the same as 5.3.1.20230401)

Code:
tesseract v5.3.1.20230401
 leptonica-1.83.1
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.4) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libwebp 1.3.0 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.2.9 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.2
 Found libcurl/8.0.1 Schannel zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.10.0

Looking at what happens inside docker when the script is running, this is what I get. There are 12 cores available, but only one is being used at 100 %. Even if it was due to resource usage, it does not explain why the bind mount option is 35 % faster than the install within docker, since both are using the same resources. Image of Docker resource usage
<p>I've successfully installed <code>tesseract 5.3.1</code> inside a docker container built on the base image <code>FROM:python:3.10</code> have run a script that takes file_path and runs it through <code>PyMuPDF</code> with OCR enabled (using tesseract on the backend).</p>
<p>On my local machine (Windows), I have the same version of tesseract and PyMuPDF installed. Running the exact same code with the exact same file and version of libraries, I get wildly different results - running anything inside Docker on the same machine is much slower than on local.</p>
<p>I tried three configurations:</p>
<p><strong>Running locally on my machine</strong></p>
<pre><code>100%|██████████| 24/24 [00:49<00:00, 2.06s/it]
It took 0:00:49 to process 24 pages.
</code></pre>
<p><strong>Running with the docker install</strong></p>
<pre><code>100%|██████████| 24/24 [03:14<00:00, 8.12s/it]
It took 0:03:14 to process 24 pages.
</code></pre>
<p><strong>Running within docker, through bind mount</strong></p>
<pre><code>100%|██████████| 24/24 [02:06<00:00, 5.29s/it]
It took 0:02:06 to process 24 pages.
</code></pre>
<p>In docker compose, I have the following + I've set the env path for <code>TESSDATA_PREFIX</code> to <code>/code/ocr/tessdata</code> so it refers to the bind mount.</p>
<pre><code>- type: bind
source: "C:/Program Files/Tesseract-OCR/tessdata"
target: /code/ocr/tessdata
</code></pre>
<p>The script is essentially this:</p>
<pre><code>import fitz # pip install pymupdf
content_dict = {}
pages = fitz.open(pdf_file_path)
for page_num, page in enumerate(pages, start=1):
list_of_text_in_block = []
page_data = page.get_textpage_ocr(dpi=400, full=True)
text_blocks = page_data.extractBLOCKS()
for text_block in text_blocks:
block_text = text_block[4]
list_of_text_in_block.append(block_text)
content_dict[page_num] = ' '.join(list_of_text_in_block)
</code></pre>
<p>The command <code>tesseract --version</code> yields this inside docker:</p>
<pre><code>tesseract 5.3.1
leptonica-1.79.0
libgif 5.1.9 : libjpeg 6b (libjpeg-turbo 2.0.6) : libpng 1.6.37 : libtiff 4.2.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libcurl/7.74.0 OpenSSL/1.1.1n zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh2/1.9.0 nghttp2/1.43.0 librtmp/2.3
</code></pre>
<p>and this on my local (note: 5.3.1 is the same as 5.3.1.20230401)</p>
<pre><code>tesseract v5.3.1.20230401
leptonica-1.83.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.4) : libpng 1.6.39 : libtiff 4.5.0 : zlib 1.2.13 : libwebp 1.3.0 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.2.9 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.2
Found libcurl/8.0.1 Schannel zlib/1.2.13 brotli/1.0.9 zstd/1.5.4 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.10.0
</code></pre>
<p>Looking at what happens inside docker when the script is running, this is what I get. There are 12 cores available, but only one is being used at 100 %. Even if it was due to resource usage, it does not explain why the bind mount option is 35 % faster than the install within docker, since both are using the same resources.
<a href="https://i.sstatic.net/tQrlF.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/tQrlF.png" alt="Image of Docker resource usage" /></a></p>
 
Top