OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Trying to get accurate OCR in Python

  • Thread starter Thread starter Jacob
  • Start date Start date
J

Jacob

Guest
I am trying to grab text from PDF documents using pytesseract but it is proving to be very inaccurate. In particular, I am trying to read the bottom of the page where it says ZI2440A but it prints Z12440A. Is there a way I can process this image better before OCR, or is there a different tool that will work better?

I have attached the PDF, already converted to jpg that I used, with sensitive information blocked. I also attached my simple code.

Test Doc

Code:
import pytesseract
from pdf2image import convert_from_path

file = r"C:\Users\jkaplan\Documents\2023_HYDE, MATTHEW_SIGNED E-FILE AUTHORIZATION FORM.pdf"
image = convert_from_path(file, use_pdftocairo=True)
image[0].save('testdoc.jpg')
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
text = pytesseract.image_to_string(image[0], lang="eng")
print(text)

I have tried to replicate this code using pypdf and pdfminer, but they both read this character as 1 instead of I.
<p>I am trying to grab text from PDF documents using pytesseract but it is proving to be very inaccurate. In particular, I am trying to read the bottom of the page where it says Z<strong>I</strong>2440A but it prints Z<strong>1</strong>2440A. Is there a way I can process this image better before OCR, or is there a different tool that will work better?</p>
<p>I have attached the PDF, already converted to jpg that I used, with sensitive information blocked. I also attached my simple code.</p>
<p><a href="https://i.sstatic.net/4aJmTNmL.jpg" rel="nofollow noreferrer">Test Doc</a></p>
<pre><code>import pytesseract
from pdf2image import convert_from_path

file = r"C:\Users\jkaplan\Documents\2023_HYDE, MATTHEW_SIGNED E-FILE AUTHORIZATION FORM.pdf"
image = convert_from_path(file, use_pdftocairo=True)
image[0].save('testdoc.jpg')
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
text = pytesseract.image_to_string(image[0], lang="eng")
print(text)
</code></pre>
<p>I have tried to replicate this code using pypdf and pdfminer, but they both read this character as 1 instead of I.</p>
Continue reading...
 

Latest posts

Top