October 21, 2024
Chicago 12, Melborne City, USA
python

Trouble locating a table in an image using OpenCV


Working on extracting text from documents. The text can either be paragraph-like or sectioned and zoned.

Tesseract by itself is doing an amazing job at extracting the text, but as you can see from the image the middle section (in red) is a table that doesn’t follow a well defined structure. I don’t care about the table structure itself. What I’m interested in is keeping the extracted text in its context. So going cell by cell seems like the solution.

I couldn’t get OpenCV to detect that table section so that I would later crop each cell and OCRize it away from other cells to keep the text in context.

example document

Here’s what I did :

import cv2
import pytesseract
import numpy as np
import matplotlib.pyplot as plt

# Load the image (in your case, load from PDF converted image)
# Define paths
RAW_DATA_PATH = '../../data/raw/'
PROCESSED_DATA_PATH = '../../data/processed/'
RESULTS_PATH = '../../data/results/'

image_path = f'{RAW_DATA_PATH}download0.png'

# Adjust this to match the path of your image
image = cv2.imread(image_path, cv2.IMREAD_COLOR)

# Preprocess the image (convert to grayscale, binarize it, etc.)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY_INV)

# Detect horizontal and vertical lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))

# Detect horizontal lines
horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel)
# Detect vertical lines
vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel)

# Combine both horizontal and vertical lines
table_structure = cv2.add(horizontal_lines, vertical_lines)

# Find contours of the table cells
contours, _ = cv2.findContours(table_structure, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Sort contours by position for proper ordering of table cells
contours = sorted(contours, key=lambda x: cv2.boundingRect(x)[1])  # Sort by Y position

# Define a function to extract text from each contour (table cell)
def extract_text_from_contour(image, contour):
    x, y, w, h = cv2.boundingRect(contour)
    cell_image = image[y:y+h, x:x+w]
    text = pytesseract.image_to_string(cell_image, lang='fra', config='--psm 6')
    return text, (x, y, w, h)

# Iterate through each contour and extract text
table_texts = []
for contour in contours:
    text, (x, y, w, h) = extract_text_from_contour(image, contour)
    table_texts.append(text)

    # Optionally, display the cell
    cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.show()

# Display the extracted text for each table cell
for idx, text in enumerate(table_texts):
    print(f"Cell {idx + 1}:")
    print(text)
    print("------")

# Save the processed image with table contours for verification
cv2.imwrite(f'{PROCESSED_DATA_PATH}processed_table_image.png', image)

Any suggestions ?



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video