Trouble locating a table in an image using OpenCV

Working on extracting text from documents. The text can either be paragraph-like or sectioned and zoned.

Tesseract by itself is doing an amazing job at extracting the text, but as you can see from the image the middle section (in red) is a table that doesn’t follow a well defined structure. I don’t care about the table structure itself. What I’m interested in is keeping the extracted text in its context. So going cell by cell seems like the solution.

I couldn’t get OpenCV to detect that table section so that I would later crop each cell and OCRize it away from other cells to keep the text in context.

Here’s what I did :

import cv2
import pytesseract
import numpy as np
import matplotlib.pyplot as plt

# Load the image (in your case, load from PDF converted image)
# Define paths
RAW_DATA_PATH = '../../data/raw/'
PROCESSED_DATA_PATH = '../../data/processed/'
RESULTS_PATH = '../../data/results/'

image_path = f'{RAW_DATA_PATH}download0.png'

# Adjust this to match the path of your image
image = cv2.imread(image_path, cv2.IMREAD_COLOR)

# Preprocess the image (convert to grayscale, binarize it, etc.)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
_, binary = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY_INV)

# Detect horizontal and vertical lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))

# Detect horizontal lines
horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel)
# Detect vertical lines
vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel)

# Combine both horizontal and vertical lines
table_structure = cv2.add(horizontal_lines, vertical_lines)

# Find contours of the table cells
contours, _ = cv2.findContours(table_structure, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Sort contours by position for proper ordering of table cells
contours = sorted(contours, key=lambda x: cv2.boundingRect(x)[1])  # Sort by Y position

# Define a function to extract text from each contour (table cell)
def extract_text_from_contour(image, contour):
    x, y, w, h = cv2.boundingRect(contour)
    cell_image = image[y:y+h, x:x+w]
    text = pytesseract.image_to_string(cell_image, lang='fra', config='--psm 6')
    return text, (x, y, w, h)

# Iterate through each contour and extract text
table_texts = []
for contour in contours:
    text, (x, y, w, h) = extract_text_from_contour(image, contour)
    table_texts.append(text)

    # Optionally, display the cell
    cv2.rectangle(image, (x, y), (x+w, y+h), (0, 255, 0), 2)
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.show()

# Display the extracted text for each table cell
for idx, text in enumerate(table_texts):
    print(f"Cell {idx + 1}:")
    print(text)
    print("------")

# Save the processed image with table contours for verification
cv2.imwrite(f'{PROCESSED_DATA_PATH}processed_table_image.png', image)

Any suggestions ?

You need to sign in to view this answers

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Trouble locating a table in an image using OpenCV

Leave feedback about this Cancel Reply

PROS

CONS

Categories

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP

Recent Posts

Postgres drop type XX000 “cache lookup failed for type”

PostgreSQL how to merge rows where some fields match and others are null

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Follow Us

Trouble locating a table in an image using OpenCV

Share This Post:

Leave feedback about this Cancel Reply

PROS

CONS

Related Post

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP