I’m reading in text from a bunch of PDFs using the following code:
import fitz
import numpy as np
import pandas as pd
# open the document
doc = fitz.open(filename_path)
# get the text from each page in the document
for idx, page in enumerate(doc):
page = doc.load_page(idx)
page_text = page.get_text("text")
doc_text = doc_text + page_text
# store the document text in a "text" column in my dataframe
doc_df["text"] = doc_text
It mostly works fine but I’ve noticed that words containing ‘ff’ such as ‘stuff’ are not read in correctly e.g. ‘stu@’ or ‘stuI’. From a brief search it seems this is something to do with ‘ligatures’ but I don’t know what they are or how to resolve it.
Example text similar to what I read in from my PDF:
"I found some stuff in the bag"
Text after pymupdf has read it in:
"I found some stuI in the bag"
It doesn’t seem to be a static conversion either as once it converted ff to @ (in a different word but same phrase as above used below for illustration):
"I found some stu@ in the bag"
What should the corrected code look like so I can stop this happening?
You need to sign in to view this answers