How can I stop pymupdf converting 'ff' to a different character such as @ or I?

I’m reading in text from a bunch of PDFs using the following code:

import fitz
import numpy as np
import pandas as pd

# open the document
doc = fitz.open(filename_path)

# get the text from each page in the document
for idx, page in enumerate(doc):
    page = doc.load_page(idx)
    page_text = page.get_text("text")
    doc_text = doc_text + page_text

# store the document text in a "text" column in my dataframe
doc_df["text"] = doc_text

It mostly works fine but I’ve noticed that words containing ‘ff’ such as ‘stuff’ are not read in correctly e.g. ‘stu@’ or ‘stuI’. From a brief search it seems this is something to do with ‘ligatures’ but I don’t know what they are or how to resolve it.

Example text similar to what I read in from my PDF:

"I found some stuff in the bag"

Text after pymupdf has read it in:

"I found some stuI in the bag"

It doesn’t seem to be a static conversion either as once it converted ff to @ (in a different word but same phrase as above used below for illustration):

"I found some stu@ in the bag"

What should the corrected code look like so I can stop this happening?

You need to sign in to view this answers

Related Post