OiO.lk Blog pdf Extracting text from pdf with custom font
pdf

Extracting text from pdf with custom font


I have a voluminous PDF file containing text with specific scientific notation. I’m trying to extract the text using pdfplumber.

At first, I noticed that certain symbols are extracted as capital Latin characters, while technical symbols like ‘[‘ and codes e.g., (cid:8) are also present. Moreover, the same code is often displayed in the file with different symbols. I solved this problem by collecting not only the text representation of each symbol but also the name of the font.
However, I now wonder if it is possible to extract the encoding directly from the PDF file. I mean getting information in the format: {‘symbol’: ‘e’, ‘font’: ‘ejdeij+4brane’} displayed as something.



You need to sign in to view this answers

Exit mobile version