I have a voluminous PDF file containing text with specific scientific notation. I’m trying to extract the text using pdfplumber.
At first, I noticed that certain symbols are extracted as capital Latin characters, while technical symbols like ‘[‘ and codes e.g., (cid:8) are also present. Moreover, the same code is often displayed in the file with different symbols. I solved this problem by collecting not only the text representation of each symbol but also the name of the font.
However, I now wonder if it is possible to extract the encoding directly from the PDF file. I mean getting information in the format: {‘symbol’: ‘e’, ‘font’: ‘ejdeij+4brane’} displayed as something.
You need to sign in to view this answers
Leave feedback about this