October 25, 2024
Chicago 12, Melborne City, USA
java

How To Extract Text From PDF With Formatting or In HTML Form?


I’m working on an Android app that extracts text from PDF files while preserving the formatting as much as possible. Ideally, the output should be in HTML format so that I can highlight the text later and create a new PDF.

I’ve tried using pdfbox-android, a port of Apache PDFBox for Android, and managed to extract plain text. However, this approach only provides unformatted text, which loses structure like paragraphs, bullet points, images, and headings.

I have tried using this class

import com.tom_roush.pdfbox.pdmodel.PDDocument;
import com.tom_roush.pdfbox.text.PDFTextStripper;
import com.tom_roush.pdfbox.text.TextPosition;
import java.io.File;
import java.io.IOException;

public class PDFToHTMLConverter extends PDFTextStripper {
    private StringBuilder html;

    public PDFToHTMLConverter() throws IOException {
        super();
        this.html = new StringBuilder();
    }

    @Override
    protected void startDocument(PDDocument document) {
        html.append("<html><body>");
    }

    @Override
    protected void endDocument(PDDocument document) {
        html.append("</body></html>");
    }

    @Override
    protected void writeString(String text) throws IOException {
        html.append("<p>").append(new StringBuilder(text).reverse()).append("</p>");  // Wrap each line in a paragraph
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        // Append each character and apply any custom HTML styling here if needed
        html.append(text.getUnicode());
    }

    public String getHTMLText(File pdfFile) throws IOException {
        try (PDDocument document = PDDocument.load(pdfFile)) {
            this.writeText(document, new NullWriter());
        }
        return html.toString();
    }
}

and NullWriter

import java.io.Writer;
import java.io.IOException;


public class NullWriter extends Writer {
    @Override
    public void write(char[] cbuf, int off, int len) throws IOException {
        // Do nothing
    }

    @Override
    public void flush() throws IOException {
        // Do nothing
    }

    @Override
    public void close() throws IOException {
        // Do nothing
    }
}

So far it’s not giving the correct output when it comes to Arabic text. No formatting.



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video