I’m working on an Android app that extracts text from PDF files while preserving the formatting as much as possible. Ideally, the output should be in HTML format so that I can highlight the text later and create a new PDF.
I’ve tried using pdfbox-android, a port of Apache PDFBox for Android, and managed to extract plain text. However, this approach only provides unformatted text, which loses structure like paragraphs, bullet points, images, and headings.
I have tried using this class
import com.tom_roush.pdfbox.pdmodel.PDDocument;
import com.tom_roush.pdfbox.text.PDFTextStripper;
import com.tom_roush.pdfbox.text.TextPosition;
import java.io.File;
import java.io.IOException;
public class PDFToHTMLConverter extends PDFTextStripper {
private StringBuilder html;
public PDFToHTMLConverter() throws IOException {
super();
this.html = new StringBuilder();
}
@Override
protected void startDocument(PDDocument document) {
html.append("<html><body>");
}
@Override
protected void endDocument(PDDocument document) {
html.append("</body></html>");
}
@Override
protected void writeString(String text) throws IOException {
html.append("<p>").append(new StringBuilder(text).reverse()).append("</p>"); // Wrap each line in a paragraph
}
@Override
protected void processTextPosition(TextPosition text) {
// Append each character and apply any custom HTML styling here if needed
html.append(text.getUnicode());
}
public String getHTMLText(File pdfFile) throws IOException {
try (PDDocument document = PDDocument.load(pdfFile)) {
this.writeText(document, new NullWriter());
}
return html.toString();
}
}
and NullWriter
import java.io.Writer;
import java.io.IOException;
public class NullWriter extends Writer {
@Override
public void write(char[] cbuf, int off, int len) throws IOException {
// Do nothing
}
@Override
public void flush() throws IOException {
// Do nothing
}
@Override
public void close() throws IOException {
// Do nothing
}
}
So far it’s not giving the correct output when it comes to Arabic text. No formatting.
You need to sign in to view this answers
Leave feedback about this