I want to write an application that converts Excel spreadsheets to HTML and preserve styling. Apache Tika is the best free solution I have found so far. I tested some conversions in the command line such as:
java -jar tika-app-2.9.2.jar --html spreadsheet.xlsx > output.html
and the output looks great. I get an HTML file with tons of styling (i.e. table tags, tr tags, etc).
I am having some trouble replicating this behavior in code, using Kotlin with Tika packages. The closest I have gotten is the code below, which parses a spreadsheet and outputs the content as a string.
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser
import org.apache.tika.sax.BodyContentHandler
import java.io.File
import java.io.FileInputStream
fun tikaHelper(inputFilePath: String) {
//detecting the file type
val handler = BodyContentHandler()
val metadata = Metadata()
val inputstream = FileInputStream(File(inputFilePath))
val pcontext = ParseContext()
//OOXml parser
val msofficeparser = OOXMLParser()
msofficeparser.parse(inputstream, handler, metadata, pcontext)
println("Contents of the document:$handler")
}
fun main() {
val inputFilePath = "./spreadsheet.xlsx"
tikaHelper(inputFilePath)
}
This simply outputs the data as a string. How can I mimic the behavior of that --html
CLI argument?
I tried following various docs online such as https://www.tutorialspoint.com/tika/tika_extracting_ms_office_files.htm but that only gets me as far as spreadsheet–>string. I need to go from spreadsheet–>html.
You need to sign in to view this answers
Leave feedback about this