When it comes to parsing PDF files in Java, two popular libraries stand out: Apache Tika and Apache PDFBox. Both libraries provide powerful features for working with PDF documents, but they have different approaches and trade-offs. In this article, we will explore how to parse a PDF using each library and compare their pros and cons.
Parsing PDF (and more) using Apache Tika
Apache Tika is a versatile content analysis toolkit that supports parsing various file formats, including PDF. It aims to provide a unified interface for content extraction and metadata retrieval. Here’s how you can parse a PDF using Apache Tika:
Add the necessary dependencies to your project’s configuration.
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>2.8.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>2.8.0</version> </dependency>
And here is the code to parse the PDF in Java:
import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.BodyContentHandler; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; public class TikaPdfParser { public static void main(String[] args) { String filePath = "path/to/your/pdf/file.pdf"; try (InputStream inputStream = new FileInputStream(new File(filePath))) { AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); ParseContext parseContext = new ParseContext(); parser.parse(inputStream, handler, null, parseContext); String text = handler.toString(); System.out.println(text); } catch (IOException e) { e.printStackTrace(); } } }
Parsing PDF using Tika and Quarkus
if you are using Quarkus to develop your applications then it’s even easier! there’s a Quarkus extension for Apache Tika which makes parsing PDF or OpenOffice document fairly easier!
Just plug the following dependencies in your project:
<dependency> <groupId>io.quarkiverse.tika</groupId> <artifactId>quarkus-tika</artifactId> <version>${quarkus-tika.version}</version> </dependency> <!-- some of the operations done here require AWT --> <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-awt</artifactId> </dependency>
And here is a sample QuarkusApplication which you can run from the command line:
import java.io.*; import javax.inject.Inject; import io.quarkus.runtime.Quarkus; import io.quarkus.runtime.QuarkusApplication; import io.quarkus.runtime.annotations.QuarkusMain; import io.quarkus.tika.TikaParser; @QuarkusMain public class HelloWorldMain implements QuarkusApplication { @Inject TikaParser parser; public static void main(String... args) { Quarkus.run(HelloWorldMain.class, args); } @Override public int run(String... args) throws Exception { if (args.length < 1) { System.out.println("Usage: quarkus dev <filename> "); System.exit(1); } try (InputStream inputStream = new FileInputStream(new File(args[0]))) { System.out.println("============================="); System.out.println( parser.getText(inputStream)); System.out.println("============================="); } catch (IOException e) { e.printStackTrace(); } return 0; } }
For example, you can run it using “quarkus” CLI as follows:
quarkus dev /path/to/pdf
Here is a sample execution:
Pros of Apache Tika
- Unified Interface: Apache Tika provides a consistent API for parsing various file formats, making it easier to work with different document types.
- Metadata Extraction: Tika excels at extracting metadata from PDF files, such as author, title, creation date, and more.
- Support for Multiple Formats: Tika supports parsing not only PDF but also a wide range of other file formats, such as Microsoft Office documents, HTML, XML, and more.
Cons of Apache Tika
- Limited PDF-Specific Features: Apache Tika focuses on content extraction and metadata retrieval, so its PDF parsing capabilities might be less advanced compared to dedicated PDF libraries.
- Performance Overhead: Tika provides a generalized approach to handle various formats, which can introduce some performance overhead compared to specialized libraries.
Parsing PDF with PDFBox
Apache PDFBox is a robust Java library specifically designed for working with PDF files. It offers comprehensive functionality for creating, manipulating, and extracting data from PDF documents. Let’s see how to parse a PDF using Apache PDFBox:
As an example, let’s code a JBang script which you can use to parse a PDF text from the Command Line:
//DEPS org.apache.pdfbox:pdfbox:2.0.28 import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import java.io.File; import java.io.IOException; public class PdfParser { public static void main(String[] args) { if (args.length < 1 || args[0] == null) { System.err.println("Please provide the path to the PDF file as the first command-line argument."); return; } String filePath = args[0]; try (PDDocument document = PDDocument.load(new File(filePath))) { PDFTextStripper textStripper = new PDFTextStripper(); String text = textStripper.getText(document); System.out.println(text); } catch (IOException e) { e.printStackTrace(); } } }
With JBang, you can just run it with:
jbang PdfParser
Clearly replace the //DEPS with the dependencies in a pom.xml if you are running the application in Maven.
To learn how to run a Java main Class from the command line check this article: Run a Java Class from Maven made simple
Pros of Apache PDFBox
- Rich PDF Manipulation: Apache PDFBox provides extensive features for working with PDF files, including parsing, text extraction, metadata manipulation, merging documents, adding annotations, and more.
- PDF-Specific Capabilities: PDFBox offers fine-grained control over PDF elements, such as fonts, images, bookmarks, and annotations, making it suitable for advanced PDF processing tasks.
- Active Community: Apache PDFBox has an active community and frequent updates, ensuring ongoing support and bug fixes.
Cons of Apache PDFBox
Steeper Learning Curve: Due to its rich feature set and complex API, Apache PDFBox might have a steeper learning curve compared to simpler libraries like Tika.
Lack of Format Support: While PDFBox primarily focuses on PDF manipulation, it has limited support for other file formats, which can be a drawback for projects requiring multi-format parsing.
Conclusion
Both Apache Tika and Apache PDFBox offer powerful capabilities for parsing PDF files in Java, but they have different strengths and trade-offs. Apache Tika provides a unified interface for parsing various file formats, including PDF, with excellent metadata extraction capabilities. On the other hand, Apache PDFBox is a dedicated PDF library with advanced PDF manipulation features but a narrower focus.