Parsing PDF in Java made simple

When it comes to parsing PDF files in Java, two popular libraries stand out: Apache Tika and Apache PDFBox. Both libraries provide powerful features for working with PDF documents, but they have different approaches and trade-offs. In this article, we will explore how to parse a PDF using each library and compare their pros and cons.

Parsing PDF (and more) using Apache Tika

Apache Tika is a versatile content analysis toolkit that supports parsing various file formats, including PDF. It aims to provide a unified interface for content extraction and metadata retrieval. Here’s how you can parse a PDF using Apache Tika:

Add the necessary dependencies to your project’s configuration.

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.8.0</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>2.8.0</version>
</dependency>

And here is the code to parse the PDF in Java:

import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class TikaPdfParser {

    public static void main(String[] args) {
        String filePath = "path/to/your/pdf/file.pdf";
        try (InputStream inputStream = new FileInputStream(new File(filePath))) {
            AutoDetectParser parser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler();
            ParseContext parseContext = new ParseContext();
            parser.parse(inputStream, handler, null, parseContext);
            String text = handler.toString();
            System.out.println(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Parsing PDF using Tika and Quarkus

if you are using Quarkus to develop your applications then it’s even easier! there’s a Quarkus extension for Apache Tika which makes parsing PDF or OpenOffice document fairly easier!

Just plug the following dependencies in your project:

<dependency>
    <groupId>io.quarkiverse.tika</groupId>
    <artifactId>quarkus-tika</artifactId>
    <version>${quarkus-tika.version}</version>
</dependency>
<!-- some of the operations done here require AWT -->
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-awt</artifactId>
</dependency>

And here is a sample QuarkusApplication which you can run from the command line:

import java.io.*;

import javax.inject.Inject;

import io.quarkus.runtime.Quarkus;
import io.quarkus.runtime.QuarkusApplication;
import io.quarkus.runtime.annotations.QuarkusMain;
import io.quarkus.tika.TikaParser;

@QuarkusMain
public class HelloWorldMain implements QuarkusApplication {
  @Inject
    TikaParser parser;
    public static void main(String... args) {
        Quarkus.run(HelloWorldMain.class, args);
    }

    @Override
    public int run(String... args) throws Exception {
      if (args.length < 1) {
        System.out.println("Usage: quarkus dev <filename> ");
        System.exit(1);
      }

        try (InputStream inputStream = new FileInputStream(new File(args[0]))) {
           System.out.println("=============================");
           System.out.println( parser.getText(inputStream));
           System.out.println("=============================");
        } catch (IOException e) {
            e.printStackTrace();
        }
      return 0;
    }
}

For example, you can run it using “quarkus” CLI as follows:

quarkus dev /path/to/pdf

Here is a sample execution:

java parse pdf

Pros of Apache Tika

  • Unified Interface: Apache Tika provides a consistent API for parsing various file formats, making it easier to work with different document types.
  • Metadata Extraction: Tika excels at extracting metadata from PDF files, such as author, title, creation date, and more.
  • Support for Multiple Formats: Tika supports parsing not only PDF but also a wide range of other file formats, such as Microsoft Office documents, HTML, XML, and more.

Cons of Apache Tika

  • Limited PDF-Specific Features: Apache Tika focuses on content extraction and metadata retrieval, so its PDF parsing capabilities might be less advanced compared to dedicated PDF libraries.
  • Performance Overhead: Tika provides a generalized approach to handle various formats, which can introduce some performance overhead compared to specialized libraries.

Parsing PDF with PDFBox

Apache PDFBox is a robust Java library specifically designed for working with PDF files. It offers comprehensive functionality for creating, manipulating, and extracting data from PDF documents. Let’s see how to parse a PDF using Apache PDFBox:

As an example, let’s code a JBang script which you can use to parse a PDF text from the Command Line:

//DEPS org.apache.pdfbox:pdfbox:2.0.28

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;

public class PdfParser {

    public static void main(String[] args) {
        if (args.length < 1 || args[0] == null) {
            System.err.println("Please provide the path to the PDF file as the first command-line argument.");
            return;
        }

        String filePath = args[0];
        try (PDDocument document = PDDocument.load(new File(filePath))) {
            PDFTextStripper textStripper = new PDFTextStripper();
            String text = textStripper.getText(document);
            System.out.println(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

With JBang, you can just run it with:

jbang PdfParser

Clearly replace the //DEPS with the dependencies in a pom.xml if you are running the application in Maven.
To learn how to run a Java main Class from the command line check this article: Run a Java Class from Maven made simple

Pros of Apache PDFBox

  • Rich PDF Manipulation: Apache PDFBox provides extensive features for working with PDF files, including parsing, text extraction, metadata manipulation, merging documents, adding annotations, and more.
  • PDF-Specific Capabilities: PDFBox offers fine-grained control over PDF elements, such as fonts, images, bookmarks, and annotations, making it suitable for advanced PDF processing tasks.
  • Active Community: Apache PDFBox has an active community and frequent updates, ensuring ongoing support and bug fixes.

Cons of Apache PDFBox

Steeper Learning Curve: Due to its rich feature set and complex API, Apache PDFBox might have a steeper learning curve compared to simpler libraries like Tika.
Lack of Format Support: While PDFBox primarily focuses on PDF manipulation, it has limited support for other file formats, which can be a drawback for projects requiring multi-format parsing.

Conclusion

Both Apache Tika and Apache PDFBox offer powerful capabilities for parsing PDF files in Java, but they have different strengths and trade-offs. Apache Tika provides a unified interface for parsing various file formats, including PDF, with excellent metadata extraction capabilities. On the other hand, Apache PDFBox is a dedicated PDF library with advanced PDF manipulation features but a narrower focus.