Converting a PDF to Text in Java made simple

In this tutorial, we will explore how to convert a PDF document into raw text using the Apache PDFBox library in Java. This process can be particularly useful for applications that need to analyze or summarize the contents of a PDF, such as providing input to an AI chatbot.

Introduction to PDFBox

Apache PDFBox is an open-source Java library that allows for the creation, manipulation, and extraction of data from PDF documents. It’s a robust tool for converting PDF content to raw text, which can then be processed by various text analysis tools or AI models.

Setting Up PDFBox

First, you’ll need to include PDFBox in your project. If you’re using Maven, add the following dependency to your pom.xml:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.2</version>
</dependency>

As usual, in our example, we will use a JBang script to show how PDFBox works so you don’t need to create a Maven project for it. Read this article to learn more about JBang: JBang: Create Java scripts like a pro

Converting Text to PDF

To convert a PDF document to text, you will use the PDFTextStripper class provided by PDFBox. This class extracts text from a given PDF document.

//DEPS org.apache.pdfbox:pdfbox-app:3.0.2
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class PdfToText {

    public static void main(String[] args) {
        if (args.length != 2) {
            System.err.println("Usage: java PdfToText <input-pdf> <output-text-file>");
            System.exit(1);
        }

        String pdfFilePath = args[0];
        String textFilePath = args[1];

        try {
            File pdfFile = new File(pdfFilePath);
            File textFile = new File(textFilePath);

            // Load the PDF document
            PDDocument document = PDDocument.load(pdfFile);

            // Extract text from the PDF
            PDFTextStripper pdfStripper = new PDFTextStripper();
            String text = pdfStripper.getText(document);

            // Save the extracted text to a file
            FileWriter writer = new FileWriter(textFile);
            writer.write(text);
            writer.close();

            // Close the PDF document
            document.close();

            System.out.println("Text extraction completed. Output saved to: " + textFilePath);
        } catch (IOException e) {
            System.err.println("Error occurred while processing the PDF: " + e.getMessage());
        }
    }
}

The above Class, will load a PDF and save it as a raw text file.

Breaking the Text into Chapters

Unfortunately, you cannot detect from the PDF Metadata where each chapter begins or ends. Breaking a PDF into single chapters is useful if you want to provide the single chapters to a tool, such as a ChatBoot tool. On the other hand, you can easily use a Regular expression to detect the beginning of a Chapter.

For example, in the PDF I wanted to convert, each Chapter begins with the following text: “Chapter n “.

Therefore, you can use this simple Regexp to break the large text into chapters:

String regex = "(?m)(?=^Chapter \\d+$)";

Here is a full example which splits a PDF text into multiple text files. One for each chapter:

//DEPS org.apache.pdfbox:pdfbox-app:3.0.2

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.Loader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Arrays;
import java.util.regex.Pattern;
public class PdfToTextSplit {

     public static void main(String[] args) {
        if (args.length != 2) {
            System.err.println("Usage: java PDFToTextConverter <PDF file path> <Output text file path>");
            System.exit(1);
        }

        String pdfFilePath = args[0];
        String outputFilePath = args[1];
        File pdfFile = new File(pdfFilePath);

        if (!pdfFile.exists()) {
            System.err.println("The file " + pdfFilePath + " does not exist.");
            System.exit(1);
        }

        try {
            String text = convertPDFToText(pdfFile);
           
            String regex = "(?m)(?=^Chapter \\d+$)";
            String[] chapters = text.split(regex);
            System.out.println("Found chapters: "+chapters.length);
            for (int ii=0;ii<chapters.length;ii++) {
              String output = outputFilePath + ii;
              saveTextToFile(chapters[ii], output);
              System.out.println("Chapter extracted and saved to " + output);
            }  
        } catch (IOException e) {
            System.err.println("An error occurred while processing the PDF: " + e.getMessage());
        }
    }

 
   
    public static String convertPDFToText(File pdfFile) throws IOException {
        try (PDDocument document = Loader.loadPDF(pdfFile)) {
            PDFTextStripper pdfStripper = new PDFTextStripper();
            return pdfStripper.getText(document);
        }
    }

    public static void saveTextToFile(String text, String outputFilePath) throws IOException {
        try (FileWriter writer = new FileWriter(outputFilePath)) {
            writer.write(text);
        }
    }
}

Launch it and you are done!

convert text to pdf in java

Conclusion

Converting PDF documents to raw text using the PDFBox library is a straightforward and powerful way to enable various text processing applications. Whether it’s for summarization, question answering, or content analysis, extracting text from PDFs provides a versatile input for AI chatbots and other text-based tools.

By following this tutorial, you should be able to set up PDFBox, extract text from PDFs, and understand how to leverage this text for further processing with AI models.