How to Parse PDF in PHP

Have you ever tried to open a PDF file to search for a specific line or word? It does not work. All you will find is binary data that makes absolutely no sense.

Parsing PDF files is very tedious and complicated for any software developer, not because it’s complex but because of the nature of PDF files. PDF files contain objects which are identified by a unique number. PDF objects can collect information such as images, text, and more. These objects are encrypted and compressed, making it nearly impossible to process PDFs as text documents.

This guide will learn how to parse PDF documents using the PHP programming language.


The first step is to set up a development environment. We will start by installing PHP and the required libraries.

To install PHP, open the terminal and enter the command:

$ sudo apt-get install php7.5 -y

Once PHP is installed, use it to install Composer as shown in the commands:

php -r "copy('', 'composer-setup.php');"

php -r "if (hash_file('sha384', 'composer-setup.php') ===


8a03482574915d1a8') { echo 'Installer verified'; } else { echo 'Installer corrupt';

unlink('composer-setup.php'); } echo PHP_EOL;"

php composer-setup.php

php -r "unlink('composer-setup.php');"

Once we have the composer installed and set up, we can proceed to use the PDFParser library.

Open the terminal and enter the command:

$ sudo php composer.phar require smalot/pdfparser

Generate PDF File

The next step is to select a PDF file for use. There are various ways and resources you can use to create a PDF file. For example, if you are on Windows, you can export a .doc/docx document to pdf.

However, for this example, we will use free files readily available on the internet. Open your browser and navigate to the resource provided below:

Please select one of the available PDF files and save it on your system.

NOTE: Ensure to check for malicious files before using such documents. Tools such as VirusTotal are great resources.

The following is a scan report of sample1.pdf file.

Extract PDF Metadata

To extract metadata from the PDF using the PDF parser library, we can implement sample code as shown below:


    // include composer autoloader

    include 'vendor/autoload.php';

    // parse pdf

    $parser = new \Smalot\PdfParser\Parser();

    $pdf = $parser->parseFile("sample1.pdf");

    // get metadata

    $metadata = $pdf-getDetails();

    // loop each property

    foreach ($metadata as meta =>$value) {

        if (is_array($value)) {

            $value.implode(", ", $value);


        echo $meta . "=>" . $value . "\n";



The above code should fetch metadata information about the file. Such information includes:

CreationDate:  2016-12-22T11:43:55-05:00

Creator: Adobe InDesign CC 2015 (Macintosh)

ModDate: 2016-12-29T15:47:20-05:00

Producer: Adobe PDF Library 15.0

Trapped: False

Pages   1

Extract Text

To extract text from each page of the submitted PDF, we can implement the code as shown below:


    include "vendor/autoload.php";

    $parser = new \Smalot\PdfParser\Parser();

    $pdf = $parser->parseFile("sample1.pdf");

    $text = $pdf->getText();

    echo $text;


Once we run the code above, we should see the text extracted from the sample1.pdf file. Example ouput is as shown below:


This guide shows you how you can parse PDF files using PHP and the PDFParser library. Check the documentation to learn more.

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list