php

How to Parse PDF in PHP

Have you ever tried to open a PDF file to search for a specific line or word? It does not work. All you will find is binary data that makes absolutely no sense.

Parsing PDF files is very tedious and complicated for any software developer, not because it’s complex but because of the nature of PDF files. PDF files contain objects which are identified by a unique number. PDF objects can collect information such as images, text, and more. These objects are encrypted and compressed, making it nearly impossible to process PDFs as text documents.

This guide will learn how to parse PDF documents using the PHP programming language.

Setup

The first step is to set up a development environment. We will start by installing PHP and the required libraries.

To install PHP, open the terminal and enter the command:

$ sudo apt-get install php7.5 -y

Once PHP is installed, use it to install Composer as shown in the commands:

php -r "copy('https://getcomposer.org/installer', 'composer-setup.php');"

php -r "if (hash_file('sha384', 'composer-setup.php') ===

'906a84df04cea2aa72f40b5f787e49f22d4c2f19492ac310e8cba5b96ac8b64115ac402c8cd292b

8a03482574915d1a8') { echo 'Installer verified'; } else { echo 'Installer corrupt';

unlink('composer-setup.php'); } echo PHP_EOL;"


php composer-setup.php

php -r "unlink('composer-setup.php');"

Once we have the composer installed and set up, we can proceed to use the PDFParser library.

Open the terminal and enter the command:

$ sudo php composer.phar require smalot/pdfparser

Generate PDF File

The next step is to select a PDF file for use. There are various ways and resources you can use to create a PDF file. For example, if you are on Windows, you can export a .doc/docx document to pdf.

However, for this example, we will use free files readily available on the internet. Open your browser and navigate to the resource provided below:

https://filesamples.com/formats/pdf

Please select one of the available PDF files and save it on your system.

NOTE: Ensure to check for malicious files before using such documents. Tools such as VirusTotal are great resources.

https://www.virustotal.com/gui/

The following is a scan report of sample1.pdf file.

https://www.virustotal.com/gui/file/6b22904a7de5b77bf40598c37e94e01771485e1b900651b58bf50af7009f8056

Extract PDF Metadata

To extract metadata from the PDF using the PDF parser library, we can implement sample code as shown below:

<?php

    // include composer autoloader

    include 'vendor/autoload.php';

    // parse pdf

    $parser = new \Smalot\PdfParser\Parser();

    $pdf = $parser->parseFile("sample1.pdf");

    // get metadata

    $metadata = $pdf-getDetails();

    // loop each property

    foreach ($metadata as meta =>$value) {

        if (is_array($value)) {

            $value.implode(", ", $value);

        }

        echo $meta . "=>" . $value . "\n";

    }

?>

The above code should fetch metadata information about the file. Such information includes:

CreationDate:  2016-12-22T11:43:55-05:00

Creator: Adobe InDesign CC 2015 (Macintosh)

ModDate: 2016-12-29T15:47:20-05:00

Producer: Adobe PDF Library 15.0

Trapped: False

Pages   1

Extract Text

To extract text from each page of the submitted PDF, we can implement the code as shown below:

<?php

    include "vendor/autoload.php";

    $parser = new \Smalot\PdfParser\Parser();

    $pdf = $parser->parseFile("sample1.pdf");

    $text = $pdf->getText();

    echo $text;

?>

Once we run the code above, we should see the text extracted from the sample1.pdf file. Example ouput is as shown below:

Closing

This guide shows you how you can parse PDF files using PHP and the PDFParser library. Check the documentation to learn more.

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list