Parsing PDF files is very tedious and complicated for any software developer, not because it’s complex but because of the nature of PDF files. PDF files contain objects which are identified by a unique number. PDF objects can collect information such as images, text, and more. These objects are encrypted and compressed, making it nearly impossible to process PDFs as text documents.
This guide will learn how to parse PDF documents using the PHP programming language.
Setup
The first step is to set up a development environment. We will start by installing PHP and the required libraries.
To install PHP, open the terminal and enter the command:
Once PHP is installed, use it to install Composer as shown in the commands:
php -r "if (hash_file('sha384', 'composer-setup.php') ===
'906a84df04cea2aa72f40b5f787e49f22d4c2f19492ac310e8cba5b96ac8b64115ac402c8cd292b
8a03482574915d1a8') { echo 'Installer verified'; } else { echo 'Installer corrupt';
unlink('composer-setup.php'); } echo PHP_EOL;"
php composer-setup.php
php -r "unlink('composer-setup.php');"
Once we have the composer installed and set up, we can proceed to use the PDFParser library.
Open the terminal and enter the command:
Generate PDF File
The next step is to select a PDF file for use. There are various ways and resources you can use to create a PDF file. For example, if you are on Windows, you can export a .doc/docx document to pdf.
However, for this example, we will use free files readily available on the internet. Open your browser and navigate to the resource provided below:
https://filesamples.com/formats/pdf
Please select one of the available PDF files and save it on your system.
NOTE: Ensure to check for malicious files before using such documents. Tools such as VirusTotal are great resources.
https://www.virustotal.com/gui/
The following is a scan report of sample1.pdf file.
https://www.virustotal.com/gui/file/6b22904a7de5b77bf40598c37e94e01771485e1b900651b58bf50af7009f8056
Extract PDF Metadata
To extract metadata from the PDF using the PDF parser library, we can implement sample code as shown below:
// include composer autoloader
include 'vendor/autoload.php';
// parse pdf
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("sample1.pdf");
// get metadata
$metadata = $pdf-getDetails();
// loop each property
foreach ($metadata as meta =>$value) {
if (is_array($value)) {
$value.implode(", ", $value);
}
echo $meta . "=>" . $value . "\n";
}
?>
The above code should fetch metadata information about the file. Such information includes:
Creator: Adobe InDesign CC 2015 (Macintosh)
ModDate: 2016-12-29T15:47:20-05:00
Producer: Adobe PDF Library 15.0
Trapped: False
Pages 1
Extract Text
To extract text from each page of the submitted PDF, we can implement the code as shown below:
include "vendor/autoload.php";
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile("sample1.pdf");
$text = $pdf->getText();
echo $text;
?>
Once we run the code above, we should see the text extracted from the sample1.pdf file. Example ouput is as shown below:
Closing
This guide shows you how you can parse PDF files using PHP and the PDFParser library. Check the documentation to learn more.