LangChain

How to Use PDF Loaders in LangChain?

LangChain is the framework to build AI models using Natural Language Processing (NLP) to generate the text by understanding it from different sources. Portable Document Format or PDF is the standard for storing documents in text, images, etc. which was designed by Adobe in 1992. LangChain allows the user to build LLMs that can understand the text from PDFs using different PDF loaders as this guide explains their use.

This article will demonstrate the process of using PDF loaders in LangChain.

How to Use PDF Loaders in LangChain?

To use the PDF loader in LangChain, simply follow this simple and easy guide:

Setup Prerequisites

Before starting the process of using PDF loaders, simply install LangChain first to get started with the process:

pip install langchain

The next module to install for this process is OpenAI and the following is mentioned the code for its installation:

pip install openai

A screenshot of a computer Description automatically generated

Method 1: Using PyPDF Library

The first PDF loader which is used to load PDF in LangChain is the PyPDF library and it will convert the PDF into an array of documents. Install PyPDF before importing its library to use it by typing the following code in the Python IDE:

pip install pypdf

A screenshot of a computer Description automatically generated

Upload the PDF file using this code in the Google Collaboratory on the cloud:

from google.colab import files

upload = files.upload()

After running the above code, simply choose the file from the local system and upload it to the cloud:

After that, import the PyPDF library from LangChain and load the PDF file to split its pages into an array of documents:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("Paradis debuts.pdf")

pages = loader.load_and_split()

Now, simply use the index number to fetch the document from the PDF file:

pages[10]

A screen shot of a computer Description automatically generated

Method 2: Using OpenAIEmbeddings Library

The next method for using the PDF loader is using the OpenAIEmbedding which requires the installation of the following modules:

  • Tiktoken
  • FAISS

Run the following code to get the necessary file for using the tiktoken tokenizer to load PDF files using the OpenAIEmbedding:

pip install tiktoken

FAISS is the module to get efficient similarity searches from the pool of data:

pip install faiss-gpu

A screenshot of a computer program Description automatically generated

After that, simply provide the OpenAI API key using the get pass library and the os library to access the resources of the operating system:

import os

import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Now, simply import OpenAIEmbeddings and FAISS from LangChain to create a retriever using the command in the similarity search() function:

from langchain.vectorstores import FAISS

from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())

docs = faiss_index.similarity_search("What are objects", k=2)

for doc in docs:

print(str(doc.metadata["page"]) + ":", doc.page_content[:50])

The above code searches for the query using the FAISS similarity search to extract information from the document as displayed in the screenshot below:

Method 3: Using Unstructured Library

Install the unstructured module to use this method for loading PDF documents in LangChain:

pip install unstructured

A screenshot of a computer Description automatically generated

Install pdf2image module to use its resources while loading the PDF files:

pip install pdf2image

Another module that is required to use this method for loading PDF is pdfminer.six to get its high_level resources:

pip install pdfminer.six

A screenshot of a computer program Description automatically generated

After that, import the UnstructuredPDFLoader library from LangChain:

from langchain.document_loaders import UnstructuredPDFLoader

Load the PDF file using the following code with the name of the PDF uploaded in the first method:

loader = UnstructuredPDFLoader("Paradis debuts.pdf")

Simply print the data using the loader.load() method containing the contents of the PDF file:

data = loader.load()

A screenshot of a computer program Description automatically generated

Method 4: Using Unstructured Library While Retaining Elements

The user can also use the unstructured method to load PDF by retaining elements as the module splits the PDF into small elements. By default, the library combines these elements to display it as a single unit, but the user can simply separate them using the mode=“elements”:

loader = UnstructuredPDFLoader("Paradis debuts.pdf", mode="elements")

Simply load the PDF in the data variable:

data = loader.load()

Print the data using the index number of the element to display it on the screen:

data[10]

A screenshot of a computer Description automatically generated

Method 5: Using OnlinePDFLoader Library

The user can also use an Online PDF loader to load PDFs from the internet using the following code:

from langchain.document_loaders import OnlinePDFLoader

Load the data from the internet by providing the path to the PDF file:

loader = OnlinePDFLoader("https://arxiv.org/pdf/2302.03803.pdf")

Simply print the file using the data variable:

data = loader.load()

print(data)

A screenshot of a computer program Description automatically generated

Method 6: Using PyPDFium2 Library

Install the PyPDFium2 module using the following code block to load PDF files:

pip install pypdfium2

After the installation, simply import the library to use its resources and methods:

from langchain.document_loaders import PyPDFium2Loader

A screenshot of a computer program Description automatically generated

Use the PyPDFium2Loader() method with the name of the file to load it in LangChain:

loader = PyPDFium2Loader("Paradis debuts.pdf")

Simply print the contents of the document using the data variable:

data = loader.load()

print(data)

A screenshot of a computer Description automatically generated

Method 7: Using PDFMinerLoader Library

Import the PDFMinerLoader library using its module which was installed in step 3 of this post:

from langchain.document_loaders import PDFMinerLoader

The loader will load the document using its PDFMinerLoader() with the file name uploaded in the first step and the user can upload different files to get data from them:

loader = PDFMinerLoader("Paradis debuts.pdf")

The PDFMinerLoader loads the data in the data variable, and it can be printed on screen by simply calling data using the print() function:

data = loader.load()

print(data)

A screenshot of a computer program Description automatically generated

Method 8: Using PDFMinerLoader Library to Get HTML Text

The PDFMinerLoader can also be used to display the PDF as HTML in LangChain by using the PDFMinerPDFasHTMLLoader library:

from langchain.document_loaders import PDFMinerPDFasHTMLLoader

Simply laid the PDF file using its function:

loader = PDFMinerPDFasHTMLLoader("Paradis debuts.pdf")

Load the complete data/file as one document:

data = loader.load()[0]

Using the BeautifulSoup library, the user can get useful insights from the data like its metadata containing the size of the document, the font used in it, etc.

from bs4 import BeautifulSoup

soup = BeautifulSoup(data.page_content,'html.parser')

content = soup.find_all('div')

Configure the style of output to be displayed about the PDF file like snippets with the same font size and find duplicates in the document to identify the headers and footers:

import re

cur_fs = None

cur_text = ''

snippets = []

for c in content:

  sp = c.find('span')

  if not sp:

    continue

  st = sp.get('style')

  if not st:

    continue

  fs = re.findall('font-size:(\d+)px',st)

  if not fs:

    continue

  fs = int(fs[0])

  if not cur_fs:

    cur_fs = fs

  if fs == cur_fs:

    cur_text += c.text

  else:

    snippets.append((cur_text,cur_fs))

    cur_fs = fs

    cur_text = c.text

  snippets.append((cur_text,cur_fs))

Provide some more configurations to identify more specific details like the size of the heading and its content concerning changes in their sizes and another formatting:

from langchain.docstore.document import Document

cur_idx = -1

semantic_snippets = []

for s in snippets:

  if not semantic_snippets or s[1] > semantic_snippets[cur_idx].metadata['heading_font']:

    metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}

    metadata.update(data.metadata)

    semantic_snippets.append(Document(page_content='',metadata=metadata))

    cur_idx += 1

    continue

  if not semantic_snippets[cur_idx].metadata['content_font'] or s[1] <=
    semantic_snippets[cur_idx].metadata['content_font']:

    semantic_snippets[cur_idx].page_content += s[0]

    semantic_snippets[cur_idx].metadata['content_font'] = max(s[1],     semantic_snippets[cur_idx].metadata['content_font'])

    continue

  metadata={'heading':s[0], 'content_font': 0, 'heading_font': s[1]}

  metadata.update(data.metadata)

  semantic_snippets.append(Document(page_content='',metadata=metadata))

  cur_idx += 1

Simply use the semantic_snippets with the index number of the element:

semantic_snippets[2]

A screenshot of a computer Description automatically generated

Method 9: Using PyMuPDFLoader Library

PDF loaders have their respective advantages as PyMuPDFLoader is the fastest to load PDF files and fetches useful information like metadata of the content. Simply install the module to use its libraries and methods in LangChain:

pip install pymupdf

Import the library to laid PDF files using its module installed previously:

from langchain.document_loaders import PyMuPDFLoader

Load the dataset containing PDF style document:

loader = PyMuPDFLoader("Paradis debuts.pdf")

A screenshot of a computer program Description automatically generated

Load the file and initialize it using the data variable:

data = loader.load()

Simply use the index number of the data as the PuMuPDFLoader returns one document per page:

data[0]

A screenshot of a computer program Description automatically generated

Method 10: Using PyPDFDirectoryLoader Library

To load the PDF from the directory, simply import the PyPDFDirectoryLoader library using the PyPDF installed in the first method:

from langchain.document_loaders import PyPDFDirectoryLoader

Load the PDF file uploaded earlier using its method for loading PDF files:

loader = PyPDFDirectoryLoader("Paradis debuts.pdf")

Load the data in the “docs” variable by initializing it with the PDF file:

docs = loader.load()

A screenshot of a computer Description automatically generated

Method 11: Using PDFPlumberLoader Library

The PDFPlumberLoader is another method that can be used to load PDF files in LangChain:

pip install pdfplumber

Simply import the PDFPlumberLoader library to use its functions to laid PDF files:

from langchain.document_loaders import PDFPlumberLoader

Load the PDF file using the PDSPlumberLoader() function:

loader = PDFPlumberLoader("Paradis debuts.pdf")

A screenshot of a computer Description automatically generated

Simply load the data in the data variable using the loader.load() method:

data = loader.load()

Print the data variable using the index number of the document to be displayed on the screen:

data[0]

That is all about using the PDF loaders in LangChain to fetch data from PDF files/documents.

Conclusion

To use the PDF loaders in LangChain, simply install LangChain and OpenAI modules to get started with the process. LangChain allows multiple processes to use PDF loaders as this guide explains eleven of them in detail with their code and explanations. This guide has demonstrated the process of using the PDF loaders in LangChain to get data from the portable document format.

About the author

Talha Mahmood

As a technical author, I am eager to learn about writing and technology. I have a degree in computer science which gives me a deep understanding of technical concepts and the ability to communicate them to a variety of audiences effectively.