How to Use Parent Document Retriever in LangChain?

The LangChain framework enables the developers to build Large Language Models that can understand and generate text in natural language. The LangChain models are trained on a massive amount of data so the model can understand the language by storing the data in vector stores. It also enables the user to build retrievers that can extract data from the database or vector stores with all the data stored for the model.

This post will demonstrate the process of using the parent document retriever in LangChain.

How to Use a Parent Document Retriever in LangChain?

Parent document retriever in LangChain can be used by splitting the documents into smaller chunks so they do not lose their meaning at the moments of embedding. The parent document can be said to be the whole document or the larger chunk from which the smaller chunks are extracted.

To learn the process of using the parent document retriever in LangChain, simply check out this guide:

Step 1: Install Modules

First, start using the parent document retriever by installing the LangChain framework using the pip command:

pip install langchain

Install the Chroma database module to save the embeddings of the document and retrieve data from it:

pip install chromadb

To install tiktoken which is a tokenizer that gets the tokens of the document by creating small chunks:

pip install tiktoken

Get the OpenAI module by executing the following command on the Python notebook to get its dependencies and libraries:

pip install openai

Step 2: Setup Environment & Upload Data

The next step is to set up the environment using the API key from the OpenAI account:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Now, upload the documents from the local system after importing the files library and then call the upload() method:

from google.colab import files
uploaded = files.upload()

Step 3: Import Libraries

The next step contains the code for importing the required libraries for using the parent document retrievers using the LangChain framework:

from langchain.retrievers import ParentDocumentRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader

Load the document to build the retriever using the TextLoader() methods with the path of the files:

loaders = [
TextLoader('Data.txt'),
TextLoader('state_of_the_union.txt'),
]
docs = []
for l in loaders:

Step 4: Retrieving Complete Documents

Once the documents/files are loaded to the model, simply build the embeddings of the documents, and store them in the vector stores:

child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

vectorstore = Chroma(
collection_name="full_documents",
embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
)

Now, call the add_documents() method using the retriever to get the retriever to the documents:

retriever.add_documents(docs, ids=None)

The following code extracts the embeddings of the documents that have been stored in the database for the uploaded files:

list(store.yield_keys())

After getting the embeddings of the documents, call the similarity_search() method with the query to get the small chunks from the document:

sub_docs = vectorstore.similarity_search("justice breyer")

Call the print() method to display the chunks called in the previous code based on the query:

print(sub_docs[0].page_content)

Call the complete the retriever() function to get all the tokens stored in the database using the following code:

retrieved_docs = retriever.get_relevant_documents("justice breyer")

Printing all the documents would take a huge time and processing power, so simply get the length of documents retrieved previously:

len(retrieved_docs[0].page_content)

Step 5: Retrieving Larger Chunks

This step will not take the whole document; however, it would take a bigger chipmunk from the document and retrieve a smaller chunk from it:

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()

Configure the retriever to get the smaller token from the huge pool of data stored in the “vectorstore” variable:

retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)

Call the retriever to get the larger chunks from the vector stores using the docs variable in the argument of the function:

retriever.add_documents(docs)

Get the length of these documents from the docs variable via the below command:

len(list(store.yield_keys()))

Simply get a smaller chunk from a bigger one as the previous screenshot displays that there are 23 documents stored in the vector store. The query is used to get the relevant data using the similarity_search() method to retrieve data from the vector store:

sub_docs = vectorstore.similarity_search("justice breyer")

Print the smaller chunks using the query mentioned in the previous code to display them on the screen:

print(sub_docs[0].page_content)

Now, use the retriever on the complete dataset stored in the database using the query as the argument of the function:

retrieved_docs = retriever.get_relevant_documents("justice breyer")

Get the length of the complete chunks created and stored in the database:

len(retrieved_docs[0].page_content)

We cannot display all the chunks, but the first chunk with the index number 0 is displayed using the following code:

print(retrieved_docs[0].page_content

That is all about the process of using the parent document retriever in LangChain.

Conclusion

To use the parent document retriever in LangChain, simply install the modules and set up the OpenAI environment using its API key. After that, import the required libraries from LangChain for using the parent document retriever, and then load the documents for the model. The user can use parent documents as the whole document or the large chunk and get a smaller chunk using the query. This post has elaborated on the process of using the parent document retriever in LangChain.

How to Use Parent Document Retriever in LangChain?

How to Use a Parent Document Retriever in LangChain?

Conclusion

About the author

Talha Mahmood