LangChain

How to Use MultiVector Retrievers in LangChain?

angChain offers the required modules and libraries that enable developers to build Large Language Models or LLMs. These models need a massive amount of training data written in natural languages like English, Spanish, etc. so the humans can ask questions in their native language. The dataset has its own complexities and as a massive data set can be stored in multiple vectors to avoid performance issues.

This post will demonstrate the process of using MultiVector retrievers in LangChain.

How to Use MultiVector Retrievers in LangChain?

LangChain provides the “MultiVectorRetriever” library to create multiple vectors of a document and store them in the vector stores or databases. To learn the process of using the MultiVector retrievers in LangChain, simply go through the listed steps:

Step 1: Install Modules

First, install the LangChain module to get its dependencies for using the MultiVector retriever:

pip install langchain

Install the Chroma database using the pip command to get the database for storing the vectors of the document:

pip install chromadb

Install the “tiktoken” tokenizer for splitting the document vectors into smaller chunks and store them in the Chroma database:

pip install tiktoken

The last module to install is the OpenAI framework that can be used to get the OpenAI environment and use the OpenAIEmbedding() method:

pip install openai

Step 2: Setup Environment & Upload Data

Now, the next step after getting all the required modules is to set the environment using the API key from the OpenAI account:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

Use the following code to upload the documents in the Python Notebook after importing the files library:

from google.colab import files
uploaded = files.upload()

Step 3: Import Libraries

After completing all the installations and setups, simply get the libraries from the installed modules for getting into the process of using the MultiVector retrievers:

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader

Using the TextLoader() method to load the documents uploaded in the Python Notebook in the second step:

loaders = [
    TextLoader('Data.txt'),
    TextLoader('state_of_the_union.txt'),
]
docs = []
for l in loaders:
    docs.extend(l.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

Method 1: Use MultiVector Retrievers Via Small Chunks

The first method simply creates small chunks from the complete document and stores them in the vector store with their identity number. These smaller chunks can be stored as a single entity and can be extracted from the database using the MultiVector retriever:

Step 1: Creating Small Chunks

This is the first major step towards using the MultiVector retriever as the following code creates small chunks of the loaded document. After that, the small chunks are stored in the vector stores after creating their embeddings that convert the text into numbers:

vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]

Use the TextSplitter to convert the document into chunks of size 400:

child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

Now make sub-documents containing the chunks to store them in the database with their identities for use later:

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

Now, add the subdocuments and original document in the vector stores with their identity numbers:

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

Step 2: Testing the Retriever

Once the documents are split into small chunks and stored in the database, simply test the retriever using the similarity search with the query from the document:

retriever.vectorstore.similarity_search("justice breyer")[0]


All the chunks are not possible to print on the screen, so the following code only displays the length of documents that retriever got using the query:

[cc lang="python"  width="100%" height="100%" escaped="true" theme="blackboard" nowrap="0"]
len(retriever.get_relevant_documents("justice breyer")[0].page_content)

Method 2: Use MultiVector Retrievers Via Summaries

This method uses a MultiVector retriever to create summaries from the documents and store them in the vector store as the previous one used small chunk. This guide can get a better explanation or version of each document with a smaller quantity of embeddings or tokens to increase efficiency:

Step 1: Getting Summary

The next steps explain the process of using the MultiVector retrievers in LangChain by extracting summaries from the whole document. Getting summaries of the document would decrease the performance load and it would be easy to manage them with a huge number of documents in a dataset:

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.schema.document import Document

Build the template or structure for the prompts to configure the chain for the LLM using OpenAI() and StrOutputParser() methods:

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

Step 2: Storing Summaries

Define the summaries variable by getting the tokens of the document by using the batch() method with multiple arguments:

summaries = chain.batch(docs, {"max_concurrency": 5})

Build the vector stores using the Chroma database to store the summaries and embeddings created through the original documents:

vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

Create a list of all the summaries to attach the ID number with each summary:

summary_docs = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

Now store the summaries and original files/documents in the vector store with their identity numbers:

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

Create sub-documents using the similarity_search() method containing the extracted documents or summaries using the query:

sub_docs = vectorstore.similarity_search("justice breyer")

Step 3: Testing the Retriever

Now print the first document extracted and stored in the sub_docs variable by specifying the index number of the summary:

sub_docs[0]

Store all the retrieved documents extracted using the query in the retriever() method:

retrieved_docs = retriever.get_relevant_documents("justice breyer")

Print only the length of all the documents extracted using the query mentioned in the retriever() function:

len(retrieved_docs[0].page_content)

That’s all about using the MultiVector retriever in LangChain and to learn more about it, visit this guide:

Conclusion

To use the MultiVector retrievers in LangChain, simply import all the libraries using the modules needed to get the dependencies for using the MultiVector retriever. The MultiVector retriever can be used through multiple methods like splitting text into smaller chunks or getting summaries of the documents. This post has elaborated on all the steps for using a MultiVector retriever in LangChain for both above methods.

About the author

Talha Mahmood

As a technical author, I am eager to learn about writing and technology. I have a degree in computer science which gives me a deep understanding of technical concepts and the ability to communicate them to a variety of audiences effectively.