How to Use Weaviate Self-Querying in LangChain?

Training model in a large language needs a massive amount of data which can be difficult to manage without vector stores or data lakes. Weaviate is an open-source vector store or database that can store and manage big data objects and vector embeddings for AI applications. It allows the user to retrieve data using a similarity search based on queries or prompts provided in various client-side programming languages like GraphQL, REST, etc.

Thai guide will explain the process of using the Weaviate self-querying in LangChain.

How to Use Weaviate Self-Querying in LangChain?

To use the Weaviate self-querying in LangChain, simply follow this easy and simple guide with the complete process:

Prerequisites

Before using the self-query for the Weaviate database in LangChain, it is required to create a cluster in the Weaviate account and its connection credentials.

Install Frameworks

After creating the cluster in the Weaviate vector database, simply install LangChain to use self-query for Weaviate data:

pip install langchain

Install the OpenAI module to use the Embedding functions while creating self-query:

pip install openai

Install a lark module for weaviate-client to access its resources using the following code:

pip install lark weaviate-client

The tiktoken tokenizer is also required to retrieve data from the database:

pip install tiktoken

After installing all the modules, simply provide the OpenAI and Weaviate API keys from their respective accounts which are used to access their resources:

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["WEAVIATE_API_KEY"] = getpass.getpass("Weaviate API Key:")

Import Libraries

After that, simply import libraries like Document to create data to be stored in Weaviate. OpenAIEmbedding is used to embed data after splitting it into small chunks, and Weaviate access to its resources from outside:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import Document
#importing Weaviate vector stores to create a vector database
from langchain.vectorstores import Weaviate
import os
embeddings = OpenAIEmbeddings()

Insert Data in Weaviate Cluster

After importing all the necessary libraries, simply create documents to store them in the database using Weaviate’s URL:

docs = [
Document(
page_content="Earth is a million years old",
metadata={"year": 2003, "rating": 8.7, "genre": "science fiction"},
),
Document(
page_content="Mark Boucher gets lost in space",
metadata={"year": 2009, "director": "Ab De-Villiers", "rating": 9.2},
),
Document(
page_content="A doctor gets lost in a series of dreams",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 7.6},
),
Document(
page_content="A bunch of highly talented ladies/women are saving the world",
metadata={"year": 2019, "director": "Sara Taylor", "rating": 8.3},
),
Document(
page_content="Toys cars are fighting for their existing ayt racing track",
metadata={"year": 2000, "genre": "animated"},
),
Document(
page_content="prisoners plan to escape but caught",
metadata={
"year": 2009, "director": "Ben Ducket", "genre": "thriller", "rating": 9.9,
},
),
]
vectorstore = Weaviate.from_documents(
docs, embeddings, weaviate_url="https://demo-8pmnonvn.weaviate.network"
)

Configure Self-Query Retriever

After storing data in the Weaviate database, simply create a self-query to extract data from the Weaviate by configuring its template:

from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string or list[string]",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
#configure the retriever using the LLM in OpenAI application to get data from database
document_content_description = "Get basic info about the movie"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)

Test Self-Query Retriever

Now, simply test the self-query by providing the prompt in natural language to extract data from the database:

retriever.get_relevant_documents("What's an science fiction movie")

The following screenshot displays the movies related to the science genre placed in the database:

After that, simply use the prompt with a filter to get the exact data from Weaviate:

retriever.get_relevant_documents("Has Greta Gerwig directed any movies about women")

Using Filter K

Configure the retriever to enable the limit which can be used to get the exact number of records from the database:

retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
enable_limit=True,
verbose=True,
)

The command can be placed with the exact number to fetch that number of documents from the Weaviate vector database:

retriever.get_relevant_documents("what are two movies about science")

Running the retriever fetches only two movies from the database in the science genre:

That is all about using the self-query for Weaviate in the LangChain framework.

Conclusion

To use Weaviate self-querying in LangChain, simply install LangChain, OpenAI, and Lark for the Weaviate client to get data from the database. After that, set up the OpenAI and Weaviate API keys to access their resources from outside the account and insert data in the Weaviate database using its URL. Create the self-query to retrieve data from the database and then apply multiple filters with the query to get data using the retriever. This post demonstrated the process of using the self-query for the Weaviate database in LangChain.

How to Use Weaviate Self-Querying in LangChain?

How to Use Weaviate Self-Querying in LangChain?

Conclusion

About the author

Talha Mahmood