Training model in a large language needs a massive amount of data which can be difficult to manage without vector stores or data lakes. Weaviate is an open-source vector store or database that can store and manage big data objects and vector embeddings for AI applications. It allows the user to retrieve data using a similarity search based on queries or prompts provided in various client-side programming languages like GraphQL, REST, etc.
Thai guide will explain the process of using the Weaviate self-querying in LangChain.
How to Use Weaviate Self-Querying in LangChain?
To use the Weaviate self-querying in LangChain, simply follow this easy and simple guide with the complete process:
Prerequisites
Before using the self-query for the Weaviate database in LangChain, it is required to create a cluster in the Weaviate account and its connection credentials.
Install Frameworks
After creating the cluster in the Weaviate vector database, simply install LangChain to use self-query for Weaviate data:
Install the OpenAI module to use the Embedding functions while creating self-query:
Install a lark module for weaviate-client to access its resources using the following code:
The tiktoken tokenizer is also required to retrieve data from the database:
After installing all the modules, simply provide the OpenAI and Weaviate API keys from their respective accounts which are used to access their resources:
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["WEAVIATE_API_KEY"] = getpass.getpass("Weaviate API Key:")
Import Libraries
After that, simply import libraries like Document to create data to be stored in Weaviate. OpenAIEmbedding is used to embed data after splitting it into small chunks, and Weaviate access to its resources from outside:
from langchain.schema import Document
#importing Weaviate vector stores to create a vector database
from langchain.vectorstores import Weaviate
import os
embeddings = OpenAIEmbeddings()
Insert Data in Weaviate Cluster
After importing all the necessary libraries, simply create documents to store them in the database using Weaviate’s URL:
Document(
page_content="Earth is a million years old",
metadata={"year": 2003, "rating": 8.7, "genre": "science fiction"},
),
Document(
page_content="Mark Boucher gets lost in space",
metadata={"year": 2009, "director": "Ab De-Villiers", "rating": 9.2},
),
Document(
page_content="A doctor gets lost in a series of dreams",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 7.6},
),
Document(
page_content="A bunch of highly talented ladies/women are saving the world",
metadata={"year": 2019, "director": "Sara Taylor", "rating": 8.3},
),
Document(
page_content="Toys cars are fighting for their existing ayt racing track",
metadata={"year": 2000, "genre": "animated"},
),
Document(
page_content="prisoners plan to escape but caught",
metadata={
"year": 2009, "director": "Ben Ducket", "genre": "thriller", "rating": 9.9,
},
),
]
vectorstore = Weaviate.from_documents(
docs, embeddings, weaviate_url="https://demo-8pmnonvn.weaviate.network"
)
Configure Self-Query Retriever
After storing data in the Weaviate database, simply create a self-query to extract data from the Weaviate by configuring its template:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie",
type="string or list[string]",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
#configure the retriever using the LLM in OpenAI application to get data from database
document_content_description = "Get basic info about the movie"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
llm, vectorstore, document_content_description, metadata_field_info, verbose=True
)
Test Self-Query Retriever
Now, simply test the self-query by providing the prompt in natural language to extract data from the database:
The following screenshot displays the movies related to the science genre placed in the database:
After that, simply use the prompt with a filter to get the exact data from Weaviate:
Using Filter K
Configure the retriever to enable the limit which can be used to get the exact number of records from the database:
llm,
vectorstore,
document_content_description,
metadata_field_info,
enable_limit=True,
verbose=True,
)
The command can be placed with the exact number to fetch that number of documents from the Weaviate vector database:
Running the retriever fetches only two movies from the database in the science genre:
That is all about using the self-query for Weaviate in the LangChain framework.
Conclusion
To use Weaviate self-querying in LangChain, simply install LangChain, OpenAI, and Lark for the Weaviate client to get data from the database. After that, set up the OpenAI and Weaviate API keys to access their resources from outside the account and insert data in the Weaviate database using its URL. Create the self-query to retrieve data from the database and then apply multiple filters with the query to get data using the retriever. This post demonstrated the process of using the self-query for the Weaviate database in LangChain.