AI

Milvus Hybrid Search

In Milvus, a hybrid search is a vector search that incorporates the attribute filtering. Using specific Boolean expressions to filter either the scalar fields or the primary key field, you can apply specific conditions to narrow down your search.

This tutorial demonstrates how to perform a basic hybrid search in Milvus using the PyMilvus API.

Requirements:

To use the provided methods and the code in this post, ensure that you have the following:

  1. Access to a Milvus server
  2. Python 3.10+ and higher
  3. Installed PyMilvus

With the given requirements met, we can proceed with the tutorial.

Create a Collection

Before we dive into performing a hybrid search in Milvus, let us start by setting up a basic collection for demonstration purposes.

We can do this by setting up the necessary collection parameters such as the field schema, collection schema, and the collection name.

We use the Milvus Python API to create a primary collection to store the film information. Remember that this is demo data and does not represent a real-world data or schema configuration.

from pymilvus import CollectionSchema, FieldSchema, DataType

film_id = FieldSchema(

  name="film_id",

  dtype=DataType.INT64,

  is_primary=True,

)

film_title = FieldSchema(

  name="film_title",

  dtype=DataType.VARCHAR,

  max_length=200,

)

film_director = FieldSchema(

  name="film_director",

  dtype=DataType.VARCHAR,

  max_length=100,

)

film_release_year = FieldSchema(

  name="film_release_year",

  dtype=DataType.INT64,

)

film_genre = FieldSchema(

  name="film_genre",

  dtype=DataType.VARCHAR,

  max_length=50,

)

film_intro = FieldSchema(

  name="film_intro",

  dtype=DataType.FLOAT_VECTOR,

  dim=2

)

schema = CollectionSchema(

  fields=[film_id, film_title, film_director, film_release_year, film_genre, film_intro],

  description="Film search",

  enable_dynamic_field=True

)

collection_name = "film"

The previous example uses the PyMilvus SDK to setup a basic collection schema with the defined parameters. This contains fields such as the “film_id”, “film_title”, “film_release_year”, etc.

Once we are done, we can create the collection using the previous schema. The code is as follows:

from pymilvus import Collection

collection = Collection(

  name=collection_name,

  schema=schema,

  using='default',

  shards_num=2

)

Upon creation, we can use the collection.insert() method to insert the sample data into the collection as shown in the following definition:

film_data = [

{

  "film_id": 1,

  "film_title": "Inception",

  "film_director": "Christopher Nolan",

  "film_release_year": 2010,

  "film_genre": "Science Fiction",

  "film_intro": [0.1, 0.5], # Sample float vector

  },

{

  "film_id": 2,

  "film_title": "The Shawshank Redemption",

  "film_director": "Frank Darabont",

  "film_release_year": 1994,

  "film_genre": "Drama",

  "film_intro": [0.2, 0.7], # Sample float vector

},

]

To insert the data, connect to the Milvus server and call the insert() method as follows:

collection = Collection("film")
mr = collection.insert(film_data)

This should insert the “film_data” that is previously defined into the film collection.

Load the Collection

Before searching the data that is stored in the collection, we need to load the collection from the system disk to the system memory. You can check out our tutorial on Milvus Load Collection to learn more.

For now, we use the PyMilvus SDK to load the film collection to the system memory as follows:

from pymilvus import Collection

collection = Collection("film")

collection.load()

This should make the collection available which allows us to perform the target searches on the server as what we demonstrate in the following steps.

Milvus Hybrid Search

Finally, we can perform a hybrid search. Let us start by defining what exactly is a Milvus hybrid search.

In Milvus, a hybrid search is a search technique that combines the vector search with attribute filtering. This approach allows us to search based on vector similarities and specific attribute conditions.

By applying the Boolean expressions to filter the scalar fields or the primary key field, we can constrain the search to meet the specified conditions.

Hence, using a hybrid search in Milvus, we introduce the versatility and precision by leveraging the vector representation and attribute-based filtering in a single query.

NOTE: Milvus may require us to create an index for the collection on the fields that we wish to search. In our example, we can create an index on the “film_intro” field as follows:

ffrom pymilvus import Collection

collection = Collection('film')

index_params = {
  "index_type": "IVF_FLAT",
  "metric_type": "L2",
  "params": {
  "nlist": 2
  }
}

collection.create_index(
  field_name="film_intro",
  index_params=index_params,
  index_name="intro_index"
)

To also create an index on the “release_year”, we can run the code as follows:

from pymilvus import Collection
collection = Collection('film')
collection.create_index(
field_name="film_release_year",

index_name="year_index"
)

Milvus Hybrid Search Example:

Let us demonstrate how we can perform a vector hybrid search by specifying a Boolean expression to filter the scalar field of the entities.

Consider the following example:

search_param = {
  "data": [[0.1, 0.2]],
  "anns_field": "film_intro",
  "param": {"metric_type": "L2", "params": {"nprobe": 10}, "offset": 0},
  "limit": 10,
  "expr": "film_release_year <= 2000",
}
res = collection.search(**search_param)
print(res)

In this case, we perform a hybrid search by specifying a Boolean expression to match the value where the “film_release_year” is less than or equal to 2000.

Once we run the previous code, we should get the output as follows:

["['id: 2, distance: 0.25999999046325684, entity: {}']"]

In this case, the matching value has an ID of 2 as defined in the second entity and the distance.

Full Source Code:

The following shows the full source code that is used in this post including the schema creation, data insertion, and the hybrid search.

from pymilvus import CollectionSchema, FieldSchema, DataType, Collection, connections

connections.connect(

  alias="default",

  host='localhost',

  port='19530'

)

film_id = FieldSchema(

  name="film_id",

  dtype=DataType.INT64,

  is_primary=True,

)

film_title = FieldSchema(

  name="film_title",

  dtype=DataType.VARCHAR,

  max_length=200,

)

film_director = FieldSchema(

  name="film_director",

  dtype=DataType.VARCHAR,

  max_length=100,

)

film_release_year = FieldSchema(

  name="film_release_year",

  dtype=DataType.INT64,

)

film_genre = FieldSchema(

  name="film_genre",

  dtype=DataType.VARCHAR,

  max_length=50,

)

film_intro = FieldSchema(

  name="film_intro",

  dtype=DataType.FLOAT_VECTOR,

  dim=2

)

schema = CollectionSchema(

  fields=[film_id, film_title, film_director, film_release_year, film_genre, film_intro],

  description="Film search",

  enable_dynamic_field=True

)

collection_name = "film"

collection = Collection(

  name=collection_name,

  schema=schema,

  using='default',

  shards_num=2

)

film_data = [

 {

  "film_id": 1,

  "film_title": "Inception",

  "film_director": "Christopher Nolan",

  "film_release_year": 2010,

  "film_genre": "Science Fiction",

  "film_intro": [0.1, 0.5],  # Sample float vector

  },

{

  "film_id": 2,

  "film_title": "The Shawshank Redemption",

  "film_director": "Frank Darabont",

  "film_release_year": 1994,

  "film_genre": "Drama",

  "film_intro": [0.2, 0.7],  # Sample float vector

},

]

collection = Collection("film")

mr = collection.insert(film_data)

collection = Collection("film")

collection.load()

search_param = {

  "data": [[0.1, 0.2]],

  "anns_field": "film_intro",

  "param": {"metric_type": "L2", "params": {"nprobe": 10}, "offset": 0},

  "limit": 10,

  "expr": "film_release_year <= 2000",

}

res = collection.search(**search_param)

print(res)

Conclusion

We explored the workings of Milvus by learning about the hybrid search. We also demonstrated, using practical examples, on how to create schemas, add data, and specify the Boolean expressions on a given entity field to perform a hybrid search as desired.

About the author

John Otieno

My name is John and am a fellow geek like you. I am passionate about all things computers from Hardware, Operating systems to Programming. My dream is to share my knowledge with the world and help out fellow geeks. Follow my content by subscribing to LinuxHint mailing list