# Vector Search

Vector search is a search method that represents data as vectors. It is commonly used in applications such as image search, video search, and text search, as well as in machine learning applications such as image classification and clustering.

This section introduces how to create and manipulate vector indexes to accelerate vector search, as well as how to configure different types of vector indexes.

# Command Reference

# Creating a Table with Vectors

Currently, we support two types of vectors: array of float32 and binary string. Correspondingly, embedding vectors will be represented as Array(Float32) or FixedString in MyScale.

WARNING

The binary string type of vectors is only applicable to the DB version 1.3.1 or higher.

TIP

All vectors of an embedding vector column must have same dimension.

  • For array of float32 vectors, use CONSTRAINT to avoid errors. For example, CONSTRAINT constraint_name_1 CHECK length(data) = 128.
  • For binary string vectors, represent them as a fixed-length FixedString(N) type. Consequently, it is unnecessary to specify the dimensional constraints for these vectors, since each vector's dimension must be consistent and equal to 8 * N.

The following are examples of creating tables for different types of vector data.

Creating a table with float vectors

-- Create table with a column of 128D float32 vectors and length constraint
CREATE TABLE test_float_vector
(
    id    UInt32,
    data  Array(Float32),
    CONSTRAINT check_length CHECK length(data) = 128,
    date  Date,
    label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
) ENGINE = MergeTree ORDER BY id

Creating a table with binary vectors

-- Create table with a column of 128D binary string vectors (packed into 16 bytes)
CREATE TABLE test_binary_vector
(
    id    UInt32,
    binary_data  FixedString(16),
    date  Date,
    label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
) ENGINE = MergeTree ORDER BY id

NOTE

Data type names are case-sensitive and an error will be returned if the case is incorrect.

# Creating a Vector Index

A vector index must be created before running a vector search. The syntax for creating a vector index is as follows:

ALTER TABLE [db.]table_name
ADD VECTOR INDEX index_name column_name
[TYPE index_type('param1 = value1', 'param2 = value2', ...)]
  • index_name: the name of the vector index.
  • column_name: the name of the column on which the vector index will be created. This column must be either of type Array(Float32) or FixedString. When it is of the type Array(Float32), it must have a constraint specifying the length of the array.
  • index_type: the specific type of vector index.

TIP

For array of float32 vectors, we highly recommend using the MSTG algorithm for optimal results. However, the following other index types are also available.

  • FLAT: the brute force algorithm for comparison
  • ScaNN: the algorithm developed by Google Research for Scalable Nearest Neighbors
  • IVF family: including IVFFLAT, IVFSQ, and IVFPQ
  • HNSW family: including HNSWFLAT, HNSWSQ, and HNSWPQ

For binary string vectors, the index types available are BinaryMSTG, BinaryFLAT, BinaryIVF, and BinaryHNSW. Among these, we also highly recommend using the BinaryMSTG algorithm for optimal results.

NOTE

The vector index types of MSTG and BinaryMSTG are exclusively offered in the MyScale SaaS/BYOC, while all other types are available in all versions, including open-source, SaaS, and BYOC.

There are also two ways to create a default vector index without specifying a index_type:

  1. Create a vector index with index_type equal to DEFAULT.
  2. No need to specify the TYPE index_type(...) field when creating a vector index.

TIP

In the open-source MyscaleDB, we create ScaNN and BinaryIVF vector indexes by default. While in MyScale SaaS/BYOC, we create MSTG and BinaryMSTG vector indexes by default.

For example, to create a vector index named idx of type MSTG for the vector column in the test_float_vector table on SaaS, use one of the following commands:

-- Specify the vector index type.
ALTER TABLE test_float_vector ADD VECTOR INDEX idx vector TYPE MSTG
-- Create a vector index with index_type equal to DEFAULT.
ALTER TABLE test_float_vector ADD VECTOR INDEX idx vector TYPE DEFAULT
-- The vector index type is not specified.
ALTER TABLE test_float_vector ADD VECTOR INDEX idx vector

TIP

For detailed information on parameters for specific types, see the section Explanation of Vector Index Configuration Options. For suggestions on performance tuning, see the section Advice on Performance Tuning.

From DB version v1.4.0 or higher, MyScale allows to create multiple vector indexes on a table. However only one vector index is allowed on a vector column. For example, a table can be created with multiple vector columns:

-- Create table with two float32 vectors columns with different dimensions
CREATE TABLE test_multiple_vectors
(
    id    UInt32,
    vec1  Array(Float32),
    vec2  Array(Float32),
    CONSTRAINT check_length CHECK length(vec1) = 128,
    CONSTRAINT check_length CHECK length(vec2) = 64,
    date  Date,
    label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
) ENGINE = MergeTree ORDER BY id

Then a vector index can be created on each vector column:

-- add MSTG index type on vec1 column
ALTER TABLE test_multiple_vectors ADD VECTOR INDEX idx1 vec1 TYPE MSTG
-- add another MSTG vector index on vec2 column
ALTER TABLE test_multiple_vectors ADD VECTOR INDEX idx2 vec2 TYPE MSTG

# Deleting a Vector Index

The following syntax can be used to delete a vector index. It will remove the index and free up related memory and disk resources. If the index is currently being built, the building process will also be stopped immediately.

ALTER TABLE [db.]table_name DROP VECTOR INDEX index_name

# Checking the Status of Vector Indexes

To see the current status of vector indexes, you can use the system.vector_indices system table. The following syntax allows you to view all existing vector indexes:

SELECT * FROM system.vector_indices

You can use a WHERE clause to filter the results by table name or other criteria. For example, to view the vector indexes for a specific table, such as test_float_vector, you can use the following command:

SELECT table, name, status FROM system.vector_indices
WHERE table = 'test_float_vector'

This will output information on the vector index, including its current status, which can be one of the following:

  1. Built: This status indicates that the index has been successfully constructed and is ready for use.
  2. InProgress: This status means that the index is currently being built or updated. During this time, the index may be incomplete, and vector search on data that have not been indexed will fall back to the brute force algorithm, which is much slower.
  3. Error: If the index encounters an error during construction or use, it will enter the Error status. This could be due to a variety of reasons, such as invalid input data or system failures. When the index is in this status, it is typically unavailable for use until the error is resolved.

For vector indexes in the Error status, you can view the failure reasons with the following command:

SELECT table, name, latest_failed_part, latest_fail_reason
FROM system.vector_indices WHERE status = 'Error'

The distance() function is utilized to perform vector searches in MyScale. It calculates the distance between a specified vector and all vector data in a specified column, and returns the top candidates. The basic syntax for the distance() function is as follows:

distance('param1 = value1', 'param2 = value2')(column_name, query_vector)
  • params represents search-specific parameters. params can also include index-specific parameters, such as nprobe = 1 for IVFFLAT vector index, to define the scope of the search.
  • column_name refers to the column containing the vector data to be searched.
  • query_vector is the vector that will be searched.

    TIP

    • For array of float32 vector, it is important to include a decimal point in the query vector to prevent it from being recognized as an Array(UInt64) type, which would result in an error when executing the query. For example, a 128D array of float32 vector in the format [3.0, 9, ..., 4].
    • For binary string vector, it may not be as easy to display and input as other types, but we can manipulate it with the help of specific functions.
      • Use the char() function to produce a binary vector. This function returns the binary string with the length as the number of passed arguments and each byte has the value of corresponding argument. For example, a 128D binary string vector in the format char(128, 254, ... 127, 100).
      • Use the unbin() function to produce a binary vector. This function interprets each pair of binary digits (in the argument) as a number and converts it to the byte represented by the number. For example, a 128D binary string vector in the format unbin('01010...01010').
      • Use the unhex() function to produce a binary vector. This function interprets each pair of hexadecimal digits (in the argument) as a number and converts it to the byte represented by the number. For example, a 128D binary string vector in the format unhex('FFEEDDCC...112233').
  • The distance() function should be used with order by and limit clause to get the top candidates.
  • The sorting directions of the distance function column in order by clause need to correspond to the metric types (Cosine, IP, etc.) of vector index, otherwise an error will be reported. When metric type is IP, the sorting direction must be DESC.

A typical array of float32 vector search query would look like this:

SELECT id, date, label,
  distance(data, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist
FROM test_float_vector
ORDER BY dist LIMIT 10

This query will return the id, date, label, and the distance between the vector column and the query vector [3.0, 9, ..., 4] from the test_float_vector table. The ORDER BY dist LIMIT 10 clause specifies that the top 10 closest results should be returned.

Output:

id date label dist
3 "2024-08-11" "animal" 0
790110 "2001-10-14" "person" 102904
396372 "1987-12-15" "animal" 108579
401952 "1975-08-24" "animal" 117388
603558 "1999-09-26" "animal" 118487
25589 "1978-08-29" "animal" 119259
12632 "2019-02-25" "animal" 119662
800289 "2000-07-09" "building" 119673
16298 "1997-03-11" "animal" 120011
395903 "2020-08-19" "animal" 121352

The query result will be a table with three columns: id, date, label, and dist, showing the id of the vector, the date, the label, and the distance between the top vector results and the query vector respectively.

Accordingly, a typical binary string vector search query would look like this:

SELECT id, date, label,
  distance(binary_data, unhex('ABCDEF0123456789ABCDEF0123456789')) AS dist
FROM test_binary_vector
ORDER BY dist LIMIT 10

# Vector Search with Filters

Vector search with filters allows you to narrow down the results based on values from other columns or distance values. For instance, the following query will return the id, vector, and the distance between the vector column and the query vector [3.0, 9, ..., 4] from the test_float_vector table, but only for rows where the id column is greater than 100000:

SELECT id, date, label,
  distance(data, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist
FROM test_float_vector
WHERE id > 100000
ORDER BY dist LIMIT 10

Output:

id date label dist
790110 "2001-10-14" "person" 102904
396372 "1987-12-15" "animal" 108579
401952 "1975-08-24" "animal" 117388
603558 "1999-09-26" "animal" 118487
800289 "2000-07-09" "building" 119673
395903 "2020-08-19" "animal" 121352
600737 "1972-08-25" "animal" 125027
790101 "1990-02-22" "person" 129224
790265 "2019-05-26" "building" 133267
198290 "1974-04-22" "building" 134178

To filter by distance values, use the WHERE clause as follows:

SELECT id, date, label,
  distance(data, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist
FROM test_float_vector
WHERE dist < 110000
ORDER BY dist LIMIT 10

This query will return the id, date, label, and the distance between the vector column and the query vector [3.0, 9, ..., 4] from the test_float_vector table, but only for rows where the distance is less than 110000.

Output:

id date label dist
3 "2024-08-11" "animal" 0
790110 "2001-10-14" "person" 102904
396372 "1987-12-15" "animal" 108579

# Multiple Vector Searches

Note

The feature is available only in DB version v1.8 or higher.

Mutlitple Vector Searches allows you to use multiple distance() functions in a vector search query. The top candidates for multiple distance() functions are combined based on part and row id. The result type of distance() function is Float32. Hence, for a row with two vectors, vec1 and vec2, only one vector appears in top candidates, the distance() value of the other vector would be NaN. A typical multiple vector searches query would look like this:

SELECT id, date, label, distance(vec1, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist1,
distance(vec2, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3]) AS dist2
FROM test_multiple_vectors
ORDER BY dist1 + dist2
LIMIT 10;
  • The vector search select query should be used with order by and limit clause to get the top candidates.
  • For cases where distance() values include NaN, you can use expression like ifNotFinite(dist1, <default_value>) to provide a proper default value for NaN value.

# Explanation of Vector Index Configuration Options

Vector indexes take two kinds of parameters: index creation parameters and search parameters.

Index creation parameters are specified during index creation, for example

ALTER TABLE [db.]table_name
ADD VECTOR INDEX index_name column_name
[TYPE index_type
(
  'creation_param1 = value1',
  'creation_param2 = value2',
  ...
)]

Search parameters are specified during search, for example

SELECT
  id,
  distance('search_param1 = value1', 'search_param2 = value2')(column_name, query_vector) as dist
FROM [db.]table_name
ORDER BY dist LIMIT 10

# General Index Creation Parameters

The following parameters can be employed in the creation of any type of vector index:

  • metric_type: This parameter determines the distance metric used in vector search.
    • For array of float32 vectors, three options are available, with L2 as the default value.
      • L2: The L2 metric, also known as Euclidean distance.
      • Cosine: The cosine distance, which is based on cosine similarity. The cosine distance formula is calculated as:

        The cosine similarity can be obtained by subtracting the distance value from 1.
      • IP: The inner product (IP) metric. Please note that when using the IP metric, it's necessary to use ORDER BY ... DESC, as higher IP values indicate greater similarity.
    • For binary string vectors, one option is available, with Hamming as the default value.
      • Hamming: The Hamming distance between two equal-length vectors is the count of differing symbols at corresponding positions.
      • Jaccard: The Jaccard distance measures dissimilarity. The Jaccard distance formula is calculated as:

        The Jaccard coefficient can be obtained by subtracting the distance value from 1.

# MSTG / BinaryMSTG Parameters

The Multi-Scale Tree Graph (MSTG) algorithm, developed by MyScale, is a proprietary solution designed to deliver high data density and high performance for both standard and filtered vector search operations.

Index Creation Parameter:

MSTG, and BinaryMSTG don't take any parameters other than the metric_type described above during index creation.

Search Parameter:

  • alpha = float: This parameter controls the accuracy of the search operation. The higher the value, the more accurate the search, but the lower the QPS. The default value is 3, and the valid range is between 1 and 4.

# ScaNN Parameters

ScaNN is an algorithm developed by Google for fast and scalable nearest neighbor search in high-dimensional vector spaces.

Index Creation Parameter:

ScaNN doesn't take any parameters other than the metric_type described above during index creation.

Search Parameter:

  • alpha = float: This parameter controls the accuracy of the search operation. The higher the value, the more accurate the search, but the lower the QPS. The default value is 3, and the valid range is between 1 and 4.

# FLAT / BinaryFLAT Parameters

FLAT and BinaryFLAT are the simplest form of vector indexing, directly computing distances based on raw data without any additional optimization parameters. It is useful for prototyping and ensuring the accuracy of search results, but it is not recommended for production use due to its relatively slow performance.

# IVFFLAT Parameters

IVFFLAT is a hierarchical index that employs clustering to divide vectors into smaller clusters for more efficient searches.

Index Creation Parameter:

  • ncentroids = int: Determines the number of clusters into which all vector data will be divided. Larger values of ncentroids result in slower table building times. Default value is 1024.

Search Parameter:

  • nprobe = int: Specifies the number of clusters to search during a search operation. Larger values result in slower searches but greater accuracy. Default value is 1.

Suggested Parameter Values:

It is recommended to choose a value between 1000-10000 for ncentroids, with a preference for values close to the square root of the amount of data. If ncentroids is too large, performance may be impacted. A value between 0.1-10% of ncentroids is suggested for nprobe.

# IVFPQ Parameters

Index Creation Parameter:

  • ncentroids = int: Refer to IVFFlat.

  • M = int: Reduces the original vector dimension to M. M must be divisible by the original vector dimension. Default value is 16.

  • bit_size = int: Refers to the size of the Product Quantization (PQ) encoding table used to replace the original vector. Valid values are multiples of 4, with a default value of 8.

Search Parameter:

Suggested Parameter Values:

The recommended values for ncentroids and nprobe are similar to those for IVFFlat. It is important to note that the compression ratio of PQ is calculated as (bit_size * M) / (original_dimension * original_element_size). For a 128-dimensional float32 vector, when M = 16 and bit_size = 8, the corresponding compression ratio is (16*8)/(128*32) = 1/32. Excessively high compression ratios can greatly impact the accuracy of search results, so it is recommended to keep bit_size at 8 and M within 1/4 of the original dimension to avoid this issue.

# IVFSQ Parameters

Index Creation Parameter:

  • ncentroids = int: Refer to IVFFlat.

  • bit_size = string: Acceptable values are 8bit, 6bit, 4bit, 8bit_uniform, 8bit_direct, 4bit_uniform, and QT_fp16, with a default value of 8bit.

The Scalar Quantization (SQ) algorithm is used to compress each vector dimension while retaining the number of dimensions. When bit_size is set to 8, the compression rate is approximately 25%. The precision of the algorithm increases in the order 8bit_direct, 8bit_uniform and 8bit, but index construction speed is inversely proportional to precision. 8bit_direct converts float to uint_8 using static_cast, 8bit_uniform evenly divides all float values into 256 bins, and 8bit evenly divides each dimension into 256 bins. 4bit_uniform divides and quantizes data into 16 bins. QT_fp16 is a variation of the SQ algorithm that uses half-precision floating-point numbers, and the details can be found at the link https://gist.github.com/rygorous/2156668 (opens new window).

Search Parameter:

# HNSWFLAT Parameters

The Hierarchical Navigable Small World (HNSW) algorithm is a type of approximate nearest neighbor search algorithm that is designed to quickly find the nearest neighbors of a given query point in high-dimensional space. It does this by organizeing the data points into a multi-level graph data structure. The HNSW algorithm uses the principle of "small world" networks, in which most nodes are only a small number of steps away from each other, to navigate the graph and efficiently find the closest data points to the query point. It is known for its high performance and scalability in large datasets and high dimensional space.

Index Creation Parameter:

  • m = int: This parameter determines the number of neighbors of each data point in the HNSW diagram and affects the quality of the index. A larger m will result in higher accuracy of the search results but will also increase the time it takes to build the index. Default value is 16.
  • ef_c = int: This parameter determines the size of the priority queue used by HNSW when creating the index, and it affects the quality of the index. A larger ef_c will result in a higher precision of the search results but will also increase the time it takes to build the index. Default value is 100.

Search Parameter:

  • ef_s = int: This parameter determines the size of the priority queue used by HNSW during the search operation. A larger ef_s will result in a higher precision of the search results, but will also increase the search time. Default value is 50.

Suggested Parameter Values:

It is generally recommended to set m between 8-128 and ef_c between 50-400. Doubling ef_c will approximately double the index building time. The value of ef_s should be adjusted according to the search requirements, and it is recommended to use the same value range as ef_c. A lower value can be chosen if a low latency requirement is needed.

# HNSWSQ Parameters

Index Creation Parameter:

Search Parameter:

Suggested Parameter Values:

Refer to IVFSQ for bit_size selection, and refer to HNSWFLAT for the rest.

# Advice on Performance Tuning

In general, we recommend using the MSTG/BinaryMSTG index and the default values for index creation to optimize performance. When searching, you can adjust the alpha value based on desired throughput and precision.

In MyScale, data is automatically split into multiple parts, and one vector index is created for each part internally. For optimal performance, we suggest optimizing the table into one data part before creating the vector index. You can use the following command to optimize a table:

OPTIMIZE TABLE test_vector FINAL

Therefore, the recommended order of operations is:

  1. Create table using CREATE TABLE ...;
  2. Insert data using INSERT INTO ...;
  3. Optimize the table;
  4. Create the vector index using ALTER TABLE ... ADD VECTOR INDEX;
  5. Wait for the vector index to be built;
  6. Start searching.

It is essential to note that optimizing a table after index creation can consume a lot of memory. Therefore, please do not optimize a table after the index creation.

Last Updated: Tue Oct 08 2024 07:44:34 GMT+0000