diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 4ca98465..71ea0f53 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -50,6 +50,7 @@ markdown_extensions: - pymdownx.superfences - pymdownx.tabbed: alternate_style: true +- md_in_html nav: - Home: index.md diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 497d37ce..1a2b4cbd 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -1,7 +1,7 @@ # ANN (Approximate Nearest Neighbor) Indexes You can create an index over your vector data to make search faster. -Vector indexes are faster but less accurate than exhaustive search. +Vector indexes are faster but less accurate than exhaustive search (KNN or Flat Search). LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results. Currently, LanceDB does *not* automatically create the ANN index. @@ -10,7 +10,18 @@ If you can live with <100ms latency, skipping index creation is a simpler workfl In the future we will look to automatically create and configure the ANN index. -## Creating an ANN Index +## Types of Index + +Lance can support multiple index types, the most widely used one is `IVF_PQ`. + +* `IVF_PQ`: use **Inverted File Index (IVF)** to first divide the dataset into `N` partitions, + and then use **Product Quantization** to compress vectors in each partition. +* `DISKANN` (**Experimental**): organize the vector as a on-disk graph, where the vertices approximately + represent the nearest neighbors of each vector. + +## Creating an IVF_PQ Index + +Lance supports `IVF_PQ` index type by default. === "Python" Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method. @@ -45,15 +56,18 @@ In the future we will look to automatically create and configure the ANN index. await table.createIndex({ type: 'ivf_pq', column: 'vector', num_partitions: 256, num_sub_vectors: 96 }) ``` -Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index -creation by providing the following parameters: +- **metric** (default: "L2"): The distance metric to use. By default it uses euclidean distance "`L2`". +We also support "cosine" and "dot" distance as well. +- **num_partitions** (default: 256): The number of partitions of the index. +- **num_sub_vectors** (default: 96): The number of sub-vectors (M) that will be created during Product Quantization (PQ). +For D dimensional vector, it will be divided into `M` of `D/M` sub-vectors, each of which is presented by +a single PQ code. + +
+ ![IVF PQ](./assets/ivf_pq.png) +
IVF_PQ index with num_partitions=2, num_sub_vectors=4
+
-- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support "cosine" distance. -- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table -with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional. -A higher number leads to faster queries, but it makes index generation slower. -- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes -search more accurate, but also makes the index larger and slower to build. ## Querying an ANN Index @@ -138,3 +152,31 @@ You can select the columns returned by the query using a select clause. .select(["id"]) .execute() ``` + +## FAQ + +### When is it necessary to create an ANN vector index. + +`LanceDB` has manually tuned SIMD code for computing vector distances. +In our benchmarks, computing 100K pairs of 1K dimension vectors only take less than 20ms. +For small dataset (<100K rows) or the applications which can accept 100ms latency, vector indices are usually not necessary. + +For large-scale or higher dimension vectors, it is beneficial to create vector index. + +### How big is my index, and how many memory will it take. + +In LanceDB, all vector indices are disk-based, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code. + +For example, with a 1024-dimension dataset, if we choose `num_sub_vectors=64`, each sub-vector has `1024 / 64 = 16` float32 numbers. +Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction. + +### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index. + +`num_partitions` is used to decide how many partitions the first level `IVF` index uses. +Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. +On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall. + +`num_sub_vectors` decides how many Product Quantization code to generate on each vector. Because +Product Quantization is a lossy compression of the original vector, the more `num_sub_vectors` usually results to +less space distortion, and thus yield better accuracy. However, similarly, more `num_sub_vectors` causes heavier I/O and +more PQ computation, thus, higher latency. `dimension / num_sub_vectors` should be aligned with 8 for better SIMD efficiency. \ No newline at end of file diff --git a/docs/src/assets/ivf_pq.png b/docs/src/assets/ivf_pq.png new file mode 100644 index 00000000..72d7652b Binary files /dev/null and b/docs/src/assets/ivf_pq.png differ diff --git a/docs/src/search.md b/docs/src/search.md index 1a3c23cd..ab35fc0f 100644 --- a/docs/src/search.md +++ b/docs/src/search.md @@ -25,9 +25,9 @@ Currently, we support the following metrics: ### Flat Search +If LanceDB does not create a vector index, LanceDB would need to scan (`Flat Search`) the entire vector column +and compute the distance for each vector in order to find the closest matches. -If there is no [vector index is created](ann_indexes.md), LanceDB will just brute-force scan -the vector column and compute the distance.