[Doc] Describe the difference between ANN and KNN, and how to create indices. (#293)

2026-06-03 12:20:40 +00:00 · 2023-07-13 08:52:58 -07:00
parent ff1d206182
commit 08cc483ec9
4 changed files with 66 additions and 13 deletions
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -50,6 +50,7 @@ markdown_extensions:
 - pymdownx.superfences
 - pymdownx.tabbed:
    alternate_style: true
+- md_in_html

 nav:
 - Home: index.md
--- a/docs/src/ann_indexes.md
+++ b/docs/src/ann_indexes.md
@@ -1,7 +1,7 @@
 # ANN (Approximate Nearest Neighbor) Indexes

 You can create an index over your vector data to make search faster.
-Vector indexes are faster but less accurate than exhaustive search.
+Vector indexes are faster but less accurate than exhaustive search (KNN or Flat Search).
 LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.

 Currently, LanceDB does *not* automatically create the ANN index.
@@ -10,7 +10,18 @@ If you can live with <100ms latency, skipping index creation is a simpler workfl

 In the future we will look to automatically create and configure the ANN index.

-## Creating an ANN Index
+## Types of Index
+
+Lance can support multiple index types, the most widely used one is `IVF_PQ`.
+
+* `IVF_PQ`: use **Inverted File Index (IVF)** to first divide the dataset into `N` partitions,
+   and then use **Product Quantization** to compress vectors in each partition.
+* `DISKANN` (**Experimental**): organize the vector as a on-disk graph, where the vertices approximately
+   represent the nearest neighbors of each vector.
+
+## Creating an IVF_PQ Index
+
+Lance supports `IVF_PQ` index type by default.

 === "Python"
     Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
@@ -45,15 +56,18 @@ In the future we will look to automatically create and configure the ANN index.
     await table.createIndex({ type: 'ivf_pq', column: 'vector', num_partitions: 256, num_sub_vectors: 96 })
     ```

-Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
-creation by providing the following parameters:
+- **metric** (default: "L2"): The distance metric to use. By default it uses euclidean distance "`L2`".
+We also support "cosine" and "dot" distance as well.
+- **num_partitions** (default: 256): The number of partitions of the index.
+- **num_sub_vectors** (default: 96): The number of sub-vectors (M) that will be created during Product Quantization (PQ).
+For D dimensional vector, it will be divided into `M` of `D/M` sub-vectors, each of which is presented by
+a single PQ code.
+
+<figure markdown>
+  ![IVF PQ](./assets/ivf_pq.png)
+  <figcaption>IVF_PQ index with <code>num_partitions=2, num_sub_vectors=4</code></figcaption>
+</figure>

- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support "cosine" distance.
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
-with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
-A higher number leads to faster queries, but it makes index generation slower.
- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
-search more accurate, but also makes the index larger and slower to build.

 ## Querying an ANN Index

@@ -138,3 +152,31 @@ You can select the columns returned by the query using a select clause.
         .select(["id"])
         .execute()
     ```
+
+## FAQ
+
+### When is it necessary to create an ANN vector index.
+
+`LanceDB` has manually tuned SIMD code for computing vector distances.
+In our benchmarks, computing 100K pairs of 1K dimension vectors only take less than 20ms.
+For small dataset (<100K rows) or the applications which can accept 100ms latency, vector indices are usually not necessary.
+
+For large-scale or higher dimension vectors, it is beneficial to create vector index.
+
+### How big is my index, and how many memory will it take.
+
+In LanceDB, all vector indices are disk-based, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code.
+
+For example, with a 1024-dimension dataset, if we choose `num_sub_vectors=64`, each sub-vector has `1024 / 64 = 16` float32 numbers.
+Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction.
+
+### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index.
+
+`num_partitions` is used to decide how many partitions the first level `IVF` index uses.
+Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train.
+On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall.
+
+`num_sub_vectors` decides how many Product Quantization code to generate on each vector. Because
+Product Quantization is a lossy compression of the original vector, the more `num_sub_vectors` usually results to
+less space distortion, and thus yield better accuracy. However, similarly, more `num_sub_vectors` causes heavier I/O and
+more PQ computation, thus, higher latency. `dimension / num_sub_vectors` should be aligned with 8 for better SIMD efficiency.
--- a/docs/src/assets/ivf_pq.png
+++ b/docs/src/assets/ivf_pq.png
--- a/docs/src/search.md
+++ b/docs/src/search.md
@@ -25,9 +25,9 @@ Currently, we support the following metrics:

 ### Flat Search

+If LanceDB does not create a vector index, LanceDB would need to scan (`Flat Search`) the entire vector column
+and compute the distance for each vector in order to find the closest matches.

-If there is no [vector index is created](ann_indexes.md), LanceDB will just brute-force scan
-the vector column and compute the distance.

 <!-- Setup Code
 ```python
@@ -106,6 +106,16 @@ as well.
    ```


-### Search with Vector Index.
+### Approximate Nearest Neighbor (ANN) Search with Vector Index.
+
+To accelerate vector retrievals, it is common to build vector indices.
+A vector index is a data structure specifically designed to efficiently organize and
+search vector data based on their similarity or distance metrics.
+By constructing a vector index, you can reduce the search space and avoid the need
+for brute-force scanning of the entire vector column.
+
+However, fast vector search using indices often entails making a trade-off with accuracy to some extent.
+This is why it is often called **Approximate Nearest Neighbors (ANN)** search, while the Flat Search (KNN)
+always returns 100% recall.

 See [ANN Index](ann_indexes.md) for more details.