Add more detailed docs for the ANN index and search features

2026-01-03 18:32:55 +00:00 · 2023-04-24 19:19:55 -07:00
parent 177192f852
commit 4f2dae8a0d
3 changed files with 34 additions and 13 deletions
--- a/docs/src/ann_indexes.md
+++ b/docs/src/ann_indexes.md
@@ -1,12 +1,18 @@
 # ANN (Approximate Nearest Neighbor) Indexes

-You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
+You can create an index over your vector data to make search faster.
+Vector indexes are faster but less accurate than exhaustive search.
+LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.

-Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration.
+Currently, LanceDB does *not* automatically create the ANN index.
+LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
+If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
+
+In the future we will look to automatically create and configure the ANN index.

 ## Creating an ANN Index

-Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function.
+Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.

 ```python
 import lancedb
@@ -28,11 +34,11 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96)
 Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
 creation by providing the following parameters:

- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table 
-with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional. 
-A higher number leads to faster queries, but it makes index generation slower. 
+- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
+with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
+A higher number leads to faster queries, but it makes index generation slower.
 - **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
-search more accurate, but also makes the index larger and slower to build. 
+search more accurate, but also makes the index larger and slower to build.

 ## Querying an ANN Index

@@ -41,15 +47,20 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance
 There are a couple of parameters that can be used to fine-tune the search:

 - **limit** (default: 10): The amount of results that will be returned
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes 
-search more accurate but also slower.
+- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.<br/>
+  Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.<br/>
+  e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
+  Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
+- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
+  A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
+  e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
+  Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.

 ```python
 tbl.search(np.random.random((768))) \
    .limit(2) \
    .nprobes(20) \
-    .refine_factor(20) \
+    .refine_factor(10) \
    .to_df()

                                              vector       item       score
@@ -57,7 +68,9 @@ tbl.search(np.random.random((768))) \
 1  [0.48587373, 0.269207, 0.15095535, 0.65531915,...  item 3953  108.393867
 ```

-The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
+The search will return the data requested in addition to the score of each item.
+
+**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.

 ### Filtering (where clause)

--- a/python/lancedb/query.py
+++ b/python/lancedb/query.py
@@ -108,7 +108,12 @@ class LanceQueryBuilder:
        return self

    def to_df(self) -> pd.DataFrame:
-        """Execute the query and return the results as a pandas DataFrame."""
+        """
+        Execute the query and return the results as a pandas DataFrame.
+        In addition to the selected columns, LanceDB also returns a vector
+        and also the "score" column which is the distance between the query
+        vector and the returned vector.
+        """
        ds = self._table.to_lance()
        # TODO indexed search
        tbl = ds.to_table(
--- a/python/lancedb/table.py
+++ b/python/lancedb/table.py
@@ -166,6 +166,9 @@ class LanceTable:
        Returns
        -------
        A LanceQueryBuilder object representing the query.
+        Once executed, the query returns selected columns, the vector,
+        and also the "score" column which is the distance between the query
+        vector and the returned vector.
        """
        if isinstance(query, list):
            query = np.array(query)