diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 96b93e2b..c66c39b9 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -1,12 +1,18 @@ # ANN (Approximate Nearest Neighbor) Indexes -You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results. +You can create an index over your vector data to make search faster. +Vector indexes are faster but less accurate than exhaustive search. +LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results. -Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration. +Currently, LanceDB does *not* automatically create the ANN index. +LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all. +If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall. + +In the future we will look to automatically create and configure the ANN index. ## Creating an ANN Index -Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function. +Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method. ```python import lancedb @@ -28,11 +34,11 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96) Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index creation by providing the following parameters: -- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table -with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional. -A higher number leads to faster queries, but it makes index generation slower. +- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table +with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional. +A higher number leads to faster queries, but it makes index generation slower. - **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes -search more accurate, but also makes the index larger and slower to build. +search more accurate, but also makes the index larger and slower to build. ## Querying an ANN Index @@ -41,15 +47,20 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance There are a couple of parameters that can be used to fine-tune the search: - **limit** (default: 10): The amount of results that will be returned -- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower. -- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes -search more accurate but also slower. +- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
+ Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.
+ e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.
+ Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored. +- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.
+ A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.
+ e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.
+ Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored. ```python tbl.search(np.random.random((768))) \ .limit(2) \ .nprobes(20) \ - .refine_factor(20) \ + .refine_factor(10) \ .to_df() vector item score @@ -57,7 +68,9 @@ tbl.search(np.random.random((768))) \ 1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867 ``` -The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant. +The search will return the data requested in addition to the score of each item. + +**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant. ### Filtering (where clause) diff --git a/python/lancedb/query.py b/python/lancedb/query.py index 21333bec..fcaf8937 100644 --- a/python/lancedb/query.py +++ b/python/lancedb/query.py @@ -108,7 +108,12 @@ class LanceQueryBuilder: return self def to_df(self) -> pd.DataFrame: - """Execute the query and return the results as a pandas DataFrame.""" + """ + Execute the query and return the results as a pandas DataFrame. + In addition to the selected columns, LanceDB also returns a vector + and also the "score" column which is the distance between the query + vector and the returned vector. + """ ds = self._table.to_lance() # TODO indexed search tbl = ds.to_table( diff --git a/python/lancedb/table.py b/python/lancedb/table.py index f798fb37..a2eded9c 100644 --- a/python/lancedb/table.py +++ b/python/lancedb/table.py @@ -166,6 +166,9 @@ class LanceTable: Returns ------- A LanceQueryBuilder object representing the query. + Once executed, the query returns selected columns, the vector, + and also the "score" column which is the distance between the query + vector and the returned vector. """ if isinstance(query, list): query = np.array(query)