Add more detailed docs for the ANN index and search features

This commit is contained in:
Chang She
2023-04-24 19:19:55 -07:00
parent 177192f852
commit 4f2dae8a0d
3 changed files with 34 additions and 13 deletions

View File

@@ -1,12 +1,18 @@
# ANN (Approximate Nearest Neighbor) Indexes
You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
You can create an index over your vector data to make search faster.
Vector indexes are faster but less accurate than exhaustive search.
LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration.
Currently, LanceDB does *not* automatically create the ANN index.
LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
In the future we will look to automatically create and configure the ANN index.
## Creating an ANN Index
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function.
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
```python
import lancedb
@@ -28,11 +34,11 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96)
Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
creation by providing the following parameters:
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
A higher number leads to faster queries, but it makes index generation slower.
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
A higher number leads to faster queries, but it makes index generation slower.
- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
search more accurate, but also makes the index larger and slower to build.
search more accurate, but also makes the index larger and slower to build.
## Querying an ANN Index
@@ -41,15 +47,20 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance
There are a couple of parameters that can be used to fine-tune the search:
- **limit** (default: 10): The amount of results that will be returned
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes
search more accurate but also slower.
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.<br/>
Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.<br/>
e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
```python
tbl.search(np.random.random((768))) \
.limit(2) \
.nprobes(20) \
.refine_factor(20) \
.refine_factor(10) \
.to_df()
vector item score
@@ -57,7 +68,9 @@ tbl.search(np.random.random((768))) \
1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867
```
The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
The search will return the data requested in addition to the score of each item.
**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
### Filtering (where clause)

View File

@@ -108,7 +108,12 @@ class LanceQueryBuilder:
return self
def to_df(self) -> pd.DataFrame:
"""Execute the query and return the results as a pandas DataFrame."""
"""
Execute the query and return the results as a pandas DataFrame.
In addition to the selected columns, LanceDB also returns a vector
and also the "score" column which is the distance between the query
vector and the returned vector.
"""
ds = self._table.to_lance()
# TODO indexed search
tbl = ds.to_table(

View File

@@ -166,6 +166,9 @@ class LanceTable:
Returns
-------
A LanceQueryBuilder object representing the query.
Once executed, the query returns selected columns, the vector,
and also the "score" column which is the distance between the query
vector and the returned vector.
"""
if isinstance(query, list):
query = np.array(query)