[DOCS][PYTHON] Update docs for clarity (#531)

I only modified those docs pages that are untouched in existing unmerged
PRs, so hopefully there are no merge conflicts!

1. The `tantivy-py` version specified in the docs doesn't work (pip
install fails), but with the latest version of pip and wheel installed
on my Mac, I was able to just `pip install tantivy` and FTS works great
for me. I updated the docs page to include this in
7ca4b757ce but can always modify to
another specific version in case this breaks any tests.
2. The `.add()` method for Python should take in a list of dicts as the
first option (to also align with the JS API), and additionally, users
can pass an existing pandas DataFrame to add to a table. Hope this makes
sense.
3. I've had multiple conversations with users who are unclear that the
terms "exhaustive", "flat" and "KNN" are all the same kind of search, so
I've updated the verbiage of this section to clarify this.
4. Fixed typos and improved clarity in the ANN indexes page.
This commit is contained in:
Prashanth Rao
2023-10-03 00:16:53 -04:00
committed by GitHub
parent d326146a40
commit 1d1f8964d2
4 changed files with 30 additions and 22 deletions

View File

@@ -6,7 +6,7 @@ LanceDB provides many parameters to fine-tune the index's size, the speed of que
Currently, LanceDB does *not* automatically create the ANN index.
LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
If you can live with < 100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
In the future we will look to automatically create and configure the ANN index.
@@ -154,28 +154,28 @@ You can select the columns returned by the query using a select clause.
## FAQ
### When is it necessary to create an ANN vector index.
### When is it necessary to create an ANN vector index?
`LanceDB` has manually tuned SIMD code for computing vector distances.
In our benchmarks, computing 100K pairs of 1K dimension vectors only take less than 20ms.
For small dataset (<100K rows) or the applications which can accept 100ms latency, vector indices are usually not necessary.
`LanceDB` has manually-tuned SIMD code for computing vector distances.
In our benchmarks, computing 100K pairs of 1K dimension vectors takes **less than 20ms**.
For small datasets (< 100K rows) or applications that can accept 100ms latency, vector indices are usually not necessary.
For large-scale or higher dimension vectors, it is beneficial to create vector index.
### How big is my index, and how many memory will it take.
### How big is my index, and how many memory will it take?
In LanceDB, all vector indices are disk-based, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code.
In LanceDB, all vector indices are **disk-based**, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code.
For example, with a 1024-dimension dataset, if we choose `num_sub_vectors=64`, each sub-vector has `1024 / 64 = 16` float32 numbers.
Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction.
### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index.
### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index?
`num_partitions` is used to decide how many partitions the first level `IVF` index uses.
Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train.
On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall.
`num_sub_vectors` decides how many Product Quantization code to generate on each vector. Because
Product Quantization is a lossy compression of the original vector, the more `num_sub_vectors` usually results to
less space distortion, and thus yield better accuracy. However, similarly, more `num_sub_vectors` causes heavier I/O and
more PQ computation, thus, higher latency. `dimension / num_sub_vectors` should be aligned with 8 for better SIMD efficiency.
`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because
PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in
less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and
more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency.

View File

@@ -123,9 +123,15 @@ After a table has been created, you can always add more data to it using
=== "Python"
```python
df = pd.DataFrame([{"vector": [1.3, 1.4], "item": "fizz", "price": 100.0},
{"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}])
tbl.add(df)
# Option 1: Add a list of dicts to a table
data = [{"vector": [1.3, 1.4], "item": "fizz", "price": 100.0},
{"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}]
tbl.add(data)
# Option 2: Add a pandas DataFrame to a table
df = pd.DataFrame(data)
tbl.add(data)
```
=== "Javascript"

View File

@@ -6,17 +6,19 @@ to make this available for JS as well.
## Installation
To use full text search, you must install optional dependency tantivy-py:
To use full text search, you must install the dependency `tantivy-py`:
# tantivy 0.19.2
pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985
# tantivy 0.20.1
```sh
pip install tantivy==0.20.1
```
## Quickstart
Assume:
1. `table` is a LanceDB Table
2. `text` is the name of the Table column that we want to index
2. `text` is the name of the `Table` column that we want to index
For example,

View File

@@ -25,8 +25,8 @@ Currently, we support the following metrics:
### Flat Search
If LanceDB does not create a vector index, LanceDB would need to scan (`Flat Search`) the entire vector column
and compute the distance for each vector in order to find the closest matches.
If you do not create a vector index, LanceDB would need to exhaustively scan the entire vector column (via `Flat Search`)
and compute the distance for *every* vector in order to find the closest matches. This is effectively a KNN search.
<!-- Setup Code
@@ -110,7 +110,7 @@ as well.
To accelerate vector retrievals, it is common to build vector indices.
A vector index is a data structure specifically designed to efficiently organize and
search vector data based on their similarity or distance metrics.
search vector data based on their similarity via the chosen distance metric.
By constructing a vector index, you can reduce the search space and avoid the need
for brute-force scanning of the entire vector column.