diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 58c336e6..bf0eb547 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -6,7 +6,7 @@ LanceDB provides many parameters to fine-tune the index's size, the speed of que Currently, LanceDB does *not* automatically create the ANN index. LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all. -If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall. +If you can live with < 100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall. In the future we will look to automatically create and configure the ANN index. @@ -154,28 +154,28 @@ You can select the columns returned by the query using a select clause. ## FAQ -### When is it necessary to create an ANN vector index. +### When is it necessary to create an ANN vector index? -`LanceDB` has manually tuned SIMD code for computing vector distances. -In our benchmarks, computing 100K pairs of 1K dimension vectors only take less than 20ms. -For small dataset (<100K rows) or the applications which can accept 100ms latency, vector indices are usually not necessary. +`LanceDB` has manually-tuned SIMD code for computing vector distances. +In our benchmarks, computing 100K pairs of 1K dimension vectors takes **less than 20ms**. +For small datasets (< 100K rows) or applications that can accept 100ms latency, vector indices are usually not necessary. For large-scale or higher dimension vectors, it is beneficial to create vector index. -### How big is my index, and how many memory will it take. +### How big is my index, and how many memory will it take? -In LanceDB, all vector indices are disk-based, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code. +In LanceDB, all vector indices are **disk-based**, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code. For example, with a 1024-dimension dataset, if we choose `num_sub_vectors=64`, each sub-vector has `1024 / 64 = 16` float32 numbers. Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction. -### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index. +### How to choose `num_partitions` and `num_sub_vectors` for `IVF_PQ` index? `num_partitions` is used to decide how many partitions the first level `IVF` index uses. Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall. -`num_sub_vectors` decides how many Product Quantization code to generate on each vector. Because -Product Quantization is a lossy compression of the original vector, the more `num_sub_vectors` usually results to -less space distortion, and thus yield better accuracy. However, similarly, more `num_sub_vectors` causes heavier I/O and -more PQ computation, thus, higher latency. `dimension / num_sub_vectors` should be aligned with 8 for better SIMD efficiency. \ No newline at end of file +`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because +PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in +less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and +more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency. \ No newline at end of file diff --git a/docs/src/basic.md b/docs/src/basic.md index 53c235ae..8b24f809 100644 --- a/docs/src/basic.md +++ b/docs/src/basic.md @@ -123,9 +123,15 @@ After a table has been created, you can always add more data to it using === "Python" ```python - df = pd.DataFrame([{"vector": [1.3, 1.4], "item": "fizz", "price": 100.0}, - {"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}]) - tbl.add(df) + + # Option 1: Add a list of dicts to a table + data = [{"vector": [1.3, 1.4], "item": "fizz", "price": 100.0}, + {"vector": [9.5, 56.2], "item": "buzz", "price": 200.0}] + tbl.add(data) + + # Option 2: Add a pandas DataFrame to a table + df = pd.DataFrame(data) + tbl.add(data) ``` === "Javascript" diff --git a/docs/src/fts.md b/docs/src/fts.md index 58dea5d7..dafcb055 100644 --- a/docs/src/fts.md +++ b/docs/src/fts.md @@ -6,17 +6,19 @@ to make this available for JS as well. ## Installation -To use full text search, you must install optional dependency tantivy-py: +To use full text search, you must install the dependency `tantivy-py`: -# tantivy 0.19.2 -pip install tantivy@git+https://github.com/quickwit-oss/tantivy-py#164adc87e1a033117001cf70e38c82a53014d985 +# tantivy 0.20.1 +```sh +pip install tantivy==0.20.1 +``` ## Quickstart Assume: 1. `table` is a LanceDB Table -2. `text` is the name of the Table column that we want to index +2. `text` is the name of the `Table` column that we want to index For example, diff --git a/docs/src/search.md b/docs/src/search.md index ab35fc0f..8c5aa96c 100644 --- a/docs/src/search.md +++ b/docs/src/search.md @@ -25,8 +25,8 @@ Currently, we support the following metrics: ### Flat Search -If LanceDB does not create a vector index, LanceDB would need to scan (`Flat Search`) the entire vector column -and compute the distance for each vector in order to find the closest matches. +If you do not create a vector index, LanceDB would need to exhaustively scan the entire vector column (via `Flat Search`) +and compute the distance for *every* vector in order to find the closest matches. This is effectively a KNN search.