diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 5460e012..3f776f49 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -7,20 +7,11 @@ for brute-force scanning of the entire vector space. A vector index is faster but less accurate than exhaustive search (kNN or flat search). LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results. -Currently, LanceDB does _not_ automatically create the ANN index. -LanceDB has optimized code for kNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all. -If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall. +## Disk-based Index -In the future we will look to automatically create and configure the ANN index as data comes in. - -## Types of Index - -Lance can support multiple index types, the most widely used one is `IVF_PQ`. - -- `IVF_PQ`: use **Inverted File Index (IVF)** to first divide the dataset into `N` partitions, - and then use **Product Quantization** to compress vectors in each partition. -- `DiskANN` (**Experimental**): organize the vector as a on-disk graph, where the vertices approximately - represent the nearest neighbors of each vector. +Lance provides an `IVF_PQ` disk-based index. It uses **Inverted File Index (IVF)** to first divide +the dataset into `N` partitions, and then applies **Product Quantization** to compress vectors in each partition. +See the [indexing](concepts/index_ivfpq.md) concepts guide for more information on how this works. ## Creating an IVF_PQ Index @@ -88,7 +79,7 @@ You can specify the GPU device to train IVF partitions via ) ``` -=== "Macos" +=== "MacOS" ```python @@ -100,7 +91,7 @@ You can specify the GPU device to train IVF partitions via ) ``` -Trouble shootings: +Troubleshooting: If you see `AssertionError: Torch not compiled with CUDA enabled`, you need to [install PyTorch with CUDA support](https://pytorch.org/get-started/locally/). @@ -187,13 +178,21 @@ You can select the columns returned by the query using a select clause. ## FAQ +### Why do I need to manually create an index? + +Currently, LanceDB does _not_ automatically create the ANN index. +LanceDB is well-optimized for kNN (exhaustive search) via a disk-based index. For many use-cases, +datasets of the order of ~100K vectors don't require index creation. If you can live with up to +100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall. + ### When is it necessary to create an ANN vector index? -`LanceDB` has manually-tuned SIMD code for computing vector distances. -In our benchmarks, computing 100K pairs of 1K dimension vectors takes **less than 20ms**. -For small datasets (< 100K rows) or applications that can accept 100ms latency, vector indices are usually not necessary. +`LanceDB` comes out-of-the-box with highly optimized SIMD code for computing vector similarity. +In our benchmarks, computing distances for 100K pairs of 1K dimension vectors takes **less than 20ms**. +We observe that for small datasets (~100K rows) or for applications that can accept 100ms latency, +vector indices are usually not necessary. -For large-scale or higher dimension vectors, it is beneficial to create vector index. +For large-scale or higher dimension vectors, it can beneficial to create vector index for performance. ### How big is my index, and how many memory will it take? diff --git a/docs/src/basic.md b/docs/src/basic.md index f942472c..6df07b59 100644 --- a/docs/src/basic.md +++ b/docs/src/basic.md @@ -46,7 +46,7 @@ !!! info "Please also make sure you're using the same version of Arrow as in the [vectordb crate](https://github.com/lancedb/lancedb/blob/main/Cargo.toml)" -## How to connect to a database +## Connect to a database === "Python" @@ -69,17 +69,22 @@ ```rust #[tokio::main] async fn main() -> Result<()> { - --8<-- "rust/vectordb/examples/simple.rs:connect" + --8<-- "rust/lancedb/examples/simple.rs:connect" } ``` - !!! info "See [examples/simple.rs](https://github.com/lancedb/lancedb/tree/main/rust/vectordb/examples/simple.rs) for a full working example." + !!! info "See [examples/simple.rs](https://github.com/lancedb/lancedb/tree/main/rust/lancedb/examples/simple.rs) for a full working example." LanceDB will create the directory if it doesn't exist (including parent directories). If you need a reminder of the uri, you can call `db.uri()`. -## How to create a table +## Create a table + +### Directly insert data to a new table + +If you have data to insert into the table at creation time, you can simultaneously create a +table and insert the data to it. === "Python" @@ -118,17 +123,18 @@ If you need a reminder of the uri, you can call `db.uri()`. use arrow_schema::{DataType, Schema, Field}; use arrow_array::{RecordBatch, RecordBatchIterator}; - --8<-- "rust/vectordb/examples/simple.rs:create_table" + --8<-- "rust/lancedb/examples/simple.rs:create_table" ``` If the table already exists, LanceDB will raise an error by default. -!!! info "Under the hood, LanceDB is converting the input data into an Apache Arrow table and persisting it to disk in [Lance format](https://www.github.com/lancedb/lance)." +!!! info "Under the hood, LanceDB converts the input data into an Apache Arrow table and persists it to disk using the [Lance format](https://www.github.com/lancedb/lance)." -### Creating an empty table +### Create an empty table Sometimes you may not have the data to insert into the table at creation time. -In this case, you can create an empty table and specify the schema. +In this case, you can create an empty table and specify the schema, so that you can add +data to the table at a later time (such that it conforms to the schema). === "Python" @@ -147,12 +153,12 @@ In this case, you can create an empty table and specify the schema. === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:create_empty_table" + --8<-- "rust/lancedb/examples/simple.rs:create_empty_table" ``` -## How to open an existing table +## Open an existing table -Once created, you can open a table using the following code: +Once created, you can open a table as follows: === "Python" @@ -169,7 +175,7 @@ Once created, you can open a table using the following code: === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:open_with_existing_file" + --8<-- "rust/lancedb/examples/simple.rs:open_with_existing_file" ``` If you forget the name of your table, you can always get a listing of all table names: @@ -189,12 +195,12 @@ If you forget the name of your table, you can always get a listing of all table === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:list_names" + --8<-- "rust/lancedb/examples/simple.rs:list_names" ``` -## How to add data to a table +## Add data to a table -After a table has been created, you can always add more data to it using +After a table has been created, you can always add more data to it as follows: === "Python" @@ -219,12 +225,12 @@ After a table has been created, you can always add more data to it using === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:add" + --8<-- "rust/lancedb/examples/simple.rs:add" ``` -## How to search for (approximate) nearest neighbors +## Search for nearest neighbors -Once you've embedded the query, you can find its nearest neighbors using the following code: +Once you've embedded the query, you can find its nearest neighbors as follows: === "Python" @@ -245,11 +251,12 @@ Once you've embedded the query, you can find its nearest neighbors using the fol ```rust use futures::TryStreamExt; - --8<-- "rust/vectordb/examples/simple.rs:search" + --8<-- "rust/lancedb/examples/simple.rs:search" ``` By default, LanceDB runs a brute-force scan over dataset to find the K nearest neighbours (KNN). For tables with more than 50K vectors, creating an ANN index is recommended to speed up search performance. +LanceDB allows you to create an ANN index on a table as follows: === "Python" @@ -266,12 +273,17 @@ For tables with more than 50K vectors, creating an ANN index is recommended to s === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:create_index" + --8<-- "rust/lancedb/examples/simple.rs:create_index" ``` -Check [Approximate Nearest Neighbor (ANN) Indexes](/ann_indices.md) section for more details. +!!! note "Why do I need to create an index manually?" + LanceDB does not automatically create the ANN index, for two reasons. The first is that it's optimized + for really fast retrievals via a disk-based index, and the second is that data and query workloads can + be very diverse, so there's no one-size-fits-all index configuration. LanceDB provides many parameters + to fine-tune index size, query latency and accuracy. See the section on + [ANN indexes](ann_indexes.md) for more details. -## How to delete rows from a table +## Delete rows from a table Use the `delete()` method on tables to delete rows from a table. To choose which rows to delete, provide a filter that matches on the metadata columns. @@ -292,7 +304,7 @@ This can delete any number of rows that match the filter. === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:delete" + --8<-- "rust/lancedb/examples/simple.rs:delete" ``` The deletion predicate is a SQL expression that supports the same expressions @@ -307,7 +319,7 @@ To see what expressions are supported, see the [SQL filters](sql.md) section. Read more: [vectordb.Table.delete](javascript/interfaces/Table.md#delete) -## How to remove a table +## Drop a table Use the `drop_table()` method on the database to remove a table. @@ -333,7 +345,7 @@ Use the `drop_table()` method on the database to remove a table. === "Rust" ```rust - --8<-- "rust/vectordb/examples/simple.rs:drop_table" + --8<-- "rust/lancedb/examples/simple.rs:drop_table" ``` !!! note "Bundling `vectordb` apps with Webpack" diff --git a/docs/src/concepts/index_ivfpq.md b/docs/src/concepts/index_ivfpq.md index 6b4e8489..044373f4 100644 --- a/docs/src/concepts/index_ivfpq.md +++ b/docs/src/concepts/index_ivfpq.md @@ -81,24 +81,4 @@ The above query will perform a search on the table `tbl` using the given query v * `to_pandas()`: Convert the results to a pandas DataFrame And there you have it! You now understand what an IVF-PQ index is, and how to create and query it in LanceDB. - - -## FAQ - -### When is it necessary to create a vector index? - -LanceDB has manually-tuned SIMD code for computing vector distances. In our benchmarks, computing 100K pairs of 1K dimension vectors takes **<20ms**. For small datasets (<100K rows) or applications that can accept up to 100ms latency, vector indices are usually not necessary. - -For large-scale or higher dimension vectors, it is beneficial to create vector index. - -### How big is my index, and how much memory will it take? - -In LanceDB, all vector indices are disk-based, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code. - -For example, with 1024-dimension vectors, if we choose `num_sub_vectors = 64`, each sub-vector has `1024 / 64 = 16` float32 numbers. Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction. - -### How to choose `num_partitions` and `num_sub_vectors` for IVF_PQ index? - -`num_partitions` is used to decide how many partitions the first level IVF index uses. Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On SIFT-1M dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency/recall. - -`num_sub_vectors` specifies how many PQ short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency. \ No newline at end of file +To see how to create an IVF-PQ index in LanceDB, take a look at the [ANN indexes](../ann_indexes.md) section. diff --git a/docs/src/faq.md b/docs/src/faq.md index b2fb6d67..2e64cc82 100644 --- a/docs/src/faq.md +++ b/docs/src/faq.md @@ -40,7 +40,7 @@ LanceDB and its underlying data format, Lance, are built to scale to really larg No. LanceDB is blazing fast (due to its disk-based index) for even brute force kNN search, within reason. In our benchmarks, computing 100K pairs of 1000-dimension vectors takes less than 20ms. For small datasets of ~100K records or applications that can accept ~100ms latency, an ANN index is usually not necessary. -For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index. +For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index. See the [ANN indexes](ann_indexes.md) section for more details. ### Does LanceDB support full-text search?