[docs]: Fix issues with Rust code snippets in "quick start" (#1047)

The renaming of `vectordb` to `lancedb` broke the [quick start docs](https://lancedb.github.io/lancedb/basic/#__tabbed_5_3) (it's pointing to a non-existent directory). This PR fixes the code snippets and the paths in the docs page. Additionally, more fixes related to indexing docs below 👇🏽.
2026-07-09 14:00:44 +00:00 · 2024-03-03 18:59:57 -05:00
parent acfdf1b9cb
commit 14566df213
4 changed files with 57 additions and 66 deletions
--- a/docs/src/ann_indexes.md
+++ b/docs/src/ann_indexes.md
@@ -7,20 +7,11 @@ for brute-force scanning of the entire vector space.
 A vector index is faster but less accurate than exhaustive search (kNN or flat search).
 LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.

-Currently, LanceDB does _not_ automatically create the ANN index.
-LanceDB has optimized code for kNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
-If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
+## Disk-based Index

-In the future we will look to automatically create and configure the ANN index as data comes in.
-
-## Types of Index
-
-Lance can support multiple index types, the most widely used one is `IVF_PQ`.
-
- `IVF_PQ`: use **Inverted File Index (IVF)** to first divide the dataset into `N` partitions,
-  and then use **Product Quantization** to compress vectors in each partition.
- `DiskANN` (**Experimental**): organize the vector as a on-disk graph, where the vertices approximately
-  represent the nearest neighbors of each vector.
+Lance provides an `IVF_PQ` disk-based index. It uses **Inverted File Index (IVF)** to first divide
+the dataset into `N` partitions, and then applies **Product Quantization** to compress vectors in each partition.
+See the [indexing](concepts/index_ivfpq.md) concepts guide for more information on how this works.

 ## Creating an IVF_PQ Index

@@ -88,7 +79,7 @@ You can specify the GPU device to train IVF partitions via
     )
     ```

-=== "Macos"
+=== "MacOS"

     <!-- skip-test -->
     ```python
@@ -100,7 +91,7 @@ You can specify the GPU device to train IVF partitions via
     )
     ```

-Trouble shootings:
+Troubleshooting:

 If you see `AssertionError: Torch not compiled with CUDA enabled`, you need to [install
 PyTorch with CUDA support](https://pytorch.org/get-started/locally/).
@@ -187,13 +178,21 @@ You can select the columns returned by the query using a select clause.

 ## FAQ

+### Why do I need to manually create an index?
+
+Currently, LanceDB does _not_ automatically create the ANN index.
+LanceDB is well-optimized for kNN (exhaustive search) via a disk-based  index. For many use-cases,
+datasets of the order of ~100K vectors don't require index creation. If you can live with up to
+100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
+
 ### When is it necessary to create an ANN vector index?

-`LanceDB` has manually-tuned SIMD code for computing vector distances.
-In our benchmarks, computing 100K pairs of 1K dimension vectors takes **less than 20ms**.
-For small datasets (< 100K rows) or applications that can accept 100ms latency, vector indices are usually not necessary.
+`LanceDB` comes out-of-the-box with highly optimized SIMD code for computing vector similarity.
+In our benchmarks, computing distances for 100K pairs of 1K dimension vectors takes **less than 20ms**.
+We observe that for small datasets (~100K rows) or for applications that can accept 100ms latency,
+vector indices are usually not necessary.

-For large-scale or higher dimension vectors, it is beneficial to create vector index.
+For large-scale or higher dimension vectors, it can beneficial to create vector index for performance.

 ### How big is my index, and how many memory will it take?

--- a/docs/src/basic.md
+++ b/docs/src/basic.md
@@ -46,7 +46,7 @@

    !!! info "Please also make sure you're using the same version of Arrow as in the [vectordb crate](https://github.com/lancedb/lancedb/blob/main/Cargo.toml)"

-## How to connect to a database
+## Connect to a database

 === "Python"

@@ -69,17 +69,22 @@
    ```rust
    #[tokio::main]
    async fn main() -> Result<()> {
-        --8<-- "rust/vectordb/examples/simple.rs:connect"
+        --8<-- "rust/lancedb/examples/simple.rs:connect"
    }
    ```

-    !!! info "See [examples/simple.rs](https://github.com/lancedb/lancedb/tree/main/rust/vectordb/examples/simple.rs) for a full working example."
+    !!! info "See [examples/simple.rs](https://github.com/lancedb/lancedb/tree/main/rust/lancedb/examples/simple.rs) for a full working example."

 LanceDB will create the directory if it doesn't exist (including parent directories).

 If you need a reminder of the uri, you can call `db.uri()`.

-## How to create a table
+## Create a table
+
+### Directly insert data to a new table
+
+If you have data to insert into the table at creation time, you can simultaneously create a 
+table and insert the data to it.

 === "Python"

@@ -118,17 +123,18 @@ If you need a reminder of the uri, you can call `db.uri()`.
    use arrow_schema::{DataType, Schema, Field};
    use arrow_array::{RecordBatch, RecordBatchIterator};

-    --8<-- "rust/vectordb/examples/simple.rs:create_table"
+    --8<-- "rust/lancedb/examples/simple.rs:create_table"
    ```

    If the table already exists, LanceDB will raise an error by default.

-!!! info "Under the hood, LanceDB is converting the input data into an Apache Arrow table and persisting it to disk in [Lance format](https://www.github.com/lancedb/lance)."
+!!! info "Under the hood, LanceDB converts the input data into an Apache Arrow table and persists it to disk using the [Lance format](https://www.github.com/lancedb/lance)."

-### Creating an empty table
+### Create an empty table

 Sometimes you may not have the data to insert into the table at creation time.
-In this case, you can create an empty table and specify the schema.
+In this case, you can create an empty table and specify the schema, so that you can add
+data to the table at a later time (such that it conforms to the schema).

 === "Python"

@@ -147,12 +153,12 @@ In this case, you can create an empty table and specify the schema.
 === "Rust"

    ```rust
-    --8<-- "rust/vectordb/examples/simple.rs:create_empty_table"
+    --8<-- "rust/lancedb/examples/simple.rs:create_empty_table"
    ```

-## How to open an existing table
+## Open an existing table

-Once created, you can open a table using the following code:
+Once created, you can open a table as follows:

 === "Python"

@@ -169,7 +175,7 @@ Once created, you can open a table using the following code:
 === "Rust"

    ```rust
-    --8<-- "rust/vectordb/examples/simple.rs:open_with_existing_file"
+    --8<-- "rust/lancedb/examples/simple.rs:open_with_existing_file"
    ```

 If you forget the name of your table, you can always get a listing of all table names:
@@ -189,12 +195,12 @@ If you forget the name of your table, you can always get a listing of all table
 === "Rust"

    ```rust
-    --8<-- "rust/vectordb/examples/simple.rs:list_names"
+    --8<-- "rust/lancedb/examples/simple.rs:list_names"
    ```

-## How to add data to a table
+## Add data to a table

-After a table has been created, you can always add more data to it using
+After a table has been created, you can always add more data to it as follows:

 === "Python"

@@ -219,12 +225,12 @@ After a table has been created, you can always add more data to it using
 === "Rust"

    ```rust
-    --8<-- "rust/vectordb/examples/simple.rs:add"
+    --8<-- "rust/lancedb/examples/simple.rs:add"
    ```

-## How to search for (approximate) nearest neighbors
+## Search for nearest neighbors

-Once you've embedded the query, you can find its nearest neighbors using the following code:
+Once you've embedded the query, you can find its nearest neighbors as follows:

 === "Python"

@@ -245,11 +251,12 @@ Once you've embedded the query, you can find its nearest neighbors using the fol
    ```rust
    use futures::TryStreamExt;

-    --8<-- "rust/vectordb/examples/simple.rs:search"
+    --8<-- "rust/lancedb/examples/simple.rs:search"
    ```

 By default, LanceDB runs a brute-force scan over dataset to find the K nearest neighbours (KNN).
 For tables with more than 50K vectors, creating an ANN index is recommended to speed up search performance.
+LanceDB allows you to create an ANN index on a table as follows:

 === "Python"

@@ -266,12 +273,17 @@ For tables with more than 50K vectors, creating an ANN index is recommended to s
 === "Rust"

    ```rust
-     --8<-- "rust/vectordb/examples/simple.rs:create_index"
+     --8<-- "rust/lancedb/examples/simple.rs:create_index"
    ```

-Check [Approximate Nearest Neighbor (ANN) Indexes](/ann_indices.md) section for more details.
+!!! note "Why do I need to create an index manually?"
+    LanceDB does not automatically create the ANN index, for two reasons. The first is that it's optimized
+    for really fast retrievals via a disk-based index, and the second is that data and query workloads can
+    be very diverse, so there's no one-size-fits-all index configuration. LanceDB provides many parameters
+    to fine-tune index size, query latency and accuracy. See the section on
+    [ANN indexes](ann_indexes.md) for more details.

-## How to delete rows from a table
+## Delete rows from a table

 Use the `delete()` method on tables to delete rows from a table. To choose
 which rows to delete, provide a filter that matches on the metadata columns.
@@ -292,7 +304,7 @@ This can delete any number of rows that match the filter.
 === "Rust"

    ```rust
-    --8<-- "rust/vectordb/examples/simple.rs:delete"
+    --8<-- "rust/lancedb/examples/simple.rs:delete"
    ```

 The deletion predicate is a SQL expression that supports the same expressions
@@ -307,7 +319,7 @@ To see what expressions are supported, see the [SQL filters](sql.md) section.

      Read more: [vectordb.Table.delete](javascript/interfaces/Table.md#delete)

-## How to remove a table
+## Drop a table

 Use the `drop_table()` method on the database to remove a table.

@@ -333,7 +345,7 @@ Use the `drop_table()` method on the database to remove a table.
 === "Rust"

    ```rust
-    --8<-- "rust/vectordb/examples/simple.rs:drop_table"
+    --8<-- "rust/lancedb/examples/simple.rs:drop_table"
    ```

 !!! note "Bundling `vectordb` apps with Webpack"
--- a/docs/src/concepts/index_ivfpq.md
+++ b/docs/src/concepts/index_ivfpq.md
@@ -81,24 +81,4 @@ The above query will perform a search on the table `tbl` using the given query v
 * `to_pandas()`: Convert the results to a pandas DataFrame

 And there you have it! You now understand what an IVF-PQ index is, and how to create and query it in LanceDB.
-
-
-## FAQ
-
-### When is it necessary to create a vector index?
-
-LanceDB has manually-tuned SIMD code for computing vector distances. In our benchmarks, computing 100K pairs of 1K dimension vectors takes **<20ms**. For small datasets (<100K rows) or applications that can accept up to 100ms latency, vector indices are usually not necessary.
-
-For large-scale or higher dimension vectors, it is beneficial to create vector index.
-
-### How big is my index, and how much memory will it take?
-
-In LanceDB, all vector indices are disk-based, meaning that when responding to a vector query, only the relevant pages from the index file are loaded from disk and cached in memory. Additionally, each sub-vector is usually encoded into 1 byte PQ code.
-
-For example, with 1024-dimension vectors, if we choose `num_sub_vectors = 64`, each sub-vector has `1024 / 64 = 16` float32 numbers. Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` times of space reduction.
-
-### How to choose `num_partitions` and `num_sub_vectors` for IVF_PQ index?
-
-`num_partitions` is used to decide how many partitions the first level IVF index uses. Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On SIFT-1M dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency/recall.
-
-`num_sub_vectors` specifies how many PQ short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency.
+To see how to create an IVF-PQ index in LanceDB, take a look at the [ANN indexes](../ann_indexes.md) section.
--- a/docs/src/faq.md
+++ b/docs/src/faq.md
@@ -40,7 +40,7 @@ LanceDB and its underlying data format, Lance, are built to scale to really larg

 No. LanceDB is blazing fast (due to its disk-based index) for even brute force kNN search, within reason. In our benchmarks, computing 100K pairs of 1000-dimension vectors takes less than 20ms. For small datasets of ~100K records or applications that can accept ~100ms latency, an ANN index is usually not necessary.

-For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index.
+For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ANN index. See the [ANN indexes](ann_indexes.md) section for more details.

 ### Does LanceDB support full-text search?