diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 12cfcc3d..931a7c80 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -27,7 +27,6 @@ theme: - content.tabs.link - content.action.edit - toc.follow - # - toc.integrate - navigation.top - navigation.tabs - navigation.tabs.sticky @@ -64,7 +63,7 @@ plugins: add_image: True # Automatically add meta image add_keywords: True # Add page keywords in the header tag add_share_buttons: True # Add social share buttons - add_authors: False # Display page authors + add_authors: False # Display page authors add_desc: False add_dates: False @@ -140,11 +139,13 @@ nav: - Serverless Website Chatbot: examples/serverless_website_chatbot.md - YouTube Transcript Search: examples/youtube_transcript_bot_with_nodejs.md - TransformersJS Embedding Search: examples/transformerjs_embedding_search_nodejs.md + - πŸ¦€ Rust: + - Overview: examples/examples_rust.md - πŸ’­ FAQs: faq.md - βš™οΈ API reference: - 🐍 Python: python/python.md - πŸ‘Ύ JavaScript: javascript/modules.md - - πŸ¦€ Rust: https://docs.rs/vectordb/latest/vectordb/ + - πŸ¦€ Rust: https://docs.rs/lancedb/latest/lancedb/ - ☁️ LanceDB Cloud: - Overview: cloud/index.md - API reference: @@ -188,21 +189,21 @@ nav: - Pydantic: python/pydantic.md - Voxel51: integrations/voxel51.md - PromptTools: integrations/prompttools.md -- Python examples: +- Examples: - examples/index.md - YouTube Transcript Search: notebooks/youtube_transcript_search.ipynb - Documentation QA Bot using LangChain: notebooks/code_qa_bot.ipynb - Multimodal search using CLIP: notebooks/multimodal_search.ipynb - Serverless QA Bot with S3 and Lambda: examples/serverless_lancedb_with_s3_and_lambda.md - Serverless QA Bot with Modal: examples/serverless_qa_bot_with_modal_and_langchain.md -- Javascript examples: - - Overview: examples/examples_js.md - - YouTube Transcript Search: examples/youtube_transcript_bot_with_nodejs.md + - YouTube Transcript Search (JS): examples/youtube_transcript_bot_with_nodejs.md - Serverless Chatbot from any website: examples/serverless_website_chatbot.md - TransformersJS Embedding Search: examples/transformerjs_embedding_search_nodejs.md - API reference: + - Overview: api_reference.md - Python: python/python.md - Javascript: javascript/modules.md + - Rust: https://docs.rs/lancedb/latest/lancedb/index.html - LanceDB Cloud: - Overview: cloud/index.md - API reference: diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 3f776f49..37a3cd33 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -19,39 +19,61 @@ Lance supports `IVF_PQ` index type by default. === "Python" - Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method. + Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method. - ```python - import lancedb - import numpy as np - uri = "data/sample-lancedb" - db = lancedb.connect(uri) + ```python + import lancedb + import numpy as np + uri = "data/sample-lancedb" + db = lancedb.connect(uri) - # Create 10,000 sample vectors - data = [{"vector": row, "item": f"item {i}"} + # Create 10,000 sample vectors + data = [{"vector": row, "item": f"item {i}"} for i, row in enumerate(np.random.random((10_000, 1536)).astype('float32'))] - # Add the vectors to a table - tbl = db.create_table("my_vectors", data=data) + # Add the vectors to a table + tbl = db.create_table("my_vectors", data=data) - # Create and train the index - you need to have enough data in the table for an effective training step - tbl.create_index(num_partitions=256, num_sub_vectors=96) - ``` + # Create and train the index - you need to have enough data in the table for an effective training step + tbl.create_index(num_partitions=256, num_sub_vectors=96) + ``` === "Typescript" - ```typescript - --8<--- "docs/src/ann_indexes.ts:import" + ```typescript + --8<--- "docs/src/ann_indexes.ts:import" - --8<-- "docs/src/ann_indexes.ts:ingest" - ``` + --8<-- "docs/src/ann_indexes.ts:ingest" + ``` -- **metric** (default: "L2"): The distance metric to use. By default it uses euclidean distance "`L2`". +=== "Rust" + + ```rust + --8<-- "rust/lancedb/examples/ivf_pq.rs:create_index" + ``` + + IVF_PQ index parameters are more fully defined in the [crate docs](https://docs.rs/lancedb/latest/lancedb/index/vector/struct.IvfPqIndexBuilder.html). + +The following IVF_PQ paramters can be specified: + +- **distance_type**: The distance metric to use. By default it uses euclidean distance "`L2`". We also support "cosine" and "dot" distance as well. -- **num_partitions** (default: 256): The number of partitions of the index. -- **num_sub_vectors** (default: 96): The number of sub-vectors (M) that will be created during Product Quantization (PQ). - For D dimensional vector, it will be divided into `M` of `D/M` sub-vectors, each of which is presented by - a single PQ code. +- **num_partitions**: The number of partitions in the index. The default is the square root + of the number of rows. + +!!! note + + In the synchronous python SDK and node's `vectordb` the default is 256. This default has + changed in the asynchronous python SDK and node's `lancedb`. + +- **num_sub_vectors**: The number of sub-vectors (M) that will be created during Product Quantization (PQ). + For D dimensional vector, it will be divided into `M` subvectors with dimension `D/M`, each of which is replaced by + a single PQ code. The default is the dimension of the vector divided by 16. + +!!! note + + In the synchronous python SDK and node's `vectordb` the default is currently 96. This default has + changed in the asynchronous python SDK and node's `lancedb`.
![IVF PQ](./assets/ivf_pq.png) @@ -114,25 +136,33 @@ There are a couple of parameters that can be used to fine-tune the search: === "Python" - ```python - tbl.search(np.random.random((1536))) \ - .limit(2) \ - .nprobes(20) \ - .refine_factor(10) \ - .to_pandas() - ``` + ```python + tbl.search(np.random.random((1536))) \ + .limit(2) \ + .nprobes(20) \ + .refine_factor(10) \ + .to_pandas() + ``` - ```text + ```text vector item _distance - 0 [0.44949695, 0.8444449, 0.06281311, 0.23338133... item 1141 103.575333 - 1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867 - ``` + 0 [0.44949695, 0.8444449, 0.06281311, 0.23338133... item 1141 103.575333 + 1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867 + ``` === "Typescript" - ```typescript - --8<-- "docs/src/ann_indexes.ts:search1" - ``` + ```typescript + --8<-- "docs/src/ann_indexes.ts:search1" + ``` + +=== "Rust" + + ```rust + --8<-- "rust/lancedb/examples/ivf_pq.rs:search1" + ``` + + Vector search options are more fully defined in the [crate docs](https://docs.rs/lancedb/latest/lancedb/query/struct.Query.html#method.nearest_to). The search will return the data requested in addition to the distance of each item. @@ -181,7 +211,7 @@ You can select the columns returned by the query using a select clause. ### Why do I need to manually create an index? Currently, LanceDB does _not_ automatically create the ANN index. -LanceDB is well-optimized for kNN (exhaustive search) via a disk-based index. For many use-cases, +LanceDB is well-optimized for kNN (exhaustive search) via a disk-based index. For many use-cases, datasets of the order of ~100K vectors don't require index creation. If you can live with up to 100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall. diff --git a/docs/src/api_reference.md b/docs/src/api_reference.md new file mode 100644 index 00000000..face89f7 --- /dev/null +++ b/docs/src/api_reference.md @@ -0,0 +1,7 @@ +# API Reference + +The API reference for the LanceDB client SDKs are available at the following locations: + +- [Python](python/python.md) +- [JavaScript](javascript/modules.md) +- [Rust](https://docs.rs/lancedb/latest/lancedb/index.html) diff --git a/docs/src/basic.md b/docs/src/basic.md index 6df07b59..b5de77a7 100644 --- a/docs/src/basic.md +++ b/docs/src/basic.md @@ -3,7 +3,7 @@ !!! info "LanceDB can be run in a number of ways:" * Embedded within an existing backend (like your Django, Flask, Node.js or FastAPI application) - * Connected to directly from a client application like a Jupyter notebook for analytical workloads + * Directly from a client application like a Jupyter notebook for analytical workloads * Deployed as a remote serverless database ![](assets/lancedb_embedded_explanation.png) @@ -24,13 +24,11 @@ === "Rust" - !!! warning "Rust SDK is experimental, might introduce breaking changes in the near future" - ```shell - cargo add vectordb + cargo add lancedb ``` - !!! info "To use the vectordb create, you first need to install protobuf." + !!! info "To use the lancedb create, you first need to install protobuf." === "macOS" @@ -44,7 +42,7 @@ sudo apt install -y protobuf-compiler libssl-dev ``` - !!! info "Please also make sure you're using the same version of Arrow as in the [vectordb crate](https://github.com/lancedb/lancedb/blob/main/Cargo.toml)" + !!! info "Please also make sure you're using the same version of Arrow as in the [lancedb crate](https://github.com/lancedb/lancedb/blob/main/Cargo.toml)" ## Connect to a database @@ -81,10 +79,11 @@ If you need a reminder of the uri, you can call `db.uri()`. ## Create a table -### Directly insert data to a new table +### Create a table from initial data -If you have data to insert into the table at creation time, you can simultaneously create a -table and insert the data to it. +If you have data to insert into the table at creation time, you can simultaneously create a +table and insert the data into it. The schema of the data will be used as the schema of the +table. === "Python" @@ -120,21 +119,27 @@ table and insert the data to it. === "Rust" ```rust - use arrow_schema::{DataType, Schema, Field}; - use arrow_array::{RecordBatch, RecordBatchIterator}; - --8<-- "rust/lancedb/examples/simple.rs:create_table" ``` - If the table already exists, LanceDB will raise an error by default. + If the table already exists, LanceDB will raise an error by default. See + [the mode option](https://docs.rs/lancedb/latest/lancedb/connection/struct.CreateTableBuilder.html#method.mode) + for details on how to overwrite (or open) existing tables instead. -!!! info "Under the hood, LanceDB converts the input data into an Apache Arrow table and persists it to disk using the [Lance format](https://www.github.com/lancedb/lance)." + !!! Providing table records in Rust + + The Rust SDK currently expects data to be provided as an Arrow + [RecordBatchReader](https://docs.rs/arrow-array/latest/arrow_array/trait.RecordBatchReader.html) + Support for additional formats (such as serde or polars) is on the roadmap. + +!!! info "Under the hood, LanceDB reads in the Apache Arrow data and persists it to disk using the [Lance format](https://www.github.com/lancedb/lance)." ### Create an empty table Sometimes you may not have the data to insert into the table at creation time. In this case, you can create an empty table and specify the schema, so that you can add -data to the table at a later time (such that it conforms to the schema). +data to the table at a later time (as long as it conforms to the schema). This is +similar to a `CREATE TABLE` statement in SQL. === "Python" @@ -175,7 +180,7 @@ Once created, you can open a table as follows: === "Rust" ```rust - --8<-- "rust/lancedb/examples/simple.rs:open_with_existing_file" + --8<-- "rust/lancedb/examples/simple.rs:open_existing_tbl" ``` If you forget the name of your table, you can always get a listing of all table names: @@ -254,6 +259,14 @@ Once you've embedded the query, you can find its nearest neighbors as follows: --8<-- "rust/lancedb/examples/simple.rs:search" ``` + !!! Query vectors in Rust + Rust does not yet support automatic execution of embedding functions. You will need to + calculate embeddings yourself. Support for this is on the roadmap and can be tracked at + https://github.com/lancedb/lancedb/issues/994 + + Query vectors can be provided as Arrow arrays or a Vec/slice of Rust floats. + Support for additional formats (e.g. `polars::series::Series`) is on the roadmap. + By default, LanceDB runs a brute-force scan over dataset to find the K nearest neighbours (KNN). For tables with more than 50K vectors, creating an ANN index is recommended to speed up search performance. LanceDB allows you to create an ANN index on a table as follows: @@ -277,7 +290,7 @@ LanceDB allows you to create an ANN index on a table as follows: ``` !!! note "Why do I need to create an index manually?" - LanceDB does not automatically create the ANN index, for two reasons. The first is that it's optimized + LanceDB does not automatically create the ANN index for two reasons. The first is that it's optimized for really fast retrievals via a disk-based index, and the second is that data and query workloads can be very diverse, so there's no one-size-fits-all index configuration. LanceDB provides many parameters to fine-tune index size, query latency and accuracy. See the section on @@ -308,8 +321,9 @@ This can delete any number of rows that match the filter. ``` The deletion predicate is a SQL expression that supports the same expressions -as the `where()` clause on a search. They can be as simple or complex as needed. -To see what expressions are supported, see the [SQL filters](sql.md) section. +as the `where()` clause (`only_if()` in Rust) on a search. They can be as +simple or complex as needed. To see what expressions are supported, see the +[SQL filters](sql.md) section. === "Python" @@ -319,6 +333,10 @@ To see what expressions are supported, see the [SQL filters](sql.md) section. Read more: [vectordb.Table.delete](javascript/interfaces/Table.md#delete) +=== "Rust" + + Read more: [lancedb::Table::delete](https://docs.rs/lancedb/latest/lancedb/table/struct.Table.html#method.delete) + ## Drop a table Use the `drop_table()` method on the database to remove a table. diff --git a/docs/src/examples/examples_rust.md b/docs/src/examples/examples_rust.md new file mode 100644 index 00000000..aa1ae0df --- /dev/null +++ b/docs/src/examples/examples_rust.md @@ -0,0 +1,3 @@ +# Examples: Rust + +Our Rust SDK is now stable. Examples are coming soon. diff --git a/docs/src/examples/index.md b/docs/src/examples/index.md index 73a4d0df..885e80b1 100644 --- a/docs/src/examples/index.md +++ b/docs/src/examples/index.md @@ -2,10 +2,11 @@ ## Recipes and example code -LanceDB provides language APIs, allowing you to embed a database in your language of choice. We currently provide Python and Javascript APIs, with the Rust API and examples actively being worked on and will be available soon. +LanceDB provides language APIs, allowing you to embed a database in your language of choice. * 🐍 [Python](examples_python.md) examples -* πŸ‘Ύ [JavaScript](exampled_js.md) examples +* πŸ‘Ύ [JavaScript](examples_js.md) examples +* πŸ¦€ Rust examples (coming soon) ## Applications powered by LanceDB diff --git a/docs/src/faq.md b/docs/src/faq.md index 2e64cc82..4eb2583f 100644 --- a/docs/src/faq.md +++ b/docs/src/faq.md @@ -16,7 +16,7 @@ As we mention in our talk titled β€œ[Lance, a modern columnar data format](https ### Why build in Rust? πŸ¦€ -We believe that the Rust ecosystem has attained mainstream maturity and that Rust will form the underpinnings of large parts of the data and ML landscape in a few years. Performance, latency and reliability are paramount to a vector DB, and building in Rust allows us to iterate and release updates more rapidly due to Rust’s safety guarantees. Both Lance (the data format) and LanceDB (the database) are written entirely in Rust. We also provide Python and JavaScript client libraries to interact with the database. Our Rust API is a little rough around the edges right now, but is fast becoming on par with the Python and JS APIs. +We believe that the Rust ecosystem has attained mainstream maturity and that Rust will form the underpinnings of large parts of the data and ML landscape in a few years. Performance, latency and reliability are paramount to a vector DB, and building in Rust allows us to iterate and release updates more rapidly due to Rust’s safety guarantees. Both Lance (the data format) and LanceDB (the database) are written entirely in Rust. We also provide Python, JavaScript, and Rust client libraries to interact with the database. ### What is the difference between LanceDB OSS and LanceDB Cloud? @@ -44,7 +44,7 @@ For large-scale (>1M) or higher dimension vectors, it is beneficial to create an ### Does LanceDB support full-text search? -Yes, LanceDB supports full-text search (FTS) via [Tantivy](https://github.com/quickwit-oss/tantivy). Our current FTS integration is Python-only, and our goal is to push it down to the Rust level in future versions to enable much more powerful search capabilities available to our Python, JavaScript and Rust clients. +Yes, LanceDB supports full-text search (FTS) via [Tantivy](https://github.com/quickwit-oss/tantivy). Our current FTS integration is Python-only, and our goal is to push it down to the Rust level in future versions to enable much more powerful search capabilities available to our Python, JavaScript and Rust clients. Follow along in the [Github issue](https://github.com/lancedb/lance/issues/1195) ### How can I speed up data inserts? diff --git a/docs/src/fts.md b/docs/src/fts.md index 61665f72..ac385197 100644 --- a/docs/src/fts.md +++ b/docs/src/fts.md @@ -1,6 +1,6 @@ # Full-text search -LanceDB provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy) (currently Python only), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. Our goal is to push the FTS integration down to the Rust level in the future, so that it's available for JavaScript users as well. +LanceDB provides support for full-text search via [Tantivy](https://github.com/quickwit-oss/tantivy) (currently Python only), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. Our goal is to push the FTS integration down to the Rust level in the future, so that it's available for Rust and JavaScript users as well. Follow along at [this Github issue](https://github.com/lancedb/lance/issues/1195) A hybrid search solution combining vector and full-text search is also on the way. @@ -77,7 +77,7 @@ table.search("puppy").limit(10).where("meta='foo'").to_list() ## Phrase queries vs. terms queries -For full-text search you can specify either a **phrase** query like `"the old man and the sea"`, +For full-text search you can specify either a **phrase** query like `"the old man and the sea"`, or a **terms** search query like `"(Old AND Man) AND Sea"`. For more details on the terms query syntax, see Tantivy's [query parser rules](https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html). @@ -112,7 +112,7 @@ double quotes replaced by single quotes. ## Configurations -By default, LanceDB configures a 1GB heap size limit for creating the index. You can +By default, LanceDB configures a 1GB heap size limit for creating the index. You can reduce this if running on a smaller node, or increase this for faster performance while indexing a larger corpus. @@ -128,7 +128,6 @@ table.create_fts_index(["text1", "text2"], writer_heap_size=heap, replace=True) If you add data after FTS index creation, it won't be reflected in search results until you do a full reindex. -2. We currently only support local filesystem paths for the FTS index. +2. We currently only support local filesystem paths for the FTS index. This is a tantivy limitation. We've implemented an object store plugin but there's no way in tantivy-py to specify to use it. - diff --git a/docs/src/index.md b/docs/src/index.md index 884f29a3..8339edf7 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -28,7 +28,7 @@ LanceDB **Cloud** is a SaaS (software-as-a-service) solution that runs serverles * Fast production-scale vector similarity, full-text & hybrid search and a SQL query interface (via [DataFusion](https://github.com/apache/arrow-datafusion)) -* Native Python and Javascript/Typescript support +* Python, Javascript/Typescript, and Rust support * Store, query & manage multi-modal data (text, images, videos, point clouds, etc.), not just the embeddings and metadata @@ -54,3 +54,4 @@ The following pages go deeper into the internal of LanceDB and how to use it. * [Ecosystem Integrations](integrations/index.md): Integrate LanceDB with other tools in the data ecosystem * [Python API Reference](python/python.md): Python OSS and Cloud API references * [JavaScript API Reference](javascript/modules.md): JavaScript OSS and Cloud API references +* [Rust API Reference](https://docs.rs/lancedb/latest/lancedb/index.html): Rust API reference diff --git a/docs/src/search.md b/docs/src/search.md index 911fc668..bcf0b27a 100644 --- a/docs/src/search.md +++ b/docs/src/search.md @@ -22,7 +22,7 @@ Currently, LanceDB supports the following metrics: ## Exhaustive search (kNN) If you do not create a vector index, LanceDB exhaustively scans the _entire_ vector space -and compute the distance to every vector in order to find the exact nearest neighbors. This is effectively a kNN search. +and computes the distance to every vector in order to find the exact nearest neighbors. This is effectively a kNN search.