docs: Updates and refactor (#683)

This PR makes incremental changes to the documentation. * Closes #697 * Closes #698 ## Chores - [x] Add dark mode - [x] Fix headers in navbar - [x] Add `extra.css` to customize navbar styles - [x] Customize fonts for prose/code blocks, navbar and admonitions - [x] Inspect all admonition boxes (remove redundant dropdowns) and improve clarity and readability - [x] Ensure that all images in the docs have white background (not transparent) to be viewable in dark mode - [x] Improve code formatting in code blocks to make them consistent with autoformatters (eslint/ruff) - [x] Add bolder weight to h1 headers - [x] Add diagram showing the difference between embedded (OSS) and serverless (Cloud) - [x] Fix [Creating an empty table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table) section: right now, the subheaders are not clickable. - [x] In critical data ingestion methods like `table.add` (among others), the type signature often does not match the actual code - [x] Proof-read each documentation section and rewrite as necessary to provide more context, use cases, and explanations so it reads less like reference documentation. This is especially important for CRUD and search sections since those are so central to the user experience. ## Restructure/new content - [x] The section for [Adding data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table) only shows examples for pandas and iterables. We should include pydantic models, arrow tables, etc. - [x] Add conceptual tutorial for IVF-PQ index - [x] Clearly separate vector search, FTS and filtering sections so that these are easier to find - [x] Add docs on refine factor to explain its importance for recall. Closes #716 - [x] Add an FAQ page showing answers to commonly asked questions about LanceDB. Closes #746 - [x] Add simple polars example to the integrations section. Closes #756 and closes #153 - [ ] Add basic docs for the Rust API (more detailed API docs can come later). Closes #781 - [x] Add a section on the various storage options on local vs. cloud (S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782 - [x] Revamp filtering docs: add pre-filtering examples and redo headers and update content for SQL filters. Closes #783 and closes #784. - [x] Add docs for data management: compaction, cleaning up old versions and incremental indexing. Closes #785 - [ ] Add a benchmark section that also discusses some best practices. Closes #787 --------- Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>
2026-01-12 23:02:59 +00:00 · 2024-01-18 13:48:37 -05:00
parent 8bcdc81fd3
commit 119b928a52
59 changed files with 1406 additions and 770 deletions
--- a/docs/src/python/duckdb.md
+++ b/docs/src/python/duckdb.md
@@ -1,14 +1,14 @@
 # DuckDB

-`LanceDB` works with `DuckDB` via [PyArrow integration](https://duckdb.org/docs/guides/python/sql_on_arrow).
+LanceDB is very well-integrated with [DuckDB](https://duckdb.org/), an in-process SQL OLAP database. This integration is done via [Arrow](https://duckdb.org/docs/guides/python/sql_on_arrow) .

-Let us start with installing `duckdb` and `lancedb`.
+We can demonstrate this by first installing `duckdb` and `lancedb`.

 ```shell
 pip install duckdb lancedb
 ```

-We will re-use [the dataset created previously](./arrow.md):
+We will re-use the dataset [created previously](./pandas_and_pyarrow.md):

 ```python
 import lancedb
@@ -22,7 +22,7 @@ table = db.create_table("pd_table", data=data)
 arrow_table = table.to_arrow()
 ```

-`DuckDB` can directly query the `arrow_table`:
+DuckDB can directly query the `pyarrow.Table` object:

 ```python
 import duckdb
@@ -40,6 +40,8 @@ duckdb.query("SELECT * FROM arrow_table")
 └─────────────┴─────────┴────────┘
 ```

+You can very easily run any other DuckDB SQL queries on your data.
+
 ```py
 duckdb.query("SELECT mean(price) FROM arrow_table")
 ```
--- a/docs/src/python/integration.md
+++ b/docs/src/python/integration.md
@@ -1,7 +0,0 @@
-# Integration
-
-Built on top of [Apache Arrow](https://arrow.apache.org/),
-`LanceDB` is very easy to be integrate with Python ecosystems.
-
-* [Pandas and Arrow Integration](./arrow.md)
-* [DuckDB Integration](./duckdb.md)
--- a/docs/src/python/pandas_and_pyarrow.md
+++ b/docs/src/python/pandas_and_pyarrow.md
@@ -1,13 +1,12 @@
 # Pandas and PyArrow

-
-Built on top of [Apache Arrow](https://arrow.apache.org/),
-`LanceDB` is easy to integrate with the Python ecosystem, including [Pandas](https://pandas.pydata.org/)
-and PyArrow.
+Because Lance is built on top of [Apache Arrow](https://arrow.apache.org/),
+LanceDB is tightly integrated with the Python data ecosystem, including [Pandas](https://pandas.pydata.org/)
+and PyArrow. The sequence of steps in a typical workflow is shown below.

 ## Create dataset

-First, we need to connect to a `LanceDB` database.
+First, we need to connect to a LanceDB database.

 ```py

@@ -16,7 +15,7 @@ import lancedb
 db = lancedb.connect("data/sample-lancedb")
 ```

-Afterwards, we write a `Pandas DataFrame` to LanceDB directly.
+We can load a Pandas `DataFrame` to LanceDB directly.

 ```py
 import pandas as pd
@@ -29,11 +28,10 @@ data = pd.DataFrame({
 table = db.create_table("pd_table", data=data)
 ```

-Similar to [`pyarrow.write_dataset()`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html),
-[db.create_table()](../python/#lancedb.db.DBConnection.create_table) accepts a wide-range of forms of data.
+Similar to the [`pyarrow.write_dataset()`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html) method, LanceDB's
+[`db.create_table()`](python.md/#lancedb.db.DBConnection.create_table) accepts data in a variety of forms.

-For example, if you have a dataset that is larger than memory size, you can create table with `Iterator[pyarrow.RecordBatch]`,
-to lazily generate data:
+If you have a dataset that is larger than memory, you can create a table with `Iterator[pyarrow.RecordBatch]` to lazily load the data:

 ```py

@@ -59,13 +57,13 @@ schema=pa.schema([
 table = db.create_table("iterable_table", data=make_batches(), schema=schema)
 ```

-You will find detailed instructions of creating dataset in
-[Basic Operations](../basic.md) and [API](../python/#lancedb.db.DBConnection.create_table)
+You will find detailed instructions of creating a LanceDB dataset in
+[Getting Started](../basic.md#quick-start) and [API](python.md/#lancedb.db.DBConnection.create_table)
 sections.

-## Vector Search
+## Vector search

-We can now perform similarity search via `LanceDB` Python API.
+We can now perform similarity search via the LanceDB Python API.

 ```py
 # Open the table previously created.
@@ -82,8 +80,8 @@ print(df)
 0  [5.9, 26.5]  bar   20.0  14257.05957
 ```

-If you have a simple filter, it's faster to provide a `where clause` to `LanceDB`'s search query.
-If you have more complex criteria, you can always apply the filter to the resulting Pandas `DataFrame`.
+If you have a simple filter, it's faster to provide a `where` clause to LanceDB's `search` method.
+For more complex filters or aggregations, you can always resort to using the underlying `DataFrame` methods after performing a search.

 ```python

@@ -97,4 +95,4 @@ df = results = table.search([100, 100]).to_pandas()
 results = df[df.price < 15]
 assert len(results) == 1
 assert results["item"].iloc[0] == "foo"
-```
+```
--- a/docs/src/python/polars_arrow.md
+++ b/docs/src/python/polars_arrow.md
@@ -0,0 +1,53 @@
+# Polars
+
+LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.
+
+## Create dataset
+
+First, we need to connect to a LanceDB database.
+
+```py
+
+import lancedb
+
+db = lancedb.connect("data/polars-lancedb")
+```
+
+We can load a Polars `DataFrame` to LanceDB directly.
+
+```py
+import polars as pl
+
+data = pl.DataFrame({
+    "vector": [[3.1, 4.1], [5.9, 26.5]],
+    "item": ["foo", "bar"],
+    "price": [10.0, 20.0]
+})
+table = db.create_table("pl_table", data=data)
+```
+
+## Vector search
+
+We can now perform similarity search via the LanceDB Python API.
+
+```py
+query = [3.1, 4.1]
+result = table.search(query).limit(1).to_polars()
+assert len(result) == 1
+assert result["item"][0] == "foo"
+```
+
+In addition to the selected columns, LanceDB also returns a vector
+and also the `_distance` column which is the distance between the query
+vector and the returned vector.
+
+```
+shape: (1, 4)
+┌───────────────┬──────┬───────┬───────────┐
+│ vector        ┆ item ┆ price ┆ _distance │
+│ ---           ┆ ---  ┆ ---   ┆ ---       │
+│ array[f32, 2] ┆ str  ┆ f64   ┆ f32       │
+╞═══════════════╪══════╪═══════╪═══════════╡
+│ [3.1, 4.1]    ┆ foo  ┆ 10.0  ┆ 0.0       │
+└───────────────┴──────┴───────┴───────────┘
+```
--- a/docs/src/python/python.md
+++ b/docs/src/python/python.md
@@ -1,4 +1,6 @@
-# LanceDB Python API Reference
+# Python API Reference
+
+This section contains the API reference for the OSS Python API.

 ## Installation

@@ -36,7 +38,7 @@ pip install lancedb

 ::: lancedb.embeddings.open_clip.OpenClipEmbeddings

-::: lancedb.embeddings.with_embeddings
+::: lancedb.embeddings.utils.with_embeddings

 ## Context

--- a/docs/src/python/saas-python.md
+++ b/docs/src/python/saas-python.md
@@ -1,4 +1,6 @@
-# LanceDB Python API Reference
+# Python API Reference (SaaS)
+
+This section contains the API reference for the SaaS Python API.

 ## Installation