Docs updates incl. Polars (#827)

This PR makes the following aesthetic and content updates to the docs. - [x] Fix max width issue on mobile: Content should now render more cleanly and be more readable on smaller devices - [x] Improve image quality of flowchart in data management page - [x] Fix syntax highlighting in text at the bottom of the IVF-PQ concepts page - [x] Add example of Polars LazyFrames to docs (Integrations) - [x] Add example of adding data to tables using Polars (guides)
2026-01-03 18:32:55 +00:00 · 2024-01-18 23:43:59 -05:00
parent 4d5d748acd
commit e6bb907d81
5 changed files with 128 additions and 11 deletions
--- a/docs/src/assets/lancedb_storage_tradeoffs.png
+++ b/docs/src/assets/lancedb_storage_tradeoffs.png
--- a/docs/src/concepts/index_ivfpq.md
+++ b/docs/src/concepts/index_ivfpq.md
@@ -101,4 +101,4 @@ For example, with 1024-dimension vectors, if we choose `num_sub_vectors = 64`, e

 `num_partitions` is used to decide how many partitions the first level IVF index uses. Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On SIFT-1M dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency/recall.

-`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. dimension / num_sub_vectors should be a multiple of 8 for optimum SIMD efficiency.
+`num_sub_vectors` specifies how many PQ short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency.
--- a/docs/src/guides/tables.md
+++ b/docs/src/guides/tables.md
@@ -79,6 +79,24 @@ This guide will show how to create tables, insert data into them, and update the
    table = db.create_table("my_table", data, schema=custom_schema)
    ```

+    ### From a Polars DataFrame
+
+    LanceDB supports [Polars](https://pola.rs/), a modern, fast DataFrame library
+    written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow
+    under the hood. A deeper integration between LanceDB Tables and Polars DataFrames
+    is on the way.
+
+    ```python
+    import polars as pl
+
+    data = pl.DataFrame({
+        "vector": [[3.1, 4.1], [5.9, 26.5]],
+        "item": ["foo", "bar"],
+        "price": [10.0, 20.0]
+    })
+    table = db.create_table("pl_table", data=data)
+    ```
+
    ### From PyArrow Tables
    You can also create LanceDB tables directly from PyArrow tables

@@ -358,6 +376,15 @@ After a table has been created, you can always add more data to it using the var
    tbl.add(df)
    ```

+    ### Add a Polars DataFrame
+
+    ```python
+    df = pl.DataFrame({
+        "vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0]
+    })
+    tbl.add(df)
+    ```
+
    ### Add an Iterator

    You can also add a large dataset batch in one go using Iterator of any supported data types.
--- a/docs/src/python/polars_arrow.md
+++ b/docs/src/python/polars_arrow.md
@@ -2,12 +2,13 @@

 LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.

-## Create dataset
+## Create & Query LanceDB Table

-First, we need to connect to a LanceDB database.
+### From Polars DataFrame
+
+First, we connect to a LanceDB database.

 ```py
-
 import lancedb

 db = lancedb.connect("data/polars-lancedb")
@@ -26,15 +27,13 @@ data = pl.DataFrame({
 table = db.create_table("pl_table", data=data)
 ```

-## Vector search
-
 We can now perform similarity search via the LanceDB Python API.

 ```py
-query = [3.1, 4.1]
+query = [3.0, 4.0]
 result = table.search(query).limit(1).to_polars()
-assert len(result) == 1
-assert result["item"][0] == "foo"
+print(result)
+print(type(result))
 ```

 In addition to the selected columns, LanceDB also returns a vector
@@ -50,4 +49,94 @@ shape: (1, 4)
 ╞═══════════════╪══════╪═══════╪═══════════╡
 │ [3.1, 4.1]    ┆ foo  ┆ 10.0  ┆ 0.0       │
 └───────────────┴──────┴───────┴───────────┘
-```
+<class 'polars.dataframe.frame.DataFrame'>
+```
+
+Note that the type of the result from a table search is a Polars DataFrame.
+
+### From Pydantic Models
+
+Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame.
+
+```py
+import polars as pl
+from lancedb.pydantic import Vector, LanceModel
+
+
+class Item(LanceModel):
+    vector: Vector(2)
+    item: str
+    price: float
+
+data = {
+    "vector": [[3.1, 4.1]],
+    "item": "foo",
+    "price": 10.0,
+}
+
+table = db.create_table("test_table", schema=Item)
+df = pl.DataFrame(data)
+# Add Polars DataFrame to table
+table.add(df)
+```
+
+The table can now be queried as usual.
+
+```py
+result = table.search([3.0, 4.0]).limit(1).to_polars()
+print(result)
+print(type(result))
+```
+
+```
+shape: (1, 4)
+┌───────────────┬──────┬───────┬───────────┐
+│ vector        ┆ item ┆ price ┆ _distance │
+│ ---           ┆ ---  ┆ ---   ┆ ---       │
+│ array[f32, 2] ┆ str  ┆ f64   ┆ f32       │
+╞═══════════════╪══════╪═══════╪═══════════╡
+│ [3.1, 4.1]    ┆ foo  ┆ 10.0  ┆ 0.02      │
+└───────────────┴──────┴───────┴───────────┘
+<class 'polars.dataframe.frame.DataFrame'>
+```
+
+This result is the same as the previous one, with a DataFrame returned.
+
+## Dump Table to LazyFrame
+
+As you iterate on your application, you'll likely need to work with the whole table's data pretty frequently.
+LanceDB tables can also be converted directly into a polars LazyFrame for further processing.
+
+```python
+ldf = table.to_polars()
+print(type(ldf))
+```
+
+Unlike the search result from a query, we can see that the type of the result is a LazyFrame.
+
+```
+<class 'polars.lazyframe.frame.LazyFrame'>
+```
+
+We can now work with the LazyFrame as we would in Polars, and collect the first result.
+
+```python
+print(ldf.first().collect())
+```
+
+```
+shape: (1, 3)
+┌───────────────┬──────┬───────┐
+│ vector        ┆ item ┆ price │
+│ ---           ┆ ---  ┆ ---   │
+│ array[f32, 2] ┆ str  ┆ f64   │
+╞═══════════════╪══════╪═══════╡
+│ [3.1, 4.1]    ┆ foo  ┆ 10.0  │
+└───────────────┴──────┴───────┘
+```
+
+The reason it's beneficial to not convert the LanceDB Table
+to a DataFrame is because the table can potentially be way larger
+than memory, and Polars LazyFrames allow us to work with such
+larger-than-memory datasets by not loading it into memory all at once.
+
--- a/docs/src/styles/extra.css
+++ b/docs/src/styles/extra.css
@@ -11,11 +11,12 @@

 /* grid */
 .md-grid {
-    max-width: 80%;
+    max-width: 95%;
  }
  
  @media (min-width: 1220px) {
    .md-main__inner {
+      max-width: 80%;
      margin-top: 0;
    }
    .md-sidebar {