diff --git a/docs/src/assets/lancedb_storage_tradeoffs.png b/docs/src/assets/lancedb_storage_tradeoffs.png index 022a50a5..0fccf430 100644 Binary files a/docs/src/assets/lancedb_storage_tradeoffs.png and b/docs/src/assets/lancedb_storage_tradeoffs.png differ diff --git a/docs/src/concepts/index_ivfpq.md b/docs/src/concepts/index_ivfpq.md index 7c5b8dfa..6b4e8489 100644 --- a/docs/src/concepts/index_ivfpq.md +++ b/docs/src/concepts/index_ivfpq.md @@ -101,4 +101,4 @@ For example, with 1024-dimension vectors, if we choose `num_sub_vectors = 64`, e `num_partitions` is used to decide how many partitions the first level IVF index uses. Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On SIFT-1M dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency/recall. -`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. dimension / num_sub_vectors should be a multiple of 8 for optimum SIMD efficiency. \ No newline at end of file +`num_sub_vectors` specifies how many PQ short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency. \ No newline at end of file diff --git a/docs/src/guides/tables.md b/docs/src/guides/tables.md index 2eccfeac..ede6aee5 100644 --- a/docs/src/guides/tables.md +++ b/docs/src/guides/tables.md @@ -79,6 +79,24 @@ This guide will show how to create tables, insert data into them, and update the table = db.create_table("my_table", data, schema=custom_schema) ``` + ### From a Polars DataFrame + + LanceDB supports [Polars](https://pola.rs/), a modern, fast DataFrame library + written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow + under the hood. A deeper integration between LanceDB Tables and Polars DataFrames + is on the way. + + ```python + import polars as pl + + data = pl.DataFrame({ + "vector": [[3.1, 4.1], [5.9, 26.5]], + "item": ["foo", "bar"], + "price": [10.0, 20.0] + }) + table = db.create_table("pl_table", data=data) + ``` + ### From PyArrow Tables You can also create LanceDB tables directly from PyArrow tables @@ -358,6 +376,15 @@ After a table has been created, you can always add more data to it using the var tbl.add(df) ``` + ### Add a Polars DataFrame + + ```python + df = pl.DataFrame({ + "vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0] + }) + tbl.add(df) + ``` + ### Add an Iterator You can also add a large dataset batch in one go using Iterator of any supported data types. diff --git a/docs/src/python/polars_arrow.md b/docs/src/python/polars_arrow.md index db78290d..7bf5ad43 100644 --- a/docs/src/python/polars_arrow.md +++ b/docs/src/python/polars_arrow.md @@ -2,12 +2,13 @@ LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame. -## Create dataset +## Create & Query LanceDB Table -First, we need to connect to a LanceDB database. +### From Polars DataFrame + +First, we connect to a LanceDB database. ```py - import lancedb db = lancedb.connect("data/polars-lancedb") @@ -26,15 +27,13 @@ data = pl.DataFrame({ table = db.create_table("pl_table", data=data) ``` -## Vector search - We can now perform similarity search via the LanceDB Python API. ```py -query = [3.1, 4.1] +query = [3.0, 4.0] result = table.search(query).limit(1).to_polars() -assert len(result) == 1 -assert result["item"][0] == "foo" +print(result) +print(type(result)) ``` In addition to the selected columns, LanceDB also returns a vector @@ -50,4 +49,94 @@ shape: (1, 4) ╞═══════════════╪══════╪═══════╪═══════════╡ │ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.0 │ └───────────────┴──────┴───────┴───────────┘ -``` \ No newline at end of file + +``` + +Note that the type of the result from a table search is a Polars DataFrame. + +### From Pydantic Models + +Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame. + +```py +import polars as pl +from lancedb.pydantic import Vector, LanceModel + + +class Item(LanceModel): + vector: Vector(2) + item: str + price: float + +data = { + "vector": [[3.1, 4.1]], + "item": "foo", + "price": 10.0, +} + +table = db.create_table("test_table", schema=Item) +df = pl.DataFrame(data) +# Add Polars DataFrame to table +table.add(df) +``` + +The table can now be queried as usual. + +```py +result = table.search([3.0, 4.0]).limit(1).to_polars() +print(result) +print(type(result)) +``` + +``` +shape: (1, 4) +┌───────────────┬──────┬───────┬───────────┐ +│ vector ┆ item ┆ price ┆ _distance │ +│ --- ┆ --- ┆ --- ┆ --- │ +│ array[f32, 2] ┆ str ┆ f64 ┆ f32 │ +╞═══════════════╪══════╪═══════╪═══════════╡ +│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.02 │ +└───────────────┴──────┴───────┴───────────┘ + +``` + +This result is the same as the previous one, with a DataFrame returned. + +## Dump Table to LazyFrame + +As you iterate on your application, you'll likely need to work with the whole table's data pretty frequently. +LanceDB tables can also be converted directly into a polars LazyFrame for further processing. + +```python +ldf = table.to_polars() +print(type(ldf)) +``` + +Unlike the search result from a query, we can see that the type of the result is a LazyFrame. + +``` + +``` + +We can now work with the LazyFrame as we would in Polars, and collect the first result. + +```python +print(ldf.first().collect()) +``` + +``` +shape: (1, 3) +┌───────────────┬──────┬───────┐ +│ vector ┆ item ┆ price │ +│ --- ┆ --- ┆ --- │ +│ array[f32, 2] ┆ str ┆ f64 │ +╞═══════════════╪══════╪═══════╡ +│ [3.1, 4.1] ┆ foo ┆ 10.0 │ +└───────────────┴──────┴───────┘ +``` + +The reason it's beneficial to not convert the LanceDB Table +to a DataFrame is because the table can potentially be way larger +than memory, and Polars LazyFrames allow us to work with such +larger-than-memory datasets by not loading it into memory all at once. + diff --git a/docs/src/styles/extra.css b/docs/src/styles/extra.css index e78d7158..1075e4d6 100644 --- a/docs/src/styles/extra.css +++ b/docs/src/styles/extra.css @@ -11,11 +11,12 @@ /* grid */ .md-grid { - max-width: 80%; + max-width: 95%; } @media (min-width: 1220px) { .md-main__inner { + max-width: 80%; margin-top: 0; } .md-sidebar {