Docs updates incl. Polars (#827)

This PR makes the following aesthetic and content updates to the docs.

- [x] Fix max width issue on mobile: Content should now render more
cleanly and be more readable on smaller devices
- [x] Improve image quality of flowchart in data management page
- [x] Fix syntax highlighting in text at the bottom of the IVF-PQ
concepts page
- [x] Add example of Polars LazyFrames to docs (Integrations)
- [x] Add example of adding data to tables using Polars (guides)
This commit is contained in:
Prashanth Rao
2024-01-18 23:43:59 -05:00
committed by Weston Pace
parent 4d5d748acd
commit e6bb907d81
5 changed files with 128 additions and 11 deletions

View File

@@ -2,12 +2,13 @@
LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.
## Create dataset
## Create & Query LanceDB Table
First, we need to connect to a LanceDB database.
### From Polars DataFrame
First, we connect to a LanceDB database.
```py
import lancedb
db = lancedb.connect("data/polars-lancedb")
@@ -26,15 +27,13 @@ data = pl.DataFrame({
table = db.create_table("pl_table", data=data)
```
## Vector search
We can now perform similarity search via the LanceDB Python API.
```py
query = [3.1, 4.1]
query = [3.0, 4.0]
result = table.search(query).limit(1).to_polars()
assert len(result) == 1
assert result["item"][0] == "foo"
print(result)
print(type(result))
```
In addition to the selected columns, LanceDB also returns a vector
@@ -50,4 +49,94 @@ shape: (1, 4)
╞═══════════════╪══════╪═══════╪═══════════╡
│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.0 │
└───────────────┴──────┴───────┴───────────┘
```
<class 'polars.dataframe.frame.DataFrame'>
```
Note that the type of the result from a table search is a Polars DataFrame.
### From Pydantic Models
Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame.
```py
import polars as pl
from lancedb.pydantic import Vector, LanceModel
class Item(LanceModel):
vector: Vector(2)
item: str
price: float
data = {
"vector": [[3.1, 4.1]],
"item": "foo",
"price": 10.0,
}
table = db.create_table("test_table", schema=Item)
df = pl.DataFrame(data)
# Add Polars DataFrame to table
table.add(df)
```
The table can now be queried as usual.
```py
result = table.search([3.0, 4.0]).limit(1).to_polars()
print(result)
print(type(result))
```
```
shape: (1, 4)
┌───────────────┬──────┬───────┬───────────┐
│ vector ┆ item ┆ price ┆ _distance │
│ --- ┆ --- ┆ --- ┆ --- │
│ array[f32, 2] ┆ str ┆ f64 ┆ f32 │
╞═══════════════╪══════╪═══════╪═══════════╡
│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.02 │
└───────────────┴──────┴───────┴───────────┘
<class 'polars.dataframe.frame.DataFrame'>
```
This result is the same as the previous one, with a DataFrame returned.
## Dump Table to LazyFrame
As you iterate on your application, you'll likely need to work with the whole table's data pretty frequently.
LanceDB tables can also be converted directly into a polars LazyFrame for further processing.
```python
ldf = table.to_polars()
print(type(ldf))
```
Unlike the search result from a query, we can see that the type of the result is a LazyFrame.
```
<class 'polars.lazyframe.frame.LazyFrame'>
```
We can now work with the LazyFrame as we would in Polars, and collect the first result.
```python
print(ldf.first().collect())
```
```
shape: (1, 3)
┌───────────────┬──────┬───────┐
│ vector ┆ item ┆ price │
│ --- ┆ --- ┆ --- │
│ array[f32, 2] ┆ str ┆ f64 │
╞═══════════════╪══════╪═══════╡
│ [3.1, 4.1] ┆ foo ┆ 10.0 │
└───────────────┴──────┴───────┘
```
The reason it's beneficial to not convert the LanceDB Table
to a DataFrame is because the table can potentially be way larger
than memory, and Polars LazyFrames allow us to work with such
larger-than-memory datasets by not loading it into memory all at once.