mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-03 18:32:55 +00:00
Docs updates incl. Polars (#827)
This PR makes the following aesthetic and content updates to the docs. - [x] Fix max width issue on mobile: Content should now render more cleanly and be more readable on smaller devices - [x] Improve image quality of flowchart in data management page - [x] Fix syntax highlighting in text at the bottom of the IVF-PQ concepts page - [x] Add example of Polars LazyFrames to docs (Integrations) - [x] Add example of adding data to tables using Polars (guides)
This commit is contained in:
committed by
Weston Pace
parent
4d5d748acd
commit
e6bb907d81
Binary file not shown.
|
Before Width: | Height: | Size: 542 KiB After Width: | Height: | Size: 224 KiB |
@@ -101,4 +101,4 @@ For example, with 1024-dimension vectors, if we choose `num_sub_vectors = 64`, e
|
||||
|
||||
`num_partitions` is used to decide how many partitions the first level IVF index uses. Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train. On SIFT-1M dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency/recall.
|
||||
|
||||
`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. dimension / num_sub_vectors should be a multiple of 8 for optimum SIMD efficiency.
|
||||
`num_sub_vectors` specifies how many PQ short codes to generate on each vector. Because PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in less space distortion, and thus yields better accuracy. However, a higher `num_sub_vectors` also causes heavier I/O and more PQ computation, and thus, higher latency. `dimension / num_sub_vectors` should be a multiple of 8 for optimum SIMD efficiency.
|
||||
@@ -79,6 +79,24 @@ This guide will show how to create tables, insert data into them, and update the
|
||||
table = db.create_table("my_table", data, schema=custom_schema)
|
||||
```
|
||||
|
||||
### From a Polars DataFrame
|
||||
|
||||
LanceDB supports [Polars](https://pola.rs/), a modern, fast DataFrame library
|
||||
written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow
|
||||
under the hood. A deeper integration between LanceDB Tables and Polars DataFrames
|
||||
is on the way.
|
||||
|
||||
```python
|
||||
import polars as pl
|
||||
|
||||
data = pl.DataFrame({
|
||||
"vector": [[3.1, 4.1], [5.9, 26.5]],
|
||||
"item": ["foo", "bar"],
|
||||
"price": [10.0, 20.0]
|
||||
})
|
||||
table = db.create_table("pl_table", data=data)
|
||||
```
|
||||
|
||||
### From PyArrow Tables
|
||||
You can also create LanceDB tables directly from PyArrow tables
|
||||
|
||||
@@ -358,6 +376,15 @@ After a table has been created, you can always add more data to it using the var
|
||||
tbl.add(df)
|
||||
```
|
||||
|
||||
### Add a Polars DataFrame
|
||||
|
||||
```python
|
||||
df = pl.DataFrame({
|
||||
"vector": [[1.3, 1.4], [9.5, 56.2]], "item": ["banana", "apple"], "price": [5.0, 7.0]
|
||||
})
|
||||
tbl.add(df)
|
||||
```
|
||||
|
||||
### Add an Iterator
|
||||
|
||||
You can also add a large dataset batch in one go using Iterator of any supported data types.
|
||||
|
||||
@@ -2,12 +2,13 @@
|
||||
|
||||
LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.
|
||||
|
||||
## Create dataset
|
||||
## Create & Query LanceDB Table
|
||||
|
||||
First, we need to connect to a LanceDB database.
|
||||
### From Polars DataFrame
|
||||
|
||||
First, we connect to a LanceDB database.
|
||||
|
||||
```py
|
||||
|
||||
import lancedb
|
||||
|
||||
db = lancedb.connect("data/polars-lancedb")
|
||||
@@ -26,15 +27,13 @@ data = pl.DataFrame({
|
||||
table = db.create_table("pl_table", data=data)
|
||||
```
|
||||
|
||||
## Vector search
|
||||
|
||||
We can now perform similarity search via the LanceDB Python API.
|
||||
|
||||
```py
|
||||
query = [3.1, 4.1]
|
||||
query = [3.0, 4.0]
|
||||
result = table.search(query).limit(1).to_polars()
|
||||
assert len(result) == 1
|
||||
assert result["item"][0] == "foo"
|
||||
print(result)
|
||||
print(type(result))
|
||||
```
|
||||
|
||||
In addition to the selected columns, LanceDB also returns a vector
|
||||
@@ -50,4 +49,94 @@ shape: (1, 4)
|
||||
╞═══════════════╪══════╪═══════╪═══════════╡
|
||||
│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.0 │
|
||||
└───────────────┴──────┴───────┴───────────┘
|
||||
```
|
||||
<class 'polars.dataframe.frame.DataFrame'>
|
||||
```
|
||||
|
||||
Note that the type of the result from a table search is a Polars DataFrame.
|
||||
|
||||
### From Pydantic Models
|
||||
|
||||
Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame.
|
||||
|
||||
```py
|
||||
import polars as pl
|
||||
from lancedb.pydantic import Vector, LanceModel
|
||||
|
||||
|
||||
class Item(LanceModel):
|
||||
vector: Vector(2)
|
||||
item: str
|
||||
price: float
|
||||
|
||||
data = {
|
||||
"vector": [[3.1, 4.1]],
|
||||
"item": "foo",
|
||||
"price": 10.0,
|
||||
}
|
||||
|
||||
table = db.create_table("test_table", schema=Item)
|
||||
df = pl.DataFrame(data)
|
||||
# Add Polars DataFrame to table
|
||||
table.add(df)
|
||||
```
|
||||
|
||||
The table can now be queried as usual.
|
||||
|
||||
```py
|
||||
result = table.search([3.0, 4.0]).limit(1).to_polars()
|
||||
print(result)
|
||||
print(type(result))
|
||||
```
|
||||
|
||||
```
|
||||
shape: (1, 4)
|
||||
┌───────────────┬──────┬───────┬───────────┐
|
||||
│ vector ┆ item ┆ price ┆ _distance │
|
||||
│ --- ┆ --- ┆ --- ┆ --- │
|
||||
│ array[f32, 2] ┆ str ┆ f64 ┆ f32 │
|
||||
╞═══════════════╪══════╪═══════╪═══════════╡
|
||||
│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.02 │
|
||||
└───────────────┴──────┴───────┴───────────┘
|
||||
<class 'polars.dataframe.frame.DataFrame'>
|
||||
```
|
||||
|
||||
This result is the same as the previous one, with a DataFrame returned.
|
||||
|
||||
## Dump Table to LazyFrame
|
||||
|
||||
As you iterate on your application, you'll likely need to work with the whole table's data pretty frequently.
|
||||
LanceDB tables can also be converted directly into a polars LazyFrame for further processing.
|
||||
|
||||
```python
|
||||
ldf = table.to_polars()
|
||||
print(type(ldf))
|
||||
```
|
||||
|
||||
Unlike the search result from a query, we can see that the type of the result is a LazyFrame.
|
||||
|
||||
```
|
||||
<class 'polars.lazyframe.frame.LazyFrame'>
|
||||
```
|
||||
|
||||
We can now work with the LazyFrame as we would in Polars, and collect the first result.
|
||||
|
||||
```python
|
||||
print(ldf.first().collect())
|
||||
```
|
||||
|
||||
```
|
||||
shape: (1, 3)
|
||||
┌───────────────┬──────┬───────┐
|
||||
│ vector ┆ item ┆ price │
|
||||
│ --- ┆ --- ┆ --- │
|
||||
│ array[f32, 2] ┆ str ┆ f64 │
|
||||
╞═══════════════╪══════╪═══════╡
|
||||
│ [3.1, 4.1] ┆ foo ┆ 10.0 │
|
||||
└───────────────┴──────┴───────┘
|
||||
```
|
||||
|
||||
The reason it's beneficial to not convert the LanceDB Table
|
||||
to a DataFrame is because the table can potentially be way larger
|
||||
than memory, and Polars LazyFrames allow us to work with such
|
||||
larger-than-memory datasets by not loading it into memory all at once.
|
||||
|
||||
|
||||
@@ -11,11 +11,12 @@
|
||||
|
||||
/* grid */
|
||||
.md-grid {
|
||||
max-width: 80%;
|
||||
max-width: 95%;
|
||||
}
|
||||
|
||||
@media (min-width: 1220px) {
|
||||
.md-main__inner {
|
||||
max-width: 80%;
|
||||
margin-top: 0;
|
||||
}
|
||||
.md-sidebar {
|
||||
|
||||
Reference in New Issue
Block a user