Files
lancedb/docs/src/python/polars_arrow.md
Will Jones ecdee4d2b1 feat(python): add search() method to async API (#2049)
Reviving #1966.

Closes #1938

The `search()` method can apply embeddings for the user. This simplifies
hybrid search, so instead of writing:

```python
vector_query = embeddings.compute_query_embeddings("flower moon")[0]
await (
    async_tbl.query()
    .nearest_to(vector_query)
    .nearest_to_text("flower moon")
    .to_pandas()
)
```

You can write:

```python
await (await async_tbl.search("flower moon", query_type="hybrid")).to_pandas()
```

Unfortunately, we had to do a double-await here because `search()` needs
to be async. This is because it often needs to do IO to retrieve and run
an embedding function.
2025-02-24 14:19:25 -08:00

5.1 KiB

Polars

LanceDB supports Polars, a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.

Create & Query LanceDB Table

From Polars DataFrame

First, we connect to a LanceDB database.

=== "Sync API"

```py
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb"
```

=== "Async API"

```py
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb_async"
```

We can load a Polars DataFrame to LanceDB directly.

=== "Sync API"

```py
--8<-- "python/python/tests/docs/test_python.py:import-polars"
--8<-- "python/python/tests/docs/test_python.py:create_table_polars"
```

=== "Async API"

```py
--8<-- "python/python/tests/docs/test_python.py:import-polars"
--8<-- "python/python/tests/docs/test_python.py:create_table_polars_async"
```

We can now perform similarity search via the LanceDB Python API.

=== "Sync API"

```py
--8<-- "python/python/tests/docs/test_python.py:vector_search_polars"
```

=== "Async API"

```py
--8<-- "python/python/tests/docs/test_python.py:vector_search_polars_async"
```

In addition to the selected columns, LanceDB also returns a vector and also the _distance column which is the distance between the query vector and the returned vector.

shape: (1, 4)
┌───────────────┬──────┬───────┬───────────┐
│ vector        ┆ item ┆ price ┆ _distance │
│ ---           ┆ ---  ┆ ---   ┆ ---       │
│ array[f32, 2] ┆ str  ┆ f64   ┆ f32       │
╞═══════════════╪══════╪═══════╪═══════════╡
│ [3.1, 4.1]    ┆ foo  ┆ 10.0  ┆ 0.0       │
└───────────────┴──────┴───────┴───────────┘
<class 'polars.dataframe.frame.DataFrame'>

Note that the type of the result from a table search is a Polars DataFrame.

From Pydantic Models

Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame.

--8<-- "python/python/tests/docs/test_python.py:import-polars"
--8<-- "python/python/tests/docs/test_python.py:import-lancedb-pydantic"
--8<-- "python/python/tests/docs/test_python.py:class_Item"
--8<-- "python/python/tests/docs/test_python.py:create_table_pydantic"

The table can now be queried as usual.

--8<-- "python/python/tests/docs/test_python.py:vector_search_polars"
shape: (1, 4)
┌───────────────┬──────┬───────┬───────────┐
│ vector        ┆ item ┆ price ┆ _distance │
│ ---           ┆ ---  ┆ ---   ┆ ---       │
│ array[f32, 2] ┆ str  ┆ f64   ┆ f32       │
╞═══════════════╪══════╪═══════╪═══════════╡
│ [3.1, 4.1]    ┆ foo  ┆ 10.0  ┆ 0.02      │
└───────────────┴──────┴───────┴───────────┘
<class 'polars.dataframe.frame.DataFrame'>

This result is the same as the previous one, with a DataFrame returned.

Dump Table to LazyFrame

As you iterate on your application, you'll likely need to work with the whole table's data pretty frequently. LanceDB tables can also be converted directly into a polars LazyFrame for further processing.

--8<-- "python/python/tests/docs/test_python.py:dump_table_lazyform"

Unlike the search result from a query, we can see that the type of the result is a LazyFrame.

<class 'polars.lazyframe.frame.LazyFrame'>

We can now work with the LazyFrame as we would in Polars, and collect the first result.

--8<-- "python/python/tests/docs/test_python.py:print_table_lazyform"
shape: (1, 3)
┌───────────────┬──────┬───────┐
│ vector        ┆ item ┆ price │
│ ---           ┆ ---  ┆ ---   │
│ array[f32, 2] ┆ str  ┆ f64   │
╞═══════════════╪══════╪═══════╡
│ [3.1, 4.1]    ┆ foo  ┆ 10.0  │
└───────────────┴──────┴───────┘

The reason it's beneficial to not convert the LanceDB Table to a DataFrame is because the table can potentially be way larger than memory, and Polars LazyFrames allow us to work with such larger-than-memory datasets by not loading it into memory all at once.