mirror of
https://github.com/lancedb/lancedb.git
synced 2026-01-15 00:02:59 +00:00
feat: pare down docs to only show API refs (#2770)
This PR does the following: - Pare down the docs to only what's needed (Python, JS/TS API docs and a pointer to Rust docs) - Styling changes to be more in line with the main website theme The relative URLs remain unchanged, so assuming CI passes, there should be no breaking changes from the main docs site that points back here.
This commit is contained in:
@@ -1,53 +0,0 @@
|
||||
# Apache Datafusion
|
||||
|
||||
In Python, LanceDB tables can also be queried with [Apache Datafusion](https://datafusion.apache.org/), an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. This means you can write complex SQL queries to analyze your data in LanceDB.
|
||||
|
||||
This integration is done via [Datafusion FFI](https://docs.rs/datafusion-ffi/latest/datafusion_ffi/), which provides a native integration between LanceDB and Datafusion.
|
||||
The Datafusion FFI allows to pass down column selections and basic filters to LanceDB, reducing the amount of scanned data when executing your query. Additionally, the integration allows streaming data from LanceDB tables which allows to do aggregation larger-than-memory.
|
||||
|
||||
We can demonstrate this by first installing `datafusion` and `lancedb`.
|
||||
|
||||
```shell
|
||||
pip install datafusion lancedb
|
||||
```
|
||||
|
||||
We will re-use the dataset [created previously](./pandas_and_pyarrow.md):
|
||||
|
||||
```python
|
||||
import lancedb
|
||||
|
||||
from datafusion import SessionContext
|
||||
from lance import FFILanceTableProvider
|
||||
|
||||
db = lancedb.connect("data/sample-lancedb")
|
||||
data = [
|
||||
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
|
||||
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
|
||||
]
|
||||
lance_table = db.create_table("lance_table", data)
|
||||
|
||||
ctx = SessionContext()
|
||||
|
||||
ffi_lance_table = FFILanceTableProvider(
|
||||
lance_table.to_lance(), with_row_id=True, with_row_addr=True
|
||||
)
|
||||
ctx.register_table_provider("ffi_lance_table", ffi_lance_table)
|
||||
```
|
||||
|
||||
The `to_lance` method converts the LanceDB table to a `LanceDataset`, which is accessible to Datafusion through the Datafusion FFI integration layer.
|
||||
To query the resulting Lance dataset in Datafusion, you first need to register the dataset with Datafusion and then just reference it by the same name in your SQL query.
|
||||
|
||||
```python
|
||||
ctx.table("ffi_lance_table")
|
||||
ctx.sql("SELECT * FROM ffi_lance_table")
|
||||
```
|
||||
|
||||
```
|
||||
┌─────────────┬─────────┬────────┬─────────────────┬─────────────────┐
|
||||
│ vector │ item │ price │ _rowid │ _rowaddr │
|
||||
│ float[] │ varchar │ double │ bigint unsigned │ bigint unsigned │
|
||||
├─────────────┼─────────┼────────┼─────────────────┼─────────────────┤
|
||||
│ [3.1, 4.1] │ foo │ 10.0 │ 0 │ 0 │
|
||||
│ [5.9, 26.5] │ bar │ 20.0 │ 1 │ 1 │
|
||||
└─────────────┴─────────┴────────┴─────────────────┴─────────────────┘
|
||||
```
|
||||
@@ -1,61 +0,0 @@
|
||||
# DuckDB
|
||||
|
||||
In Python, LanceDB tables can also be queried with [DuckDB](https://duckdb.org/), an in-process SQL OLAP database. This means you can write complex SQL queries to analyze your data in LanceDB.
|
||||
|
||||
This integration is done via [Apache Arrow](https://duckdb.org/docs/guides/python/sql_on_arrow), which provides zero-copy data sharing between LanceDB and DuckDB. DuckDB is capable of passing down column selections and basic filters to LanceDB, reducing the amount of data that needs to be scanned to perform your query. Finally, the integration allows streaming data from LanceDB tables, allowing you to aggregate tables that won't fit into memory. All of this uses the same mechanism described in DuckDB's blog post *[DuckDB quacks Arrow](https://duckdb.org/2021/12/03/duck-arrow.html)*.
|
||||
|
||||
|
||||
We can demonstrate this by first installing `duckdb` and `lancedb`.
|
||||
|
||||
```shell
|
||||
pip install duckdb lancedb
|
||||
```
|
||||
|
||||
We will re-use the dataset [created previously](./pandas_and_pyarrow.md):
|
||||
|
||||
```python
|
||||
import lancedb
|
||||
|
||||
db = lancedb.connect("data/sample-lancedb")
|
||||
data = [
|
||||
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
|
||||
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0}
|
||||
]
|
||||
table = db.create_table("pd_table", data=data)
|
||||
```
|
||||
|
||||
The `to_lance` method converts the LanceDB table to a `LanceDataset`, which is accessible to DuckDB through the Arrow compatibility layer.
|
||||
To query the resulting Lance dataset in DuckDB, all you need to do is reference the dataset by the same name in your SQL query.
|
||||
|
||||
```python
|
||||
import duckdb
|
||||
|
||||
arrow_table = table.to_lance()
|
||||
|
||||
duckdb.query("SELECT * FROM arrow_table")
|
||||
```
|
||||
|
||||
```
|
||||
┌─────────────┬─────────┬────────┐
|
||||
│ vector │ item │ price │
|
||||
│ float[] │ varchar │ double │
|
||||
├─────────────┼─────────┼────────┤
|
||||
│ [3.1, 4.1] │ foo │ 10.0 │
|
||||
│ [5.9, 26.5] │ bar │ 20.0 │
|
||||
└─────────────┴─────────┴────────┘
|
||||
```
|
||||
|
||||
You can very easily run any other DuckDB SQL queries on your data.
|
||||
|
||||
```py
|
||||
duckdb.query("SELECT mean(price) FROM arrow_table")
|
||||
```
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ mean(price) │
|
||||
│ double │
|
||||
├─────────────┤
|
||||
│ 15.0 │
|
||||
└─────────────┘
|
||||
```
|
||||
@@ -1,97 +0,0 @@
|
||||
# Pandas and PyArrow
|
||||
|
||||
Because Lance is built on top of [Apache Arrow](https://arrow.apache.org/),
|
||||
LanceDB is tightly integrated with the Python data ecosystem, including [Pandas](https://pandas.pydata.org/)
|
||||
and PyArrow. The sequence of steps in a typical workflow is shown below.
|
||||
|
||||
## Create dataset
|
||||
|
||||
First, we need to connect to a LanceDB database.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
|
||||
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb"
|
||||
```
|
||||
=== "Async API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
|
||||
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb_async"
|
||||
```
|
||||
|
||||
We can load a Pandas `DataFrame` to LanceDB directly.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-pandas"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_pandas"
|
||||
```
|
||||
=== "Async API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-pandas"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_pandas_async"
|
||||
```
|
||||
|
||||
Similar to the [`pyarrow.write_dataset()`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html) method, LanceDB's
|
||||
[`db.create_table()`](python.md/#lancedb.db.DBConnection.create_table) accepts data in a variety of forms.
|
||||
|
||||
If you have a dataset that is larger than memory, you can create a table with `Iterator[pyarrow.RecordBatch]` to lazily load the data:
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-iterable"
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-pyarrow"
|
||||
--8<-- "python/python/tests/docs/test_python.py:make_batches"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_iterable"
|
||||
```
|
||||
=== "Async API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-iterable"
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-pyarrow"
|
||||
--8<-- "python/python/tests/docs/test_python.py:make_batches"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_iterable_async"
|
||||
```
|
||||
|
||||
You will find detailed instructions of creating a LanceDB dataset in
|
||||
[Getting Started](../basic.md#quick-start) and [API](python.md/#lancedb.db.DBConnection.create_table)
|
||||
sections.
|
||||
|
||||
## Vector search
|
||||
|
||||
We can now perform similarity search via the LanceDB Python API.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search"
|
||||
```
|
||||
=== "Async API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search_async"
|
||||
```
|
||||
|
||||
```
|
||||
vector item price _distance
|
||||
0 [5.9, 26.5] bar 20.0 14257.05957
|
||||
```
|
||||
|
||||
If you have a simple filter, it's faster to provide a `where` clause to LanceDB's `search` method.
|
||||
For more complex filters or aggregations, you can always resort to using the underlying `DataFrame` methods after performing a search.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search_with_filter"
|
||||
```
|
||||
=== "Async API"
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search_with_filter_async"
|
||||
```
|
||||
@@ -1,141 +0,0 @@
|
||||
# Polars
|
||||
|
||||
LanceDB supports [Polars](https://github.com/pola-rs/polars), a blazingly fast DataFrame library for Python written in Rust. Just like in Pandas, the Polars integration is enabled by PyArrow under the hood. A deeper integration between Lance Tables and Polars DataFrames is in progress, but at the moment, you can read a Polars DataFrame into LanceDB and output the search results from a query to a Polars DataFrame.
|
||||
|
||||
|
||||
## Create & Query LanceDB Table
|
||||
|
||||
### From Polars DataFrame
|
||||
|
||||
First, we connect to a LanceDB database.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
|
||||
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb"
|
||||
```
|
||||
|
||||
=== "Async API"
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-lancedb"
|
||||
--8<-- "python/python/tests/docs/test_python.py:connect_to_lancedb_async"
|
||||
```
|
||||
|
||||
|
||||
We can load a Polars `DataFrame` to LanceDB directly.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-polars"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_polars"
|
||||
```
|
||||
|
||||
=== "Async API"
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-polars"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_polars_async"
|
||||
```
|
||||
|
||||
We can now perform similarity search via the LanceDB Python API.
|
||||
|
||||
=== "Sync API"
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search_polars"
|
||||
```
|
||||
|
||||
=== "Async API"
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search_polars_async"
|
||||
```
|
||||
|
||||
In addition to the selected columns, LanceDB also returns a vector
|
||||
and also the `_distance` column which is the distance between the query
|
||||
vector and the returned vector.
|
||||
|
||||
```
|
||||
shape: (1, 4)
|
||||
┌───────────────┬──────┬───────┬───────────┐
|
||||
│ vector ┆ item ┆ price ┆ _distance │
|
||||
│ --- ┆ --- ┆ --- ┆ --- │
|
||||
│ array[f32, 2] ┆ str ┆ f64 ┆ f32 │
|
||||
╞═══════════════╪══════╪═══════╪═══════════╡
|
||||
│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.0 │
|
||||
└───────────────┴──────┴───────┴───────────┘
|
||||
<class 'polars.dataframe.frame.DataFrame'>
|
||||
```
|
||||
|
||||
Note that the type of the result from a table search is a Polars DataFrame.
|
||||
|
||||
### From Pydantic Models
|
||||
|
||||
Alternately, we can create an empty LanceDB Table using a Pydantic schema and populate it with a Polars DataFrame.
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-polars"
|
||||
--8<-- "python/python/tests/docs/test_python.py:import-lancedb-pydantic"
|
||||
--8<-- "python/python/tests/docs/test_python.py:class_Item"
|
||||
--8<-- "python/python/tests/docs/test_python.py:create_table_pydantic"
|
||||
```
|
||||
|
||||
The table can now be queried as usual.
|
||||
|
||||
```py
|
||||
--8<-- "python/python/tests/docs/test_python.py:vector_search_polars"
|
||||
```
|
||||
|
||||
```
|
||||
shape: (1, 4)
|
||||
┌───────────────┬──────┬───────┬───────────┐
|
||||
│ vector ┆ item ┆ price ┆ _distance │
|
||||
│ --- ┆ --- ┆ --- ┆ --- │
|
||||
│ array[f32, 2] ┆ str ┆ f64 ┆ f32 │
|
||||
╞═══════════════╪══════╪═══════╪═══════════╡
|
||||
│ [3.1, 4.1] ┆ foo ┆ 10.0 ┆ 0.02 │
|
||||
└───────────────┴──────┴───────┴───────────┘
|
||||
<class 'polars.dataframe.frame.DataFrame'>
|
||||
```
|
||||
|
||||
This result is the same as the previous one, with a DataFrame returned.
|
||||
|
||||
## Dump Table to LazyFrame
|
||||
|
||||
As you iterate on your application, you'll likely need to work with the whole table's data pretty frequently.
|
||||
LanceDB tables can also be converted directly into a polars LazyFrame for further processing.
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:dump_table_lazyform"
|
||||
```
|
||||
|
||||
Unlike the search result from a query, we can see that the type of the result is a LazyFrame.
|
||||
|
||||
```
|
||||
<class 'polars.lazyframe.frame.LazyFrame'>
|
||||
```
|
||||
|
||||
We can now work with the LazyFrame as we would in Polars, and collect the first result.
|
||||
|
||||
```python
|
||||
--8<-- "python/python/tests/docs/test_python.py:print_table_lazyform"
|
||||
```
|
||||
|
||||
```
|
||||
shape: (1, 3)
|
||||
┌───────────────┬──────┬───────┐
|
||||
│ vector ┆ item ┆ price │
|
||||
│ --- ┆ --- ┆ --- │
|
||||
│ array[f32, 2] ┆ str ┆ f64 │
|
||||
╞═══════════════╪══════╪═══════╡
|
||||
│ [3.1, 4.1] ┆ foo ┆ 10.0 │
|
||||
└───────────────┴──────┴───────┘
|
||||
```
|
||||
|
||||
The reason it's beneficial to not convert the LanceDB Table
|
||||
to a DataFrame is because the table can potentially be way larger
|
||||
than memory, and Polars LazyFrames allow us to work with such
|
||||
larger-than-memory datasets by not loading it into memory all at once.
|
||||
@@ -1,47 +0,0 @@
|
||||
# Pydantic
|
||||
|
||||
[Pydantic](https://docs.pydantic.dev/latest/) is a data validation library in Python.
|
||||
LanceDB integrates with Pydantic for schema inference, data ingestion, and query result casting.
|
||||
Using [LanceModel][lancedb.pydantic.LanceModel], users can seamlessly
|
||||
integrate Pydantic with the rest of the LanceDB APIs.
|
||||
|
||||
```python
|
||||
|
||||
--8<-- "python/python/tests/docs/test_pydantic_integration.py:imports"
|
||||
|
||||
--8<-- "python/python/tests/docs/test_pydantic_integration.py:base_model"
|
||||
|
||||
--8<-- "python/python/tests/docs/test_pydantic_integration.py:set_url"
|
||||
--8<-- "python/python/tests/docs/test_pydantic_integration.py:base_example"
|
||||
```
|
||||
|
||||
|
||||
## Vector Field
|
||||
|
||||
LanceDB provides a [`Vector(dim)`](python.md#lancedb.pydantic.Vector) method to define a
|
||||
vector Field in a Pydantic Model.
|
||||
|
||||
::: lancedb.pydantic.Vector
|
||||
|
||||
## Type Conversion
|
||||
|
||||
LanceDB automatically convert Pydantic fields to
|
||||
[Apache Arrow DataType](https://arrow.apache.org/docs/python/generated/pyarrow.DataType.html#pyarrow.DataType).
|
||||
|
||||
Current supported type conversions:
|
||||
|
||||
| Pydantic Field Type | PyArrow Data Type |
|
||||
| ------------------- | ----------------- |
|
||||
| `int` | `pyarrow.int64` |
|
||||
| `float` | `pyarrow.float64` |
|
||||
| `bool` | `pyarrow.bool` |
|
||||
| `str` | `pyarrow.utf8()` |
|
||||
| `list` | `pyarrow.List` |
|
||||
| `BaseModel` | `pyarrow.Struct` |
|
||||
| `Vector(n)` | `pyarrow.FixedSizeList(float32, n)` |
|
||||
|
||||
LanceDB supports to create Apache Arrow Schema from a
|
||||
[Pydantic BaseModel][pydantic.BaseModel]
|
||||
via [pydantic_to_schema()](python.md#lancedb.pydantic.pydantic_to_schema) method.
|
||||
|
||||
::: lancedb.pydantic.pydantic_to_schema
|
||||
@@ -1,7 +1,6 @@
|
||||
# Python API Reference
|
||||
|
||||
This section contains the API reference for the Python API. There is a
|
||||
synchronous and an asynchronous API client.
|
||||
This section contains the API reference for the Python API of [LanceDB](https://github.com/lancedb/lancedb). Both synchronous and asynchronous APIs are available.
|
||||
|
||||
The general flow of using the API is:
|
||||
|
||||
|
||||
@@ -1,24 +0,0 @@
|
||||
# Python API Reference (SaaS)
|
||||
|
||||
This section contains the API reference for the LanceDB Cloud Python API.
|
||||
|
||||
## Installation
|
||||
|
||||
```shell
|
||||
pip install lancedb
|
||||
```
|
||||
|
||||
## Connection
|
||||
|
||||
::: lancedb.connect
|
||||
|
||||
::: lancedb.remote.db.RemoteDBConnection
|
||||
|
||||
## Table
|
||||
|
||||
::: lancedb.remote.table.RemoteTable
|
||||
options:
|
||||
filters:
|
||||
- "!cleanup_old_versions"
|
||||
- "!compact_files"
|
||||
- "!optimize"
|
||||
Reference in New Issue
Block a user