Files
lancedb/docs/src/python/arrow.md
Will Jones 722462c38b chore: upgrade Lance and rename score to _distance (#398)
BREAKING CHANGE: The `score` column has been renamed to `_distance` to
more accurately describe the semantics (smaller means closer / better).

---------

Co-authored-by: Lei Xu <lei@lancedb.com>
2023-08-11 21:42:33 -07:00

2.6 KiB

Pandas and PyArrow

Built on top of Apache Arrow, LanceDB is easy to integrate with the Python ecosystem, including Pandas and PyArrow.

Create dataset

First, we need to connect to a LanceDB database.


import lancedb

db = lancedb.connect("data/sample-lancedb")

Afterwards, we write a Pandas DataFrame to LanceDB directly.

import pandas as pd

data = pd.DataFrame({
    "vector": [[3.1, 4.1], [5.9, 26.5]],
    "item": ["foo", "bar"],
    "price": [10.0, 20.0]
})
table = db.create_table("pd_table", data=data)

Similar to pyarrow.write_dataset(), db.create_table() accepts a wide-range of forms of data.

For example, if you have a dataset that is larger than memory size, you can create table with Iterator[pyarrow.RecordBatch], to lazily generate data:


from typing import Iterable
import pyarrow as pa
import lancedb

def make_batches() -> Iterable[pa.RecordBatch]:
    for i in range(5):
        yield pa.RecordBatch.from_arrays(
            [
                pa.array([[3.1, 4.1], [5.9, 26.5]]),
                pa.array(["foo", "bar"]),
                pa.array([10.0, 20.0]),
            ],
            ["vector", "item", "price"])

schema=pa.schema([
    pa.field("vector", pa.list_(pa.float32())),
    pa.field("item", pa.utf8()),
    pa.field("price", pa.float32()),
])

table = db.create_table("iterable_table", data=make_batches(), schema=schema)

You will find detailed instructions of creating dataset in Basic Operations and API sections.

We can now perform similarity search via LanceDB Python API.

# Open the table previously created.
table = db.open_table("pd_table")

query_vector = [100, 100]
# Pandas DataFrame
df = table.search(query_vector).limit(1).to_df()
print(df)
    vector     item  price    _distance
0  [5.9, 26.5]  bar   20.0  14257.05957

If you have a simple filter, it's faster to provide a where clause to LanceDB's search query. If you have more complex criteria, you can always apply the filter to the resulting Pandas DataFrame.


# Apply the filter via LanceDB
results = table.search([100, 100]).where("price < 15").to_df()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"

# Apply the filter via Pandas
df = results = table.search([100, 100]).to_df()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"