BREAKING CHANGE: The `score` column has been renamed to `_distance` to more accurately describe the semantics (smaller means closer / better). --------- Co-authored-by: Lei Xu <lei@lancedb.com>
2.6 KiB
Pandas and PyArrow
Built on top of Apache Arrow,
LanceDB is easy to integrate with the Python ecosystem, including Pandas
and PyArrow.
Create dataset
First, we need to connect to a LanceDB database.
import lancedb
db = lancedb.connect("data/sample-lancedb")
Afterwards, we write a Pandas DataFrame to LanceDB directly.
import pandas as pd
data = pd.DataFrame({
"vector": [[3.1, 4.1], [5.9, 26.5]],
"item": ["foo", "bar"],
"price": [10.0, 20.0]
})
table = db.create_table("pd_table", data=data)
Similar to pyarrow.write_dataset(),
db.create_table() accepts a wide-range of forms of data.
For example, if you have a dataset that is larger than memory size, you can create table with Iterator[pyarrow.RecordBatch],
to lazily generate data:
from typing import Iterable
import pyarrow as pa
import lancedb
def make_batches() -> Iterable[pa.RecordBatch]:
for i in range(5):
yield pa.RecordBatch.from_arrays(
[
pa.array([[3.1, 4.1], [5.9, 26.5]]),
pa.array(["foo", "bar"]),
pa.array([10.0, 20.0]),
],
["vector", "item", "price"])
schema=pa.schema([
pa.field("vector", pa.list_(pa.float32())),
pa.field("item", pa.utf8()),
pa.field("price", pa.float32()),
])
table = db.create_table("iterable_table", data=make_batches(), schema=schema)
You will find detailed instructions of creating dataset in Basic Operations and API sections.
Vector Search
We can now perform similarity search via LanceDB Python API.
# Open the table previously created.
table = db.open_table("pd_table")
query_vector = [100, 100]
# Pandas DataFrame
df = table.search(query_vector).limit(1).to_df()
print(df)
vector item price _distance
0 [5.9, 26.5] bar 20.0 14257.05957
If you have a simple filter, it's faster to provide a where clause to LanceDB's search query.
If you have more complex criteria, you can always apply the filter to the resulting Pandas DataFrame.
# Apply the filter via LanceDB
results = table.search([100, 100]).where("price < 15").to_df()
assert len(results) == 1
assert results["item"].iloc[0] == "foo"
# Apply the filter via Pandas
df = results = table.search([100, 100]).to_df()
results = df[df.price < 15]
assert len(results) == 1
assert results["item"].iloc[0] == "foo"