mirror of https://github.com/lancedb/lancedb.git synced 2026-01-07 12:22:59 +00:00

Files

Chang She 371d2f979e feat(python): add option to flatten output in to_pandas (#722 )

Closes https://github.com/lancedb/lance/issues/1738

We add a `flatten` parameter to the signature of `to_pandas`. By default
this is None and does nothing.
If set to True or -1, then LanceDB will flatten structs before
converting to a pandas dataframe. All nested structs are also flattened.
If set to any positive integer, then LanceDB will flatten structs up to
the specified level of nesting.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>

2023-12-20 12:23:07 -08:00

5.8 KiB

Raw Blame History

Vector Search

Vector Search finds the nearest vectors from the database. In a recommendation system or search engine, you can find similar products from the one you searched. In LLM and other AI applications, each data point can be presented by the embeddings generated from some models, it returns the most relevant features.

A search in high-dimensional vector space, is to find K-Nearest-Neighbors (KNN) of the query vector.

Metric

In LanceDB, a Metric is the way to describe the distance between a pair of vectors. Currently, we support the following metrics:

Metric	Description
`L2`	Euclidean / L2 distance
`Cosine`	Cosine Similarity
`Dot`	Dot Production

Search

Flat Search

If you do not create a vector index, LanceDB would need to exhaustively scan the entire vector column (via Flat Search) and compute the distance for every vector in order to find the closest matches. This is effectively a KNN search.

=== "Python"

```python
import lancedb
import numpy as np

db = lancedb.connect("data/sample-lancedb")

tbl = db.open_table("my_vectors")

df = tbl.search(np.random.random((1536))) \
    .limit(10) \
    .to_list()
```

=== "JavaScript"

```javascript
const vectordb = require('vectordb')
const db = await vectordb.connect('data/sample-lancedb')

const tbl = await db.openTable("my_vectors")

const results_1 = await tbl.search(Array(1536).fill(1.2))
    .limit(10)
    .execute()
```

By default, l2 will be used as Metric type. You can customize the metric type as well.

=== "Python"

```python
df = tbl.search(np.random.random((1536))) \
    .metric("cosine") \
    .limit(10) \
    .to_list()
```

=== "JavaScript"

```javascript
const results_2 = await tbl.search(Array(1536).fill(1.2))
    .metricType("cosine")
    .limit(10)
    .execute()
```

Approximate Nearest Neighbor (ANN) Search with Vector Index.

To accelerate vector retrievals, it is common to build vector indices. A vector index is a data structure specifically designed to efficiently organize and search vector data based on their similarity via the chosen distance metric. By constructing a vector index, you can reduce the search space and avoid the need for brute-force scanning of the entire vector column.

However, fast vector search using indices often entails making a trade-off with accuracy to some extent. This is why it is often called Approximate Nearest Neighbors (ANN) search, while the Flat Search (KNN) always returns 100% recall.

See ANN Index for more details.

Output formats

LanceDB returns results in many different formats commonly used in python. Let's create a LanceDB table with a nested schema:

from datetime import datetime
import lancedb
from lancedb.pydantic import LanceModel, Vector
import numpy as np
from pydantic import BaseModel
uri = "data/sample-lancedb-nested"

class Metadata(BaseModel):
    source: str
    timestamp: datetime

class Document(BaseModel):
    content: str
    meta: Metadata

class LanceSchema(LanceModel):
    id: str
    vector: Vector(1536)
    payload: Document

# Let's add 100 sample rows to our dataset
data = [LanceSchema(
    id=f"id{i}",
    vector=np.random.randn(1536),
    payload=Document(
        content=f"document{i}", meta=Metadata(source=f"source{i%10}", timestamp=datetime.now())
    ),
) for i in range(100)]

tbl = db.create_table("documents", data=data)

As a pyarrow table

Using to_arrow() we can get the results back as a pyarrow Table. This result table has the same columns as the LanceDB table, with the addition of an _distance column for vector search or a score column for full text search.

tbl.search(np.random.randn(1536)).to_arrow()

As a pandas dataframe

You can also get the results as a pandas dataframe.

tbl.search(np.random.randn(1536)).to_pandas()

While other formats like Arrow/Pydantic/Python dicts have a natural way to handle nested schemas, pandas can only store nested data as a python dict column, which makes it difficult to support nested references. So for convenience, you can also tell LanceDB to flatten a nested schema when creating the pandas dataframe.

tbl.search(np.random.randn(1536)).to_pandas(flatten=True)

If your table has a deeply nested struct, you can control how many levels of nesting to flatten by passing in a positive integer.

tbl.search(np.random.randn(1536)).to_pandas(flatten=1)

As a list of python dicts

You can of course return results as a list of python dicts.

tbl.search(np.random.randn(1536)).to_list()

As a list of pydantic models

We can add data using pydantic models, and we can certainly retrieve results as pydantic models

tbl.search(np.random.randn(1536)).to_pydantic(LanceSchema)

Note that in this case the extra _distance field is discarded since it's not part of the LanceSchema.

5.8 KiB Raw Blame History