mirror of
https://github.com/lancedb/lancedb.git
synced 2025-12-24 22:09:58 +00:00
Compare commits
21 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
afa7fe19e6 | ||
|
|
66080d791b | ||
|
|
5554fddd54 | ||
|
|
f06ea935fe | ||
|
|
a8db7f56d2 | ||
|
|
7a375185a1 | ||
|
|
6592b4c13b | ||
|
|
72a44eb927 | ||
|
|
b0e578c609 | ||
|
|
89e6232aeb | ||
|
|
44ea687984 | ||
|
|
4f2dae8a0d | ||
|
|
5e748e6e70 | ||
|
|
177192f852 | ||
|
|
1fb596942f | ||
|
|
73d3cb78e6 | ||
|
|
a1583444ec | ||
|
|
78e4f4d1a8 | ||
|
|
b92eb988b6 | ||
|
|
0cd092814d | ||
|
|
a6294925df |
@@ -8,7 +8,7 @@
|
||||
<a href="https://lancedb.github.io/lancedb/">Documentation</a> •
|
||||
<a href="https://blog.eto.ai/">Blog</a> •
|
||||
<a href="https://discord.gg/zMM32dvNtd">Discord</a> •
|
||||
<a href="https://twitter.com/etodotai">Twitter</a>
|
||||
<a href="https://twitter.com/lancedb">Twitter</a>
|
||||
|
||||
</p>
|
||||
</div>
|
||||
|
||||
@@ -15,6 +15,7 @@ nav:
|
||||
- Home: index.md
|
||||
- Basics: basic.md
|
||||
- Embeddings: embedding.md
|
||||
- Indexing: ann_indexes.md
|
||||
- Integrations: integrations.md
|
||||
- Python API: python.md
|
||||
|
||||
|
||||
@@ -1,12 +1,18 @@
|
||||
# ANN (Approximate Nearest Neighbor) Indexes
|
||||
|
||||
You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
|
||||
You can create an index over your vector data to make search faster.
|
||||
Vector indexes are faster but less accurate than exhaustive search.
|
||||
LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
|
||||
|
||||
Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration.
|
||||
Currently, LanceDB does *not* automatically create the ANN index.
|
||||
LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
|
||||
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
|
||||
|
||||
In the future we will look to automatically create and configure the ANN index.
|
||||
|
||||
## Creating an ANN Index
|
||||
|
||||
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function.
|
||||
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
|
||||
|
||||
```python
|
||||
import lancedb
|
||||
@@ -28,11 +34,12 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96)
|
||||
Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
|
||||
creation by providing the following parameters:
|
||||
|
||||
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
|
||||
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
|
||||
A higher number leads to faster queries, but it makes index generation slower.
|
||||
- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance.
|
||||
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
|
||||
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
|
||||
A higher number leads to faster queries, but it makes index generation slower.
|
||||
- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
|
||||
search more accurate, but also makes the index larger and slower to build.
|
||||
search more accurate, but also makes the index larger and slower to build.
|
||||
|
||||
## Querying an ANN Index
|
||||
|
||||
@@ -41,15 +48,21 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance
|
||||
There are a couple of parameters that can be used to fine-tune the search:
|
||||
|
||||
- **limit** (default: 10): The amount of results that will be returned
|
||||
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
|
||||
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes
|
||||
search more accurate but also slower.
|
||||
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.<br/>
|
||||
Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.<br/>
|
||||
e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
|
||||
Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
|
||||
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
|
||||
A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
|
||||
e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
|
||||
Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
|
||||
|
||||
|
||||
```python
|
||||
tbl.search(np.random.random((768))) \
|
||||
.limit(2) \
|
||||
.nprobes(20) \
|
||||
.refine_factor(20) \
|
||||
.refine_factor(10) \
|
||||
.to_df()
|
||||
|
||||
vector item score
|
||||
@@ -57,7 +70,9 @@ tbl.search(np.random.random((768))) \
|
||||
1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867
|
||||
```
|
||||
|
||||
The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
|
||||
The search will return the data requested in addition to the score of each item.
|
||||
|
||||
**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
|
||||
|
||||
### Filtering (where clause)
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Welcome to LanceDB's Documentation
|
||||
|
||||
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.
|
||||
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrivial, filtering and management of embeddings.
|
||||
|
||||
The key features of LanceDB include:
|
||||
|
||||
@@ -42,5 +42,6 @@ We will be adding completed demo apps built using LanceDB.
|
||||
## Documentation Quick Links
|
||||
* [`Basic Operations`](basic.md) - basic functionality of LanceDB.
|
||||
* [`Embedding Functions`](embedding.md) - functions for working with embeddings.
|
||||
* [`Indexing`](ann_indexes.md) - create vector indexes to speed up queries.
|
||||
* [`Ecosystem Integrations`](integrations.md) - integrating LanceDB with python data tooling ecosystem.
|
||||
* [`API Reference`](python.md) - detailed documentation for the LanceDB Python SDK.
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "42bf01fb",
|
||||
"metadata": {},
|
||||
@@ -22,10 +21,10 @@
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
||||
]
|
||||
}
|
||||
@@ -88,7 +87,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "5ac2b6a3",
|
||||
"metadata": {},
|
||||
@@ -231,7 +229,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "2106b5bb",
|
||||
"metadata": {},
|
||||
@@ -251,7 +248,7 @@
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "39f3161f3ef54a129cd65fb296332b54",
|
||||
"model_id": "c6f1c76d9567421d88911923388d2530",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
@@ -574,7 +571,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "559a095b",
|
||||
"metadata": {},
|
||||
@@ -631,7 +627,7 @@
|
||||
" <iframe\n",
|
||||
" width=\"400\"\n",
|
||||
" height=\"300\"\n",
|
||||
" src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289.76\"\n",
|
||||
" src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289\"\n",
|
||||
" frameborder=\"0\"\n",
|
||||
" allowfullscreen\n",
|
||||
" \n",
|
||||
@@ -639,7 +635,7 @@
|
||||
" "
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.lib.display.YouTubeVideo at 0x177fde4d0>"
|
||||
"<IPython.lib.display.YouTubeVideo at 0x13ec062c0>"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
@@ -651,7 +647,7 @@
|
||||
"from IPython.display import YouTubeVideo\n",
|
||||
"\n",
|
||||
"top_match = context.iloc[0]\n",
|
||||
"YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=top_match[\"start\"])"
|
||||
"YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=int(top_match[\"start\"]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -12,6 +12,8 @@
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import sys
|
||||
|
||||
from retry import retry
|
||||
from typing import Callable, Union
|
||||
|
||||
@@ -64,13 +66,17 @@ class EmbeddingFunction:
|
||||
return self.func(c.tolist())
|
||||
|
||||
if len(self.rate_limiter_kwargs) > 0:
|
||||
import ratelimiter
|
||||
v = int(sys.version_info.minor)
|
||||
if v >= 11:
|
||||
print("WARNING: rate limit only support up to 3.10, proceeding without rate limiter")
|
||||
else:
|
||||
import ratelimiter
|
||||
|
||||
max_calls = self.rate_limiter_kwargs["max_calls"]
|
||||
limiter = ratelimiter.RateLimiter(
|
||||
max_calls, period=self.rate_limiter_kwargs["period"]
|
||||
)
|
||||
embed_func = limiter(embed_func)
|
||||
max_calls = self.rate_limiter_kwargs["max_calls"]
|
||||
limiter = ratelimiter.RateLimiter(
|
||||
max_calls, period=self.rate_limiter_kwargs["period"]
|
||||
)
|
||||
embed_func = limiter(embed_func)
|
||||
batches = self.to_batches(text)
|
||||
embeds = [emb for c in batches for emb in embed_func(c)]
|
||||
return embeds
|
||||
@@ -79,11 +85,6 @@ class EmbeddingFunction:
|
||||
return f"EmbeddingFunction(func={self.func})"
|
||||
|
||||
def rate_limit(self, max_calls=0.9, period=1.0):
|
||||
import sys
|
||||
|
||||
v = int(sys.version_info.minor)
|
||||
if v >= 11:
|
||||
raise ValueError("rate limit only support up to 3.10")
|
||||
self.rate_limiter_kwargs = dict(max_calls=max_calls, period=period)
|
||||
return self
|
||||
|
||||
|
||||
@@ -24,6 +24,7 @@ class LanceQueryBuilder:
|
||||
"""
|
||||
|
||||
def __init__(self, table: "lancedb.table.LanceTable", query: np.ndarray):
|
||||
self._metric = "L2"
|
||||
self._nprobes = 20
|
||||
self._refine_factor = None
|
||||
self._table = table
|
||||
@@ -77,6 +78,21 @@ class LanceQueryBuilder:
|
||||
self._where = where
|
||||
return self
|
||||
|
||||
def metric(self, metric: str) -> LanceQueryBuilder:
|
||||
"""Set the distance metric to use.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
metric: str
|
||||
The distance metric to use. By default "l2" is used.
|
||||
|
||||
Returns
|
||||
-------
|
||||
The LanceQueryBuilder object.
|
||||
"""
|
||||
self._metric = metric
|
||||
return self
|
||||
|
||||
def nprobes(self, nprobes: int) -> LanceQueryBuilder:
|
||||
"""Set the number of probes to use.
|
||||
|
||||
@@ -108,7 +124,12 @@ class LanceQueryBuilder:
|
||||
return self
|
||||
|
||||
def to_df(self) -> pd.DataFrame:
|
||||
"""Execute the query and return the results as a pandas DataFrame."""
|
||||
"""
|
||||
Execute the query and return the results as a pandas DataFrame.
|
||||
In addition to the selected columns, LanceDB also returns a vector
|
||||
and also the "score" column which is the distance between the query
|
||||
vector and the returned vector.
|
||||
"""
|
||||
ds = self._table.to_lance()
|
||||
# TODO indexed search
|
||||
tbl = ds.to_table(
|
||||
@@ -118,6 +139,7 @@ class LanceQueryBuilder:
|
||||
"column": VECTOR_COLUMN_NAME,
|
||||
"q": self._query,
|
||||
"k": self._limit,
|
||||
"metric": self._metric,
|
||||
"nprobes": self._nprobes,
|
||||
"refine_factor": self._refine_factor,
|
||||
},
|
||||
|
||||
@@ -106,11 +106,14 @@ class LanceTable:
|
||||
def _dataset_uri(self) -> str:
|
||||
return os.path.join(self._conn.uri, f"{self.name}.lance")
|
||||
|
||||
def create_index(self, num_partitions=256, num_sub_vectors=96):
|
||||
def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
|
||||
"""Create an index on the table.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
metric: str, default "L2"
|
||||
The distance metric to use when creating the index. Valid values are "L2" or "cosine".
|
||||
L2 is euclidean distance.
|
||||
num_partitions: int
|
||||
The number of IVF partitions to use when creating the index.
|
||||
Default is 256.
|
||||
@@ -121,6 +124,7 @@ class LanceTable:
|
||||
self._dataset.create_index(
|
||||
column=VECTOR_COLUMN_NAME,
|
||||
index_type="IVF_PQ",
|
||||
metric=metric,
|
||||
num_partitions=num_partitions,
|
||||
num_sub_vectors=num_sub_vectors,
|
||||
)
|
||||
@@ -166,6 +170,9 @@ class LanceTable:
|
||||
Returns
|
||||
-------
|
||||
A LanceQueryBuilder object representing the query.
|
||||
Once executed, the query returns selected columns, the vector,
|
||||
and also the "score" column which is the distance between the query
|
||||
vector and the returned vector.
|
||||
"""
|
||||
if isinstance(query, list):
|
||||
query = np.array(query)
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
[project]
|
||||
name = "lancedb"
|
||||
version = "0.1"
|
||||
dependencies = ["pylance>=0.4.3", "ratelimiter", "retry", "tqdm"]
|
||||
version = "0.1.1"
|
||||
dependencies = ["pylance>=0.4.4", "ratelimiter", "retry", "tqdm"]
|
||||
description = "lancedb"
|
||||
authors = [
|
||||
{ name = "Lance Devs", email = "dev@eto.ai" },
|
||||
|
||||
@@ -14,7 +14,9 @@
|
||||
import lance
|
||||
from lancedb.query import LanceQueryBuilder
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pandas.testing as tm
|
||||
import pyarrow as pa
|
||||
|
||||
import pytest
|
||||
@@ -60,3 +62,21 @@ def test_query_builder_with_filter(table):
|
||||
df = LanceQueryBuilder(table, [0, 0]).where("id = 2").to_df()
|
||||
assert df["id"].values[0] == 2
|
||||
assert all(df["vector"].values[0] == [3, 4])
|
||||
|
||||
|
||||
def test_query_builder_with_metric(table):
|
||||
query = [4, 8]
|
||||
df_default = LanceQueryBuilder(table, query).to_df()
|
||||
df_l2 = LanceQueryBuilder(table, query).metric("l2").to_df()
|
||||
tm.assert_frame_equal(df_default, df_l2)
|
||||
|
||||
df_cosine = LanceQueryBuilder(table, query).metric("cosine").limit(1).to_df()
|
||||
assert df_cosine.score[0] == pytest.approx(
|
||||
cosine_distance(query, df_cosine.vector[0]),
|
||||
abs=1e-6,
|
||||
)
|
||||
assert 0 <= df_cosine.score[0] <= 1
|
||||
|
||||
|
||||
def cosine_distance(vec1, vec2):
|
||||
return 1 - np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
|
||||
|
||||
Reference in New Issue
Block a user