Compare commits

...

21 Commits
v0.1 ... v0.1.1

Author SHA1 Message Date
Chang She
afa7fe19e6 bump version for v0.1.1 2023-04-26 16:55:25 -07:00
Chang She
66080d791b Merge pull request #46 from lancedb/changhiskhan/improve-index-docs 2023-04-25 21:13:51 -07:00
Chang She
5554fddd54 Merge branch 'main' into changhiskhan/improve-index-docs 2023-04-25 21:04:01 -07:00
Chang She
f06ea935fe Merge pull request #47 from lancedb/changhiskhan/expose-metric
Make distance metric configurable in LanceDB
2023-04-25 21:02:59 -07:00
Chang She
a8db7f56d2 tolerance 2023-04-25 20:08:18 -07:00
Chang She
7a375185a1 increment lance version to include cosine distance fix 2023-04-25 19:57:58 -07:00
Chang She
6592b4c13b document metric in create_index 2023-04-24 22:46:21 -07:00
Chang She
72a44eb927 specify metric during index creation 2023-04-24 22:45:37 -07:00
Chang She
b0e578c609 add documentation for metric 2023-04-24 22:42:30 -07:00
Chang She
89e6232aeb Make distance metric configurable during search 2023-04-24 22:40:40 -07:00
Chang She
44ea687984 Merge pull request #45 from lancedb/changhiskhan/notebook-fix
Minor notebook fix. Closes #40
2023-04-24 20:12:03 -07:00
Chang She
4f2dae8a0d Add more detailed docs for the ANN index and search features 2023-04-24 19:19:55 -07:00
Chang She
5e748e6e70 Minor notebook fix. Closes #40 2023-04-24 18:46:05 -07:00
Chang She
177192f852 Merge pull request #37 from lancedb/gsilvestrin/ratelimit_3.11
skipping embeddings rate limit when python version > 3.10
2023-04-22 21:03:18 -07:00
Lei Xu
1fb596942f Merge pull request #39 from wilhelmjung/patch-1
Update index.md
2023-04-22 20:32:36 -07:00
YangWeiliang_DeepNova@Deepexi
73d3cb78e6 Update index.md
Just a typo. Fixed.
2023-04-23 09:52:23 +08:00
gsilvestrin
a1583444ec add ann_index to main doc page 2023-04-20 16:07:25 -07:00
gsilvestrin
78e4f4d1a8 add ann_index to main doc page 2023-04-20 13:19:10 -07:00
gsilvestrin
b92eb988b6 add ann_index to main doc page 2023-04-20 11:51:42 -07:00
gsilvestrin
0cd092814d skipping rate limit when python version > 3.10 2023-04-20 10:28:14 -07:00
Jai
a6294925df Update README.md 2023-04-20 10:22:03 -07:00
10 changed files with 102 additions and 39 deletions

View File

@@ -8,7 +8,7 @@
<a href="https://lancedb.github.io/lancedb/">Documentation</a>
<a href="https://blog.eto.ai/">Blog</a>
<a href="https://discord.gg/zMM32dvNtd">Discord</a>
<a href="https://twitter.com/etodotai">Twitter</a>
<a href="https://twitter.com/lancedb">Twitter</a>
</p>
</div>

View File

@@ -15,6 +15,7 @@ nav:
- Home: index.md
- Basics: basic.md
- Embeddings: embedding.md
- Indexing: ann_indexes.md
- Integrations: integrations.md
- Python API: python.md

View File

@@ -1,12 +1,18 @@
# ANN (Approximate Nearest Neighbor) Indexes
You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
You can create an index over your vector data to make search faster.
Vector indexes are faster but less accurate than exhaustive search.
LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration.
Currently, LanceDB does *not* automatically create the ANN index.
LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
In the future we will look to automatically create and configure the ANN index.
## Creating an ANN Index
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function.
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
```python
import lancedb
@@ -28,11 +34,12 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96)
Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
creation by providing the following parameters:
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
A higher number leads to faster queries, but it makes index generation slower.
- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance.
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
A higher number leads to faster queries, but it makes index generation slower.
- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
search more accurate, but also makes the index larger and slower to build.
search more accurate, but also makes the index larger and slower to build.
## Querying an ANN Index
@@ -41,15 +48,21 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance
There are a couple of parameters that can be used to fine-tune the search:
- **limit** (default: 10): The amount of results that will be returned
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes
search more accurate but also slower.
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.<br/>
Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.<br/>
e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
```python
tbl.search(np.random.random((768))) \
.limit(2) \
.nprobes(20) \
.refine_factor(20) \
.refine_factor(10) \
.to_df()
vector item score
@@ -57,7 +70,9 @@ tbl.search(np.random.random((768))) \
1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867
```
The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
The search will return the data requested in addition to the score of each item.
**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
### Filtering (where clause)

View File

@@ -1,6 +1,6 @@
# Welcome to LanceDB's Documentation
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrivial, filtering and management of embeddings.
The key features of LanceDB include:
@@ -42,5 +42,6 @@ We will be adding completed demo apps built using LanceDB.
## Documentation Quick Links
* [`Basic Operations`](basic.md) - basic functionality of LanceDB.
* [`Embedding Functions`](embedding.md) - functions for working with embeddings.
* [`Indexing`](ann_indexes.md) - create vector indexes to speed up queries.
* [`Ecosystem Integrations`](integrations.md) - integrating LanceDB with python data tooling ecosystem.
* [`API Reference`](python.md) - detailed documentation for the LanceDB Python SDK.

View File

@@ -1,7 +1,6 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "42bf01fb",
"metadata": {},
@@ -22,10 +21,10 @@
"output_type": "stream",
"text": [
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
@@ -88,7 +87,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "5ac2b6a3",
"metadata": {},
@@ -231,7 +229,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "2106b5bb",
"metadata": {},
@@ -251,7 +248,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "39f3161f3ef54a129cd65fb296332b54",
"model_id": "c6f1c76d9567421d88911923388d2530",
"version_major": 2,
"version_minor": 0
},
@@ -574,7 +571,6 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "559a095b",
"metadata": {},
@@ -631,7 +627,7 @@
" <iframe\n",
" width=\"400\"\n",
" height=\"300\"\n",
" src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289.76\"\n",
" src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" \n",
@@ -639,7 +635,7 @@
" "
],
"text/plain": [
"<IPython.lib.display.YouTubeVideo at 0x177fde4d0>"
"<IPython.lib.display.YouTubeVideo at 0x13ec062c0>"
]
},
"execution_count": 15,
@@ -651,7 +647,7 @@
"from IPython.display import YouTubeVideo\n",
"\n",
"top_match = context.iloc[0]\n",
"YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=top_match[\"start\"])"
"YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=int(top_match[\"start\"]))"
]
},
{

View File

@@ -12,6 +12,8 @@
# limitations under the License.
import math
import sys
from retry import retry
from typing import Callable, Union
@@ -64,13 +66,17 @@ class EmbeddingFunction:
return self.func(c.tolist())
if len(self.rate_limiter_kwargs) > 0:
import ratelimiter
v = int(sys.version_info.minor)
if v >= 11:
print("WARNING: rate limit only support up to 3.10, proceeding without rate limiter")
else:
import ratelimiter
max_calls = self.rate_limiter_kwargs["max_calls"]
limiter = ratelimiter.RateLimiter(
max_calls, period=self.rate_limiter_kwargs["period"]
)
embed_func = limiter(embed_func)
max_calls = self.rate_limiter_kwargs["max_calls"]
limiter = ratelimiter.RateLimiter(
max_calls, period=self.rate_limiter_kwargs["period"]
)
embed_func = limiter(embed_func)
batches = self.to_batches(text)
embeds = [emb for c in batches for emb in embed_func(c)]
return embeds
@@ -79,11 +85,6 @@ class EmbeddingFunction:
return f"EmbeddingFunction(func={self.func})"
def rate_limit(self, max_calls=0.9, period=1.0):
import sys
v = int(sys.version_info.minor)
if v >= 11:
raise ValueError("rate limit only support up to 3.10")
self.rate_limiter_kwargs = dict(max_calls=max_calls, period=period)
return self

View File

@@ -24,6 +24,7 @@ class LanceQueryBuilder:
"""
def __init__(self, table: "lancedb.table.LanceTable", query: np.ndarray):
self._metric = "L2"
self._nprobes = 20
self._refine_factor = None
self._table = table
@@ -77,6 +78,21 @@ class LanceQueryBuilder:
self._where = where
return self
def metric(self, metric: str) -> LanceQueryBuilder:
"""Set the distance metric to use.
Parameters
----------
metric: str
The distance metric to use. By default "l2" is used.
Returns
-------
The LanceQueryBuilder object.
"""
self._metric = metric
return self
def nprobes(self, nprobes: int) -> LanceQueryBuilder:
"""Set the number of probes to use.
@@ -108,7 +124,12 @@ class LanceQueryBuilder:
return self
def to_df(self) -> pd.DataFrame:
"""Execute the query and return the results as a pandas DataFrame."""
"""
Execute the query and return the results as a pandas DataFrame.
In addition to the selected columns, LanceDB also returns a vector
and also the "score" column which is the distance between the query
vector and the returned vector.
"""
ds = self._table.to_lance()
# TODO indexed search
tbl = ds.to_table(
@@ -118,6 +139,7 @@ class LanceQueryBuilder:
"column": VECTOR_COLUMN_NAME,
"q": self._query,
"k": self._limit,
"metric": self._metric,
"nprobes": self._nprobes,
"refine_factor": self._refine_factor,
},

View File

@@ -106,11 +106,14 @@ class LanceTable:
def _dataset_uri(self) -> str:
return os.path.join(self._conn.uri, f"{self.name}.lance")
def create_index(self, num_partitions=256, num_sub_vectors=96):
def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
"""Create an index on the table.
Parameters
----------
metric: str, default "L2"
The distance metric to use when creating the index. Valid values are "L2" or "cosine".
L2 is euclidean distance.
num_partitions: int
The number of IVF partitions to use when creating the index.
Default is 256.
@@ -121,6 +124,7 @@ class LanceTable:
self._dataset.create_index(
column=VECTOR_COLUMN_NAME,
index_type="IVF_PQ",
metric=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
)
@@ -166,6 +170,9 @@ class LanceTable:
Returns
-------
A LanceQueryBuilder object representing the query.
Once executed, the query returns selected columns, the vector,
and also the "score" column which is the distance between the query
vector and the returned vector.
"""
if isinstance(query, list):
query = np.array(query)

View File

@@ -1,7 +1,7 @@
[project]
name = "lancedb"
version = "0.1"
dependencies = ["pylance>=0.4.3", "ratelimiter", "retry", "tqdm"]
version = "0.1.1"
dependencies = ["pylance>=0.4.4", "ratelimiter", "retry", "tqdm"]
description = "lancedb"
authors = [
{ name = "Lance Devs", email = "dev@eto.ai" },

View File

@@ -14,7 +14,9 @@
import lance
from lancedb.query import LanceQueryBuilder
import numpy as np
import pandas as pd
import pandas.testing as tm
import pyarrow as pa
import pytest
@@ -60,3 +62,21 @@ def test_query_builder_with_filter(table):
df = LanceQueryBuilder(table, [0, 0]).where("id = 2").to_df()
assert df["id"].values[0] == 2
assert all(df["vector"].values[0] == [3, 4])
def test_query_builder_with_metric(table):
query = [4, 8]
df_default = LanceQueryBuilder(table, query).to_df()
df_l2 = LanceQueryBuilder(table, query).metric("l2").to_df()
tm.assert_frame_equal(df_default, df_l2)
df_cosine = LanceQueryBuilder(table, query).metric("cosine").limit(1).to_df()
assert df_cosine.score[0] == pytest.approx(
cosine_distance(query, df_cosine.vector[0]),
abs=1e-6,
)
assert 0 <= df_cosine.score[0] <= 1
def cosine_distance(vec1, vec2):
return 1 - np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))