bump version for v0.1.1

Merge pull request #46 from lancedb/changhiskhan/improve-index-docs
Merge branch 'main' into changhiskhan/improve-index-docs
2025-12-24 22:09:58 +00:00 · 2023-04-26 16:55:25 -07:00 · 2023-04-25 21:13:51 -07:00 · 2023-04-25 21:04:01 -07:00 · 2023-04-25 21:02:59 -07:00 · 2023-04-25 20:08:18 -07:00
10 changed files with 102 additions and 39 deletions
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
 <a href="https://lancedb.github.io/lancedb/">Documentation</a> •
 <a href="https://blog.eto.ai/">Blog</a> •
 <a href="https://discord.gg/zMM32dvNtd">Discord</a> •
-<a href="https://twitter.com/etodotai">Twitter</a>
+<a href="https://twitter.com/lancedb">Twitter</a>

 </p>
 </div>
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@@ -15,6 +15,7 @@ nav:
 - Home: index.md
 - Basics: basic.md
 - Embeddings: embedding.md
+- Indexing: ann_indexes.md
 - Integrations: integrations.md
 - Python API: python.md

--- a/docs/src/ann_indexes.md
+++ b/docs/src/ann_indexes.md
@@ -1,12 +1,18 @@
 # ANN (Approximate Nearest Neighbor) Indexes

-You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
+You can create an index over your vector data to make search faster.
+Vector indexes are faster but less accurate than exhaustive search.
+LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.

-Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration.
+Currently, LanceDB does *not* automatically create the ANN index.
+LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
+If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
+
+In the future we will look to automatically create and configure the ANN index.

 ## Creating an ANN Index

-Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function.
+Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.

 ```python
 import lancedb
@@ -28,11 +34,12 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96)
 Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
 creation by providing the following parameters:

- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table 
-with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional. 
-A higher number leads to faster queries, but it makes index generation slower. 
+- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance.
+- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
+with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
+A higher number leads to faster queries, but it makes index generation slower.
 - **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
-search more accurate, but also makes the index larger and slower to build. 
+search more accurate, but also makes the index larger and slower to build.

 ## Querying an ANN Index

@@ -41,15 +48,21 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance
 There are a couple of parameters that can be used to fine-tune the search:

 - **limit** (default: 10): The amount of results that will be returned
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes 
-search more accurate but also slower.
+- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.<br/>
+  Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.<br/>
+  e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
+  Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
+- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
+  A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
+  e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
+  Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
+

 ```python
 tbl.search(np.random.random((768))) \
    .limit(2) \
    .nprobes(20) \
-    .refine_factor(20) \
+    .refine_factor(10) \
    .to_df()

                                              vector       item       score
@@ -57,7 +70,9 @@ tbl.search(np.random.random((768))) \
 1  [0.48587373, 0.269207, 0.15095535, 0.65531915,...  item 3953  108.393867
 ```

-The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
+The search will return the data requested in addition to the score of each item.
+
+**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.

 ### Filtering (where clause)

--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -1,6 +1,6 @@
 # Welcome to LanceDB's Documentation

-LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.
+LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrivial, filtering and management of embeddings.

 The key features of LanceDB include:

@@ -42,5 +42,6 @@ We will be adding completed demo apps built using LanceDB.
 ## Documentation Quick Links
 * [`Basic Operations`](basic.md) - basic functionality of LanceDB.
 * [`Embedding Functions`](embedding.md) - functions for working with embeddings.
+* [`Indexing`](ann_indexes.md) - create vector indexes to speed up queries.
 * [`Ecosystem Integrations`](integrations.md) - integrating LanceDB with python data tooling ecosystem.
 * [`API Reference`](python.md) - detailed documentation for the LanceDB Python SDK.
--- a/notebooks/youtube_transcript_search.ipynb
+++ b/notebooks/youtube_transcript_search.ipynb
@@ -1,7 +1,6 @@
 {
 "cells": [
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "42bf01fb",
   "metadata": {},
@@ -22,10 +21,10 @@
     "output_type": "stream",
     "text": [
      "\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
      "\n",
-      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
+      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
@@ -88,7 +87,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "5ac2b6a3",
   "metadata": {},
@@ -231,7 +229,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "2106b5bb",
   "metadata": {},
@@ -251,7 +248,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "39f3161f3ef54a129cd65fb296332b54",
+       "model_id": "c6f1c76d9567421d88911923388d2530",
       "version_major": 2,
       "version_minor": 0
      },
@@ -574,7 +571,6 @@
   ]
  },
  {
-   "attachments": {},
   "cell_type": "markdown",
   "id": "559a095b",
   "metadata": {},
@@ -631,7 +627,7 @@
       "        <iframe\n",
       "            width=\"400\"\n",
       "            height=\"300\"\n",
-       "            src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289.76\"\n",
+       "            src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
@@ -639,7 +635,7 @@
       "        "
      ],
      "text/plain": [
-       "<IPython.lib.display.YouTubeVideo at 0x177fde4d0>"
+       "<IPython.lib.display.YouTubeVideo at 0x13ec062c0>"
      ]
     },
     "execution_count": 15,
@@ -651,7 +647,7 @@
    "from IPython.display import YouTubeVideo\n",
    "\n",
    "top_match = context.iloc[0]\n",
-    "YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=top_match[\"start\"])"
+    "YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=int(top_match[\"start\"]))"
   ]
  },
  {
--- a/python/lancedb/embeddings.py
+++ b/python/lancedb/embeddings.py
@@ -12,6 +12,8 @@
 #  limitations under the License.

 import math
+import sys
+
 from retry import retry
 from typing import Callable, Union

@@ -64,13 +66,17 @@ class EmbeddingFunction:
                return self.func(c.tolist())

        if len(self.rate_limiter_kwargs) > 0:
-            import ratelimiter
+            v = int(sys.version_info.minor)
+            if v >= 11:
+                print("WARNING: rate limit only support up to 3.10, proceeding without rate limiter")
+            else:
+                import ratelimiter

-            max_calls = self.rate_limiter_kwargs["max_calls"]
-            limiter = ratelimiter.RateLimiter(
-                max_calls, period=self.rate_limiter_kwargs["period"]
-            )
-            embed_func = limiter(embed_func)
+                max_calls = self.rate_limiter_kwargs["max_calls"]
+                limiter = ratelimiter.RateLimiter(
+                    max_calls, period=self.rate_limiter_kwargs["period"]
+                )
+                embed_func = limiter(embed_func)
        batches = self.to_batches(text)
        embeds = [emb for c in batches for emb in embed_func(c)]
        return embeds
@@ -79,11 +85,6 @@ class EmbeddingFunction:
        return f"EmbeddingFunction(func={self.func})"

    def rate_limit(self, max_calls=0.9, period=1.0):
-        import sys
-
-        v = int(sys.version_info.minor)
-        if v >= 11:
-            raise ValueError("rate limit only support up to 3.10")
        self.rate_limiter_kwargs = dict(max_calls=max_calls, period=period)
        return self

--- a/python/lancedb/query.py
+++ b/python/lancedb/query.py
@@ -24,6 +24,7 @@ class LanceQueryBuilder:
    """

    def __init__(self, table: "lancedb.table.LanceTable", query: np.ndarray):
+        self._metric = "L2"
        self._nprobes = 20
        self._refine_factor = None
        self._table = table
@@ -77,6 +78,21 @@ class LanceQueryBuilder:
        self._where = where
        return self

+    def metric(self, metric: str) -> LanceQueryBuilder:
+        """Set the distance metric to use.
+
+        Parameters
+        ----------
+        metric: str
+            The distance metric to use. By default "l2" is used.
+
+        Returns
+        -------
+        The LanceQueryBuilder object.
+        """
+        self._metric = metric
+        return self
+
    def nprobes(self, nprobes: int) -> LanceQueryBuilder:
        """Set the number of probes to use.

@@ -108,7 +124,12 @@ class LanceQueryBuilder:
        return self

    def to_df(self) -> pd.DataFrame:
-        """Execute the query and return the results as a pandas DataFrame."""
+        """
+        Execute the query and return the results as a pandas DataFrame.
+        In addition to the selected columns, LanceDB also returns a vector
+        and also the "score" column which is the distance between the query
+        vector and the returned vector.
+        """
        ds = self._table.to_lance()
        # TODO indexed search
        tbl = ds.to_table(
@@ -118,6 +139,7 @@ class LanceQueryBuilder:
                "column": VECTOR_COLUMN_NAME,
                "q": self._query,
                "k": self._limit,
+                "metric": self._metric,
                "nprobes": self._nprobes,
                "refine_factor": self._refine_factor,
            },
--- a/python/lancedb/table.py
+++ b/python/lancedb/table.py
@@ -106,11 +106,14 @@ class LanceTable:
    def _dataset_uri(self) -> str:
        return os.path.join(self._conn.uri, f"{self.name}.lance")

-    def create_index(self, num_partitions=256, num_sub_vectors=96):
+    def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
        """Create an index on the table.

        Parameters
        ----------
+        metric: str, default "L2"
+            The distance metric to use when creating the index. Valid values are "L2" or "cosine".
+            L2 is euclidean distance.
        num_partitions: int
            The number of IVF partitions to use when creating the index.
            Default is 256.
@@ -121,6 +124,7 @@ class LanceTable:
        self._dataset.create_index(
            column=VECTOR_COLUMN_NAME,
            index_type="IVF_PQ",
+            metric=metric,
            num_partitions=num_partitions,
            num_sub_vectors=num_sub_vectors,
        )
@@ -166,6 +170,9 @@ class LanceTable:
        Returns
        -------
        A LanceQueryBuilder object representing the query.
+        Once executed, the query returns selected columns, the vector,
+        and also the "score" column which is the distance between the query
+        vector and the returned vector.
        """
        if isinstance(query, list):
            query = np.array(query)
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@@ -1,7 +1,7 @@
 [project]
 name = "lancedb"
-version = "0.1"
-dependencies = ["pylance>=0.4.3", "ratelimiter", "retry", "tqdm"]
+version = "0.1.1"
+dependencies = ["pylance>=0.4.4", "ratelimiter", "retry", "tqdm"]
 description = "lancedb"
 authors = [
    { name = "Lance Devs", email = "dev@eto.ai" },
--- a/python/tests/test_query.py
+++ b/python/tests/test_query.py
@@ -14,7 +14,9 @@
 import lance
 from lancedb.query import LanceQueryBuilder

+import numpy as np
 import pandas as pd
+import pandas.testing as tm
 import pyarrow as pa

 import pytest
@@ -60,3 +62,21 @@ def test_query_builder_with_filter(table):
    df = LanceQueryBuilder(table, [0, 0]).where("id = 2").to_df()
    assert df["id"].values[0] == 2
    assert all(df["vector"].values[0] == [3, 4])
+
+
+def test_query_builder_with_metric(table):
+    query = [4, 8]
+    df_default = LanceQueryBuilder(table, query).to_df()
+    df_l2 = LanceQueryBuilder(table, query).metric("l2").to_df()
+    tm.assert_frame_equal(df_default, df_l2)
+
+    df_cosine = LanceQueryBuilder(table, query).metric("cosine").limit(1).to_df()
+    assert df_cosine.score[0] == pytest.approx(
+        cosine_distance(query, df_cosine.vector[0]),
+        abs=1e-6,
+    )
+    assert 0 <= df_cosine.score[0] <= 1
+
+
+def cosine_distance(vec1, vec2):
+    return 1 - np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
Author	SHA1	Message	Date
Chang She	afa7fe19e6	bump version for v0.1.1	2023-04-26 16:55:25 -07:00
Chang She	66080d791b	Merge pull request #46 from lancedb/changhiskhan/improve-index-docs	2023-04-25 21:13:51 -07:00
Chang She	5554fddd54	Merge branch 'main' into changhiskhan/improve-index-docs	2023-04-25 21:04:01 -07:00
Chang She	f06ea935fe	Merge pull request #47 from lancedb/changhiskhan/expose-metric Make distance metric configurable in LanceDB	2023-04-25 21:02:59 -07:00
Chang She	a8db7f56d2	tolerance	2023-04-25 20:08:18 -07:00
Chang She	7a375185a1	increment lance version to include cosine distance fix	2023-04-25 19:57:58 -07:00
Chang She	6592b4c13b	document metric in create_index	2023-04-24 22:46:21 -07:00
Chang She	72a44eb927	specify metric during index creation	2023-04-24 22:45:37 -07:00
Chang She	b0e578c609	add documentation for metric	2023-04-24 22:42:30 -07:00
Chang She	89e6232aeb	Make distance metric configurable during search	2023-04-24 22:40:40 -07:00
Chang She	44ea687984	Merge pull request #45 from lancedb/changhiskhan/notebook-fix Minor notebook fix. Closes #40	2023-04-24 20:12:03 -07:00
Chang She	4f2dae8a0d	Add more detailed docs for the ANN index and search features	2023-04-24 19:19:55 -07:00
Chang She	5e748e6e70	Minor notebook fix. Closes #40	2023-04-24 18:46:05 -07:00
Chang She	177192f852	Merge pull request #37 from lancedb/gsilvestrin/ratelimit_3.11 skipping embeddings rate limit when python version > 3.10	2023-04-22 21:03:18 -07:00
Lei Xu	1fb596942f	Merge pull request #39 from wilhelmjung/patch-1 Update index.md	2023-04-22 20:32:36 -07:00
YangWeiliang_DeepNova@Deepexi	73d3cb78e6	Update index.md Just a typo. Fixed.	2023-04-23 09:52:23 +08:00
gsilvestrin	a1583444ec	add ann_index to main doc page	2023-04-20 16:07:25 -07:00
gsilvestrin	78e4f4d1a8	add ann_index to main doc page	2023-04-20 13:19:10 -07:00
gsilvestrin	b92eb988b6	add ann_index to main doc page	2023-04-20 11:51:42 -07:00
gsilvestrin	0cd092814d	skipping rate limit when python version > 3.10	2023-04-20 10:28:14 -07:00
Jai	a6294925df	Update README.md	2023-04-20 10:22:03 -07:00