From cdd08ef35c36bd33eaf49c354d491068818a274e Mon Sep 17 00:00:00 2001 From: Lei Xu Date: Fri, 2 Jun 2023 17:18:01 -0700 Subject: [PATCH] [Doc] Metrics types. (#135) Closes #129 --- docs/mkdocs.yml | 4 +- docs/src/ann_indexes.md | 6 +-- docs/src/search.md | 85 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 91 insertions(+), 4 deletions(-) create mode 100644 docs/src/search.md diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 80cc7fe6..a53926d7 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -38,13 +38,15 @@ nav: - Home: index.md - Basics: basic.md - Embeddings: embedding.md -- Indexing: ann_indexes.md - Python full-text search: fts.md - Python integrations: integrations.md - Python examples: - YouTube Transcript Search using OpenAI: notebooks/youtube_transcript_search.ipynb - Documentation QA Bot using LangChain: notebooks/code_qa_bot.ipynb - Multimodal search using CLIP: notebooks/multimodal_search.ipynb +- References: + - Vector Search: search.md + - Indexing: ann_indexes.md - API references: - Python API: python/python.md - Javascript API: javascript/modules.md diff --git a/docs/src/ann_indexes.md b/docs/src/ann_indexes.md index 050247e5..bd25c789 100644 --- a/docs/src/ann_indexes.md +++ b/docs/src/ann_indexes.md @@ -18,7 +18,7 @@ In the future we will look to automatically create and configure the ANN index. ```python import lancedb import numpy as np - uri = "~/.lancedb" + uri = "data/sample-lancedb" db = lancedb.connect(uri) # Create 10,000 sample vectors @@ -48,7 +48,7 @@ In the future we will look to automatically create and configure the ANN index. Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index creation by providing the following parameters: -- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance. +- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support "cosine" distance. - **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional. A higher number leads to faster queries, but it makes index generation slower. @@ -87,7 +87,7 @@ There are a couple of parameters that can be used to fine-tune the search: === "Javascript" ```javascript const results = await table - .search(Array(1536).fill(1.2)) + .search(Array(768).fill(1.2)) .limit(2) .nprobes(20) .refineFactor(10) diff --git a/docs/src/search.md b/docs/src/search.md new file mode 100644 index 00000000..db9a59ae --- /dev/null +++ b/docs/src/search.md @@ -0,0 +1,85 @@ +# Vector Search + +`Vector Search` finds the nearest vectors from the database. +In a recommendation system or search engine, you can find similar products from +the one you searched. +In LLM and other AI applications, +each data point can be [presented by the embeddings generated from some models](embedding.md), +it returns the most relevant features. + +A search in high-dimensional vector space, is to find `K-Nearest-Neighbors (KNN)` of the query vector. + +## Metric + +In LanceDB, a `Metric` is the way to describe the distance between a pair of vectors. +Currently, we support the following metrics: + +| Metric | Description | +| ----------- | ------------------------------------ | +| `L2` | [Euclidean / L2 distance](https://en.wikipedia.org/wiki/Euclidean_distance) | +| `Cosine` | [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)| + + +## Search + +### Flat Search + + +If there is no [vector index is created](ann_indexes.md), LanceDB will just brute-force scan +the vector column and compute the distance. + +=== "Python" + + ```python + import lancedb + db = lancedb.connect("data/sample-lancedb") + + tbl = db.open_table("my_vectors") + + df = tbl.search(np.random.random((768))) + .limit(10) + .to_df() + ``` + +=== "JavaScript" + + ```javascript + const vectordb = require('vectordb') + const db = await vectordb.connect('data/sample-lancedb') + + tbl = db.open_table("my_vectors") + + const results = await tbl.search(Array(768)) + .limit(20) + .execute() + ``` + +By default, `l2` will be used as `Metric` type. You can customize the metric type +as well. + +=== "Python" + + ```python + df = tbl.search(np.random.random((768))) + .metric("cosine") + .limit(10) + .to_df() + ``` + +=== "JavaScript" + + ```javascript + const vectordb = require('vectordb') + const db = await vectordb.connect('data/sample-lancedb') + + tbl = db.open_table("my_vectors") + + const results = await tbl.search(Array(768)) + .metric("cosine") + .limit(20) + .execute() + ``` + +### Search with Vector Index. + +See [ANN Index](ann_indexes.md) for more details. \ No newline at end of file