[Doc] Metrics types. (#135)

Closes #129
This commit is contained in:
Lei Xu
2023-06-02 17:18:01 -07:00
committed by GitHub
parent adcb2a1387
commit cdd08ef35c
3 changed files with 91 additions and 4 deletions

View File

@@ -38,13 +38,15 @@ nav:
- Home: index.md
- Basics: basic.md
- Embeddings: embedding.md
- Indexing: ann_indexes.md
- Python full-text search: fts.md
- Python integrations: integrations.md
- Python examples:
- YouTube Transcript Search using OpenAI: notebooks/youtube_transcript_search.ipynb
- Documentation QA Bot using LangChain: notebooks/code_qa_bot.ipynb
- Multimodal search using CLIP: notebooks/multimodal_search.ipynb
- References:
- Vector Search: search.md
- Indexing: ann_indexes.md
- API references:
- Python API: python/python.md
- Javascript API: javascript/modules.md

View File

@@ -18,7 +18,7 @@ In the future we will look to automatically create and configure the ANN index.
```python
import lancedb
import numpy as np
uri = "~/.lancedb"
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
# Create 10,000 sample vectors
@@ -48,7 +48,7 @@ In the future we will look to automatically create and configure the ANN index.
Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
creation by providing the following parameters:
- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance.
- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support "cosine" distance.
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
A higher number leads to faster queries, but it makes index generation slower.
@@ -87,7 +87,7 @@ There are a couple of parameters that can be used to fine-tune the search:
=== "Javascript"
```javascript
const results = await table
.search(Array(1536).fill(1.2))
.search(Array(768).fill(1.2))
.limit(2)
.nprobes(20)
.refineFactor(10)

85
docs/src/search.md Normal file
View File

@@ -0,0 +1,85 @@
# Vector Search
`Vector Search` finds the nearest vectors from the database.
In a recommendation system or search engine, you can find similar products from
the one you searched.
In LLM and other AI applications,
each data point can be [presented by the embeddings generated from some models](embedding.md),
it returns the most relevant features.
A search in high-dimensional vector space, is to find `K-Nearest-Neighbors (KNN)` of the query vector.
## Metric
In LanceDB, a `Metric` is the way to describe the distance between a pair of vectors.
Currently, we support the following metrics:
| Metric | Description |
| ----------- | ------------------------------------ |
| `L2` | [Euclidean / L2 distance](https://en.wikipedia.org/wiki/Euclidean_distance) |
| `Cosine` | [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)|
## Search
### Flat Search
If there is no [vector index is created](ann_indexes.md), LanceDB will just brute-force scan
the vector column and compute the distance.
=== "Python"
```python
import lancedb
db = lancedb.connect("data/sample-lancedb")
tbl = db.open_table("my_vectors")
df = tbl.search(np.random.random((768)))
.limit(10)
.to_df()
```
=== "JavaScript"
```javascript
const vectordb = require('vectordb')
const db = await vectordb.connect('data/sample-lancedb')
tbl = db.open_table("my_vectors")
const results = await tbl.search(Array(768))
.limit(20)
.execute()
```
By default, `l2` will be used as `Metric` type. You can customize the metric type
as well.
=== "Python"
```python
df = tbl.search(np.random.random((768)))
.metric("cosine")
.limit(10)
.to_df()
```
=== "JavaScript"
```javascript
const vectordb = require('vectordb')
const db = await vectordb.connect('data/sample-lancedb')
tbl = db.open_table("my_vectors")
const results = await tbl.search(Array(768))
.metric("cosine")
.limit(20)
.execute()
```
### Search with Vector Index.
See [ANN Index](ann_indexes.md) for more details.