feat: support multivector type (#2005)

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This commit is contained in:
BubbleCal
2025-01-14 06:10:40 +08:00
committed by GitHub
parent ce9506db71
commit 66cbf6b6c5
9 changed files with 255 additions and 72 deletions

BIN
docs/src/assets/maxsim.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

View File

@@ -138,6 +138,36 @@ LanceDB supports binary vectors as a data type, and has the ability to search bi
--8<-- "python/python/tests/docs/test_binary_vector.py:async_binary_vector"
```
## Multivector type
LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).
You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. If the query is multiple vectors `mq`, the similarity (distance) from it to any multivector `mv` in the dataset, is defined as:
![maxsim](assets/maxsim.png)
where `sim` is the similarity function (e.g. cosine).
For now, only `cosine` metric is supported for multivector search.
=== "Python"
=== "sync API"
```python
--8<-- "python/python/tests/docs/test_multivector.py:imports"
--8<-- "python/python/tests/docs/test_multivector.py:sync_multivector"
```
=== "async API"
```python
--8<-- "python/python/tests/docs/test_multivector.py:imports"
--8<-- "python/python/tests/docs/test_multivector.py:async_multivector"
```
## Search with distance range
You can also search for vectors within a specific distance range from the query vector. This is useful when you want to find vectors that are not just the nearest neighbors, but also those that are within a certain distance. This can be done by using the `distance_range` method.

View File

@@ -18,7 +18,7 @@ import numpy as np
uri = "data/sample-lancedb"
data = [{"vector": row, "item": f"item {i}", "id": i}
for i, row in enumerate(np.random.random((10_000, 2)).astype('int'))]
for i, row in enumerate(np.random.random((10_000, 2)))]
# Synchronous client
db = lancedb.connect(uri)