mirror of
https://github.com/lancedb/lancedb.git
synced 2025-12-25 14:29:56 +00:00
Compare commits
34 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
59014a01e0 | ||
|
|
47ae17ea05 | ||
|
|
b6739f3f66 | ||
|
|
3a2df0ce45 | ||
|
|
c0bc65cdfa | ||
|
|
298b81f0b0 | ||
|
|
fe7a3ccd60 | ||
|
|
baf8d7c1a1 | ||
|
|
2021e1bf6d | ||
|
|
2dbe71cf88 | ||
|
|
afe19ade7f | ||
|
|
118efdce73 | ||
|
|
b0426387e7 | ||
|
|
afa7fe19e6 | ||
|
|
66080d791b | ||
|
|
5554fddd54 | ||
|
|
f06ea935fe | ||
|
|
a8db7f56d2 | ||
|
|
7a375185a1 | ||
|
|
6592b4c13b | ||
|
|
72a44eb927 | ||
|
|
b0e578c609 | ||
|
|
89e6232aeb | ||
|
|
44ea687984 | ||
|
|
4f2dae8a0d | ||
|
|
5e748e6e70 | ||
|
|
177192f852 | ||
|
|
1fb596942f | ||
|
|
73d3cb78e6 | ||
|
|
a1583444ec | ||
|
|
78e4f4d1a8 | ||
|
|
b92eb988b6 | ||
|
|
0cd092814d | ||
|
|
a6294925df |
2
.gitignore
vendored
2
.gitignore
vendored
@@ -15,3 +15,5 @@ python/build
|
||||
python/dist
|
||||
|
||||
notebooks/.ipynb_checkpoints
|
||||
|
||||
**/.hypothesis
|
||||
|
||||
@@ -3,12 +3,12 @@
|
||||
|
||||
<img width="275" alt="LanceDB Logo" src="https://user-images.githubusercontent.com/917119/226205734-6063d87a-1ecc-45fe-85be-1dea6383a3d8.png">
|
||||
|
||||
**Serverless, low-latency vector database for AI applications**
|
||||
**Developer-friendly, serverless vector database for AI applications**
|
||||
|
||||
<a href="https://lancedb.github.io/lancedb/">Documentation</a> •
|
||||
<a href="https://blog.eto.ai/">Blog</a> •
|
||||
<a href="https://discord.gg/zMM32dvNtd">Discord</a> •
|
||||
<a href="https://twitter.com/etodotai">Twitter</a>
|
||||
<a href="https://twitter.com/lancedb">Twitter</a>
|
||||
|
||||
</p>
|
||||
</div>
|
||||
@@ -21,6 +21,10 @@ The key features of LanceDB include:
|
||||
|
||||
* Production-scale vector search with no servers to manage.
|
||||
|
||||
* Optimized for multi-modal data (text, images, videos, point clouds and more).
|
||||
|
||||
* Native Python and Javascript/Typescript support (coming soon).
|
||||
|
||||
* Combine attribute-based information with vectors and store them as a single source-of-truth.
|
||||
|
||||
* Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
|
||||
|
||||
@@ -15,6 +15,7 @@ nav:
|
||||
- Home: index.md
|
||||
- Basics: basic.md
|
||||
- Embeddings: embedding.md
|
||||
- Indexing: ann_indexes.md
|
||||
- Integrations: integrations.md
|
||||
- Python API: python.md
|
||||
|
||||
|
||||
@@ -1,12 +1,18 @@
|
||||
# ANN (Approximate Nearest Neighbor) Indexes
|
||||
|
||||
You can create an index over your vector data to make search faster. Vector indexes are faster but less accurate than exhaustive search. LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
|
||||
You can create an index over your vector data to make search faster.
|
||||
Vector indexes are faster but less accurate than exhaustive search.
|
||||
LanceDB provides many parameters to fine-tune the index's size, the speed of queries, and the accuracy of results.
|
||||
|
||||
Currently, LanceDB does not automatically create the ANN index. In the future we will look to improve this experience and automate index creation and configuration.
|
||||
Currently, LanceDB does *not* automatically create the ANN index.
|
||||
LanceDB has optimized code for KNN as well. For many use-cases, datasets under 100K vectors won't require index creation at all.
|
||||
If you can live with <100ms latency, skipping index creation is a simpler workflow while guaranteeing 100% recall.
|
||||
|
||||
In the future we will look to automatically create and configure the ANN index.
|
||||
|
||||
## Creating an ANN Index
|
||||
|
||||
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) function.
|
||||
Creating indexes is done via the [create_index](https://lancedb.github.io/lancedb/python/#lancedb.table.LanceTable.create_index) method.
|
||||
|
||||
```python
|
||||
import lancedb
|
||||
@@ -28,11 +34,12 @@ tbl.create_index(num_partitions=256, num_sub_vectors=96)
|
||||
Since `create_index` has a training step, it can take a few minutes to finish for large tables. You can control the index
|
||||
creation by providing the following parameters:
|
||||
|
||||
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
|
||||
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
|
||||
A higher number leads to faster queries, but it makes index generation slower.
|
||||
- **metric** (default: "L2"): The distance metric to use. By default we use euclidean distance. We also support cosine distance.
|
||||
- **num_partitions** (default: 256): The number of partitions of the index. The number of partitions should be configured so each partition has 3-5K vectors. For example, a table
|
||||
with ~1M vectors should use 256 partitions. You can specify arbitrary number of partitions but powers of 2 is most conventional.
|
||||
A higher number leads to faster queries, but it makes index generation slower.
|
||||
- **num_sub_vectors** (default: 96): The number of subvectors (M) that will be created during Product Quantization (PQ). A larger number makes
|
||||
search more accurate, but also makes the index larger and slower to build.
|
||||
search more accurate, but also makes the index larger and slower to build.
|
||||
|
||||
## Querying an ANN Index
|
||||
|
||||
@@ -41,15 +48,21 @@ Querying vector indexes is done via the [search](https://lancedb.github.io/lance
|
||||
There are a couple of parameters that can be used to fine-tune the search:
|
||||
|
||||
- **limit** (default: 10): The amount of results that will be returned
|
||||
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.
|
||||
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory. A higher number makes
|
||||
search more accurate but also slower.
|
||||
- **nprobes** (default: 20): The number of probes used. A higher number makes search more accurate but also slower.<br/>
|
||||
Most of the time, setting nprobes to cover 5-10% of the dataset should achieve high recall with low latency.<br/>
|
||||
e.g., for 1M vectors divided up into 256 partitions, nprobes should be set to ~20-40.<br/>
|
||||
Note: nprobes is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
|
||||
- **refine_factor** (default: None): Refine the results by reading extra elements and re-ranking them in memory.<br/>
|
||||
A higher number makes search more accurate but also slower. If you find the recall is less than idea, try refine_factor=10 to start.<br/>
|
||||
e.g., for 1M vectors divided into 256 partitions, if you're looking for top 20, then refine_factor=200 reranks the whole partition.<br/>
|
||||
Note: refine_factor is only applicable if an ANN index is present. If specified on a table without an ANN index, it is ignored.
|
||||
|
||||
|
||||
```python
|
||||
tbl.search(np.random.random((768))) \
|
||||
.limit(2) \
|
||||
.nprobes(20) \
|
||||
.refine_factor(20) \
|
||||
.refine_factor(10) \
|
||||
.to_df()
|
||||
|
||||
vector item score
|
||||
@@ -57,7 +70,9 @@ tbl.search(np.random.random((768))) \
|
||||
1 [0.48587373, 0.269207, 0.15095535, 0.65531915,... item 3953 108.393867
|
||||
```
|
||||
|
||||
The search will return the data requested in addition to the score of each item. The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
|
||||
The search will return the data requested in addition to the score of each item.
|
||||
|
||||
**Note:** The score is the distance between the query vector and the element. A lower number means that the result is more relevant.
|
||||
|
||||
### Filtering (where clause)
|
||||
|
||||
|
||||
@@ -1,11 +1,15 @@
|
||||
# Welcome to LanceDB's Documentation
|
||||
|
||||
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrevial, filtering and management of embeddings.
|
||||
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrivial, filtering and management of embeddings.
|
||||
|
||||
The key features of LanceDB include:
|
||||
|
||||
* Production-scale vector search with no servers to manage.
|
||||
|
||||
* Optimized for multi-modal data (text, images, videos, point clouds and more).
|
||||
|
||||
* Native Python and Javascript/Typescript support (coming soon).
|
||||
|
||||
* Combine attribute-based information with vectors and store them as a single source-of-truth.
|
||||
|
||||
* Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
|
||||
@@ -42,5 +46,6 @@ We will be adding completed demo apps built using LanceDB.
|
||||
## Documentation Quick Links
|
||||
* [`Basic Operations`](basic.md) - basic functionality of LanceDB.
|
||||
* [`Embedding Functions`](embedding.md) - functions for working with embeddings.
|
||||
* [`Indexing`](ann_indexes.md) - create vector indexes to speed up queries.
|
||||
* [`Ecosystem Integrations`](integrations.md) - integrating LanceDB with python data tooling ecosystem.
|
||||
* [`API Reference`](python.md) - detailed documentation for the LanceDB Python SDK.
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "42bf01fb",
|
||||
"metadata": {},
|
||||
@@ -22,10 +21,10 @@
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
|
||||
"\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.0.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.1.1\u001b[0m\n",
|
||||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
||||
]
|
||||
}
|
||||
@@ -88,7 +87,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "5ac2b6a3",
|
||||
"metadata": {},
|
||||
@@ -231,7 +229,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "2106b5bb",
|
||||
"metadata": {},
|
||||
@@ -251,7 +248,7 @@
|
||||
{
|
||||
"data": {
|
||||
"application/vnd.jupyter.widget-view+json": {
|
||||
"model_id": "39f3161f3ef54a129cd65fb296332b54",
|
||||
"model_id": "c6f1c76d9567421d88911923388d2530",
|
||||
"version_major": 2,
|
||||
"version_minor": 0
|
||||
},
|
||||
@@ -574,7 +571,6 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"attachments": {},
|
||||
"cell_type": "markdown",
|
||||
"id": "559a095b",
|
||||
"metadata": {},
|
||||
@@ -631,7 +627,7 @@
|
||||
" <iframe\n",
|
||||
" width=\"400\"\n",
|
||||
" height=\"300\"\n",
|
||||
" src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289.76\"\n",
|
||||
" src=\"https://www.youtube.com/embed/pNvujJ1XyeQ?start=289\"\n",
|
||||
" frameborder=\"0\"\n",
|
||||
" allowfullscreen\n",
|
||||
" \n",
|
||||
@@ -639,7 +635,7 @@
|
||||
" "
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.lib.display.YouTubeVideo at 0x177fde4d0>"
|
||||
"<IPython.lib.display.YouTubeVideo at 0x13ec062c0>"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
@@ -651,7 +647,7 @@
|
||||
"from IPython.display import YouTubeVideo\n",
|
||||
"\n",
|
||||
"top_match = context.iloc[0]\n",
|
||||
"YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=top_match[\"start\"])"
|
||||
"YouTubeVideo(top_match[\"url\"].split(\"/\")[-1], start=int(top_match[\"start\"]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .db import LanceDBConnection, URI
|
||||
from .db import URI, LanceDBConnection
|
||||
|
||||
|
||||
def connect(uri: URI) -> LanceDBConnection:
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from pathlib import Path
|
||||
from typing import Union, List
|
||||
from typing import List, Union
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
@@ -14,10 +14,12 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pyarrow as pa
|
||||
|
||||
from .common import URI, DATA
|
||||
from .common import DATA, URI
|
||||
from .table import LanceTable
|
||||
from .util import get_uri_scheme
|
||||
|
||||
|
||||
class LanceDBConnection:
|
||||
@@ -26,10 +28,12 @@ class LanceDBConnection:
|
||||
"""
|
||||
|
||||
def __init__(self, uri: URI):
|
||||
if isinstance(uri, str):
|
||||
uri = Path(uri)
|
||||
uri = uri.expanduser().absolute()
|
||||
Path(uri).mkdir(parents=True, exist_ok=True)
|
||||
is_local = isinstance(uri, Path) or get_uri_scheme(uri) == "file"
|
||||
if is_local:
|
||||
if isinstance(uri, str):
|
||||
uri = Path(uri)
|
||||
uri = uri.expanduser().absolute()
|
||||
Path(uri).mkdir(parents=True, exist_ok=True)
|
||||
self._uri = str(uri)
|
||||
|
||||
@property
|
||||
@@ -43,7 +47,11 @@ class LanceDBConnection:
|
||||
-------
|
||||
A list of table names.
|
||||
"""
|
||||
return [p.stem for p in Path(self.uri).glob("*.lance")]
|
||||
if get_uri_scheme(self.uri) == "file":
|
||||
return [p.stem for p in Path(self.uri).glob("*.lance")]
|
||||
raise NotImplementedError(
|
||||
"List table_names is only supported for local filesystem for now"
|
||||
)
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self.table_names())
|
||||
|
||||
@@ -12,13 +12,14 @@
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
from retry import retry
|
||||
import sys
|
||||
from typing import Callable, Union
|
||||
|
||||
from lance.vector import vec_to_table
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pyarrow as pa
|
||||
from lance.vector import vec_to_table
|
||||
from retry import retry
|
||||
|
||||
|
||||
def with_embeddings(
|
||||
@@ -64,13 +65,19 @@ class EmbeddingFunction:
|
||||
return self.func(c.tolist())
|
||||
|
||||
if len(self.rate_limiter_kwargs) > 0:
|
||||
import ratelimiter
|
||||
v = int(sys.version_info.minor)
|
||||
if v >= 11:
|
||||
print(
|
||||
"WARNING: rate limit only support up to 3.10, proceeding without rate limiter"
|
||||
)
|
||||
else:
|
||||
import ratelimiter
|
||||
|
||||
max_calls = self.rate_limiter_kwargs["max_calls"]
|
||||
limiter = ratelimiter.RateLimiter(
|
||||
max_calls, period=self.rate_limiter_kwargs["period"]
|
||||
)
|
||||
embed_func = limiter(embed_func)
|
||||
max_calls = self.rate_limiter_kwargs["max_calls"]
|
||||
limiter = ratelimiter.RateLimiter(
|
||||
max_calls, period=self.rate_limiter_kwargs["period"]
|
||||
)
|
||||
embed_func = limiter(embed_func)
|
||||
batches = self.to_batches(text)
|
||||
embeds = [emb for c in batches for emb in embed_func(c)]
|
||||
return embeds
|
||||
@@ -79,11 +86,6 @@ class EmbeddingFunction:
|
||||
return f"EmbeddingFunction(func={self.func})"
|
||||
|
||||
def rate_limit(self, max_calls=0.9, period=1.0):
|
||||
import sys
|
||||
|
||||
v = int(sys.version_info.minor)
|
||||
if v >= 11:
|
||||
raise ValueError("rate limit only support up to 3.10")
|
||||
self.rate_limiter_kwargs = dict(max_calls=max_calls, period=period)
|
||||
return self
|
||||
|
||||
|
||||
@@ -24,6 +24,7 @@ class LanceQueryBuilder:
|
||||
"""
|
||||
|
||||
def __init__(self, table: "lancedb.table.LanceTable", query: np.ndarray):
|
||||
self._metric = "L2"
|
||||
self._nprobes = 20
|
||||
self._refine_factor = None
|
||||
self._table = table
|
||||
@@ -77,6 +78,21 @@ class LanceQueryBuilder:
|
||||
self._where = where
|
||||
return self
|
||||
|
||||
def metric(self, metric: str) -> LanceQueryBuilder:
|
||||
"""Set the distance metric to use.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
metric: str
|
||||
The distance metric to use. By default "l2" is used.
|
||||
|
||||
Returns
|
||||
-------
|
||||
The LanceQueryBuilder object.
|
||||
"""
|
||||
self._metric = metric
|
||||
return self
|
||||
|
||||
def nprobes(self, nprobes: int) -> LanceQueryBuilder:
|
||||
"""Set the number of probes to use.
|
||||
|
||||
@@ -108,7 +124,12 @@ class LanceQueryBuilder:
|
||||
return self
|
||||
|
||||
def to_df(self) -> pd.DataFrame:
|
||||
"""Execute the query and return the results as a pandas DataFrame."""
|
||||
"""
|
||||
Execute the query and return the results as a pandas DataFrame.
|
||||
In addition to the selected columns, LanceDB also returns a vector
|
||||
and also the "score" column which is the distance between the query
|
||||
vector and the returned vector.
|
||||
"""
|
||||
ds = self._table.to_lance()
|
||||
# TODO indexed search
|
||||
tbl = ds.to_table(
|
||||
@@ -118,6 +139,7 @@ class LanceQueryBuilder:
|
||||
"column": VECTOR_COLUMN_NAME,
|
||||
"q": self._query,
|
||||
"k": self._limit,
|
||||
"metric": self._metric,
|
||||
"nprobes": self._nprobes,
|
||||
"refine_factor": self._refine_factor,
|
||||
},
|
||||
|
||||
@@ -19,12 +19,12 @@ from functools import cached_property
|
||||
import lance
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from lance import LanceDataset
|
||||
import pyarrow as pa
|
||||
from lance import LanceDataset
|
||||
from lance.vector import vec_to_table
|
||||
|
||||
from .common import DATA, VEC, VECTOR_COLUMN_NAME
|
||||
from .query import LanceQueryBuilder
|
||||
from .common import DATA, VECTOR_COLUMN_NAME, VEC
|
||||
|
||||
|
||||
def _sanitize_data(data, schema):
|
||||
@@ -106,11 +106,14 @@ class LanceTable:
|
||||
def _dataset_uri(self) -> str:
|
||||
return os.path.join(self._conn.uri, f"{self.name}.lance")
|
||||
|
||||
def create_index(self, num_partitions=256, num_sub_vectors=96):
|
||||
def create_index(self, metric="L2", num_partitions=256, num_sub_vectors=96):
|
||||
"""Create an index on the table.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
metric: str, default "L2"
|
||||
The distance metric to use when creating the index. Valid values are "L2" or "cosine".
|
||||
L2 is euclidean distance.
|
||||
num_partitions: int
|
||||
The number of IVF partitions to use when creating the index.
|
||||
Default is 256.
|
||||
@@ -121,6 +124,7 @@ class LanceTable:
|
||||
self._dataset.create_index(
|
||||
column=VECTOR_COLUMN_NAME,
|
||||
index_type="IVF_PQ",
|
||||
metric=metric,
|
||||
num_partitions=num_partitions,
|
||||
num_sub_vectors=num_sub_vectors,
|
||||
)
|
||||
@@ -166,6 +170,9 @@ class LanceTable:
|
||||
Returns
|
||||
-------
|
||||
A LanceQueryBuilder object representing the query.
|
||||
Once executed, the query returns selected columns, the vector,
|
||||
and also the "score" column which is the distance between the query
|
||||
vector and the returned vector.
|
||||
"""
|
||||
if isinstance(query, list):
|
||||
query = np.array(query)
|
||||
|
||||
43
python/lancedb/util.py
Normal file
43
python/lancedb/util.py
Normal file
@@ -0,0 +1,43 @@
|
||||
# Copyright 2023 LanceDB Developers
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from urllib.parse import ParseResult, urlparse
|
||||
|
||||
from pyarrow import fs
|
||||
|
||||
|
||||
def get_uri_scheme(uri: str) -> str:
|
||||
"""
|
||||
Get the scheme of a URI. If the URI does not have a scheme, assume it is a file URI.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
uri : str
|
||||
The URI to parse.
|
||||
|
||||
Returns
|
||||
-------
|
||||
str: The scheme of the URI.
|
||||
"""
|
||||
parsed = urlparse(uri)
|
||||
scheme = parsed.scheme
|
||||
if not scheme:
|
||||
scheme = "file"
|
||||
elif scheme in ["s3a", "s3n"]:
|
||||
scheme = "s3"
|
||||
elif len(scheme) == 1:
|
||||
# Windows drive names are parsed as the scheme
|
||||
# e.g. "c:\path" -> ParseResult(scheme="c", netloc="", path="/path", ...)
|
||||
# So we add special handling here for schemes that are a single character
|
||||
scheme = "file"
|
||||
return scheme
|
||||
@@ -1,10 +1,10 @@
|
||||
[project]
|
||||
name = "lancedb"
|
||||
version = "0.1"
|
||||
dependencies = ["pylance>=0.4.3", "ratelimiter", "retry", "tqdm"]
|
||||
version = "0.1.2"
|
||||
dependencies = ["pylance>=0.4.6", "ratelimiter", "retry", "tqdm"]
|
||||
description = "lancedb"
|
||||
authors = [
|
||||
{ name = "Lance Devs", email = "dev@eto.ai" },
|
||||
{ name = "LanceDB Devs", email = "dev@lancedb.com" },
|
||||
]
|
||||
license = { file = "LICENSE" }
|
||||
readme = "README.md"
|
||||
|
||||
@@ -11,10 +11,11 @@
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import lancedb
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
import lancedb
|
||||
|
||||
|
||||
def test_basic(tmp_path):
|
||||
db = lancedb.connect(tmp_path)
|
||||
|
||||
@@ -12,13 +12,14 @@
|
||||
# limitations under the License.
|
||||
|
||||
import lance
|
||||
from lancedb.query import LanceQueryBuilder
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pandas.testing as tm
|
||||
import pyarrow as pa
|
||||
|
||||
import pytest
|
||||
|
||||
from lancedb.query import LanceQueryBuilder
|
||||
|
||||
|
||||
class MockTable:
|
||||
def __init__(self, tmp_path):
|
||||
@@ -60,3 +61,21 @@ def test_query_builder_with_filter(table):
|
||||
df = LanceQueryBuilder(table, [0, 0]).where("id = 2").to_df()
|
||||
assert df["id"].values[0] == 2
|
||||
assert all(df["vector"].values[0] == [3, 4])
|
||||
|
||||
|
||||
def test_query_builder_with_metric(table):
|
||||
query = [4, 8]
|
||||
df_default = LanceQueryBuilder(table, query).to_df()
|
||||
df_l2 = LanceQueryBuilder(table, query).metric("l2").to_df()
|
||||
tm.assert_frame_equal(df_default, df_l2)
|
||||
|
||||
df_cosine = LanceQueryBuilder(table, query).metric("cosine").limit(1).to_df()
|
||||
assert df_cosine.score[0] == pytest.approx(
|
||||
cosine_distance(query, df_cosine.vector[0]),
|
||||
abs=1e-6,
|
||||
)
|
||||
assert 0 <= df_cosine.score[0] <= 1
|
||||
|
||||
|
||||
def cosine_distance(vec1, vec2):
|
||||
return 1 - np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
|
||||
|
||||
30
python/tests/test_util.py
Normal file
30
python/tests/test_util.py
Normal file
@@ -0,0 +1,30 @@
|
||||
# Copyright 2023 LanceDB Developers
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from lancedb.util import get_uri_scheme
|
||||
|
||||
|
||||
def test_normalize_uri():
|
||||
uris = [
|
||||
"relative/path",
|
||||
"/absolute/path",
|
||||
"file:///absolute/path",
|
||||
"s3://bucket/path",
|
||||
"gs://bucket/path",
|
||||
"c:\\windows\\path",
|
||||
]
|
||||
schemes = ["file", "file", "file", "s3", "gs", "file"]
|
||||
|
||||
for uri, expected_scheme in zip(uris, schemes):
|
||||
parsed_scheme = get_uri_scheme(uri)
|
||||
assert parsed_scheme == expected_scheme
|
||||
12
rust/Cargo.toml
Normal file
12
rust/Cargo.toml
Normal file
@@ -0,0 +1,12 @@
|
||||
[package]
|
||||
name = "vectordb"
|
||||
version = "0.0.1"
|
||||
edition = "2021"
|
||||
description = "Serverless, low-latency vector database for AI applications"
|
||||
license = "Apache-2.0"
|
||||
repository = "https://github.com/lancedb/lancedb"
|
||||
|
||||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
|
||||
|
||||
[dependencies]
|
||||
lance = "0.4.3"
|
||||
14
rust/src/lib.rs
Normal file
14
rust/src/lib.rs
Normal file
@@ -0,0 +1,14 @@
|
||||
pub fn add(left: usize, right: usize) -> usize {
|
||||
left + right
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn it_works() {
|
||||
let result = add(2, 2);
|
||||
assert_eq!(result, 4);
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user