Compare commits

...

10 Commits
600 ... v0.4.13

Author SHA1 Message Date
Lance Release
e688484bd3 Bump version: 0.4.12 → 0.4.13 2024-03-16 05:21:44 +00:00
Weston Pace
3bcd61c8de feat: bump lance to 0.10.4 (#1123) 2024-03-15 22:21:04 -07:00
vincent d warmerdam
c76ec48603 Explain vonoroi seed initalisation (#1114)
This PR fixes https://github.com/lancedb/lancedb/issues/1112. It turned
out that K-means is currently used internally, so I figured adding that
context to the docs would be nice.
2024-03-15 14:16:05 -07:00
Christian Di Lorenzo
d974413745 fix(python): Add python azure blob read support (#1102)
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
2024-03-15 14:15:41 -07:00
Weston Pace
ec4f2fbd30 feat: update lance to v0.10.3 (#1094) 2024-03-15 08:50:28 -07:00
Ayush Chaurasia
6375ea419a chore(python): Increase event interval for telemetry (#1108)
Increasing event reporting interval from 5mins to 60mins
2024-03-15 17:04:43 +05:30
Raghav Dixit
6689192cee doc updates (#1085)
closes #1084
2024-03-14 14:38:28 +05:30
Chang She
dbec598610 feat(python): support optional vector field in pydantic model (#1097)
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField()
    text: str = openai.SourceField()
```

Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:

```python
table.add([{"id": "foo", "text": "rust all the things"}])
```

However, trying to construct pydantic model instances without vector
doesn't because it's a required field.

Instead, you need add a default value:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField(default=None)
    text: str = openai.SourceField()
```

then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```

However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
2024-03-14 03:05:08 +05:30
QianZhu
8f6e7ce4f3 add index_stats python api (#1096)
the integration test will be covered in another PR:
https://github.com/lancedb/sophon/pull/1876
2024-03-13 08:47:54 -07:00
Chang She
b482f41bf4 fix(python): fix typo in passing in the api_key explicitly (#1098)
fix silly typo
2024-03-12 22:01:12 -07:00
17 changed files with 750 additions and 28 deletions

View File

@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.4.12
current_version = 0.4.13
commit = True
message = Bump version: {current_version} → {new_version}
tag = True

View File

@@ -14,10 +14,10 @@ keywords = ["lancedb", "lance", "database", "vector", "search"]
categories = ["database-implementations"]
[workspace.dependencies]
lance = { "version" = "=0.10.2", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.10.2" }
lance-linalg = { "version" = "=0.10.2" }
lance-testing = { "version" = "=0.10.2" }
lance = { "version" = "=0.10.4", "features" = ["dynamodb"] }
lance-index = { "version" = "=0.10.4" }
lance-linalg = { "version" = "=0.10.4" }
lance-testing = { "version" = "=0.10.4" }
# Note that this one does not include pyarrow
arrow = { version = "50.0", optional = false }
arrow-array = "50.0"

View File

@@ -31,7 +31,7 @@ As an example, consider starting with 128-dimensional vector consisting of 32-bi
While PQ helps with reducing the size of the index, IVF primarily addresses search performance. The primary purpose of an inverted file index is to facilitate rapid and effective nearest neighbor search by narrowing down the search space.
In IVF, the PQ vector space is divided into *Voronoi cells*, which are essentially partitions that consist of all the points in the space that are within a threshold distance of the given region's seed point. These seed points are used to create an inverted index that correlates each centroid with a list of vectors in the space, allowing a search to be restricted to just a subset of vectors in the index.
In IVF, the PQ vector space is divided into *Voronoi cells*, which are essentially partitions that consist of all the points in the space that are within a threshold distance of the given region's seed point. These seed points are initialized by running K-means over the stored vectors. The centroids of K-means turn into the seed points which then each define a region. These regions are then are used to create an inverted index that correlates each centroid with a list of vectors in the space, allowing a search to be restricted to just a subset of vectors in the index.
![](../assets/ivfpq_ivf_desc.webp)

View File

@@ -224,7 +224,6 @@ This embedding function supports ingesting images as both bytes and urls. You ca
!!! info
LanceDB supports ingesting images directly from accessible links.
```python
db = lancedb.connect(tmp_path)
@@ -290,4 +289,67 @@ print(actual.label)
```
### Imagebind embeddings
We have support for [imagebind](https://github.com/facebookresearch/ImageBind) model embeddings. You can download our version of the packaged model via - `pip install imagebind-packaged==0.1.2`.
This function is registered as `imagebind` and supports Audio, Video and Text modalities(extending to Thermal,Depth,IMU data):
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| `name` | `str` | `"imagebind_huge"` | Name of the model. |
| `device` | `str` | `"cpu"` | The device to run the model on. Can be `"cpu"` or `"gpu"`. |
| `normalize` | `bool` | `False` | set to `True` to normalize your inputs before model ingestion. |
Below is an example demonstrating how the API works:
```python
db = lancedb.connect(tmp_path)
registry = EmbeddingFunctionRegistry.get_instance()
func = registry.get("imagebind").create()
class ImageBindModel(LanceModel):
text: str
image_uri: str = func.SourceField()
audio_path: str
vector: Vector(func.ndims()) = func.VectorField()
# add locally accessible image paths
text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]
# Load data
inputs = [
{"text": a, "audio_path": b, "image_uri": c}
for a, b, c in zip(text_list, audio_paths, image_paths)
]
#create table and add data
table = db.create_table("img_bind", schema=ImageBindModel)
table.add(inputs)
```
Now, we can search using any modality:
#### image search
```python
query_image = "./assets/dog_image2.jpg" #download an image and enter that path here
actual = table.search(query_image).limit(1).to_pydantic(ImageBindModel)[0]
print(actual.text == "dog")
```
#### audio search
```python
query_audio = "./assets/car_audio2.wav" #download an audio clip and enter path here
actual = table.search(query_audio).limit(1).to_pydantic(ImageBindModel)[0]
print(actual.text == "car")
```
#### Text search
You can add any input query and fetch the result as follows:
```python
query = "an animal which flies and tweets"
actual = table.search(query).limit(1).to_pydantic(ImageBindModel)[0]
print(actual.text == "bird")
```
If you have any questions about the embeddings API, supported models, or see a relevant model missing, please raise an issue [on GitHub](https://github.com/lancedb/lancedb/issues).

File diff suppressed because one or more lines are too long

View File

@@ -1,6 +1,6 @@
{
"name": "vectordb",
"version": "0.4.12",
"version": "0.4.13",
"description": " Serverless, low-latency vector database for AI applications",
"main": "dist/index.js",
"types": "dist/index.d.ts",
@@ -88,10 +88,10 @@
}
},
"optionalDependencies": {
"@lancedb/vectordb-darwin-arm64": "0.4.12",
"@lancedb/vectordb-darwin-x64": "0.4.12",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.12",
"@lancedb/vectordb-linux-x64-gnu": "0.4.12",
"@lancedb/vectordb-win32-x64-msvc": "0.4.12"
"@lancedb/vectordb-darwin-arm64": "0.4.13",
"@lancedb/vectordb-darwin-x64": "0.4.13",
"@lancedb/vectordb-linux-arm64-gnu": "0.4.13",
"@lancedb/vectordb-linux-x64-gnu": "0.4.13",
"@lancedb/vectordb-win32-x64-msvc": "0.4.13"
}
}

View File

@@ -3,7 +3,7 @@ name = "lancedb"
version = "0.6.3"
dependencies = [
"deprecation",
"pylance==0.10.2",
"pylance==0.10.4",
"ratelimiter~=1.0",
"retry>=0.9.2",
"tqdm>=4.27.0",
@@ -81,6 +81,7 @@ embeddings = [
"awscli>=1.29.57",
"botocore>=1.31.57",
]
azure = ["adlfs>=2024.2.0"]
[tool.maturin]
python-source = "python"

View File

@@ -31,7 +31,7 @@ class ImageBindEmbeddings(EmbeddingFunction):
six different modalities: images, text, audio, depth, thermal, and IMU data
to download package, run :
`pip install imagebind@git+https://github.com/raghavdixit99/ImageBind`
`pip install imagebind-packaged==0.1.2`
"""
name: str = "imagebind_huge"

View File

@@ -113,5 +113,5 @@ class OpenAIEmbeddings(TextEmbeddingFunction):
if self.organization:
kwargs["organization"] = self.organization
if self.api_key:
kwargs["api_key"] = self
kwargs["api_key"] = self.api_key
return openai.OpenAI(**kwargs)

View File

@@ -68,10 +68,16 @@ class RemoteTable(Table):
def list_indices(self):
"""List all the indices on the table"""
print(self._name)
resp = self._conn._client.post(f"/v1/table/{self._name}/index/list/")
return resp
def index_stats(self, index_uuid: str):
"""List all the indices on the table"""
resp = self._conn._client.post(
f"/v1/table/{self._name}/index/{index_uuid}/stats/"
)
return resp
def create_scalar_index(
self,
column: str,

View File

@@ -118,7 +118,8 @@ def _append_vector_col(data: pa.Table, metadata: dict, schema: Optional[pa.Schem
functions = EmbeddingFunctionRegistry.get_instance().parse_functions(metadata)
for vector_column, conf in functions.items():
func = conf.function
if vector_column not in data.column_names:
no_vector_column = vector_column not in data.column_names
if no_vector_column or pc.all(pc.is_null(data[vector_column])).as_py():
col_data = func.compute_source_embeddings_with_retry(
data[conf.source_column]
)
@@ -126,9 +127,16 @@ def _append_vector_col(data: pa.Table, metadata: dict, schema: Optional[pa.Schem
dtype = schema.field(vector_column).type
else:
dtype = pa.list_(pa.float32(), len(col_data[0]))
data = data.append_column(
pa.field(vector_column, type=dtype), pa.array(col_data, type=dtype)
)
if no_vector_column:
data = data.append_column(
pa.field(vector_column, type=dtype), pa.array(col_data, type=dtype)
)
else:
data = data.set_column(
data.column_names.index(vector_column),
pa.field(vector_column, type=dtype),
pa.array(col_data, type=dtype),
)
return data

View File

@@ -26,6 +26,18 @@ import pyarrow as pa
import pyarrow.fs as pa_fs
def safe_import_adlfs():
try:
import adlfs
return adlfs
except ImportError:
return None
adlfs = safe_import_adlfs()
def get_uri_scheme(uri: str) -> str:
"""
Get the scheme of a URI. If the URI does not have a scheme, assume it is a file URI.
@@ -92,6 +104,17 @@ def fs_from_uri(uri: str) -> Tuple[pa_fs.FileSystem, str]:
path = get_uri_location(uri)
return fs, path
elif get_uri_scheme(uri) == "az" and adlfs is not None:
az_blob_fs = adlfs.AzureBlobFileSystem(
account_name=os.environ.get("AZURE_STORAGE_ACCOUNT_NAME"),
account_key=os.environ.get("AZURE_STORAGE_ACCOUNT_KEY"),
)
fs = pa_fs.PyFileSystem(pa_fs.FSSpecHandler(az_blob_fs))
path = get_uri_location(uri)
return fs, path
return pa_fs.FileSystem.from_uri(uri)

View File

@@ -69,7 +69,7 @@ class _Events:
self.throttled_event_names = ["search_table"]
self.throttled_events = set()
self.max_events = 5 # max events to store in memory
self.rate_limit = 60.0 * 5 # rate limit (seconds)
self.rate_limit = 60.0 * 60.0 # rate limit (seconds)
self.time = 0.0
if is_git_dir():

View File

@@ -11,6 +11,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
from typing import List, Union
import lance
import lancedb
@@ -23,6 +24,8 @@ from lancedb.embeddings import (
EmbeddingFunctionRegistry,
with_embeddings,
)
from lancedb.embeddings.base import TextEmbeddingFunction
from lancedb.embeddings.registry import get_registry, register
from lancedb.pydantic import LanceModel, Vector
@@ -112,3 +115,34 @@ def test_embedding_function_rate_limit(tmp_path):
table.add([{"text": "hello world"}])
table.add([{"text": "hello world"}])
assert len(table) == 2
def test_add_optional_vector(tmp_path):
@register("mock-embedding")
class MockEmbeddingFunction(TextEmbeddingFunction):
def ndims(self):
return 128
def generate_embeddings(
self, texts: Union[List[str], np.ndarray]
) -> List[np.array]:
"""
Generate the embeddings for the given texts
"""
return [np.random.randn(self.ndims()).tolist() for _ in range(len(texts))]
registry = get_registry()
model = registry.get("mock-embedding").create()
class LanceSchema(LanceModel):
id: str
vector: Vector(model.ndims()) = model.VectorField(default=None)
text: str = model.SourceField()
db = lancedb.connect(tmp_path)
tbl = db.create_table("optional_vector", schema=LanceSchema)
# add works
expected = LanceSchema(id="id", text="text")
tbl.add([expected])
assert not (np.abs(tbl.to_pandas()["vector"][0]) < 1e-6).all()

View File

@@ -16,16 +16,35 @@ import os
import lancedb
import pytest
# AWS:
# You need to setup AWS credentials an a base path to run this test. Example
# AWS_PROFILE=default TEST_S3_BASE_URL=s3://my_bucket/dataset pytest tests/test_io.py
#
# Azure:
# You need to setup Azure credentials an a base path to run this test. Example
# export AZURE_STORAGE_ACCOUNT_NAME="<account>"
# export AZURE_STORAGE_ACCOUNT_KEY="<key>"
# export REMOTE_BASE_URL=az://my_blob/dataset
# pytest tests/test_io.py
@pytest.fixture(autouse=True, scope="module")
def setup():
yield
if remote_url := os.environ.get("REMOTE_BASE_URL"):
db = lancedb.connect(remote_url)
for table in db.table_names():
db.drop_table(table)
@pytest.mark.skipif(
(os.environ.get("TEST_S3_BASE_URL") is None),
reason="please setup s3 base url",
(os.environ.get("REMOTE_BASE_URL") is None),
reason="please setup remote base url",
)
def test_s3_io():
db = lancedb.connect(os.environ.get("TEST_S3_BASE_URL"))
def test_remote_io():
db = lancedb.connect(os.environ.get("REMOTE_BASE_URL"))
assert db.table_names() == []
table = db.create_table(

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-node"
version = "0.4.12"
version = "0.4.13"
description = "Serverless, low-latency vector database for AI applications"
license.workspace = true
edition.workspace = true

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.4.12"
version = "0.4.13"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true