This changes the default values for some namespace parameters in the
remote python SDK from None to [], to match the underlying code it
calls.
Prior to this commit, failing to supply "namespace" with the remote SDK
would cause an error because the underlying code it dispatches to does
not consider None to be valid input.
## Summary
This PR adds the missing `name` parameter to `create_scalar_index` and
`create_fts_index` methods in the Python SDK, which was inadvertently
omitted when it was added to `create_index` in PR #2586.
## Changes
- Add `name: Optional[str] = None` parameter to abstract
`Table.create_scalar_index` and `Table.create_fts_index` methods
- Update `LanceTable` implementation to accept and pass the `name`
parameter to the underlying Rust layer
- Update `RemoteTable` implementation to accept and pass the `name`
parameter
- Enhanced tests to verify custom index names work correctly for both
scalar and FTS indices
- When `name` is not provided, default names are generated (e.g.,
`{column}_idx`)
## Test plan
- [x] Added test cases for custom names in scalar index creation
- [x] Added test cases for custom names in FTS index creation
- [x] Verified existing tests continue to pass
- [x] Code formatting and linting checks pass
This ensures API consistency across all index creation methods in the
LanceDB Python SDK.
Fixes#2616🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
This PR adds support of multi-level namespace in a LanceDB database,
according to the Lance Namespace spec.
This allows users to create namespace inside a database connection,
perform create, drop, list, list_tables in a namespace. (other
operations like update, describe will be in a follow-up PR)
The 3 types of database connections behave like the following:
1 Local database connections will continue to have just a flat list of
tables for backwards compatibility.
2. Remote database connections will make REST API calls according to the
APIs in the Lance Namespace spec.
3. Lance Namespace connections will invoke the corresponding operations
against the specific namespace implementation which could have different
behaviors regarding these APIs.
All the table APIs now take identifier instead of name, for example
`/v1/table/{name}/create` is now `/v1/table/{id}/create`. If a table is
directly in the root namespace, the API call is identical. If the table
is in a namespace, then the full table ID should be used, with `$` as
the default delimiter (`.` is a special character and creates issues
with URL parsing so `$` is used), for example
`/v1/table/ns1$table1/create`. If a different parameter needs to be
passed in, user can configure the `id_delimiter` in client config and
that becomes a query parameter, for example
`/v1/table/ns1__table1/create?delimiter=__`
The Python and Typescript APIs are kept backwards compatible, but the
following Rust APIs are not:
1. `Connection::drop_table(&self, name: impl AsRef<str>) -> Result<()>`
is now `Connection::drop_table(&self, name: impl AsRef<str>, namespace:
&[String]) -> Result<()>`
2. `Connection::drop_all_tables(&self) -> Result<()>` is now
`Connection::drop_all_tables(&self, name: impl AsRef<str>) ->
Result<()>`
This PR integrates `lancedb` with `lance-namespace` so that users can
use LanceDB client to access Lance tables in any catalog services. In
general, we expect most of the logic to be delegated to the existing
`LanceDBConnection` and `LanceTable`, but the namespace implemenation
will control how table is created, dropped, and describe where the table
is stored with any related storage options like access credentials.
The implementation currently only supports a 1 level namespace that
directly contains tables. We will introduce nested namespace support in
a separated PR.
Users are expected to use it in the following way:
```python
>>> import lancedb
>>> import pyarrow as pa
>>> # Connect using GlueNamespace
>>> db = lancedb.connect_namespace("glue", {"catalog_id": "123456789012"})
>>> # Create a table with schema
>>> schema = pa.schema([
... pa.field("id", pa.int64()),
... pa.field("vector", pa.list_(pa.float32(), 2))
... ])
>>> table = db.create_table("my_table", schema=schema)
>>> # List tables
>>> db.table_names()
['my_table']
```
This shrinks the size of a local embedded build that can disable all the
default features. When combined with
https://github.com/lancedb/lance/pull/4362 and the dependencies are
updated to point to the fix, this resolves#2567 fully.
Verified by patching the workspace to redirect to my clone of lance with
the PR applied.
```
cargo tree -p lancedb -e no-build -e no-dev --no-default-features -i aws-config | less
```
The reason that lance itself needs to change too is that many
dependencies within that project depend on lance-io/default and lancedb
depends on them which transitively ends up enabling the cloud
regardless. The PR in lance removes the dependency on lance-io/default
from all sibling crates.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Enables two new parameters when building indices:
* `name`: Allows explicitly setting a name on the index. Default is
`{col_name}_idx`.
* `train` (default `True`): When set to `False`, an empty index will be
immediately created.
The upgrade of Lance means there are also additional behaviors from
cd76a993b8:
* When a scalar index is created on a Table, it will be kept around even
if all rows are deleted or updated.
* Scalar indices can be created on empty tables. They will default to
`train=False` if the table is empty.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
all 3 example are running now with:
```
cargo run --example simple
cargo run --example full_text_search
cargo run --example ivf_pq
```
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Adds tests to ensure that users can paginate through simple scan, FTS,
and vector search results using `limit` and `offset`.
Tests upstream work: https://github.com/lancedb/lance/pull/4318Closes#2459
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Clarified and expanded explanations of data management concepts in
LanceDB.
- Added notes on automatic background fragment compaction and
incremental reindexing support in LanceDB Cloud/Enterprise.
- Updated details on disabling interim exhaustive kNN search during
background reindexing.
- Improved formatting and removed outdated FTS reindexing subsection.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
These operations have existed in lance for a long while and many users
need to drop down to lance for this capability. This PR adds the API and
implements it using filters (e.g. `_rowid IN (...)`) so that in doesn't
currently add any load to `BaseTable`. I'm not sure that is sustainable
as base table implementations may want to specialize how they handle
this method. However, I figure it is a good starting point.
In addition, unlike Lance, this API does not currently guarantee
anything about the order of the take results. This is necessary for the
fallback filter approach to work (SQL filters cannot guarantee result
order)
Hello!
I'm new to lancedb and interested in the Rust SDK.
I couldn't find a good hybrid search example in Rust, so I created one.
## Usage
```bash
$ cargo run --quiet --example hybrid_search --features=sentence-transformers
Result: Python is a popular programming language.
Result: Mount Everest is the highest mountain in the world.
Result: The first computer programmer was Ada Lovelace.
Result: Coffee is one of the most popular beverages in the world.
Result: Basketball is a sport played with a ball and a hoop.
```
## Summary
- Enhanced vector column detection to use substring matching instead of
exact matching
- Now detects columns with names containing "vector" or "embedding"
(case-insensitive)
- Added integer vector support to Node.js implementation (matching
Python)
- Comprehensive test coverage for both float and integer vector types
## Changes
### Python (`python/python/lancedb/table.py`)
- Updated `_infer_target_schema()` to use substring matching with helper
function `_is_vector_column()`
- Preserved original field names instead of forcing "vector"
- Consolidated duplicate logic for better maintainability
### Node.js (`nodejs/lancedb/arrow.ts`)
- Enhanced type inference with `nameSuggestsVectorColumn()` helper
function
- Added `isAllIntegers()` function with performance optimization (checks
first 10 elements)
- Implemented integer vector support using `Uint8` type (matching
Python)
- Improved type safety by removing `any` usage
### Tests
- **Python**: Added
`test_infer_target_schema_with_vector_embedding_names()` in
`test_util.py`
- **Node.js**: Added comprehensive test case in `arrow.test.ts`
- Both test suites cover various naming patterns and integer/float
vector types
## Examples of newly supported column names:
- `user_vector`, `text_embedding`, `doc_embeddings`
- `my_vector_field`, `embedding_model`
- `VECTOR_COL`, `Vector_Mixed` (case-insensitive)
- Both float and integer arrays are properly converted to fixed-size
lists
## Test plan
- [x] All existing tests pass (backward compatibility maintained)
- [x] New tests pass for both Python and Node.js implementations
- [x] Integer vector detection works correctly in Node.js
- [x] Code passes linting and formatting checks
- [x] Performance optimized for large vector arrays
Fixes#2546🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
### Summary
This PR adds **SigLIP** (Sigmoid Loss Image Pretraining) as a new
embedding model in the LanceDB embedding registry. SigLIP improves
image-text alignment performance using sigmoid-based contrastive loss
and offers robust zero-shot generalization.
Fixes#2498
### What’s Implemented
#### 1. `SigLIP` Embedding Class
* Added `SigLIP` support under `python/lancedb/embeddings/siglip.py`
* Implements:
* `compute_source_embeddings`
* `_batch_generate_embeddings`
* Normalization logic
* Batch-wise progress logging for image embedding
#### 2. Registry Integration
* Registered `SigLIP` in `embeddings/__init__.py`
* `SigLIP` now usable via `connect(..., embedding="siglip")`
#### 3. Evaluation Benchmark Support
* Added SigLIP to `test_embeddings_slow.py` for side-by-side
benchmarking with OpenCLIP and ImageBind
### New Test Methods
#### `test_siglip`
* End-to-end test to verify embeddings table creation and vector shape
for SigLIP

#### `test_siglip_vs_openclip_vs_imagebind_benchmark_full`
* Benchmarks:
* **Recall\@1 / 5 / 10**
* **mAP (Mean Average Precision)**
* **Embedding & Search Latency**
* Dimensionality reporting

### Notes
* SigLIP outputs 768D embeddings (vs 512D for OpenCLIP)
* Benchmark shows competitive performance despite higher dimensionality
* I'm still new to contributing to open-source and learning as I go.
Please feel free to suggest any improvements — I'm happy to make
changes!
## Summary
- Adds an overall `timeout` parameter to `TimeoutConfig` that limits the
total time for the entire request
- Can be set via config or `LANCE_CLIENT_TIMEOUT` environment variable
- Exposed in Python and Node.js bindings
- Includes comprehensive tests
## Test plan
- [x] Unit tests for Rust TimeoutConfig
- [x] Integration tests for Python bindings
- [x] Integration tests for Node.js bindings
- [x] All existing tests pass
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
Update license field from `Apache 2.0` to be `Apache-2.0` for all
Node.js packages.
This was causing GitHub's Dependency Review license check to fail with:
> The validity of the licenses of the dependencies below could not be
determined. Ensure that they are valid SPDX licenses
`nprobes` with a value greater than 20 fails with the minimum error:
```
self = <lancedb.query.AsyncVectorQuery object at 0x10b749720>, minimum_nprobes = 30
def minimum_nprobes(self, minimum_nprobes: int) -> Self:
"""Set the minimum number of probes to use.
See `nprobes` for more details.
These partitions will be searched on every indexed vector query and will
increase recall at the expense of latency.
"""
> self._inner.minimum_nprobes(minimum_nprobes)
E ValueError: Invalid input, minimum_nprobes must be less than or equal to maximum_nprobes
python/lancedb/query.py:2744: ValueError
```
Putting the max set before the min seems reasonable but it causes this
reasonable case to fail:
```
def test_nprobes_min_max_works_sync(table):
LanceVectorQueryBuilder(table, [0, 0], "vector").minimum_nprobes(2).maximum_nprobes(4).to_list()
```
with
```
self = <lancedb.query.AsyncVectorQuery object at 0x1203f1c90>, maximum_nprobes = 4
def maximum_nprobes(self, maximum_nprobes: int) -> Self:
"""Set the maximum number of probes to use.
See `nprobes` for more details.
If this value is greater than `minimum_nprobes` then the excess partitions
will be searched only if we have not found enough results.
This can be useful when there is a narrow filter to allow these queries to
spend more time searching and avoid potential false negatives.
If this value is 0 then no limit will be applied and all partitions could be
searched if needed to satisfy the limit.
"""
> self._inner.maximum_nprobes(maximum_nprobes)
E ValueError: Invalid input, maximum_nprobes must be greater than or equal to minimum_nprobes
python/lancedb/query.py:2761: ValueError
```.
The case I care about is where min == max, but this solution handles it
even if they're not. If both min and max exist, we set both to the
minimum and then set the max. This isn't 100% the same as the minimum
setter checks for 0 on the min and `.nprobes` does not do any sanity
checking at all. But I figured this was the most reasonable and general
solution without touching more of this code.
As part of this I noticed the error messages were a bit ambiguous so I
made them symmetric and clarified them while I was here.
The `MetadataEraserExec` is super lightweight and doesn't really justify
partitioning. I had a plan recently that was partitioning just for this
node and that seems wasteful.