Add a new test feature which allows for running the lancedb tests
against a remote server. Convert over a few tests in src/connection.rs
as a proof of concept.
To make local development easier, the remote tests can be run locally
from a Makefile. This file can also be used to run the feature tests,
with a single invocation of 'make'. (The feature tests require bringing
up a docker compose environment.)
This PR adds support for namespace-backed databases through
lance-namespace integration, enabling centralized table management
through namespace APIs.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Support shallow cloning a dataset at a specific location to create a new
dataset, using the shallow_clone feature in Lance. Also introduce remote
`clone` API for remote tables for this functionality.
We had previously prototyped a `Catalog` trait anticipating a
three-tiered Catalog-Database-Table structure. Now that we have
namespaces in the `Database` we can support any tiering scheme and the
`Catalog` trait is no longer needed.
## Summary
This PR introduces a `HeaderProvider` which is called for all remote
HTTP calls to get the latest headers to inject. This is useful for
features like adding the latest auth tokens where the header provider
can auto-refresh tokens internally and each request always set the
refreshed token.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Updates lance to 0.35.0-beta4, which also incurs a datafusion update.
This brings in a fix for a memory leak in index caching, resulting from
a cyclical reference.
This PR adds mTLS (mutual TLS) configuration support for the LanceDB
remote HTTP client, allowing users to authenticate with client
certificates and configure custom CA certificates for server
verification.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Some of the data fusion optimizers optimize based on data statistics
(e.g. total bytes, number of rows).
If those statistics are not supplied, optimizers cannot optimize on top.
One example is Anti Hash Join which can optimize from LeftAnti (Left:
big table, Right: small table) to RightAnti (Left: small table, Right:
big table). Left Anti requires reading the whole big & small table while
RightAnti only requires reading the whole left table and supports limit
push down to only read partial of big table
This PR adds support of multi-level namespace in a LanceDB database,
according to the Lance Namespace spec.
This allows users to create namespace inside a database connection,
perform create, drop, list, list_tables in a namespace. (other
operations like update, describe will be in a follow-up PR)
The 3 types of database connections behave like the following:
1 Local database connections will continue to have just a flat list of
tables for backwards compatibility.
2. Remote database connections will make REST API calls according to the
APIs in the Lance Namespace spec.
3. Lance Namespace connections will invoke the corresponding operations
against the specific namespace implementation which could have different
behaviors regarding these APIs.
All the table APIs now take identifier instead of name, for example
`/v1/table/{name}/create` is now `/v1/table/{id}/create`. If a table is
directly in the root namespace, the API call is identical. If the table
is in a namespace, then the full table ID should be used, with `$` as
the default delimiter (`.` is a special character and creates issues
with URL parsing so `$` is used), for example
`/v1/table/ns1$table1/create`. If a different parameter needs to be
passed in, user can configure the `id_delimiter` in client config and
that becomes a query parameter, for example
`/v1/table/ns1__table1/create?delimiter=__`
The Python and Typescript APIs are kept backwards compatible, but the
following Rust APIs are not:
1. `Connection::drop_table(&self, name: impl AsRef<str>) -> Result<()>`
is now `Connection::drop_table(&self, name: impl AsRef<str>, namespace:
&[String]) -> Result<()>`
2. `Connection::drop_all_tables(&self) -> Result<()>` is now
`Connection::drop_all_tables(&self, name: impl AsRef<str>) ->
Result<()>`
This shrinks the size of a local embedded build that can disable all the
default features. When combined with
https://github.com/lancedb/lance/pull/4362 and the dependencies are
updated to point to the fix, this resolves#2567 fully.
Verified by patching the workspace to redirect to my clone of lance with
the PR applied.
```
cargo tree -p lancedb -e no-build -e no-dev --no-default-features -i aws-config | less
```
The reason that lance itself needs to change too is that many
dependencies within that project depend on lance-io/default and lancedb
depends on them which transitively ends up enabling the cloud
regardless. The PR in lance removes the dependency on lance-io/default
from all sibling crates.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Enables two new parameters when building indices:
* `name`: Allows explicitly setting a name on the index. Default is
`{col_name}_idx`.
* `train` (default `True`): When set to `False`, an empty index will be
immediately created.
The upgrade of Lance means there are also additional behaviors from
cd76a993b8:
* When a scalar index is created on a Table, it will be kept around even
if all rows are deleted or updated.
* Scalar indices can be created on empty tables. They will default to
`train=False` if the table is empty.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
all 3 example are running now with:
```
cargo run --example simple
cargo run --example full_text_search
cargo run --example ivf_pq
```
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Adds tests to ensure that users can paginate through simple scan, FTS,
and vector search results using `limit` and `offset`.
Tests upstream work: https://github.com/lancedb/lance/pull/4318Closes#2459
These operations have existed in lance for a long while and many users
need to drop down to lance for this capability. This PR adds the API and
implements it using filters (e.g. `_rowid IN (...)`) so that in doesn't
currently add any load to `BaseTable`. I'm not sure that is sustainable
as base table implementations may want to specialize how they handle
this method. However, I figure it is a good starting point.
In addition, unlike Lance, this API does not currently guarantee
anything about the order of the take results. This is necessary for the
fallback filter approach to work (SQL filters cannot guarantee result
order)
Hello!
I'm new to lancedb and interested in the Rust SDK.
I couldn't find a good hybrid search example in Rust, so I created one.
## Usage
```bash
$ cargo run --quiet --example hybrid_search --features=sentence-transformers
Result: Python is a popular programming language.
Result: Mount Everest is the highest mountain in the world.
Result: The first computer programmer was Ada Lovelace.
Result: Coffee is one of the most popular beverages in the world.
Result: Basketball is a sport played with a ball and a hoop.
```
## Summary
- Adds an overall `timeout` parameter to `TimeoutConfig` that limits the
total time for the entire request
- Can be set via config or `LANCE_CLIENT_TIMEOUT` environment variable
- Exposed in Python and Node.js bindings
- Includes comprehensive tests
## Test plan
- [x] Unit tests for Rust TimeoutConfig
- [x] Integration tests for Python bindings
- [x] Integration tests for Node.js bindings
- [x] All existing tests pass
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
`nprobes` with a value greater than 20 fails with the minimum error:
```
self = <lancedb.query.AsyncVectorQuery object at 0x10b749720>, minimum_nprobes = 30
def minimum_nprobes(self, minimum_nprobes: int) -> Self:
"""Set the minimum number of probes to use.
See `nprobes` for more details.
These partitions will be searched on every indexed vector query and will
increase recall at the expense of latency.
"""
> self._inner.minimum_nprobes(minimum_nprobes)
E ValueError: Invalid input, minimum_nprobes must be less than or equal to maximum_nprobes
python/lancedb/query.py:2744: ValueError
```
Putting the max set before the min seems reasonable but it causes this
reasonable case to fail:
```
def test_nprobes_min_max_works_sync(table):
LanceVectorQueryBuilder(table, [0, 0], "vector").minimum_nprobes(2).maximum_nprobes(4).to_list()
```
with
```
self = <lancedb.query.AsyncVectorQuery object at 0x1203f1c90>, maximum_nprobes = 4
def maximum_nprobes(self, maximum_nprobes: int) -> Self:
"""Set the maximum number of probes to use.
See `nprobes` for more details.
If this value is greater than `minimum_nprobes` then the excess partitions
will be searched only if we have not found enough results.
This can be useful when there is a narrow filter to allow these queries to
spend more time searching and avoid potential false negatives.
If this value is 0 then no limit will be applied and all partitions could be
searched if needed to satisfy the limit.
"""
> self._inner.maximum_nprobes(maximum_nprobes)
E ValueError: Invalid input, maximum_nprobes must be greater than or equal to minimum_nprobes
python/lancedb/query.py:2761: ValueError
```.
The case I care about is where min == max, but this solution handles it
even if they're not. If both min and max exist, we set both to the
minimum and then set the max. This isn't 100% the same as the minimum
setter checks for 0 on the min and `.nprobes` does not do any sanity
checking at all. But I figured this was the most reasonable and general
solution without touching more of this code.
As part of this I noticed the error messages were a bit ambiguous so I
made them symmetric and clarified them while I was here.
The `MetadataEraserExec` is super lightweight and doesn't really justify
partitioning. I had a plan recently that was partitioning just for this
node and that seems wasteful.
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>