This updates the workspace Lance dependencies from `v7.1.0-beta.4` to
`v7.2.0-beta.3` and refreshes `Cargo.lock`.
The lockfile now points at Lance commit
`7c070f760fa8e24c8015cb2afbd22c5e6b7898e8` and includes the transitive
dependency updates required by the new beta.
Update the Rust workspace Lance git dependencies and Java lance-core
dependency to v7.2.0-beta.1.
This keeps LanceDB aligned with the latest Lance beta release and
refreshes the Cargo lockfile for the new Lance dependency graph.
## Summary
When an `LsmWriteSpec` is installed on a table (#3396), `merge_insert`
upsert
calls are dispatched through Lance's MemWAL `ShardWriter` (LSM-style
append)
instead of the standard merge path.
- **`use_lsm_write`** — a `merge_insert` builder option, default `true`;
set it
`false` to use the standard path for a call even when a spec is set.
- **`assume_pre_sharded`** — a `merge_insert` builder option, default
`false`;
skips the per-row shard check and routes by the first row only.
- **`close_lsm_writers`** — drains and closes the table's cached MemWAL
shard
writers.
- The `merge_insert` **`on`** columns default to, and are validated
against,
the table's unenforced primary key.
- Shard writers are cached alongside the dataset (in
`DatasetConsistencyWrapper`) and reused for the session.
- `MergeResult` gains **`num_rows`** — on the LSM path the insert/update
breakdown is unknown until compaction, so only the total is reported.
Routing covers all three sharding strategies — bucket (murmur3,
Iceberg-compatible), identity, and unsharded. Each `merge_insert` call
targets
a single shard; the whole input is collected and validated before a
single
atomic `ShardWriter::put`, so a validation failure leaves the MemWAL
untouched.
Bindings: Python (`merge_insert(...).use_lsm_write(...)` /
`.assume_pre_sharded(...)`, `Table.close_lsm_writers`) and TypeScript
(`mergeInsert(...).useLsmWrite(...)` / `.assumePreSharded(...)`,
`Table.closeLsmWriters`).
## Context
Reconstructed from the original #3354 branch onto current `main`: the
branch
predated the #3394 (unenforced primary key) / #3396 (`LsmWriteSpec`)
split and
has been rebuilt on that merged foundation. Depends on Lance
`v7.0.0-beta.13`.
The MemWAL read path (reading un-flushed shard data back into queries)
and
remote (LanceDB Cloud) LSM support are follow-ups.
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
Bumps the pinned toolchain in `rust-toolchain.toml` from 1.94.0 to
1.95.0.
Fixes new lints surfaced by clippy on 1.95.0:
- `manual_checked_ops` — fragment size mean in `table.rs` uses
`checked_div`
- `explicit_counter_loop` — shuffle test loop in `shuffle.rs`
No rustc warnings were introduced.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a table is created with `pa.json_()` (PyArrow's JSON extension
type),
it is stored internally as `lance.json` (LargeBinary with `lance.json`
extension metadata). Calling `table.add()` with `pa.json_()` data failed
with:
```
RuntimeError: lance error: Append with different schema:
`data` should have type json but type was large_binary
```
`build_field_exprs` in `rust/lancedb/src/table/datafusion/cast.rs` saw
that
the input field (`Utf8` with `arrow.json` metadata) differed from the
table
field (`LargeBinary` with `lance.json` metadata). Since
`can_cast_types(Utf8, LargeBinary)` is true, it inserted a DataFusion
`Utf8 → LargeBinary` cast. That cast preserved the input field's
`arrow.json`
extension metadata instead of adopting the table's `lance.json`
metadata, so
lance-core detected a schema mismatch and rejected the append.
This adds a special case in `build_field_exprs`: when the input is
`arrow.json` and the table field is `lance.json`, the expression is
passed
through unchanged. Lance-core's write path already handles the
`arrow.json → lance.json` conversion (including JSONB encoding), so no
DataFusion cast is needed.
Fixes#3144
Continues #3291 from a fork (the original author's branch could not be
pushed to). The original commits are preserved; an additional commit
fixes
the CI failures on that PR — formatting, a missing trait import, and
read-back assertions that assumed binary storage when a lance.json
column
is read back as `Utf8`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: yunju.lly <yunju.lly@antgroup.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
- Bump lance dependency from `v7.0.0-beta.13` to `v7.0.0-rc.1`
- Remove PK constraint from `LsmWriteSpec::Bucket` docs and
`Table::set_lsm_write_spec` docs
- Remove test assertions that expected rejection when no PK is set or
when bucket column != PK
Closes https://github.com/lance-format/lance/issues/6917
Closes client side work of #3370
### Summary
- Plumbs `read_consistency_interval` from `ConnectBuilder` through
`RestfulLanceDbClient` so remote reads attach an
`x-lancedb-min-timestamp` freshness header. None = no header (default),
zero = "now", positive = `now - interval`.
- Adds per-table `FreshnessState` on `RemoteTable`: write responses
(`update`, `delete`, `merge_insert`, `add_columns`, `alter_columns`,
`drop_columns`) track the committed version, and the next read sends
`x-lancedb-min-version` so the server's cache honors read-your-write.
- `checkout(v)` / `checkout_tag(t)` / `checkout_latest()` / `restore()`
reset the freshness state appropriately; the validating `/describe/` and
tag-resolve requests are sent without freshness headers so they don't
carry stale state.
- Updates Rust, Python, and Node docstrings and calls out that stronger
consistency raises per-read latency and cost.
### Testing
- Unit tests cover default behavior, interval=0, positive interval,
checkout_latest baseline, min_version-after-write, checkout clears
state, and the two no-stale-header invariants on `checkout(v)` and
`checkout_tag(t)`.
- Ran smoke tests against local remote table to verify functionality
Modified the parameter of delete to a Predicate that could be
constructed from either datafusion Expr, from str (to support SQL
predicate), or from String to support python and javascript bindings.
When a datafusion Expr is used, it avoids the overhead of serializing to
SQL and re-parsing when callers already have an Expr (e.g. from query
planning).
The native implementation uses lance's `DeleteBuilder::from_expr`. The
remote implementation converts the Expr to SQL via `expr_to_sql_string`
before sending to the server, consistent with the existing query and
count_rows paths.
Closes#3204
Signed-off-by: Yuval Lifshitz <ylifshit@ibm.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
The `test_resolve_user_id_*` tests in `remote/client.rs` mutate the
process-global `LANCEDB_USER_ID` and `LANCEDB_USER_ID_ENV_KEY`
environment variables. cargo runs tests in a binary across multiple
threads, so one test's `remove_var` can race another's `set_var` between
when it's set and when `resolve_user_id()` reads it.
This surfaced as an intermittent failure of
`test_resolve_user_id_from_env_key` on Windows CI:
```
assertion `left == right` failed
left: None
right: Some("custom-env-user-id")
```
Annotates the five env-mutating tests with `serial_test`'s
`#[serial(user_id_env)]` so they run serially with respect to each
other.
Should be backported to `release/v0.28` (CI for #3421 hit this same
flake).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#3407.
Remote tables now resolve create-index field paths against the table
schema before sending requests, so nested, escaped, and case-insensitive
inputs use the same canonical path contract as local tables. Remote
`list_indices()` also canonicalizes returned columns against the current
schema, and the remote query tests lock explicit nested vector and FTS
request payloads.
LanceDB default vector column discovery only considered top-level
fields, so tables with a single nested vector leaf still required users
to pass an explicit field path. This updates Rust and Python discovery
to recurse into struct fields, return canonical field paths, and
preserve actionable errors when no default or multiple defaults exist.
The explicit nested path flow for index creation and search remains
supported across Rust, Python, and Node, with regression coverage for
single nested vector leaves, multiple candidate leaves, and schemas
without vector leaves.
Closes#3405.
Index metadata APIs now resolve stored field ids back to Lance canonical
field paths instead of leaf names, so nested indexes such as
`metadata.user_id` and escaped literal-dot fields round-trip through
`list_indices()`. Native index creation also canonicalizes the input
path before handing it to Lance, keeping local metadata consistent with
the field-path contract while remote responses continue to expose
server-provided canonical columns.
Fixes#3403.
Native index creation was resolving requested columns through top-level
Arrow schema lookup before handing the request to Lance, which rejected
nested paths and could collapse a nested field to its leaf name. This PR
resolves index targets with Lance field-path semantics, passes the
canonical path through to Lance, and reports indexed columns from field
ids as canonical full paths.
This also removes the Python native FTS guard that rejected dotted paths
so scalar, vector, and FTS index creation share the same nested-field
contract. Related to #3402.
Fixes#3136.
## Summary
- `WithEmbeddingsScannable::scan_as_stream` matched columns positionally
against the table schema, so a `CastError` was raised whenever the
computed batch order differed from the table schema order.
- The mismatch surfaced when `add_columns` added a new physical column
**after** an embedding column: the table schema became
`[..., embedding, extra]`, but `compute_embeddings_for_batch` always
appends embeddings at the end, producing `[..., extra, embedding]`.
Position 2 then tried to cast e.g. `score: Float64` →
`embedding: FixedSizeList` and failed.
- Now we look each output column up by name in the result batch, which
is order-independent. If a non-embedding column required by the table
schema is missing from the input, we return a clear `InvalidInput`
error instead of a confusing cast error.
## Reproduction (from the issue)
```text
Table created with: [id, text, text_vec(embedding)]
add_columns("score") → schema: [id, text, text_vec, score]
table.add([id, text, score]) → BEFORE: CastError on position 2
AFTER: succeeds, embedding is computed
```
## Tests
-
`data::scannable.rs::test_with_embeddings_scannable_column_added_after_embedding`
— unit test exercising the exact column-order mismatch via
`WithEmbeddingsScannable::with_schema`.
-
`data::scannable.rs::test_with_embeddings_scannable_missing_required_column`
— covers the new "missing column" error path.
- `table::add_data.rs::test_add_with_embeddings_after_add_columns`
— end-to-end regression test mirroring the reproduction in the issue
(create table with embedding → `add_columns` → `table.add`).
## Test plan
- [x] `cargo check --quiet --features remote --tests --examples`
- [x] `cargo clippy --quiet --features remote --tests --examples`
- [x] `cargo fmt --all`
- [x] `cargo test --quiet --features remote -p lancedb embedding` — 18
tests pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#3243.
This PR exposes a new public api `Permutation.take_offsets(offsets:
list[int])`, since users initially had to call __getitems__ directly to
batch-fetch rows by position.
Currently, the name matches the existing `Table.take_offsets` pattern,
and now the dunder `__getitem__` and `__getitems__` now delegate to it.
Also, fixes a parse error when `PermutationReader::take_offsets` gets an
empty list. Now returns an empty `RecordBatch` with the correct schema
instead. Bundled this because without the fix the new public API blows
up on a perfectly reasonable input.
`__getitems__` is preserved since PyTorch's batched DataLoader requires
it.
### Testing
- Added 3 new Rust tests for empty offsets including permutation table
with Select::All, Select::Columns, and identity path
- Added 3 new Python tests for the public API including a happy case,
and empty input on both identity and permutation
clippy, format, check all clean!
cc: @westonpace
The 0.29 release happened on a branch because the main line had already
moved past the 6.0.0 stable lance release. As a result the version bump
commits ended up on the branch. This merges those commits back into
main.
---------
Co-authored-by: Lance Release <lance-dev@lancedb.com>
## Summary
Split out from #3354
Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` /
`unset_lsm_write_spec` to
install and clear the spec that selects Lance's MemWAL LSM-style write
path for
`merge_insert`.
`LsmWriteSpec` offers three sharding strategies, all built on Lance's
`InitializeMemWalBuilder`:
- `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by
the
single-column unenforced primary key.
- `LsmWriteSpec::identity(column)` — identity sharding by the raw value
of a
scalar column.
- `LsmWriteSpec::unsharded()` — a single MemWAL shard.
Each can be refined with `with_maintained_indexes(...)` (indexes the
MemWAL
keeps up to date as rows are appended) and
`with_writer_config_defaults(...)`
(default `ShardWriter` configuration recorded in the MemWAL index, so
every
writer starts from the same defaults). All variants require the table to
have
an unenforced primary key.
- `set_lsm_write_spec` installs the spec by initializing the MemWAL
index;
`unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting
to
the standard `merge_insert` path. `unset` is idempotent.
- Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`,
`set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript
(`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` /
`"unsharded"`). `RemoteTable` returns `NotSupported`.
The actual `merge_insert` LSM dispatch and `ShardWriter` write path are
a
follow-up — this PR only installs and clears the spec.
## Summary
Adds `Table::set_unenforced_primary_key` — records a single column as
the
table's unenforced primary key in Lance schema field metadata.
"Unenforced"
means LanceDB does not check uniqueness on write; the key is metadata
that
`merge_insert` consumes.
- Single-column only; the column must exist and have a supported dtype
(Int32, Int64, Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary).
The
API accepts an iterable for binding ergonomics but requires exactly one
column — compound keys are rejected.
- The primary key is immutable: calling this on a table that already has
an
unenforced primary key is rejected. Concurrent writers racing to set the
key
fail at commit time rather than silently overriding it.
- `RemoteTable` returns `NotSupported`.
- Bindings: Python (`AsyncTable`, `LanceTable`, `RemoteTable`) and
TypeScript
(`Table.setUnenforcedPrimaryKey`).
## Context
Split out from #3354 per review feedback, so the unenforced primary key
and the
`merge_insert` sharding spec land as separate reviewable PRs.
No Lance dependency bump — `main` is already on v7.0.0-beta.10, which
includes
the field-metadata round-trip fix the API relies on. Enforcing
primary-key
immutability at the Lance commit layer (so the cross-column concurrent
race is
also rejected) is a companion Lance change: lance-format/lance#6810.
Closes#3261.
## Summary
Adds `bytes` to the accepted types of `lancedb.expr.lit()` so that
binary scalars can be used in filter / projection expressions. The
previous attempt in #3235 had to be reverted because DataFusion's SQL
unparser does not support `Binary` / `LargeBinary` scalars, so any
expression containing such a literal would fail in both `to_sql()` and
`__repr__`.
## How
`expr_to_sql_string` now has two paths:
- **Fast path** (no binary literals): delegate to DataFusion's unparser
unchanged.
- **Slow path**: rewrite each `Binary(Some(bytes))` literal in the tree
to a unique string-literal placeholder, run the unparser, then
substitute `'<placeholder>'` with `X'<HEX>'` in the resulting SQL.
`Binary(None)` / `LargeBinary(None)` are rewritten to
`ScalarValue::Null` so the unparser emits plain `NULL`.
This keeps DataFusion as the single source of truth for operator and
function serialization, so binary literals work in every expression node
type the unparser already supports — including nested cases like
`contains(col("data"), lit(b"\xff"))`, `NOT (col == lit(b"..."))`, and
`col.cast(...) == lit(b"...")`.
## Changes
- `rust/lancedb/src/expr/sql.rs`: placeholder-substitution
implementation.
- `rust/lancedb/src/expr.rs`: 4 new unit tests covering binary literals
in equality, compound predicates, scalar function calls, negation, and
`NULL` binary literals.
- `python/src/expr.rs`: `expr_lit` accepts `PyBytes` and produces
`ScalarValue::Binary`.
- `python/Cargo.toml` + `Cargo.lock`: pull in `datafusion-common` for
`ScalarValue`.
- `python/python/lancedb/expr.py`: extend `ExprLike` and `lit()` type
annotations / docstrings with `bytes`.
- `python/python/lancedb/_lancedb.pyi`: update `expr_lit` stub.
- `python/tests/test_expr.py`: unit tests for `to_sql` / `repr` of
binary literals and an integration test against a real `pa.binary()`
column for equality / inequality / compound filters.
## Example
```python
from lancedb.expr import col, lit, func
# Equality against a binary column
col("payload") == lit(b"\xca\xfe")
# Expr((payload = X'CAFE'))
# Nested inside a function call (previously failed)
func("contains", col("data"), lit(b"\xff"))
# Expr(contains(data, X'FF'))
# repr() no longer crashes
repr(lit(b"\xde\xad\xbe\xef"))
# "Expr(X'DEADBEEF')"
```
## Verification
- [x] `cargo test -p lancedb --lib expr::` — 12/12 pass (was 9; +3 new
tests)
- [x] `cargo check --features remote --tests --examples` — clean
- [x] `cargo clippy --features remote --tests --examples` — no warnings
- [x] `cargo fmt --all -- --check` — clean
- [x] `pytest python/tests/test_expr.py` — 76/76 pass (was 74; +2 new
tests)
- [x] `ruff check python` / `ruff format --check python` — clean
## Follow-ups (not in this PR)
Issue #3261 also raises the possibility of a *truncated* `__repr__` for
very large binary literals. This PR keeps `__repr__` exact (it forwards
to `to_sql()`), since truncating display output would diverge from the
SQL that actually gets executed. A display-only truncation could be
added in a follow-up by giving `__repr__` its own renderer.
Made with [Cursor](https://cursor.com)
Co-authored-by: Cursor <cursoragent@cursor.com>
## Summary
`LanceNamespaceDatabase::open_table` and `create_table` were squashing
`NamespaceError::TableNotFound` and `TableAlreadyExists` into generic
`Error::Runtime`, so callers couldn't distinguish a missing-table or
duplicate-table error from any other internal failure. Downstream this
surfaced to geneva-style code as HTTP 500 / "internal server error" on
operations that should have been 400/404 — see
[ENT-1235](https://linear.app/lancedb/issue/ENT-1235/fix-ns-errors-for-create-tableopen-table).
This PR walks the boxed-error chain from `lance::Error::Namespace` down
to the inner `NamespaceError` and maps its `ErrorCode` onto the proper
`lancedb::Error` variant:
- `NamespaceError::TableNotFound` → `Error::TableNotFound { name, source
}`
- `NamespaceError::TableAlreadyExists` → `Error::TableAlreadyExists {
name }`
- everything else → `Error::Runtime` (unchanged behavior for the long
tail)
It also replaces the existing `e.to_string().contains("already exists")`
string match in `LanceNamespaceDatabase::create_table` with a downcast
on the `NamespaceError` code. That string-match happened to work for the
`dir` backend but isn't guaranteed to match the REST namespace backend's
error format; the downcast works for both.
The chain-walk is needed because `DatasetBuilder::from_namespace`
re-wraps the inner namespace error in a fresh `lance::Error::Namespace`,
so a single top-level downcast misses it.
## How this helps geneva
Geneva's workaround (linked in the parent issue) currently has to use
`except Exception:` with a `# todo: this is too broad` comment, plus
`str(e).lower().contains("already exists")` string matching, because the
namespace-impl path raised a generic `RuntimeError`. After this PR:
- `db.open_table("missing")` raises `ValueError("Table 'missing' was not
found")` (via the existing Python binding mapping of `TableNotFound` →
`PyValueError`) — geneva can catch `ValueError` cleanly.
- `db.create_table("dup")` raises `ValueError("Table 'dup' already
exists")` reliably across both `dir` and REST backends, so the existing
string match becomes deterministic.
In phalanx (the sophon REST server), `LanceDBError::TableNotFound` and
`LanceDBError::TableAlreadyExists` already map directly to HTTP 404 and
HTTP 400 respectively — see
[phalanx/src/error.rs:77-94](https://github.com/lancedb/sophon/blob/main/src/rust/phalanx/src/error.rs#L77).
No phalanx code change is needed for the bug fix; the previous 500 came
from phalanx's string-match fallback not finding `"namespace"` AND `"not
found"` in the `Runtime` error's debug-formatted message.
## Follow-up
[ENT-1246](https://linear.app/lancedb/issue/ENT-1246/remove-dead-namespace-error-string-matching-in-phalanx)
— after this lands and phalanx picks up the new lancedb, the
string-matching fallback for table errors in
`src/rust/phalanx/src/error.rs` (lines 99-168, 236-256, 502-514) and
`src/rust/phalanx/src/rest/table/create_table.rs` (lines 224-241)
becomes dead code and can be removed. The `// TODO: Refactor for better
namespace error handling` comment at phalanx/src/error.rs:96-98 is
exactly what this PR addresses on the lancedb side; ENT-1246 finishes
the cleanup on the sophon side.
## Test plan
- [x] `cargo test --quiet --features remote -p lancedb --lib` — all 495
lib tests pass, including 4 new tests in `database::namespace::tests`:
- `test_namespace_table_not_found` — extended to assert
`Error::TableNotFound` (was just `is_err()`)
- `test_namespace_open_table_not_found_at_root` — covers the
root-namespace path
- `test_namespace_create_table_already_exists` — covers child namespace
- `test_namespace_create_table_already_exists_at_root` — covers root
namespace
- [x] `cargo clippy --quiet --features remote --tests` — clean
- [x] `cargo fmt --all` — clean
- [x] Manually confirmed (via test failures before the fix) that the two
`open_table` tests were returning `Error::Runtime { message: "Failed to
get table info from namespace: Namespace { source: TableNotFound { ... }
}" }` prior to this change.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was
documented at https://docs.lancedb.com/indexing/vector-index but
`lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised
`ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying
`pylance` already accepted it, only the LanceDB wrapper was missing the
wiring.
**Rust core (`rust/lancedb`):**
- Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the
`IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`).
- Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)`
helper, keeping symmetry with the other `IVF_HNSW_*` variants.
- Forward the variant in `RemoteTable::create_index` and add two
parametrised tests (default + customised config) for the JSON
serialisation.
- New `NativeTable` integration test
(`test_create_index_ivf_hnsw_flat`).
**Python binding (`python/`):**
- New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias.
- PyO3 `extract_index_params` recognises the `HnswFlat` config.
- `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync
`RemoteTable.create_index` both dispatch to the new config.
- `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover
the new type so `pyright`/`make check` stays clean.
- Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync
dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage.
- Existing `test_index_statistics_index_type_lists_all_supported_values`
updated to include `IVF_HNSW_FLAT`.
A matching Node.js / TypeScript binding is in a follow-up PR.
Closes#3331
## Test plan
- [ ] \`cargo check --quiet --features remote --tests --examples\`
- [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the
new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised
\`RemoteTable::create_index\` cases)
- [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote
--tests --examples\`
- [ ] \`cd python && make develop && make check && make test\` (covers
the two new async tests, the alias test, the dispatcher test, and the
updated \`test_index_statistics_index_type_lists_all_supported_values\`
assertion)
This wires Lance's existing `jieba/*` and `lindera/*` native FTS
tokenizers through the Python SDK instead of leaving them behind
disabled features and narrow public typing. It also documents the
`LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for
successful CJK indexing plus missing-model error guidance.
Closes#2168.
## Summary
`url::Url::query_pairs_mut()` leaves the URL with `query=Some("")` after
`.clear()` even when the input had no query string. The listing-database
connect path then captured that empty query into
`ListingDatabase::query_string`, and `table_uri()` blindly appended
`?<query>` to every per-table URI — producing URIs like
`s3://bucket/prefix/foo.lance?`.
The trailing `?` is benign for normal table operations, but it breaks
any caller that constructs a sub-path from the table URI. In particular,
MemWAL flushes write to `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`,
which `url::Url::parse` then re-parses as `path=<base table>` +
`query=/_mem_wal/...`. `Dataset::write` resolves the base table dataset,
finds it already exists, and fails with `Dataset already exists:
…_gen_1` on the very first MemTable flush (observed deterministically
against S3 across all merge_insert LSM modes; tracked in
[lance-format/lance#6713](https://github.com/lance-format/lance/pull/6715)).
## Fix
Treat `Some("")` query the same as no query when capturing
`query_string`. A real `?foo=bar` query is still propagated unchanged.
Adds a regression test covering both the empty-query and non-empty-query
paths.
## Verification
- `url::Url::parse("s3://bucket/prefix/").query()` → `None`, but after
`query_pairs_mut().clear()` → `Some("")`. Confirmed in a standalone
repro.
- Without this fix, every `table_uri()` for an `s3://`-style connection
ends with `?`, breaking MemWAL and any future sub-path consumer in the
same way.
- New unit test `test_table_uri_url_path_has_no_trailing_question_mark`
exercises both code paths.
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
## Summary
- make `TlsConfig::default()` enable hostname verification by default
- align the Rust default with the documented Python and Node behavior
- update the Rust unit test to lock in the safe default
So far, I have been using a hacky approach that creates and opens
namespace-backed table, by getting its location and use a temporary
lancedb connection to create or open it. This was working for features
like credentials vending but is no longer fully working for the managed
versioning feature, recently geneva tests have been failing here and
there and various patches are not addressing the root cause. This PR
fully fixes this and implements proper rust binding for it.
Specifically:
- build a real Rust namespace-backed connection from the Python
namespace client
- route namespace table create/open through that connection instead of
resolved-location temp connections
- keep namespace client naming consistent in the Rust bridge and
preserve federated namespace + DuckDB behavior
## Summary
- delegate child-namespace `ListingDatabase` operations through an
eagerly initialized `LanceNamespaceDatabase`
- support nested namespace create/open/list/drop flows without requiring
callers to inject explicit locations
- add `namespace_client_properties` plumbing for local and namespace
connections so directory namespace settings like
`table_version_tracking_enabled` can be configured
- add regression tests for nested namespace ops and namespace client
property propagation
Bumps the Rust toolchain to 1.94.0 (latest installed) to unblock CI
failures caused by the AWS SDK's MSRV requirement. No lint fixes were
needed.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Add a `user_id` field to `ClientConfig` that allows users to identify
themselves to LanceDB Cloud/Enterprise
- The user_id is sent as the `x-lancedb-user-id` HTTP header in all
requests
- Supports three configuration methods:
- Direct assignment via `ClientConfig.user_id`
- Environment variable `LANCEDB_USER_ID`
- Indirect env var lookup via `LANCEDB_USER_ID_ENV_KEY`
Closes#3230🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>