## Summary
When an `LsmWriteSpec` is installed on a table (#3396), `merge_insert`
upsert
calls are dispatched through Lance's MemWAL `ShardWriter` (LSM-style
append)
instead of the standard merge path.
- **`use_lsm_write`** — a `merge_insert` builder option, default `true`;
set it
`false` to use the standard path for a call even when a spec is set.
- **`assume_pre_sharded`** — a `merge_insert` builder option, default
`false`;
skips the per-row shard check and routes by the first row only.
- **`close_lsm_writers`** — drains and closes the table's cached MemWAL
shard
writers.
- The `merge_insert` **`on`** columns default to, and are validated
against,
the table's unenforced primary key.
- Shard writers are cached alongside the dataset (in
`DatasetConsistencyWrapper`) and reused for the session.
- `MergeResult` gains **`num_rows`** — on the LSM path the insert/update
breakdown is unknown until compaction, so only the total is reported.
Routing covers all three sharding strategies — bucket (murmur3,
Iceberg-compatible), identity, and unsharded. Each `merge_insert` call
targets
a single shard; the whole input is collected and validated before a
single
atomic `ShardWriter::put`, so a validation failure leaves the MemWAL
untouched.
Bindings: Python (`merge_insert(...).use_lsm_write(...)` /
`.assume_pre_sharded(...)`, `Table.close_lsm_writers`) and TypeScript
(`mergeInsert(...).useLsmWrite(...)` / `.assumePreSharded(...)`,
`Table.closeLsmWriters`).
## Context
Reconstructed from the original #3354 branch onto current `main`: the
branch
predated the #3394 (unenforced primary key) / #3396 (`LsmWriteSpec`)
split and
has been rebuilt on that merged foundation. Depends on Lance
`v7.0.0-beta.13`.
The MemWAL read path (reading un-flushed shard data back into queries)
and
remote (LanceDB Cloud) LSM support are follow-ups.
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
The latest main Python workflow fails across multiple matrix jobs
because `test_remote_create_index_new_api` opens a remote table whose
mocked schema only exposes `id`, while the new `create_index(...,
config=...)` path validates the requested indexed columns.
This updates the remote-table fixture to include the indexed columns
used by the smoke test and checks the emitted column payloads, keeping
the test aligned with the schema-aware API path.
## Summary
- Transitions `LanceTable` and `RemoteTable` to use the unified
`create_index()` API matching `AsyncTable`
- Deprecates `create_scalar_index()` and `create_fts_index()` with
deprecation warnings
- Adds detection logic to distinguish legacy vs new API calls
- Adds `@overload` decorators for type checker compatibility
- Adds `accelerator` parameter to IVF config classes for GPU support
**New API:**
```python
table.create_index("vec", config=IvfPq(distance_type="l2"))
table.create_index("col", config=BTree())
table.create_index("text_col", config=FTS(with_position=True))
```
**Legacy API (deprecated):**
```python
table.create_index("l2", vector_column_name="vec") # emits DeprecationWarning
table.create_scalar_index("col", index_type="BTREE") # deprecated
table.create_fts_index("text_col") # deprecated
```
Fixes#2879🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
- Bump lance dependency from `v7.0.0-beta.13` to `v7.0.0-rc.1`
- Remove PK constraint from `LsmWriteSpec::Bucket` docs and
`Table::set_lsm_write_spec` docs
- Remove test assertions that expected rejection when no PK is set or
when bucket column != PK
Closes https://github.com/lance-format/lance/issues/6917
## Problem
`LinearCombinationReranker.merge_results` has two related bugs that make
it return **inverted relevance rankings** — the least relevant document
ranks first (closes#3154).
### Bug 1 — `_combine_score` subtracts from 1, inverting the final
ranking
```python
def _combine_score(self, vector_score, fts_score):
return 1 - (self.weight * vector_score + (1 - self.weight) * fts_score)
```
Both `vector_score` (already converted via `_invert_score`) and
`fts_score` (BM25 relevance) are in **higher-is-better** space. Wrapping
the weighted average in `1 - (...)` flips the direction: a perfectly
matching document (`vector_score=1, fts_score=1`) gets `_relevance_score
= 0.0`, while a non-matching document gets a high score.
### Bug 2 — Documents missing an FTS score are rewarded, not penalised
```python
fts_score = result.get("_score", fill) # fill=1.0 by default
```
When a document has no FTS match, `fts_score = fill = 1.0`. In
`_combine_score` (with the bug-1 formula), this large value becomes a
**negative penalty** via `1 - (... + 0.3 * 1.0)`, counterintuitively
*boosting* the document's score. By contrast, missing vector results
correctly receive `_invert_score(fill) = 0.0` (penalised).
## Fix
**Bug 1** — remove the `1 -` inversion from `_combine_score`:
```python
def _combine_score(self, vector_score, fts_score):
return self.weight * vector_score + (1 - self.weight) * fts_score
```
**Bug 2** — use `1 - fill` for missing FTS scores so both penalties are
symmetric (mirror of what `_invert_score(fill)` already does for missing
vector scores):
```python
fts_score = result.get("_score", 1 - fill) # was: fill
```
With `fill=1.0` (default): `1 - 1.0 = 0.0` — missing-FTS entries
contribute `0` to the FTS term, identical to how missing-vector entries
contribute `0` to the vector term.
## Verification
Concrete example from the issue. With `weight=0.7`, `fill=1.0`:
| Document | `_distance` | `_score` | Old `_relevance_score` | New
`_relevance_score` |
|----------|-------------|----------|------------------------|------------------------|
| `apple orange` | 0.0 (best) | 2.41 (only FTS) | 0.30 (**wrong: ranked
2nd**) | 1.42 (**correct: ranked 1st**) |
| `banana grape` | 0.9999 (worst) | — | 0.70 (**wrong: ranked 1st**) |
0.00 (**correct: ranked last**) |
## Tests
Two regression tests added to `python/python/tests/test_rerankers.py`:
- `test_linear_combination_best_match_ranks_first` — the document with
the smallest distance **and** an FTS match must have the highest
`_relevance_score`.
- `test_linear_combination_missing_fts_is_penalised` — a document with
any FTS score must beat an otherwise-equal document with no FTS match.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Fixes#3407.
Remote tables now resolve create-index field paths against the table
schema before sending requests, so nested, escaped, and case-insensitive
inputs use the same canonical path contract as local tables. Remote
`list_indices()` also canonicalizes returned columns against the current
schema, and the remote query tests lock explicit nested vector and FTS
request payloads.
## Summary
Fixes#3424.
`LanceTakeQueryBuilder.to_batches()` raised `AttributeError:
'AsyncTakeQuery' object has no attribute 'execute'`. The inherited
`BaseQueryBuilder.to_batches` called `self._inner.execute(...)`, but
`self._inner` is an `AsyncQueryBase` (Python wrapper) — only its native
inner exposes `execute`. Every other sync builder overrides
`to_batches`, so the bug only surfaced on take-query builders, which
inherit the base unchanged. `take_offsets(...).to_batches()` is broken
for the same reason.
Route the sync wrapper through the async `to_batches` on the background
event loop, so the native `execute` is invoked from inside an awaiting
context (matching how the async path works correctly).
## Repro
```python
import lancedb, pyarrow as pa, tempfile
db = lancedb.connect(tempfile.mkdtemp())
tbl = db.create_table("t", data=pa.table({"a": list(range(100))}))
tbl.take_row_ids([0, 1, 2]).to_arrow() # works
tbl.search().to_batches() # works
list(tbl.take_row_ids([0, 1, 2]).to_batches()) # AttributeError (before)
```
## Test plan
- [x] New regression test `test_take_queries_to_batches` covers
`take_offsets(...).to_batches()`, `take_row_ids(...).to_batches()`, and
the `select(...)` projection — all fail on `main` with the patch
reverted, all pass with the fix.
- [x] `test_take_queries`, `test_query_builder_batches`, and
`test_query_schema` still pass.
- [x] `ruff format --check` and `ruff check` clean on changed files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LanceDB default vector column discovery only considered top-level
fields, so tables with a single nested vector leaf still required users
to pass an explicit field path. This updates Rust and Python discovery
to recurse into struct fields, return canonical field paths, and
preserve actionable errors when no default or multiple defaults exist.
The explicit nested path flow for index creation and search remains
supported across Rust, Python, and Node, with regression coverage for
single nested vector leaves, multiple candidate leaves, and schemas
without vector leaves.
Closes#3405.
There have been a couple of reports of this function freezing debuggers
because it triggers a network round-trip but is assumed to be extremely
light-weight: https://github.com/lancedb/lancedb/discussions/2853. We'll
just cache the last version we see.
I considered digging into see if we could assume or get the version at
create time or after other operations, but that could be a bit of a
rabbit hole as I'm a bit unfamiliar with this. Claude was having a hard
time of it too 😅 I propose we see how the currently implementation goes
and improve it if people find "unknown" or stale values coming up
disruptively often before improving this further.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
- Fix `name in db` and `len(db)` for local Python connections with more
than one page of tables.
- Use `list_tables()` pagination instead of deprecated `table_names()`
with its default 10-item page.
- Add regression coverage with 20 tables so later pages are included.
Fixes#2727.
## Validation
- `python3 -m py_compile python/python/lancedb/db.py
python/python/tests/test_db.py`
- No-build Python harness that extracts and executes the edited
`LanceDBConnection` pagination methods: passed
- `uvx ruff check python/python/lancedb/db.py
python/python/tests/test_db.py`
- `uvx ruff format --check python/python/lancedb/db.py
python/python/tests/test_db.py`
Note: `uv run pytest
python/tests/test_db.py::test_db_contains_and_len_include_all_table_name_pages
-q` was attempted first, but it stayed in the broad Rust/PyO3 native
extension build and was stopped before pytest started.
Adds regression coverage for Python FTS APIs targeting nested text
leaves, including sync and async match, phrase, and hybrid query paths.
This also locks in the intended error boundary: nested text leaf paths
are valid, while struct containers, non-text leaves, and missing paths
remain rejected.
Fixes#3404.
Index metadata APIs now resolve stored field ids back to Lance canonical
field paths instead of leaf names, so nested indexes such as
`metadata.user_id` and escaped literal-dot fields round-trip through
`list_indices()`. Native index creation also canonicalizes the input
path before handing it to Lance, keeping local metadata consistent with
the field-path contract while remote responses continue to expose
server-provided canonical columns.
Fixes#3403.
Native index creation was resolving requested columns through top-level
Arrow schema lookup before handing the request to Lance, which rejected
nested paths and could collapse a nested field to its leaf name. This PR
resolves index targets with Lance field-path semantics, passes the
canonical path through to Lance, and reports indexed columns from field
ids as canonical full paths.
This also removes the Python native FTS guard that rejected dotted paths
so scalar, vector, and FTS index creation share the same nested-field
contract. Related to #3402.
## Feature
This PR aligns LanceDB Python `to_pandas()` APIs with Lance pandas
conversion capabilities while keeping LanceDB query-specific semantics
intact.
- Adds `blob_mode` and pandas `**kwargs` support to local table
`to_pandas()`.
- Delegates local `LanceTable.to_pandas()` to Lance dataset
`to_pandas(blob_mode=..., **kwargs)`.
- Keeps remote table `to_pandas()` unsupported with
`NotImplementedError`.
- Allows sync and async query `to_pandas()` to forward pandas kwargs
after LanceDB `flatten` and `timeout` handling.
Why we need this feature:
Users can access Lance blob-aware pandas conversion from LanceDB local
tables and can pass PyArrow pandas conversion options through
table/query APIs without losing existing `flatten` or `timeout`
behavior.
How it works:
The table API exposes a `BlobMode` literal type for `lazy`, `bytes`, and
`descriptions`. Local tables call through to the backing Lance dataset.
Query APIs do not add `blob_mode`; they materialize Arrow results, apply
LanceDB flattening when requested, and then call `to_pandas(**kwargs)`.
## Validation
- `uv run --frozen --extra tests pytest
python/tests/test_table.py::test_table_to_pandas_default_matches_arrow
python/tests/test_table.py::test_table_to_pandas_blob_bytes
python/tests/test_table.py::test_table_to_pandas_kwargs
python/tests/test_query.py::test_query_to_pandas_kwargs
python/tests/test_query.py::test_query_timeout
python/tests/test_remote_db.py::test_table_to_pandas_not_supported`
- `uv run --frozen --extra dev ruff check python/lancedb/table.py
python/lancedb/query.py python/lancedb/remote/table.py
python/tests/test_table.py python/tests/test_query.py
python/tests/test_remote_db.py`
- `uv run --frozen --extra tests pytest python/tests/test_table.py
python/tests/test_query.py python/tests/test_remote_db.py`
Note: `python/uv.lock` was intentionally not committed in this branch.
Closes#3243.
This PR exposes a new public api `Permutation.take_offsets(offsets:
list[int])`, since users initially had to call __getitems__ directly to
batch-fetch rows by position.
Currently, the name matches the existing `Table.take_offsets` pattern,
and now the dunder `__getitem__` and `__getitems__` now delegate to it.
Also, fixes a parse error when `PermutationReader::take_offsets` gets an
empty list. Now returns an empty `RecordBatch` with the correct schema
instead. Bundled this because without the fix the new public API blows
up on a perfectly reasonable input.
`__getitems__` is preserved since PyTorch's batched DataLoader requires
it.
### Testing
- Added 3 new Rust tests for empty offsets including permutation table
with Select::All, Select::Columns, and identity path
- Added 3 new Python tests for the public API including a happy case,
and empty input on both identity and permutation
clippy, format, check all clean!
cc: @westonpace
## Summary
Split out from #3354
Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` /
`unset_lsm_write_spec` to
install and clear the spec that selects Lance's MemWAL LSM-style write
path for
`merge_insert`.
`LsmWriteSpec` offers three sharding strategies, all built on Lance's
`InitializeMemWalBuilder`:
- `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by
the
single-column unenforced primary key.
- `LsmWriteSpec::identity(column)` — identity sharding by the raw value
of a
scalar column.
- `LsmWriteSpec::unsharded()` — a single MemWAL shard.
Each can be refined with `with_maintained_indexes(...)` (indexes the
MemWAL
keeps up to date as rows are appended) and
`with_writer_config_defaults(...)`
(default `ShardWriter` configuration recorded in the MemWAL index, so
every
writer starts from the same defaults). All variants require the table to
have
an unenforced primary key.
- `set_lsm_write_spec` installs the spec by initializing the MemWAL
index;
`unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting
to
the standard `merge_insert` path. `unset` is idempotent.
- Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`,
`set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript
(`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` /
`"unsharded"`). `RemoteTable` returns `NotSupported`.
The actual `merge_insert` LSM dispatch and `ShardWriter` write path are
a
follow-up — this PR only installs and clears the spec.
## Summary
Adds `Table::set_unenforced_primary_key` — records a single column as
the
table's unenforced primary key in Lance schema field metadata.
"Unenforced"
means LanceDB does not check uniqueness on write; the key is metadata
that
`merge_insert` consumes.
- Single-column only; the column must exist and have a supported dtype
(Int32, Int64, Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary).
The
API accepts an iterable for binding ergonomics but requires exactly one
column — compound keys are rejected.
- The primary key is immutable: calling this on a table that already has
an
unenforced primary key is rejected. Concurrent writers racing to set the
key
fail at commit time rather than silently overriding it.
- `RemoteTable` returns `NotSupported`.
- Bindings: Python (`AsyncTable`, `LanceTable`, `RemoteTable`) and
TypeScript
(`Table.setUnenforcedPrimaryKey`).
## Context
Split out from #3354 per review feedback, so the unenforced primary key
and the
`merge_insert` sharding spec land as separate reviewable PRs.
No Lance dependency bump — `main` is already on v7.0.0-beta.10, which
includes
the field-metadata round-trip fix the API relies on. Enforcing
primary-key
immutability at the Lance commit layer (so the cross-column concurrent
race is
also rejected) is a companion Lance change: lance-format/lance#6810.
Follow-up to #3371 , which added runtime validation for namespace `mode`
and `behavior` parameters in the NodeJS SDK. Bringing the same fix to
Python for cross-SDK consistency.
**Before:** unrecognized values were silently dropped to `None`, so
`db.create_namespace(["x"], mode="foobar")` would quietly fall through
to the server's default mode and hide caller typos.
**After:** raises `ValueError` listing the valid values.
## Summary
Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was
documented at https://docs.lancedb.com/indexing/vector-index but
`lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised
`ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying
`pylance` already accepted it, only the LanceDB wrapper was missing the
wiring.
**Rust core (`rust/lancedb`):**
- Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the
`IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`).
- Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)`
helper, keeping symmetry with the other `IVF_HNSW_*` variants.
- Forward the variant in `RemoteTable::create_index` and add two
parametrised tests (default + customised config) for the JSON
serialisation.
- New `NativeTable` integration test
(`test_create_index_ivf_hnsw_flat`).
**Python binding (`python/`):**
- New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias.
- PyO3 `extract_index_params` recognises the `HnswFlat` config.
- `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync
`RemoteTable.create_index` both dispatch to the new config.
- `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover
the new type so `pyright`/`make check` stays clean.
- Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync
dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage.
- Existing `test_index_statistics_index_type_lists_all_supported_values`
updated to include `IVF_HNSW_FLAT`.
A matching Node.js / TypeScript binding is in a follow-up PR.
Closes#3331
## Test plan
- [ ] \`cargo check --quiet --features remote --tests --examples\`
- [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the
new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised
\`RemoteTable::create_index\` cases)
- [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote
--tests --examples\`
- [ ] \`cd python && make develop && make check && make test\` (covers
the two new async tests, the alias test, the dispatcher test, and the
updated \`test_index_statistics_index_type_lists_all_supported_values\`
assertion)
This wires Lance's existing `jieba/*` and `lindera/*` native FTS
tokenizers through the Python SDK instead of leaving them behind
disabled features and narrow public typing. It also documents the
`LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for
successful CJK indexing plus missing-model error guidance.
Closes#2168.
## Summary
PyTorch's `DataLoader` uses fork-based multiprocessing by default on
Linux, but threads do not survive `fork()`. LanceDB's Python bindings
drive async work through two threaded layers, both of which become inert
in a forked child:
- `BackgroundEventLoop` runs an asyncio loop on a Python
`threading.Thread`.
- `pyo3-async-runtimes::tokio` holds a global multi-threaded tokio
runtime whose worker threads also die on fork — and its runtime lives in
a `OnceLock` that cannot be replaced after first use.
As a result, any `Permutation` (or other async API) used inside a
fork-based `DataLoader` worker hangs indefinitely. This PR makes both
layers fork-safe so `Permutation` works as a `torch.utils.data.Dataset`
with `num_workers > 0`.
## Approach
### Rust — new `python/src/runtime.rs`
Mirrors the pattern used in [Lance's Python
bindings](456198cd6f/python/src/lib.rs (L139)),
adapted for the async-bridge use case.
- `LanceRuntime` implements `pyo3_async_runtimes::generic::Runtime +
ContextExt`, backed by an `AtomicPtr<tokio::runtime::Runtime>` we own
(sidestepping `pyo3-async-runtimes`'s frozen `OnceLock` global).
- A `pthread_atfork(after_in_child)` handler nulls the pointer; the next
`spawn` rebuilds the runtime in the child. The previous runtime is
intentionally **leaked** — calling `Drop` would try to join now-dead
worker threads and hang.
- `runtime::future_into_py` is a drop-in for
`pyo3_async_runtimes::tokio::future_into_py`. All ~80 call sites in
`arrow.rs` / `connection.rs` / `permutation.rs` / `query.rs` /
`table.rs` are updated to route through it.
- `python/Cargo.toml` adds `libc = "0.2"` and the tokio
`rt-multi-thread` feature.
### Python — `lancedb/background_loop.py`
- Refactors `BackgroundEventLoop.__init__` to a reusable `_start()`
method.
- An `os.register_at_fork(after_in_child=…)` hook calls `LOOP._start()`
to give the singleton a fresh asyncio loop and thread **in place**. This
matters because the rest of the codebase imports `LOOP` via `from
.background_loop import LOOP` — rebinding the module attribute would
leave those references holding the dead loop.
### Python — `lancedb/__init__.py`
Removes the `__warn_on_fork` pre-fork warning (and the now-unused
`import warnings`). Fork is supported.
## Test plan
- [x] New `test_permutation_dataloader_fork_workers` in
`python/tests/test_torch.py`: runs a `Permutation` through
`torch.utils.data.DataLoader(num_workers=2,
multiprocessing_context="fork")` inside a spawn-isolated child with a
30s hang detector. **Pre-fix**: timed out at 36s. **Post-fix**: passes
in ~3.6s.
- [x] New `test_remote_connection_after_fork` in
`python/tests/test_remote_db.py`: forks a child that creates a fresh
`lancedb.connect(...)` against a mock HTTP server and calls
`table_names()`; passes in <1s, validates the runtime reset is
sufficient for fresh remote clients.
- [x] All 62 tests in `test_torch.py` + `test_permutation.py` pass.
- [x] All 35 tests in `test_remote_db.py` pass.
- [x] `test_table.py` (87) + `test_db.py` + `test_query.py` (157, minus
one unrelated `sentence_transformers` import skip) — 244 passing.
- [x] `cargo clippy -p lancedb-python --tests` clean.
- [x] `cargo fmt`, `ruff check`, `ruff format` all clean.
## Known limitation (follow-up)
This PR makes a **freshly-built** `lancedb.connect(...)` work in a
forked child. An **inherited** `Connection` from the parent still
carries an inherited `reqwest::Client` whose hyper connection pool
references socket FDs and TCP/TLS state shared with the parent — using
it from the child after fork is unsafe (especially with HTTP/1.1
keep-alive). The recommended pattern for fork-based `DataLoader` workers
that hit a remote DB is to construct a new connection inside the worker.
Auto-clearing inherited HTTP client pools on fork would require tracking
live `Connection` instances in `lancedb` core and is left for a
follow-up PR.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
When pytorch is used with multiprocessing and the mp mode is spawn then
the Permutation needs to be pickled. It could not be pickled because
`Table` and `Connection` are not serializable. This PR adds pickle
support to Permutation without adding general pickle support to `Table`
or `Connection`. To add general support we probably need to start by
adding serialization in the namespace client.
In the meantime this PR enable pickling by adding special cases for:
* In-memory tables (just serialize as Arrow IPC)
* Native tables (serialize the URI)
If a user is not using one of the above cases (e.g. using a remote
connection) then they will need to provide a connection factory that can
be pickled.
## Breaking change
`PermutationBuilder.persist(...)` is removed from the Python bindings;
the permutation table is now always in-memory. The underlying Rust
`PermutationBuilder::persist` API is untouched and can be re-exposed
later if needed. It probably won't make sense to do that until we have a
way to serialize `Table` and `Connection`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This follows the Rust-side Tantivy removal by deleting the remaining
Python Tantivy runtime, tests, and packaging references.
It also turns the legacy Python-only Tantivy parameters into explicit
errors and stops reading legacy `_indices/fts` directories so Python FTS
is fully native-only.
## Summary
- pass `namespace_client` through the Python create-table path
- ensure schema-only namespace table creation uses the namespace-aware
empty-table flow
- fix reopening namespace tables created without initial data
## Summary
- delegate child-namespace `ListingDatabase` operations through an
eagerly initialized `LanceNamespaceDatabase`
- support nested namespace create/open/list/drop flows without requiring
callers to inject explicit locations
- add `namespace_client_properties` plumbing for local and namespace
connections so directory namespace settings like
`table_version_tracking_enabled` can be configured
- add regression tests for nested namespace ops and namespace client
property propagation
## Summary
Add connection serialization and child namespace support to
`LanceDBConnection`.
- `DBConnection.serialize()` / `lancedb.deserialize()` for connection
reconstruction in remote workers
- Cache `namespace_client()` in `LanceDBConnection` to avoid repeated
DirectoryNamespace builds
- `LanceDBConnection` transparently delegates child namespace operations
(open_table, create_table, list_tables, drop_table, create_namespace,
etc.) to `LanceNamespaceDBConnection` via `_namespace_conn()`
- Root namespace operations still go through the original Rust path
- Generic worker property override mechanism: any
`namespace_client_properties` key prefixed with `_lancedb_worker_` has
the prefix stripped and overrides the corresponding property when
`deserialize(data, for_worker=True)`
- `LanceNamespaceDBConnection` stores
`namespace_client_impl`/`namespace_client_properties` for serialization
roundtrip
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`.get(b"split_names", None).decode()` was called unconditionally in both
Permutations.__init__ and Permutation.from_tables(), crashing with
AttributeError when schema metadata existed but lacked the split_names
key. Guard the decode behind a None check and add regression tests.
## Problem
`on_bad_vectors="drop"` is supposed to remove invalid vector rows before
write, but for some schema-defined vector columns it can still fail
later during Arrow cast instead of dropping the bad row.
Repro:
```python
class MySchema(LanceModel):
text: str
embedding: Vector(16)
table = db.create_table("test", schema=MySchema)
table.add(
[
{"text": "hello", "embedding": []},
{"text": "bar", "embedding": [0.1] * 16},
],
on_bad_vectors="drop",
)
```
Before:
```
RuntimeError
Arrow error: C Data interface error: Invalid: ListType can only be casted to FixedSizeListType if the lists are all the expected size.
```
After:
```
rows 1
texts ['bar']
```
## Solution
Make bad-vector sanitization use schema dimensions before cast, while
keeping the handling scoped to vector columns identified by schema
metadata or existing vector-name heuristics.
This also preserves existing integer vector inputs and avoids applying
on_bad_vectors to unrelated fixed-size float columns.
Fixes#1670
Signed-off-by: yaommen <myanstu@163.com>
## Summary
Fixes#1846.
Python `Enum` fields raised `TypeError: Converting Pydantic type to
Arrow Type: unsupported type <enum 'SomethingTypes'>` when converting a
Pydantic model to an Arrow schema.
The fix adds Enum detection in `_pydantic_type_to_arrow_type`. When an
Enum subclass is encountered, the value type of its members is inspected
and mapped to the appropriate Arrow type:
- `str`-valued enums (e.g. `class Status(str, Enum)`) → `pa.utf8()`
- `int`-valued enums (e.g. `class Priority(int, Enum)`) → `pa.int64()`
- Other homogeneous value types → the Arrow type for that Python type
- Mixed-value or empty enums → `pa.utf8()` (safe fallback)
This covers the common `(str, Enum)` and `(int, Enum)` mixin patterns
used in practice.
## Changes
- `python/python/lancedb/pydantic.py`: add Enum branch in
`_pydantic_type_to_arrow_type`
- `python/python/tests/test_pydantic.py`: add `test_enum_types` covering
`str`, `int`, and `Optional` Enum fields
## Note on #2395
PR #2395 handles `StrEnum` (Python 3.11+) specifically, using a
dictionary-encoded type. This PR handles the broader `(str, Enum)` /
`(int, Enum)` mixin pattern that works across all Python versions and
stores values as their natural Arrow type.
AI assistance was used in developing this fix.
1. Refactored every client (Rust core, Python, Node/TypeScript) so
“namespace” usage is explicit: code now keeps namespace paths
(namespace_path) separate from namespace clients (namespace_client).
Connections propagate the client, table creation routes through it, and
managed versioning defaults are resolved from namespace metadata. Python
gained LanceNamespaceDBConnection/async counterparts, and the
namespace-focused tests were rewritten to match the clarified API
surface.
2. Synchronized the workspace with Lance 5.0.0-beta.3 (see
https://github.com/lance-format/lance/pull/6186 for the upstream
namespace refactor), updating Cargo/uv lockfiles and ensuring all
bindings align with the new namespace semantics.
3. Added a namespace-backed code path to lancedb.connect() via new
keyword arguments (namespace_client_impl, namespace_client_properties,
plus the existing pushdown-ops flag). When those kwargs are supplied,
connect() delegates to connect_namespace, so users can opt into
namespace clients without changing APIs. (The async helper will gain
parity in a later change)
The test added in #3190 unconditionally imports `PIL`, which is an
optional dependency. This causes CI failures in environments where
Pillow isn't installed (`ModuleNotFoundError: No module named 'PIL'`).
Use `pytest.importorskip` to skip gracefully when Pillow is unavailable.
Fixes CI failure on main.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Namespace tests expected `RuntimeError` for table-not-found and
namespace-not-empty cases, but `lance_namespace` raises
`TableNotFoundError` and `NamespaceNotEmptyError` which inherit from
`Exception`, not `RuntimeError`.
- Updated `pytest.raises` to use the correct exception types.
## Test plan
- [x] CI passes on `test_namespace.py`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
url_retrieve() calls urllib.request.urlopen() but only urllib.error was
imported, causing AttributeError for any HTTP URL input. This affects
open-clip, siglip, and jinaai embedding functions when processing image
URLs.
The bug has existed since the embeddings API refactor (#580) but was
masked because most users pass local file paths or bytes rather than
HTTP URLs.
Fixes#3183
## Summary
When `table.add(mode='overwrite')` is called, PyArrow infers input data
types (e.g. `list<double>`) which differ from the original table schema
(e.g. `fixed_size_list<float32>`). Previously, overwrite mode bypassed
`cast_to_table_schema()` entirely, so the inferred types replaced the
original schema, breaking vector search.
This fix builds a merged target schema for overwrite: columns present in
the existing table schema keep their original types, while columns
unique to the input pass through as-is. This way
`cast_to_table_schema()` is applied unconditionally, preserving vector
column types without blocking schema evolution.
## Changes
- `rust/lancedb/src/table/add_data.rs`: For overwrite mode, construct a
target schema by matching input columns against the existing table
schema, then cast. Non-overwrite (append) path is unchanged.
- Added `test_add_overwrite_preserves_vector_type` test that creates a
table with `fixed_size_list<float32>`, overwrites with `list<double>`
input, and asserts the original type is preserved.
## Test Plan
- `cargo test --features remote -p lancedb -- test_add_overwrite` — all
4 overwrite tests pass
- Full suite: 454 passed, 2 failed (pre-existing `remote::retry` flakes
unrelated to this change)
---------
Signed-off-by: majiayu000 <1835304752@qq.com>
dict.update() mutates in place and returns None. Assigning its result
caused with_metadata(None) to strip all schema metadata when embedding
metadata was merged during create_table with embedding_functions.
## Summary
Adds progress reporting for `table.add()` so users can track large write
operations. The progress callback is available in Rust, Python (sync and
async), and through the PyO3 bindings.
### Usage
Pass `progress=True` to get an automatic tqdm bar:
```python
table.add(data, progress=True)
# 100%|██████████| 1000000/1000000 [00:12<00:00, 82345 rows/s, 45.2 MB/s | 4/4 workers]
```
Or pass a tqdm bar for more control:
```python
from tqdm import tqdm
with tqdm(unit=" rows") as pbar:
table.add(data, progress=pbar)
```
Or use a callback for custom progress handling:
```python
def on_progress(p):
print(f"{p['output_rows']}/{p['total_rows']} rows, "
f"{p['active_tasks']}/{p['total_tasks']} workers, "
f"done={p['done']}")
table.add(data, progress=on_progress)
```
In Rust:
```rust
table.add(data)
.progress(|p| println!("{}/{:?} rows", p.output_rows(), p.total_rows()))
.execute()
.await?;
```
### Details
- `WriteProgress` struct in Rust with getters for `elapsed`,
`output_rows`, `output_bytes`, `total_rows`, `active_tasks`,
`total_tasks`, and `done`. Fields are private behind getters so new
fields can be added without breaking changes.
- `WriteProgressTracker` tracks progress across parallel write tasks
using a mutex for row/byte counts and atomics for active task counts.
- Active task tracking uses an RAII guard pattern (`ActiveTaskGuard`)
that increments on creation and decrements on drop.
- For remote writes, `output_bytes` reflects IPC wire bytes rather than
in-memory Arrow size. For local writes it uses in-memory Arrow size as a
proxy (see TODO below).
- tqdm postfix displays throughput (MB/s) and worker utilization
(active/total).
- The `done` callback always fires, even on error (via `FinishOnDrop`),
so progress bars are always finalized.
### TODO
- Track actual bytes written to disk for local tables. This requires
Lance to expose a progress callback from its write path. See
lance-format/lance#6247.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Problem
The generated Python API docs for
`lancedb.table.IndexStatistics.index_type` were misleading because
mkdocstrings renders that field’s type annotation directly, and the
existing `Literal[...]` listed only a subset of the actual canonical SDK
index type strings.
Current (missing index types):
<img width="823" height="83" alt="image"
src="https://github.com/user-attachments/assets/f6f29fe3-4c16-4d00-a4e9-28a7cd6e19ec"
/>
## Fix
- Update the `IndexStatistics.index_type` annotation in
`python/python/lancedb/table.py` to include the full supported set of
canonical values, so the generated docs show all valid index_type
strings inline.
- Add a small regression test in `python/python/tests/test_index.py` to
ensure the docs-facing annotation does not drift silently again in case
we add a new index/quantization type in the future.
- Bumps mkdocs and material theme versions to mkdocs 1.6 to allow access
to more features like hooks
After fix (all index types are included and tested for in the
annotations):
<img width="1017" height="93" alt="image"
src="https://github.com/user-attachments/assets/66c74d5c-34b3-4b44-8173-3ee23e3648ac"
/>
When using hybrid search with a where filter, the prefilter argument is
silently inverted. Passing prefilter=True actually performs
post-filtering, and prefilter=False actually performs pre-filtering.
When we write data with `add()`, we can input data to the table's
schema. However, we were using "safe" mode, which propagates errors as
nulls. For example, if you pass `u64::max` into a field that is a `u32`,
it will just write null instead of giving overflow error. Now it
propagates the overflow. This is the same behavior as other systems like
DuckDB.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Add `@value_to_sql.register(dict)` handler that converts Python dicts
to DataFusion's `named_struct()` SQL syntax
- Enables updating struct-typed columns via `table.update(values={"col":
{"field_a": 1, "field_b": "hello"}})`
- Recursively handles nested structs, lists, nulls, and all existing
scalar types
Closes#1363
## Details
The `named_struct` function was introduced in DataFusion 38 and is now
available (LanceDB uses DataFusion 52.1). The implementation follows the
existing `singledispatch` pattern in `util.py`.
**Example conversion:**
```python
value_to_sql({"field_a": 1, "field_b": "hello"})
# => "named_struct('field_a', 1, 'field_b', 'hello')"
```
## Test plan
- [x] Unit tests for flat struct, nested struct, list inside struct,
mixed types, null values, and empty dict
- [ ] CI integration tests with actual table.update() on struct columns
🔗 [DataFusion named_struct
docs](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#named-struct)
We don't necessarily need to do this, but one user was confused having
used `fast_search=True` as a keyword argument for vector searches, but
being unable to do so for FTS, even after the most recent changes. I
think this is the only discrepancy in where that is possible.
This hooks up a new writer implementation for the `add()` method. The
main immediate benefit is it allows streaming requests to remote tables,
and at the same time allowing retries for most inputs.
In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's
always retry-able.
For Python, all are retry-able, except `Iterator` and
`pa.RecordBatchReader`, which can only be consumed once. Some, like
`pa.datasets.Dataset` are retry-able *and* streaming.
A lot of the changes here are to make the new DataFusion write pipeline
maintain the same behavior as the existing Python-based preprocessing,
such as:
* casting input data to target schema
* rejecting NaN values if `on_bad_vectors="error"`
* applying embedding functions.
In future PRs, we'll enhance these by moving the embedding calls into
DataFusion and making sure we parallelize them. See:
https://github.com/lancedb/lancedb/issues/3048
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>