### Summary
- Closes#3362
- Adds `prewarmData(columns?: string[])` to the Node bindings, mirroring
the Rust and Python implementations
### Testing
- [x] `npm run build` (regenerates the napi `.node` module + TS
declarations)
- [x] `npm run lint`
- [x] `npm test
- [ ] live test against remote table - just waiting for my dev stack to
get created
### Documentation
- updated docs
## Summary
Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was
documented at https://docs.lancedb.com/indexing/vector-index but
`lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised
`ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying
`pylance` already accepted it, only the LanceDB wrapper was missing the
wiring.
**Rust core (`rust/lancedb`):**
- Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the
`IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`).
- Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)`
helper, keeping symmetry with the other `IVF_HNSW_*` variants.
- Forward the variant in `RemoteTable::create_index` and add two
parametrised tests (default + customised config) for the JSON
serialisation.
- New `NativeTable` integration test
(`test_create_index_ivf_hnsw_flat`).
**Python binding (`python/`):**
- New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias.
- PyO3 `extract_index_params` recognises the `HnswFlat` config.
- `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync
`RemoteTable.create_index` both dispatch to the new config.
- `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover
the new type so `pyright`/`make check` stays clean.
- Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync
dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage.
- Existing `test_index_statistics_index_type_lists_all_supported_values`
updated to include `IVF_HNSW_FLAT`.
A matching Node.js / TypeScript binding is in a follow-up PR.
Closes#3331
## Test plan
- [ ] \`cargo check --quiet --features remote --tests --examples\`
- [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the
new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised
\`RemoteTable::create_index\` cases)
- [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote
--tests --examples\`
- [ ] \`cd python && make develop && make check && make test\` (covers
the two new async tests, the alias test, the dispatcher test, and the
updated \`test_index_statistics_index_type_lists_all_supported_values\`
assertion)
This wires Lance's existing `jieba/*` and `lindera/*` native FTS
tokenizers through the Python SDK instead of leaving them behind
disabled features and narrow public typing. It also documents the
`LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for
successful CJK indexing plus missing-model error guidance.
Closes#2168.
## Summary
`url::Url::query_pairs_mut()` leaves the URL with `query=Some("")` after
`.clear()` even when the input had no query string. The listing-database
connect path then captured that empty query into
`ListingDatabase::query_string`, and `table_uri()` blindly appended
`?<query>` to every per-table URI — producing URIs like
`s3://bucket/prefix/foo.lance?`.
The trailing `?` is benign for normal table operations, but it breaks
any caller that constructs a sub-path from the table URI. In particular,
MemWAL flushes write to `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`,
which `url::Url::parse` then re-parses as `path=<base table>` +
`query=/_mem_wal/...`. `Dataset::write` resolves the base table dataset,
finds it already exists, and fails with `Dataset already exists:
…_gen_1` on the very first MemTable flush (observed deterministically
against S3 across all merge_insert LSM modes; tracked in
[lance-format/lance#6713](https://github.com/lance-format/lance/pull/6715)).
## Fix
Treat `Some("")` query the same as no query when capturing
`query_string`. A real `?foo=bar` query is still propagated unchanged.
Adds a regression test covering both the empty-query and non-empty-query
paths.
## Verification
- `url::Url::parse("s3://bucket/prefix/").query()` → `None`, but after
`query_pairs_mut().clear()` → `Some("")`. Confirmed in a standalone
repro.
- Without this fix, every `table_uri()` for an `s3://`-style connection
ends with `?`, breaking MemWAL and any future sub-path consumer in the
same way.
- New unit test `test_table_uri_url_path_has_no_trailing_question_mark`
exercises both code paths.
Fixes#3299
## Problem
Two security issues exist in `.github/workflows/java-publish.yml`:
1. **`gpg-passphrase` input is misused**: `actions/setup-java`'s
`gpg-passphrase` input expects the **name** of an environment variable
(default: `GPG_PASSPHRASE`), not the secret value itself. The previous
value `${{ secrets.GPG_PASSPHRASE }}` was setting the env var name to
the actual secret, which is incorrect.
2. **Passphrase visible on the command line**: `-Dgpg.passphrase=${{
secrets.GPG_PASSPHRASE }}` passes the GPG passphrase as a Maven system
property argument, making it visible in process listings and potentially
echoed in debug logs — a supply-chain security risk for release
workflows.
## Solution
- Fix `gpg-passphrase: MAVEN_GPG_PASSPHRASE` — use the correct env var
name so `actions/setup-java` generates a proper Maven `settings.xml`
entry that reads from `MAVEN_GPG_PASSPHRASE`.
- Remove `-Dgpg.passphrase=...` from the Maven CLI invocation.
- Add `MAVEN_GPG_PASSPHRASE: ${{ secrets.GPG_PASSPHRASE }}` to the
`env:` block of the Publish step, so the passphrase is available as an
environment variable rather than a CLI argument.
## Testing
The Java publish workflow only runs on tag pushes, so this cannot be
exercised in a PR build. The logic change is straightforward:
`actions/setup-java` is documented to write a `settings.xml` that reads
`<gpg.passphrase>` from the named env var, and `maven-gpg-plugin` picks
it up from there without any CLI argument.
Co-authored-by: octo-patch <octo-patch@github.com>
## Summary
PyTorch's `DataLoader` uses fork-based multiprocessing by default on
Linux, but threads do not survive `fork()`. LanceDB's Python bindings
drive async work through two threaded layers, both of which become inert
in a forked child:
- `BackgroundEventLoop` runs an asyncio loop on a Python
`threading.Thread`.
- `pyo3-async-runtimes::tokio` holds a global multi-threaded tokio
runtime whose worker threads also die on fork — and its runtime lives in
a `OnceLock` that cannot be replaced after first use.
As a result, any `Permutation` (or other async API) used inside a
fork-based `DataLoader` worker hangs indefinitely. This PR makes both
layers fork-safe so `Permutation` works as a `torch.utils.data.Dataset`
with `num_workers > 0`.
## Approach
### Rust — new `python/src/runtime.rs`
Mirrors the pattern used in [Lance's Python
bindings](456198cd6f/python/src/lib.rs (L139)),
adapted for the async-bridge use case.
- `LanceRuntime` implements `pyo3_async_runtimes::generic::Runtime +
ContextExt`, backed by an `AtomicPtr<tokio::runtime::Runtime>` we own
(sidestepping `pyo3-async-runtimes`'s frozen `OnceLock` global).
- A `pthread_atfork(after_in_child)` handler nulls the pointer; the next
`spawn` rebuilds the runtime in the child. The previous runtime is
intentionally **leaked** — calling `Drop` would try to join now-dead
worker threads and hang.
- `runtime::future_into_py` is a drop-in for
`pyo3_async_runtimes::tokio::future_into_py`. All ~80 call sites in
`arrow.rs` / `connection.rs` / `permutation.rs` / `query.rs` /
`table.rs` are updated to route through it.
- `python/Cargo.toml` adds `libc = "0.2"` and the tokio
`rt-multi-thread` feature.
### Python — `lancedb/background_loop.py`
- Refactors `BackgroundEventLoop.__init__` to a reusable `_start()`
method.
- An `os.register_at_fork(after_in_child=…)` hook calls `LOOP._start()`
to give the singleton a fresh asyncio loop and thread **in place**. This
matters because the rest of the codebase imports `LOOP` via `from
.background_loop import LOOP` — rebinding the module attribute would
leave those references holding the dead loop.
### Python — `lancedb/__init__.py`
Removes the `__warn_on_fork` pre-fork warning (and the now-unused
`import warnings`). Fork is supported.
## Test plan
- [x] New `test_permutation_dataloader_fork_workers` in
`python/tests/test_torch.py`: runs a `Permutation` through
`torch.utils.data.DataLoader(num_workers=2,
multiprocessing_context="fork")` inside a spawn-isolated child with a
30s hang detector. **Pre-fix**: timed out at 36s. **Post-fix**: passes
in ~3.6s.
- [x] New `test_remote_connection_after_fork` in
`python/tests/test_remote_db.py`: forks a child that creates a fresh
`lancedb.connect(...)` against a mock HTTP server and calls
`table_names()`; passes in <1s, validates the runtime reset is
sufficient for fresh remote clients.
- [x] All 62 tests in `test_torch.py` + `test_permutation.py` pass.
- [x] All 35 tests in `test_remote_db.py` pass.
- [x] `test_table.py` (87) + `test_db.py` + `test_query.py` (157, minus
one unrelated `sentence_transformers` import skip) — 244 passing.
- [x] `cargo clippy -p lancedb-python --tests` clean.
- [x] `cargo fmt`, `ruff check`, `ruff format` all clean.
## Known limitation (follow-up)
This PR makes a **freshly-built** `lancedb.connect(...)` work in a
forked child. An **inherited** `Connection` from the parent still
carries an inherited `reqwest::Client` whose hyper connection pool
references socket FDs and TCP/TLS state shared with the parent — using
it from the child after fork is unsafe (especially with HTTP/1.1
keep-alive). The recommended pattern for fork-based `DataLoader` workers
that hit a remote DB is to construct a new connection inside the worker.
Auto-clearing inherited HTTP client pools on fork would require tracking
live `Connection` instances in `lancedb` core and is left for a
follow-up PR.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
When pytorch is used with multiprocessing and the mp mode is spawn then
the Permutation needs to be pickled. It could not be pickled because
`Table` and `Connection` are not serializable. This PR adds pickle
support to Permutation without adding general pickle support to `Table`
or `Connection`. To add general support we probably need to start by
adding serialization in the namespace client.
In the meantime this PR enable pickling by adding special cases for:
* In-memory tables (just serialize as Arrow IPC)
* Native tables (serialize the URI)
If a user is not using one of the above cases (e.g. using a remote
connection) then they will need to provide a connection factory that can
be pickled.
## Breaking change
`PermutationBuilder.persist(...)` is removed from the Python bindings;
the permutation table is now always in-memory. The underlying Rust
`PermutationBuilder::persist` API is untouched and can be re-exposed
later if needed. It probably won't make sense to do that until we have a
way to serialize `Table` and `Connection`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `build:release` command already outputs the `*.node` files directly
to the `dist/` directory via the `--output-dir dist` flag.
Therefore, the `postbuild:release` script, which attempts to copy
`*.node` files from the `lancedb/` source directory, fails with a "no
such file or directory" error because the source files do not exist
there.
This commit removes the redundant `postbuild:release` script to resolve
the build failure.
fix#3284
Signed-off-by: qingfeng-occ <qing.feng@zte.com.cn>
Hi, the hybrid query error message looks like it can use a space, just
added it.
```python
def _validate_query(self, query, vector=None, text=None):
if query is not None and (vector is not None or text is not None):
raise ValueError(
"You can either provide a string query in search() method"
"or set `vector()` and `text()` explicitly for hybrid search."
"But not both."
)
```
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
## Summary
- Update `rustls-webpki` 0.103.10 → 0.103.13 to fix RUSTSEC-2026-0104
(reachable panic in CRL parsing)
- Add advisory ignore for the legacy `rustls-webpki` 0.101.7 copy pinned
to the aws-smithy/rustls 0.21 chain (same chain already exempted for
RUSTSEC-2026-0098/0099)
Fixes the `deny` CI job failure seen in #3325.
## Test plan
- [x] `cargo deny check advisories` passes locally
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a `deny.toml` at the workspace root and a `deny` CI job that runs
`cargo deny check` on every PR. Catches yanked crates, license drift,
banned or wildcard dependencies, unapproved sources, and new RUSTSEC
advisories.
As part of wiring this up:
- Updated `aws-lc-rs` 1.13.0 → 1.16.3 / `aws-lc-sys` 0.28.0 → 0.40.0 to
clear four 2026 AWS-LC advisories (timing side-channel, PKCS7 bypass,
CRL scope). Removed the `=0.28.0` workaround pin; the original build
failure no longer reproduces.
- Updated `bytes`, `zlib-rs`, `rand`, `rustls-webpki`, `lz4_flex` to
clear their current advisories.
- Marked `lancedb-nodejs` and `lancedb-python` as `publish = false` and
pinned `lzma-sys` from `*` to `0.1` so `bans.wildcards = "deny"` can
be enforced.
10 remaining advisories have no safe upgrade available (transitive via
opendal, lance, datafusion, async-openai, aws-sdk on the legacy rustls
0.21 chain). Each is ignored in `deny.toml` with a per-entry rationale
and a link to the RUSTSEC advisory. New advisories still fail CI.
Fixes#3297
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
- Replaces `LANCEDB_PYPI_API_TOKEN` (long-lived token) with OIDC trusted
publishing via `pypa/gh-action-pypi-publish`
- Adds `id-token: write` permission to linux/mac/windows jobs
- Removes `twine`-based upload and the `pypi_token` input from
`upload_wheel` composite action
- Enables PEP 740 Sigstore attestations on published wheels as a bonus
After merging, rotate/revoke the `LANCEDB_PYPI_API_TOKEN` secret.
Closes#3294🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `.github/dependabot.yml` enabling weekly cargo update PRs for the
root workspace, which produces the Rust binaries we ship: the Node.js
and Python native extensions. The `rust/lancedb` library crate shares
the same lockfile — its consumers pick versions themselves, but bumping
transitive deps here keeps the shipped binaries current.
Also removes the misleading `exclude = ["python"]` line from the root
`Cargo.toml`: `python` is listed in `members`, and `cargo metadata`
confirms it's a workspace member, so the exclude was dead code that
implied the opposite.
Minor/patch updates are grouped to reduce PR noise.
Part of #3292. Only covers the cargo ecosystem; pip, npm, and
github-actions can follow.
## Summary
- make `TlsConfig::default()` enable hostname verification by default
- align the Rust default with the documented Python and Node behavior
- update the Rust unit test to lock in the safe default
This follows the Rust-side Tantivy removal by deleting the remaining
Python Tantivy runtime, tests, and packaging references.
It also turns the legacy Python-only Tantivy parameters into explicit
errors and stops reading legacy `_indices/fts` directories so Python FTS
is fully native-only.
Adds `permissions: contents: read` to the 10 workflows that had no
top-level permissions block. Workflows that already declared
permissions, or individual jobs that need elevated permissions (`issues:
write`, `pull-requests: write`, `contents: write`), are left unchanged.
Affected workflows: `dev.yml`, `java-publish.yml`, `java.yml`,
`license-header-check.yml`, `nodejs.yml`, `pypi-publish.yml`,
`python.yml`, `rust.yml`, `update_package_lock_run.yml`,
`update_package_lock_run_nodejs.yml`
Fixes#3269.
## What I observed
Using a reranker in a hybrid query could keep the Node.js process alive
even after `table.close()` and `db.close()`.
## Root cause
The reranker callback bridge used a `ThreadsafeFunction` in referenced
mode, which can keep the event loop alive longer than intended.
## Minimal fix
- In `nodejs/src/rerankers.rs`, create the reranker callback TSFN in
weak mode (`.weak::<true>()`).
- Add a regression test in `nodejs/__test__/rerankers.test.ts` that
spawns a child process, runs a rerank query, and asserts the process
exits naturally.
## Validation
- Built Node bindings successfully.
- Ran targeted tests: `rerankers.test.ts` passes (including new
regression test).
- Pre-commit checks for changed files were run and clean.
So far, I have been using a hacky approach that creates and opens
namespace-backed table, by getting its location and use a temporary
lancedb connection to create or open it. This was working for features
like credentials vending but is no longer fully working for the managed
versioning feature, recently geneva tests have been failing here and
there and various patches are not addressing the root cause. This PR
fully fixes this and implements proper rust binding for it.
Specifically:
- build a real Rust namespace-backed connection from the Python
namespace client
- route namespace table create/open through that connection instead of
resolved-location temp connections
- keep namespace client naming consistent in the Rust bridge and
preserve federated namespace + DuckDB behavior
## Summary
- pass `namespace_client` through the Python create-table path
- ensure schema-only namespace table creation uses the namespace-aware
empty-table flow
- fix reopening namespace tables created without initial data
## Summary
- delegate child-namespace `ListingDatabase` operations through an
eagerly initialized `LanceNamespaceDatabase`
- support nested namespace create/open/list/drop flows without requiring
callers to inject explicit locations
- add `namespace_client_properties` plumbing for local and namespace
connections so directory namespace settings like
`table_version_tracking_enabled` can be configured
- add regression tests for nested namespace ops and namespace client
property propagation
## Summary
Add connection serialization and child namespace support to
`LanceDBConnection`.
- `DBConnection.serialize()` / `lancedb.deserialize()` for connection
reconstruction in remote workers
- Cache `namespace_client()` in `LanceDBConnection` to avoid repeated
DirectoryNamespace builds
- `LanceDBConnection` transparently delegates child namespace operations
(open_table, create_table, list_tables, drop_table, create_namespace,
etc.) to `LanceNamespaceDBConnection` via `_namespace_conn()`
- Root namespace operations still go through the original Rust path
- Generic worker property override mechanism: any
`namespace_client_properties` key prefixed with `_lancedb_worker_` has
the prefix stripped and overrides the corresponding property when
`deserialize(data, for_worker=True)`
- `LanceNamespaceDBConnection` stores
`namespace_client_impl`/`namespace_client_properties` for serialization
roundtrip
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>