The 0.29 release happened on a branch because the main line had already
moved past the 6.0.0 stable lance release. As a result the version bump
commits ended up on the branch. This merges those commits back into
main.
---------
Co-authored-by: Lance Release <lance-dev@lancedb.com>
## Summary
Split out from #3354
Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` /
`unset_lsm_write_spec` to
install and clear the spec that selects Lance's MemWAL LSM-style write
path for
`merge_insert`.
`LsmWriteSpec` offers three sharding strategies, all built on Lance's
`InitializeMemWalBuilder`:
- `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by
the
single-column unenforced primary key.
- `LsmWriteSpec::identity(column)` — identity sharding by the raw value
of a
scalar column.
- `LsmWriteSpec::unsharded()` — a single MemWAL shard.
Each can be refined with `with_maintained_indexes(...)` (indexes the
MemWAL
keeps up to date as rows are appended) and
`with_writer_config_defaults(...)`
(default `ShardWriter` configuration recorded in the MemWAL index, so
every
writer starts from the same defaults). All variants require the table to
have
an unenforced primary key.
- `set_lsm_write_spec` installs the spec by initializing the MemWAL
index;
`unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting
to
the standard `merge_insert` path. `unset` is idempotent.
- Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`,
`set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript
(`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` /
`"unsharded"`). `RemoteTable` returns `NotSupported`.
The actual `merge_insert` LSM dispatch and `ShardWriter` write path are
a
follow-up — this PR only installs and clears the spec.
## Summary
Adds `Table::set_unenforced_primary_key` — records a single column as
the
table's unenforced primary key in Lance schema field metadata.
"Unenforced"
means LanceDB does not check uniqueness on write; the key is metadata
that
`merge_insert` consumes.
- Single-column only; the column must exist and have a supported dtype
(Int32, Int64, Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary).
The
API accepts an iterable for binding ergonomics but requires exactly one
column — compound keys are rejected.
- The primary key is immutable: calling this on a table that already has
an
unenforced primary key is rejected. Concurrent writers racing to set the
key
fail at commit time rather than silently overriding it.
- `RemoteTable` returns `NotSupported`.
- Bindings: Python (`AsyncTable`, `LanceTable`, `RemoteTable`) and
TypeScript
(`Table.setUnenforcedPrimaryKey`).
## Context
Split out from #3354 per review feedback, so the unenforced primary key
and the
`merge_insert` sharding spec land as separate reviewable PRs.
No Lance dependency bump — `main` is already on v7.0.0-beta.10, which
includes
the field-metadata round-trip fix the API relies on. Enforcing
primary-key
immutability at the Lance commit layer (so the cross-column concurrent
race is
also rejected) is a companion Lance change: lance-format/lance#6810.
### Summary
- Expose Connection.renameTable in the Node.js bindings and align it
with existing namespace-aware connection APIs.
### Changes
- Add napi-rs rename_table on Connection, delegating to Rust
Connection::rename_table.
- Add renameTable(oldName, newName, namespacePath?) on abstract
Connection and implement on LocalConnection.
- Add a connection test that renames a table and checks names / open
behavior.
#### Testing
- cd nodejs && npm run build
- cd nodejs && npm test __test__/connection.test.ts
fix : #3364
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Closes#3261.
## Summary
Adds `bytes` to the accepted types of `lancedb.expr.lit()` so that
binary scalars can be used in filter / projection expressions. The
previous attempt in #3235 had to be reverted because DataFusion's SQL
unparser does not support `Binary` / `LargeBinary` scalars, so any
expression containing such a literal would fail in both `to_sql()` and
`__repr__`.
## How
`expr_to_sql_string` now has two paths:
- **Fast path** (no binary literals): delegate to DataFusion's unparser
unchanged.
- **Slow path**: rewrite each `Binary(Some(bytes))` literal in the tree
to a unique string-literal placeholder, run the unparser, then
substitute `'<placeholder>'` with `X'<HEX>'` in the resulting SQL.
`Binary(None)` / `LargeBinary(None)` are rewritten to
`ScalarValue::Null` so the unparser emits plain `NULL`.
This keeps DataFusion as the single source of truth for operator and
function serialization, so binary literals work in every expression node
type the unparser already supports — including nested cases like
`contains(col("data"), lit(b"\xff"))`, `NOT (col == lit(b"..."))`, and
`col.cast(...) == lit(b"...")`.
## Changes
- `rust/lancedb/src/expr/sql.rs`: placeholder-substitution
implementation.
- `rust/lancedb/src/expr.rs`: 4 new unit tests covering binary literals
in equality, compound predicates, scalar function calls, negation, and
`NULL` binary literals.
- `python/src/expr.rs`: `expr_lit` accepts `PyBytes` and produces
`ScalarValue::Binary`.
- `python/Cargo.toml` + `Cargo.lock`: pull in `datafusion-common` for
`ScalarValue`.
- `python/python/lancedb/expr.py`: extend `ExprLike` and `lit()` type
annotations / docstrings with `bytes`.
- `python/python/lancedb/_lancedb.pyi`: update `expr_lit` stub.
- `python/tests/test_expr.py`: unit tests for `to_sql` / `repr` of
binary literals and an integration test against a real `pa.binary()`
column for equality / inequality / compound filters.
## Example
```python
from lancedb.expr import col, lit, func
# Equality against a binary column
col("payload") == lit(b"\xca\xfe")
# Expr((payload = X'CAFE'))
# Nested inside a function call (previously failed)
func("contains", col("data"), lit(b"\xff"))
# Expr(contains(data, X'FF'))
# repr() no longer crashes
repr(lit(b"\xde\xad\xbe\xef"))
# "Expr(X'DEADBEEF')"
```
## Verification
- [x] `cargo test -p lancedb --lib expr::` — 12/12 pass (was 9; +3 new
tests)
- [x] `cargo check --features remote --tests --examples` — clean
- [x] `cargo clippy --features remote --tests --examples` — no warnings
- [x] `cargo fmt --all -- --check` — clean
- [x] `pytest python/tests/test_expr.py` — 76/76 pass (was 74; +2 new
tests)
- [x] `ruff check python` / `ruff format --check python` — clean
## Follow-ups (not in this PR)
Issue #3261 also raises the possibility of a *truncated* `__repr__` for
very large binary literals. This PR keeps `__repr__` exact (it forwards
to `to_sql()`), since truncating display output would diverge from the
SQL that actually gets executed. A display-only truncation could be
added in a follow-up by giving `__repr__` its own renderer.
Made with [Cursor](https://cursor.com)
Co-authored-by: Cursor <cursoragent@cursor.com>
## Summary
`LanceNamespaceDatabase::open_table` and `create_table` were squashing
`NamespaceError::TableNotFound` and `TableAlreadyExists` into generic
`Error::Runtime`, so callers couldn't distinguish a missing-table or
duplicate-table error from any other internal failure. Downstream this
surfaced to geneva-style code as HTTP 500 / "internal server error" on
operations that should have been 400/404 — see
[ENT-1235](https://linear.app/lancedb/issue/ENT-1235/fix-ns-errors-for-create-tableopen-table).
This PR walks the boxed-error chain from `lance::Error::Namespace` down
to the inner `NamespaceError` and maps its `ErrorCode` onto the proper
`lancedb::Error` variant:
- `NamespaceError::TableNotFound` → `Error::TableNotFound { name, source
}`
- `NamespaceError::TableAlreadyExists` → `Error::TableAlreadyExists {
name }`
- everything else → `Error::Runtime` (unchanged behavior for the long
tail)
It also replaces the existing `e.to_string().contains("already exists")`
string match in `LanceNamespaceDatabase::create_table` with a downcast
on the `NamespaceError` code. That string-match happened to work for the
`dir` backend but isn't guaranteed to match the REST namespace backend's
error format; the downcast works for both.
The chain-walk is needed because `DatasetBuilder::from_namespace`
re-wraps the inner namespace error in a fresh `lance::Error::Namespace`,
so a single top-level downcast misses it.
## How this helps geneva
Geneva's workaround (linked in the parent issue) currently has to use
`except Exception:` with a `# todo: this is too broad` comment, plus
`str(e).lower().contains("already exists")` string matching, because the
namespace-impl path raised a generic `RuntimeError`. After this PR:
- `db.open_table("missing")` raises `ValueError("Table 'missing' was not
found")` (via the existing Python binding mapping of `TableNotFound` →
`PyValueError`) — geneva can catch `ValueError` cleanly.
- `db.create_table("dup")` raises `ValueError("Table 'dup' already
exists")` reliably across both `dir` and REST backends, so the existing
string match becomes deterministic.
In phalanx (the sophon REST server), `LanceDBError::TableNotFound` and
`LanceDBError::TableAlreadyExists` already map directly to HTTP 404 and
HTTP 400 respectively — see
[phalanx/src/error.rs:77-94](https://github.com/lancedb/sophon/blob/main/src/rust/phalanx/src/error.rs#L77).
No phalanx code change is needed for the bug fix; the previous 500 came
from phalanx's string-match fallback not finding `"namespace"` AND `"not
found"` in the `Runtime` error's debug-formatted message.
## Follow-up
[ENT-1246](https://linear.app/lancedb/issue/ENT-1246/remove-dead-namespace-error-string-matching-in-phalanx)
— after this lands and phalanx picks up the new lancedb, the
string-matching fallback for table errors in
`src/rust/phalanx/src/error.rs` (lines 99-168, 236-256, 502-514) and
`src/rust/phalanx/src/rest/table/create_table.rs` (lines 224-241)
becomes dead code and can be removed. The `// TODO: Refactor for better
namespace error handling` comment at phalanx/src/error.rs:96-98 is
exactly what this PR addresses on the lancedb side; ENT-1246 finishes
the cleanup on the sophon side.
## Test plan
- [x] `cargo test --quiet --features remote -p lancedb --lib` — all 495
lib tests pass, including 4 new tests in `database::namespace::tests`:
- `test_namespace_table_not_found` — extended to assert
`Error::TableNotFound` (was just `is_err()`)
- `test_namespace_open_table_not_found_at_root` — covers the
root-namespace path
- `test_namespace_create_table_already_exists` — covers child namespace
- `test_namespace_create_table_already_exists_at_root` — covers root
namespace
- [x] `cargo clippy --quiet --features remote --tests` — clean
- [x] `cargo fmt --all` — clean
- [x] Manually confirmed (via test failures before the fix) that the two
`open_table` tests were returning `Error::Runtime { message: "Failed to
get table info from namespace: Namespace { source: TableNotFound { ... }
}" }` prior to this change.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to #3371 , which added runtime validation for namespace `mode`
and `behavior` parameters in the NodeJS SDK. Bringing the same fix to
Python for cross-SDK consistency.
**Before:** unrecognized values were silently dropped to `None`, so
`db.create_namespace(["x"], mode="foobar")` would quietly fall through
to the server's default mode and hide caller typos.
**After:** raises `ValueError` listing the valid values.
## **Summary**
This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.
A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.
This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.
---
## **Design**
### **Transport**
The transport uses the **Arrow IPC Stream format, one batch at a time**.
The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.
Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.
I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.
Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.
---
### **API**
```ts
class Scannable {
readonly schema: Schema
readonly numRows: number | null
readonly rescannable: boolean
static fromFactory(schema, factory, opts?)
static fromTable(table, opts?)
static fromIterable(schema, iter, opts?)
static fromRecordBatchReader(reader, opts?)
}
```
The FFI boundary consists of a single callback:
`getNextBatch(isStart: boolean): Promise<Buffer | null>`
`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.
In addition, a schema only IPC buffer is transferred once during
construction.
---
## **Changes**
* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.
* `nodejs/src/lib.rs`
Module registration.
* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.
* `nodejs/lancedb/index.ts`
Exports the new primitive.
* `nodejs/__test__/scannable.test.ts`
Test suite for the primitive.
---
## **Validation**
Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.
The harness covered the following scenarios:
* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe
All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.
Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.
These harnesses are not meant to replace the full test suite, which is
described below.
---
## **Tests**
`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.
Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.
Runtime coverage will follow in the consumer migration work.
---
## **Status**
Ready for review.
Closes#3223
---
## Summary
Switch the nodejs bindings and examples package from npm to pnpm 11 to
pick up its stronger supply-chain defaults:
- `minimumReleaseAge` defaults to 1 day, so newly-published (potentially
compromised) versions aren't resolved into installs for at least 24h.
- Install lifecycle scripts (`preinstall`/`install`/`postinstall`) are
no longer run for arbitrary transitive deps; only an explicit allowlist
may run them, and unapproved scripts cause install to fail
(`strictDepBuilds: true`).
- Audit uses GHSA IDs and `--fix=update` to add patched versions to
`minimumReleaseAgeExclude`.
This is the same class of protection that would have blunted the recent
TanStack/`@uipath`/etc. compromise discussed in the [Aikido
write-up](https://www.aikido.dev/blog/mini-shai-hulud-is-back-tanstack-compromised).
## Changes
- Replace `nodejs/package-lock.json` and
`nodejs/examples/package-lock.json` with `pnpm-lock.yaml`.
- Pin pnpm via `packageManager: pnpm@11.1.1` in both `package.json`s.
- Add `pnpm-workspace.yaml` with the four build-script packages we
actually need: `@biomejs/biome`, `onnxruntime-node`, `protobufjs`,
`sharp`. Everything else is blocked from running install scripts.
- Update package.json scripts (`npm run X` → `pnpm X`).
- Update workflows: `.github/workflows/nodejs.yml`,
`.github/workflows/npm-publish.yml`, and
`.github/workflows/codex-fix-ci.yml` — install pnpm via
`pnpm/action-setup@v4` and switch `setup-node` caches to
`pnpm-lock.yaml`.
- Refresh `nodejs/AGENTS.md`, `nodejs/CLAUDE.md`, and
`nodejs/CONTRIBUTING.md`.
`docs/package-lock.json` is **not** touched — out of scope for this PR.
## Test plan
- [ ] `Lint` job (lint Rust/TS + examples lint) passes on CI.
- [ ] `Linux (NodeJS 18/20)` build+test passes, including the examples
test step.
- [ ] `macos` build+test passes.
- [ ] `NPM Publish` workflow's PR dry-run completes (build matrix + test
matrix + dry `npm publish`).
- [ ] No new install-script approvals are required at install time.
## Follow-ups
- `update_package_lock_run_nodejs.yml` references a composite action
path that doesn't exist
(`./.github/workflows/update_package_lock_nodejs`); it was already
broken pre-PR. We may want to either delete this workflow or rewrite it
for pnpm in a follow-up.
- Consider migrating `docs/` to pnpm in a separate PR.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Summary
- Closes#3362
- Adds `prewarmData(columns?: string[])` to the Node bindings, mirroring
the Rust and Python implementations
### Testing
- [x] `npm run build` (regenerates the napi `.node` module + TS
declarations)
- [x] `npm run lint`
- [x] `npm test
- [ ] live test against remote table - just waiting for my dev stack to
get created
### Documentation
- updated docs
## Summary
Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was
documented at https://docs.lancedb.com/indexing/vector-index but
`lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised
`ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying
`pylance` already accepted it, only the LanceDB wrapper was missing the
wiring.
**Rust core (`rust/lancedb`):**
- Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the
`IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`).
- Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)`
helper, keeping symmetry with the other `IVF_HNSW_*` variants.
- Forward the variant in `RemoteTable::create_index` and add two
parametrised tests (default + customised config) for the JSON
serialisation.
- New `NativeTable` integration test
(`test_create_index_ivf_hnsw_flat`).
**Python binding (`python/`):**
- New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias.
- PyO3 `extract_index_params` recognises the `HnswFlat` config.
- `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync
`RemoteTable.create_index` both dispatch to the new config.
- `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover
the new type so `pyright`/`make check` stays clean.
- Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync
dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage.
- Existing `test_index_statistics_index_type_lists_all_supported_values`
updated to include `IVF_HNSW_FLAT`.
A matching Node.js / TypeScript binding is in a follow-up PR.
Closes#3331
## Test plan
- [ ] \`cargo check --quiet --features remote --tests --examples\`
- [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the
new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised
\`RemoteTable::create_index\` cases)
- [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote
--tests --examples\`
- [ ] \`cd python && make develop && make check && make test\` (covers
the two new async tests, the alias test, the dispatcher test, and the
updated \`test_index_statistics_index_type_lists_all_supported_values\`
assertion)
This wires Lance's existing `jieba/*` and `lindera/*` native FTS
tokenizers through the Python SDK instead of leaving them behind
disabled features and narrow public typing. It also documents the
`LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for
successful CJK indexing plus missing-model error guidance.
Closes#2168.
## Summary
`url::Url::query_pairs_mut()` leaves the URL with `query=Some("")` after
`.clear()` even when the input had no query string. The listing-database
connect path then captured that empty query into
`ListingDatabase::query_string`, and `table_uri()` blindly appended
`?<query>` to every per-table URI — producing URIs like
`s3://bucket/prefix/foo.lance?`.
The trailing `?` is benign for normal table operations, but it breaks
any caller that constructs a sub-path from the table URI. In particular,
MemWAL flushes write to `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`,
which `url::Url::parse` then re-parses as `path=<base table>` +
`query=/_mem_wal/...`. `Dataset::write` resolves the base table dataset,
finds it already exists, and fails with `Dataset already exists:
…_gen_1` on the very first MemTable flush (observed deterministically
against S3 across all merge_insert LSM modes; tracked in
[lance-format/lance#6713](https://github.com/lance-format/lance/pull/6715)).
## Fix
Treat `Some("")` query the same as no query when capturing
`query_string`. A real `?foo=bar` query is still propagated unchanged.
Adds a regression test covering both the empty-query and non-empty-query
paths.
## Verification
- `url::Url::parse("s3://bucket/prefix/").query()` → `None`, but after
`query_pairs_mut().clear()` → `Some("")`. Confirmed in a standalone
repro.
- Without this fix, every `table_uri()` for an `s3://`-style connection
ends with `?`, breaking MemWAL and any future sub-path consumer in the
same way.
- New unit test `test_table_uri_url_path_has_no_trailing_question_mark`
exercises both code paths.
Fixes#3299
## Problem
Two security issues exist in `.github/workflows/java-publish.yml`:
1. **`gpg-passphrase` input is misused**: `actions/setup-java`'s
`gpg-passphrase` input expects the **name** of an environment variable
(default: `GPG_PASSPHRASE`), not the secret value itself. The previous
value `${{ secrets.GPG_PASSPHRASE }}` was setting the env var name to
the actual secret, which is incorrect.
2. **Passphrase visible on the command line**: `-Dgpg.passphrase=${{
secrets.GPG_PASSPHRASE }}` passes the GPG passphrase as a Maven system
property argument, making it visible in process listings and potentially
echoed in debug logs — a supply-chain security risk for release
workflows.
## Solution
- Fix `gpg-passphrase: MAVEN_GPG_PASSPHRASE` — use the correct env var
name so `actions/setup-java` generates a proper Maven `settings.xml`
entry that reads from `MAVEN_GPG_PASSPHRASE`.
- Remove `-Dgpg.passphrase=...` from the Maven CLI invocation.
- Add `MAVEN_GPG_PASSPHRASE: ${{ secrets.GPG_PASSPHRASE }}` to the
`env:` block of the Publish step, so the passphrase is available as an
environment variable rather than a CLI argument.
## Testing
The Java publish workflow only runs on tag pushes, so this cannot be
exercised in a PR build. The logic change is straightforward:
`actions/setup-java` is documented to write a `settings.xml` that reads
`<gpg.passphrase>` from the named env var, and `maven-gpg-plugin` picks
it up from there without any CLI argument.
Co-authored-by: octo-patch <octo-patch@github.com>
## Summary
PyTorch's `DataLoader` uses fork-based multiprocessing by default on
Linux, but threads do not survive `fork()`. LanceDB's Python bindings
drive async work through two threaded layers, both of which become inert
in a forked child:
- `BackgroundEventLoop` runs an asyncio loop on a Python
`threading.Thread`.
- `pyo3-async-runtimes::tokio` holds a global multi-threaded tokio
runtime whose worker threads also die on fork — and its runtime lives in
a `OnceLock` that cannot be replaced after first use.
As a result, any `Permutation` (or other async API) used inside a
fork-based `DataLoader` worker hangs indefinitely. This PR makes both
layers fork-safe so `Permutation` works as a `torch.utils.data.Dataset`
with `num_workers > 0`.
## Approach
### Rust — new `python/src/runtime.rs`
Mirrors the pattern used in [Lance's Python
bindings](456198cd6f/python/src/lib.rs (L139)),
adapted for the async-bridge use case.
- `LanceRuntime` implements `pyo3_async_runtimes::generic::Runtime +
ContextExt`, backed by an `AtomicPtr<tokio::runtime::Runtime>` we own
(sidestepping `pyo3-async-runtimes`'s frozen `OnceLock` global).
- A `pthread_atfork(after_in_child)` handler nulls the pointer; the next
`spawn` rebuilds the runtime in the child. The previous runtime is
intentionally **leaked** — calling `Drop` would try to join now-dead
worker threads and hang.
- `runtime::future_into_py` is a drop-in for
`pyo3_async_runtimes::tokio::future_into_py`. All ~80 call sites in
`arrow.rs` / `connection.rs` / `permutation.rs` / `query.rs` /
`table.rs` are updated to route through it.
- `python/Cargo.toml` adds `libc = "0.2"` and the tokio
`rt-multi-thread` feature.
### Python — `lancedb/background_loop.py`
- Refactors `BackgroundEventLoop.__init__` to a reusable `_start()`
method.
- An `os.register_at_fork(after_in_child=…)` hook calls `LOOP._start()`
to give the singleton a fresh asyncio loop and thread **in place**. This
matters because the rest of the codebase imports `LOOP` via `from
.background_loop import LOOP` — rebinding the module attribute would
leave those references holding the dead loop.
### Python — `lancedb/__init__.py`
Removes the `__warn_on_fork` pre-fork warning (and the now-unused
`import warnings`). Fork is supported.
## Test plan
- [x] New `test_permutation_dataloader_fork_workers` in
`python/tests/test_torch.py`: runs a `Permutation` through
`torch.utils.data.DataLoader(num_workers=2,
multiprocessing_context="fork")` inside a spawn-isolated child with a
30s hang detector. **Pre-fix**: timed out at 36s. **Post-fix**: passes
in ~3.6s.
- [x] New `test_remote_connection_after_fork` in
`python/tests/test_remote_db.py`: forks a child that creates a fresh
`lancedb.connect(...)` against a mock HTTP server and calls
`table_names()`; passes in <1s, validates the runtime reset is
sufficient for fresh remote clients.
- [x] All 62 tests in `test_torch.py` + `test_permutation.py` pass.
- [x] All 35 tests in `test_remote_db.py` pass.
- [x] `test_table.py` (87) + `test_db.py` + `test_query.py` (157, minus
one unrelated `sentence_transformers` import skip) — 244 passing.
- [x] `cargo clippy -p lancedb-python --tests` clean.
- [x] `cargo fmt`, `ruff check`, `ruff format` all clean.
## Known limitation (follow-up)
This PR makes a **freshly-built** `lancedb.connect(...)` work in a
forked child. An **inherited** `Connection` from the parent still
carries an inherited `reqwest::Client` whose hyper connection pool
references socket FDs and TCP/TLS state shared with the parent — using
it from the child after fork is unsafe (especially with HTTP/1.1
keep-alive). The recommended pattern for fork-based `DataLoader` workers
that hit a remote DB is to construct a new connection inside the worker.
Auto-clearing inherited HTTP client pools on fork would require tracking
live `Connection` instances in `lancedb` core and is left for a
follow-up PR.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
When pytorch is used with multiprocessing and the mp mode is spawn then
the Permutation needs to be pickled. It could not be pickled because
`Table` and `Connection` are not serializable. This PR adds pickle
support to Permutation without adding general pickle support to `Table`
or `Connection`. To add general support we probably need to start by
adding serialization in the namespace client.
In the meantime this PR enable pickling by adding special cases for:
* In-memory tables (just serialize as Arrow IPC)
* Native tables (serialize the URI)
If a user is not using one of the above cases (e.g. using a remote
connection) then they will need to provide a connection factory that can
be pickled.
## Breaking change
`PermutationBuilder.persist(...)` is removed from the Python bindings;
the permutation table is now always in-memory. The underlying Rust
`PermutationBuilder::persist` API is untouched and can be re-exposed
later if needed. It probably won't make sense to do that until we have a
way to serialize `Table` and `Connection`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `build:release` command already outputs the `*.node` files directly
to the `dist/` directory via the `--output-dir dist` flag.
Therefore, the `postbuild:release` script, which attempts to copy
`*.node` files from the `lancedb/` source directory, fails with a "no
such file or directory" error because the source files do not exist
there.
This commit removes the redundant `postbuild:release` script to resolve
the build failure.
fix#3284
Signed-off-by: qingfeng-occ <qing.feng@zte.com.cn>
Hi, the hybrid query error message looks like it can use a space, just
added it.
```python
def _validate_query(self, query, vector=None, text=None):
if query is not None and (vector is not None or text is not None):
raise ValueError(
"You can either provide a string query in search() method"
"or set `vector()` and `text()` explicitly for hybrid search."
"But not both."
)
```
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
## Summary
- Update `rustls-webpki` 0.103.10 → 0.103.13 to fix RUSTSEC-2026-0104
(reachable panic in CRL parsing)
- Add advisory ignore for the legacy `rustls-webpki` 0.101.7 copy pinned
to the aws-smithy/rustls 0.21 chain (same chain already exempted for
RUSTSEC-2026-0098/0099)
Fixes the `deny` CI job failure seen in #3325.
## Test plan
- [x] `cargo deny check advisories` passes locally
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a `deny.toml` at the workspace root and a `deny` CI job that runs
`cargo deny check` on every PR. Catches yanked crates, license drift,
banned or wildcard dependencies, unapproved sources, and new RUSTSEC
advisories.
As part of wiring this up:
- Updated `aws-lc-rs` 1.13.0 → 1.16.3 / `aws-lc-sys` 0.28.0 → 0.40.0 to
clear four 2026 AWS-LC advisories (timing side-channel, PKCS7 bypass,
CRL scope). Removed the `=0.28.0` workaround pin; the original build
failure no longer reproduces.
- Updated `bytes`, `zlib-rs`, `rand`, `rustls-webpki`, `lz4_flex` to
clear their current advisories.
- Marked `lancedb-nodejs` and `lancedb-python` as `publish = false` and
pinned `lzma-sys` from `*` to `0.1` so `bans.wildcards = "deny"` can
be enforced.
10 remaining advisories have no safe upgrade available (transitive via
opendal, lance, datafusion, async-openai, aws-sdk on the legacy rustls
0.21 chain). Each is ignored in `deny.toml` with a per-entry rationale
and a link to the RUSTSEC advisory. New advisories still fail CI.
Fixes#3297
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
- Replaces `LANCEDB_PYPI_API_TOKEN` (long-lived token) with OIDC trusted
publishing via `pypa/gh-action-pypi-publish`
- Adds `id-token: write` permission to linux/mac/windows jobs
- Removes `twine`-based upload and the `pypi_token` input from
`upload_wheel` composite action
- Enables PEP 740 Sigstore attestations on published wheels as a bonus
After merging, rotate/revoke the `LANCEDB_PYPI_API_TOKEN` secret.
Closes#3294🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds `.github/dependabot.yml` enabling weekly cargo update PRs for the
root workspace, which produces the Rust binaries we ship: the Node.js
and Python native extensions. The `rust/lancedb` library crate shares
the same lockfile — its consumers pick versions themselves, but bumping
transitive deps here keeps the shipped binaries current.
Also removes the misleading `exclude = ["python"]` line from the root
`Cargo.toml`: `python` is listed in `members`, and `cargo metadata`
confirms it's a workspace member, so the exclude was dead code that
implied the opposite.
Minor/patch updates are grouped to reduce PR noise.
Part of #3292. Only covers the cargo ecosystem; pip, npm, and
github-actions can follow.
## Summary
- make `TlsConfig::default()` enable hostname verification by default
- align the Rust default with the documented Python and Node behavior
- update the Rust unit test to lock in the safe default
This follows the Rust-side Tantivy removal by deleting the remaining
Python Tantivy runtime, tests, and packaging references.
It also turns the legacy Python-only Tantivy parameters into explicit
errors and stops reading legacy `_indices/fts` directories so Python FTS
is fully native-only.
Adds `permissions: contents: read` to the 10 workflows that had no
top-level permissions block. Workflows that already declared
permissions, or individual jobs that need elevated permissions (`issues:
write`, `pull-requests: write`, `contents: write`), are left unchanged.
Affected workflows: `dev.yml`, `java-publish.yml`, `java.yml`,
`license-header-check.yml`, `nodejs.yml`, `pypi-publish.yml`,
`python.yml`, `rust.yml`, `update_package_lock_run.yml`,
`update_package_lock_run_nodejs.yml`
Fixes#3269.
## What I observed
Using a reranker in a hybrid query could keep the Node.js process alive
even after `table.close()` and `db.close()`.
## Root cause
The reranker callback bridge used a `ThreadsafeFunction` in referenced
mode, which can keep the event loop alive longer than intended.
## Minimal fix
- In `nodejs/src/rerankers.rs`, create the reranker callback TSFN in
weak mode (`.weak::<true>()`).
- Add a regression test in `nodejs/__test__/rerankers.test.ts` that
spawns a child process, runs a rerank query, and asserts the process
exits naturally.
## Validation
- Built Node bindings successfully.
- Ran targeted tests: `rerankers.test.ts` passes (including new
regression test).
- Pre-commit checks for changed files were run and clean.
So far, I have been using a hacky approach that creates and opens
namespace-backed table, by getting its location and use a temporary
lancedb connection to create or open it. This was working for features
like credentials vending but is no longer fully working for the managed
versioning feature, recently geneva tests have been failing here and
there and various patches are not addressing the root cause. This PR
fully fixes this and implements proper rust binding for it.
Specifically:
- build a real Rust namespace-backed connection from the Python
namespace client
- route namespace table create/open through that connection instead of
resolved-location temp connections
- keep namespace client naming consistent in the Rust bridge and
preserve federated namespace + DuckDB behavior
## Summary
- pass `namespace_client` through the Python create-table path
- ensure schema-only namespace table creation uses the namespace-aware
empty-table flow
- fix reopening namespace tables created without initial data
## Summary
- delegate child-namespace `ListingDatabase` operations through an
eagerly initialized `LanceNamespaceDatabase`
- support nested namespace create/open/list/drop flows without requiring
callers to inject explicit locations
- add `namespace_client_properties` plumbing for local and namespace
connections so directory namespace settings like
`table_version_tracking_enabled` can be configured
- add regression tests for nested namespace ops and namespace client
property propagation
## Summary
Add connection serialization and child namespace support to
`LanceDBConnection`.
- `DBConnection.serialize()` / `lancedb.deserialize()` for connection
reconstruction in remote workers
- Cache `namespace_client()` in `LanceDBConnection` to avoid repeated
DirectoryNamespace builds
- `LanceDBConnection` transparently delegates child namespace operations
(open_table, create_table, list_tables, drop_table, create_namespace,
etc.) to `LanceNamespaceDBConnection` via `_namespace_conn()`
- Root namespace operations still go through the original Rust path
- Generic worker property override mechanism: any
`namespace_client_properties` key prefixed with `_lancedb_worker_` has
the prefix stripped and overrides the corresponding property when
`deserialize(data, for_worker=True)`
- `LanceNamespaceDBConnection` stores
`namespace_client_impl`/`namespace_client_properties` for serialization
roundtrip
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bumps the Rust toolchain to 1.94.0 (latest installed) to unblock CI
failures caused by the AWS SDK's MSRV requirement. No lint fixes were
needed.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- migrate gemini-text embedding provider from deprecated
google.generativeai to google.genai
- update Python embedding extra dependency to google-genai
- update default model name to gemini-embedding-001
- adapt embed calls to Client().models.embed_content(...)
- apply lint fixes from CI
## Related
- Closes#3191
`.get(b"split_names", None).decode()` was called unconditionally in both
Permutations.__init__ and Permutation.from_tables(), crashing with
AttributeError when schema metadata existed but lacked the split_names
key. Guard the decode behind a None check and add regression tests.
## Problem
`on_bad_vectors="drop"` is supposed to remove invalid vector rows before
write, but for some schema-defined vector columns it can still fail
later during Arrow cast instead of dropping the bad row.
Repro:
```python
class MySchema(LanceModel):
text: str
embedding: Vector(16)
table = db.create_table("test", schema=MySchema)
table.add(
[
{"text": "hello", "embedding": []},
{"text": "bar", "embedding": [0.1] * 16},
],
on_bad_vectors="drop",
)
```
Before:
```
RuntimeError
Arrow error: C Data interface error: Invalid: ListType can only be casted to FixedSizeListType if the lists are all the expected size.
```
After:
```
rows 1
texts ['bar']
```
## Solution
Make bad-vector sanitization use schema dimensions before cast, while
keeping the handling scoped to vector columns identified by schema
metadata or existing vector-name heuristics.
This also preserves existing integer vector inputs and avoids applying
on_bad_vectors to unrelated fixed-size float columns.
Fixes#1670
Signed-off-by: yaommen <myanstu@163.com>
## Summary
- Add a `user_id` field to `ClientConfig` that allows users to identify
themselves to LanceDB Cloud/Enterprise
- The user_id is sent as the `x-lancedb-user-id` HTTP header in all
requests
- Supports three configuration methods:
- Direct assignment via `ClientConfig.user_id`
- Environment variable `LANCEDB_USER_ID`
- Indirect env var lookup via `LANCEDB_USER_ID_ENV_KEY`
Closes#3230🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
Fixes#1846.
Python `Enum` fields raised `TypeError: Converting Pydantic type to
Arrow Type: unsupported type <enum 'SomethingTypes'>` when converting a
Pydantic model to an Arrow schema.
The fix adds Enum detection in `_pydantic_type_to_arrow_type`. When an
Enum subclass is encountered, the value type of its members is inspected
and mapped to the appropriate Arrow type:
- `str`-valued enums (e.g. `class Status(str, Enum)`) → `pa.utf8()`
- `int`-valued enums (e.g. `class Priority(int, Enum)`) → `pa.int64()`
- Other homogeneous value types → the Arrow type for that Python type
- Mixed-value or empty enums → `pa.utf8()` (safe fallback)
This covers the common `(str, Enum)` and `(int, Enum)` mixin patterns
used in practice.
## Changes
- `python/python/lancedb/pydantic.py`: add Enum branch in
`_pydantic_type_to_arrow_type`
- `python/python/tests/test_pydantic.py`: add `test_enum_types` covering
`str`, `int`, and `Optional` Enum fields
## Note on #2395
PR #2395 handles `StrEnum` (Python 3.11+) specifically, using a
dictionary-encoded type. This PR handles the broader `(str, Enum)` /
`(int, Enum)` mixin pattern that works across all Python versions and
stores values as their natural Arrow type.
AI assistance was used in developing this fix.
1. Refactored every client (Rust core, Python, Node/TypeScript) so
“namespace” usage is explicit: code now keeps namespace paths
(namespace_path) separate from namespace clients (namespace_client).
Connections propagate the client, table creation routes through it, and
managed versioning defaults are resolved from namespace metadata. Python
gained LanceNamespaceDBConnection/async counterparts, and the
namespace-focused tests were rewritten to match the clarified API
surface.
2. Synchronized the workspace with Lance 5.0.0-beta.3 (see
https://github.com/lance-format/lance/pull/6186 for the upstream
namespace refactor), updating Cargo/uv lockfiles and ensuring all
bindings align with the new namespace semantics.
3. Added a namespace-backed code path to lancedb.connect() via new
keyword arguments (namespace_client_impl, namespace_client_properties,
plus the existing pushdown-ops flag). When those kwargs are supplied,
connect() delegates to connect_namespace, so users can opt into
namespace clients without changing APIs. (The async helper will gain
parity in a later change)
Bumps all lance-* workspace dependencies from `4.0.0-rc.3` (git source)
to the stable `4.0.0` release on crates.io, removing the `git`/`tag`
overrides.
No code changes were required — compiles and passes clippy cleanly.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes#1540
I could not reproduce this on current `main` from Python, but I could
still reproduce it from the Rust SDK.
Python no longer reproduces because the current Python vector/hybrid
query paths re-chunk results into a `pyarrow.Table` before returning
batches. Rust still reproduced because `max_batch_length` was passed
into planning/scanning, but vector search could still emit larger
`RecordBatch`es later in execution (for example after KNN / TopK), so it
was not enforced on the final Rust output stream.
This PR enforces `max_batch_length` on the final Rust query output
stream and adds Rust regression coverage.
Before the fix, the Rust repro produced:
`num_batches=2, max_batch=8192, min_batch=1808, all_le_100=false`
After the fix, the same repro produces batches `<= 100`.
## Runnable Rust repro
Before this fix, current `main` could still return batches like `[8192,
1808]` here even with `max_batch_length = 100`:
```rust
use std::sync::Arc;
use arrow_array::{
types::Float32Type, FixedSizeListArray, RecordBatch, RecordBatchReader, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use lancedb::query::{ExecutableQuery, QueryBase, QueryExecutionOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let tmp = tempfile::tempdir()?;
let uri = tmp.path().to_str().unwrap();
let rows = 10_000;
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Utf8, false),
Field::new(
"vector",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), 4),
false,
),
]));
let ids = StringArray::from_iter_values((0..rows).map(|i| format!("row-{i}")));
let vectors = FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
(0..rows).map(|i| Some(vec![Some(i as f32), Some(1.0), Some(2.0), Some(3.0)])),
4,
);
let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(ids), Arc::new(vectors)])?;
let reader: Box<dyn RecordBatchReader + Send> = Box::new(
arrow_array::RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
);
let db = lancedb::connect(uri).execute().await?;
let table = db.create_table("test", reader).execute().await?;
let mut opts = QueryExecutionOptions::default();
opts.max_batch_length = 100;
let mut stream = table
.query()
.nearest_to(vec![0.0, 1.0, 2.0, 3.0])?
.limit(rows)
.execute_with_options(opts)
.await?;
let mut sizes = Vec::new();
while let Some(batch) = stream.try_next().await? {
sizes.push(batch.num_rows());
}
println!("{sizes:?}");
Ok(())
}
```
Signed-off-by: yaommen <myanstu@163.com>
The test added in #3190 unconditionally imports `PIL`, which is an
optional dependency. This causes CI failures in environments where
Pillow isn't installed (`ModuleNotFoundError: No module named 'PIL'`).
Use `pytest.importorskip` to skip gracefully when Pillow is unavailable.
Fixes CI failure on main.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Namespace tests expected `RuntimeError` for table-not-found and
namespace-not-empty cases, but `lance_namespace` raises
`TableNotFoundError` and `NamespaceNotEmptyError` which inherit from
`Exception`, not `RuntimeError`.
- Updated `pytest.raises` to use the correct exception types.
## Test plan
- [x] CI passes on `test_namespace.py`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
url_retrieve() calls urllib.request.urlopen() but only urllib.error was
imported, causing AttributeError for any HTTP URL input. This affects
open-clip, siglip, and jinaai embedding functions when processing image
URLs.
The bug has existed since the embeddings API refactor (#580) but was
masked because most users pass local file paths or bytes rather than
HTTP URLs.
Fixes#2716
## Summary
Add support for querying with Float16Array, Float64Array, and Uint8Array
vectors in the Node.js SDK, eliminating precision loss from the previous
\Float32Array.from()\ conversion.
## Implementation
Follows @wjones127's [5-step
plan](https://github.com/lancedb/lancedb/issues/2716#issuecomment-3447750543):
### Rust (\
odejs/src/query.rs\)
1. \ytes_to_arrow_array(data: Uint8Array, dtype: String)\ helper that:
- Creates an Arrow \Buffer\ from the raw bytes
- Wraps it in a typed \ScalarBuffer<T>\ based on the dtype enum
- Constructs a \PrimitiveArray\ and returns \Arc<dyn Array>\
2. \
earest_to_raw(data, dtype)\ and \dd_query_vector_raw(data, dtype)\ NAPI
methods that pass the type-erased array to the core \
earest_to\/\dd_query_vector\ which already accept \impl
IntoQueryVector\ for \Arc<dyn Array>\
### TypeScript (\
odejs/lancedb/query.ts\, \rrow.ts\)
3. Extended \IntoVector\ type to include \Uint8Array\ (and
\Float16Array\ via runtime check for Node 22+)
4. \xtractVectorBuffer()\ helper detects non-Float32 typed arrays and
extracts their underlying byte buffer + dtype string
5. \
earestTo()\ and \ddQueryVector()\ route through the raw NAPI path when
the input is Float16/Float64/Uint8
### Backward compatibility
Existing \Float32Array\ and \
umber[]\ inputs are unchanged -- they still use the original \
earest_to(Float32Array)\ NAPI method. The new raw path is only used when
a non-Float32 typed array is detected.
## Usage
\\\ ypescript
// Float16Array (Node 22+) -- no precision loss
const f16vec = new Float16Array([0.1, 0.2, 0.3]);
const results = await
table.query().nearestTo(f16vec).limit(10).toArray();
// Float64Array -- no precision loss
const f64vec = new Float64Array([0.1, 0.2, 0.3]);
const results = await
table.query().nearestTo(f64vec).limit(10).toArray();
// Uint8Array (binary embeddings)
const u8vec = new Uint8Array([1, 0, 1, 1, 0]);
const results = await
table.query().nearestTo(u8vec).limit(10).toArray();
// Existing usage unchanged
const results = await table.query().nearestTo([0.1, 0.2,
0.3]).limit(10).toArray();
\\\
## Note on dependencies
The Rust side uses \rrow_array\, \rrow_buffer\, and \half\ crates.
These should already be in the dependency tree via \lancedb\ core, but
\Cargo.toml\ may need explicit entries for \half\ and the arrow
sub-crates in the nodejs workspace.
---------
Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Fixes#3183
## Summary
When `table.add(mode='overwrite')` is called, PyArrow infers input data
types (e.g. `list<double>`) which differ from the original table schema
(e.g. `fixed_size_list<float32>`). Previously, overwrite mode bypassed
`cast_to_table_schema()` entirely, so the inferred types replaced the
original schema, breaking vector search.
This fix builds a merged target schema for overwrite: columns present in
the existing table schema keep their original types, while columns
unique to the input pass through as-is. This way
`cast_to_table_schema()` is applied unconditionally, preserving vector
column types without blocking schema evolution.
## Changes
- `rust/lancedb/src/table/add_data.rs`: For overwrite mode, construct a
target schema by matching input columns against the existing table
schema, then cast. Non-overwrite (append) path is unchanged.
- Added `test_add_overwrite_preserves_vector_type` test that creates a
table with `fixed_size_list<float32>`, overwrites with `list<double>`
input, and asserts the original type is preserved.
## Test Plan
- `cargo test --features remote -p lancedb -- test_add_overwrite` — all
4 overwrite tests pass
- Full suite: 454 passed, 2 failed (pre-existing `remote::retry` flakes
unrelated to this change)
---------
Signed-off-by: majiayu000 <1835304752@qq.com>
dict.update() mutates in place and returns None. Assigning its result
caused with_metadata(None) to strip all schema metadata when embedding
metadata was merged during create_table with embedding_functions.
This patch mitigates template injection vulnerabilities in GitHub
Workflows by replacing direct references with an environment variable.
Aikido used AI to generate this PR.
High confidence: Aikido has a robust set of benchmarks for similar
fixes, and they are proven to be effective.
Co-authored-by: aikido-autofix[bot] <119856028+aikido-autofix[bot]@users.noreply.github.com>
Replace ~30 production `lock().unwrap()` calls that would cascade-panic
on a poisoned Mutex. Functions returning `Result` now propagate the
poison as an error via `?` (leveraging the existing `From<PoisonError>`
impl). Functions without a `Result` return recover via
`unwrap_or_else(|e| e.into_inner())`, which is safe because the guarded
data (counters, caches, RNG state) remains logically valid after a
panic.
## Summary
Adds progress reporting for `table.add()` so users can track large write
operations. The progress callback is available in Rust, Python (sync and
async), and through the PyO3 bindings.
### Usage
Pass `progress=True` to get an automatic tqdm bar:
```python
table.add(data, progress=True)
# 100%|██████████| 1000000/1000000 [00:12<00:00, 82345 rows/s, 45.2 MB/s | 4/4 workers]
```
Or pass a tqdm bar for more control:
```python
from tqdm import tqdm
with tqdm(unit=" rows") as pbar:
table.add(data, progress=pbar)
```
Or use a callback for custom progress handling:
```python
def on_progress(p):
print(f"{p['output_rows']}/{p['total_rows']} rows, "
f"{p['active_tasks']}/{p['total_tasks']} workers, "
f"done={p['done']}")
table.add(data, progress=on_progress)
```
In Rust:
```rust
table.add(data)
.progress(|p| println!("{}/{:?} rows", p.output_rows(), p.total_rows()))
.execute()
.await?;
```
### Details
- `WriteProgress` struct in Rust with getters for `elapsed`,
`output_rows`, `output_bytes`, `total_rows`, `active_tasks`,
`total_tasks`, and `done`. Fields are private behind getters so new
fields can be added without breaking changes.
- `WriteProgressTracker` tracks progress across parallel write tasks
using a mutex for row/byte counts and atomics for active task counts.
- Active task tracking uses an RAII guard pattern (`ActiveTaskGuard`)
that increments on creation and decrements on drop.
- For remote writes, `output_bytes` reflects IPC wire bytes rather than
in-memory Arrow size. For local writes it uses in-memory Arrow size as a
proxy (see TODO below).
- tqdm postfix displays throughput (MB/s) and worker utilization
(active/total).
- The `done` callback always fires, even on error (via `FinishOnDrop`),
so progress bars are always finalized.
### TODO
- Track actual bytes written to disk for local tables. This requires
Lance to expose a progress callback from its write path. See
lance-format/lance#6247.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Lance v4.1.0-beta requires the default-https-client feature on
aws-sdk-dynamodb and aws-sdk-s3, which was introduced in the March
2025 AWS SDK release. Update all AWS SDK pins to versions from the
same AWS SDK release to maintain internal dependency compatibility.
Co-authored-by: Esteban Gutierrez <esteban@lancedb.com>
Similar to https://github.com/lancedb/lancedb/pull/3062, we can write in
parallel to remote tables if the input data source is large enough.
We take advantage of new endpoints coming in server version 0.4.0, which
allow writing data in multiple requests, and the committing at the end
in a single request.
To make testing easier, I also introduce a `write_parallelism`
parameter. In the future, we can expose that in Python and NodeJS so
users can manually specify the parallelism they get.
Closes#2861
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Problem
The generated Python API docs for
`lancedb.table.IndexStatistics.index_type` were misleading because
mkdocstrings renders that field’s type annotation directly, and the
existing `Literal[...]` listed only a subset of the actual canonical SDK
index type strings.
Current (missing index types):
<img width="823" height="83" alt="image"
src="https://github.com/user-attachments/assets/f6f29fe3-4c16-4d00-a4e9-28a7cd6e19ec"
/>
## Fix
- Update the `IndexStatistics.index_type` annotation in
`python/python/lancedb/table.py` to include the full supported set of
canonical values, so the generated docs show all valid index_type
strings inline.
- Add a small regression test in `python/python/tests/test_index.py` to
ensure the docs-facing annotation does not drift silently again in case
we add a new index/quantization type in the future.
- Bumps mkdocs and material theme versions to mkdocs 1.6 to allow access
to more features like hooks
After fix (all index types are included and tested for in the
annotations):
<img width="1017" height="93" alt="image"
src="https://github.com/user-attachments/assets/66c74d5c-34b3-4b44-8173-3ee23e3648ac"
/>
When Lance 3.0.0 released the check_lance_release.py script did not make
a PR for it because it was a pre-release. This change may not be perfect
but it always ranks stable releases above non-stable releases.
When using hybrid search with a where filter, the prefilter argument is
silently inverted. Passing prefilter=True actually performs
post-filtering, and prefilter=False actually performs pre-filtering.
## Summary
- Update all 14 lance crates from `3.0.0-rc.3` (git source) to `3.0.0`
(crates.io release)
- Remove git/tag source references since 3.0.0 is published on crates.io
## Test plan
- [x] `cargo check --features remote --tests --examples` passes
- [x] `cargo clippy --features remote --tests --examples` passes
- [ ] CI passes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Upgrade LocalStack from 3.3 to 4.0 in `docker-compose.yml` to fix S3
integration test failures in CI
- Version 3.3 has compatibility issues with newer Python 3.13 and
updated boto3 dependencies
- Matches the LocalStack version used successfully in the lance
repository
## Test plan
- [ ] Verify `docker compose up --detach --wait` completes successfully
in CI
- [ ] All tests in `test_s3.py` pass (5 tests)
- [ ] All `@pytest.mark.s3_test` tests in
`test_namespace_integration.py` pass (7 tests)
- [ ] No regressions in non-integration test jobs (Mac, Windows)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Move away from buildjet, which is shutting down runners for GHA [^1]
* Add `Cargo.lock` to build jobs, so when we upgrade locked dependencies
we check the builds actually pass. CI started failing because
dependencies were changed in #3116 without running all build jobs.
* Add fixes for aws-lc-rs build in NodeJS.
[^1]: https://buildjet.com/for-github-actions/blog/we-are-shutting-down
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add support for passing field/data type information into add_columns()
method, bringing parity with Python bindings. The method now accepts:
- AddColumnsSql[] - SQL expressions (existing functionality)
- Field - single Arrow field with explicit data type
- Field[] - array of Arrow fields with explicit data types
- Schema - Arrow schema with explicit data types
New columns added via Field/Schema are initialized with null values. All
field-based columns must be nullable due to null initialization.
Resolves#3107
---------
Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- Removes the "Experimental API" section from `optimize` method
documentation across Rust, Python, and TypeScript
- Adds a warning to `delete_unverified` documentation in all bindings:
this should only be set to true if you can guarantee no other process is
working on the dataset, otherwise it could be corrupted
- Fixes a typo ("shoudl" → "should")
Closes#3125🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Implement `RemoteTable.prewarm_data(columns)` calling `POST
/v1/table/{id}/page_cache/prewarm/`
- Implement `RemoteTable.prewarm_index(name)` calling `POST
/v1/table/{id}/index/{name}/prewarm/` (previously returned
`NotSupported`)
- Add `BaseTable::prewarm_data(columns)` trait method and `Table` public
API in Rust core
- Add PyO3 bindings and Python API (`AsyncTable`, `LanceTable`,
`RemoteTable`) for `prewarm_data`
- Add type stubs for `prewarm_index` and `prewarm_data` in
`_lancedb.pyi`
- Upgrade Lance to 3.0.0-rc.3 with breaking change fixes
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Update dependencies across Rust, Python, Node.js, Java, Docker, and
docs
- Pin unpinned dependency lower bounds to prevent silent downgrades
- Bump CI actions to current major versions
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
When we create tables without using Arrow by parsing JS records we
always infer to float64. Many times embeddings are not float64 and it
would be nice to be able to use the native type without requiring users
to pull in Arrow. We can utilize JS's builtin Float32Array to do this.
This PR also adds support for UInt8/16/32 and Int8/16/32 arrays as well.
Closes#3115
Without this fix, if user directly use the native table to do operations
like `add_columns`, even if it is configured to use namespace db
connection, it is not really propagated through.
The fix is to bring lancedb's python binding up to date and do a similar
implementation as https://github.com/lance-format/lance/pull/5968, and
make sure the namespace is fully propagated through all the related
calls.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
When we write data with `add()`, we can input data to the table's
schema. However, we were using "safe" mode, which propagates errors as
nulls. For example, if you pass `u64::max` into a field that is a `u32`,
it will just write null instead of giving overflow error. Now it
propagates the overflow. This is the same behavior as other systems like
DuckDB.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Prior to this commit we supported passing the azure storage account name
to the lancedb remote SDK through headers. This adds support for client
ID and tenant ID as well.
This PR migrates all Rust crates in the workspace to Rust 2024 edition
and addresses the resulting compatibility updates. It also fixes all
clippy warnings surfaced by the workspace checks so the codebase remains
warning-free under the current lint configuration.
Context:
- Scope: workspace edition bump (`2021` -> `2024`) plus follow-up
refactors required by new edition and clippy rules.
- Validation: `cargo fmt --all` and `cargo clippy --quiet --features
remote --tests --examples -- -D warnings` both pass.
## Summary
- Add `@value_to_sql.register(dict)` handler that converts Python dicts
to DataFusion's `named_struct()` SQL syntax
- Enables updating struct-typed columns via `table.update(values={"col":
{"field_a": 1, "field_b": "hello"}})`
- Recursively handles nested structs, lists, nulls, and all existing
scalar types
Closes#1363
## Details
The `named_struct` function was introduced in DataFusion 38 and is now
available (LanceDB uses DataFusion 52.1). The implementation follows the
existing `singledispatch` pattern in `util.py`.
**Example conversion:**
```python
value_to_sql({"field_a": 1, "field_b": "hello"})
# => "named_struct('field_a', 1, 'field_b', 'hello')"
```
## Test plan
- [x] Unit tests for flat struct, nested struct, list inside struct,
mixed types, null values, and empty dict
- [ ] CI integration tests with actual table.update() on struct columns
🔗 [DataFusion named_struct
docs](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#named-struct)
This PR fixes the npm publish dry-run failure for prerelease versions
without changing the existing workflow trigger behavior. The publish
step now detects prerelease versions from `nodejs/package.json` and
always appends `--tag preview` when needed.
Context:
- On `main` pushes, the workflow still runs `npm publish --dry-run` by
design.
- Recent failures were caused by prerelease versions (for example
`0.27.0-beta.3`) running without `--tag`, which npm rejects.
- The previous `refs/tags/v...-beta...` check did not apply on branch
pushes, so dry-run could fail even though release tags worked.
We don't necessarily need to do this, but one user was confused having
used `fast_search=True` as a keyword argument for vector searches, but
being unable to do so for FTS, even after the most recent changes. I
think this is the only discrepancy in where that is possible.
## Summary
Adds a Rust expression builder API as a type-safe alternative to SQL
strings for query filters.
## Motivation
Filtering with raw SQL strings can be awkward when using variables and
special types:
Closes #3038
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
When input data is sufficiently large, we automatically split up into
parallel writes using a round-robin exchange operator. We sample the
first batch to determine data width, and target size of 1 million rows
or 2GB, whichever is smaller.
This hooks up a new writer implementation for the `add()` method. The
main immediate benefit is it allows streaming requests to remote tables,
and at the same time allowing retries for most inputs.
In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's
always retry-able.
For Python, all are retry-able, except `Iterator` and
`pa.RecordBatchReader`, which can only be consumed once. Some, like
`pa.datasets.Dataset` are retry-able *and* streaming.
A lot of the changes here are to make the new DataFusion write pipeline
maintain the same behavior as the existing Python-based preprocessing,
such as:
* casting input data to target schema
* rejecting NaN values if `on_bad_vectors="error"`
* applying embedding functions.
In future PRs, we'll enhance these by moving the embedding calls into
DataFusion and making sure we parallelize them. See:
https://github.com/lancedb/lancedb/issues/3048
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Upgrades `@napi-rs/cli` from v2 to v3, `napi`/`napi-derive` Rust
crates to 3.x
- Fixes a bug
([napi-rs#1170](https://github.com/napi-rs/napi-rs/issues/1170)) where
the CLI failed to locate the built `.node` binary when a custom Cargo
target directory is set (via `config.toml`)
## Changes
**package.json / CLI**:
- `napi.name` → `napi.binaryName`, `napi.triples` → `napi.targets`
- Removed `--no-const-enum` flag and fixed output dir arg
- `napi universal` → `napi universalize`
**Rust API migration**:
- `#[napi::module_init]` → `#[napi_derive::module_init]`
- `napi::JsObject` → `Object`, `.get::<_, T>()` → `.get::<T>()`
- `ErrorStrategy` removed; `ThreadsafeFunction` now takes an explicit
`Return` type with `CalleeHandled = false` const generic
- `JsFunction` + `create_threadsafe_function` replaced by typed
`Function<Args, Return>` + `build_threadsafe_function().build()`
- `RerankerCallbacks` struct removed (`Function<'env,...>` can't be
stored in structs); `VectorQuery::rerank` now accepts the function
directly
- `ClassInstance::clone()` now returns `ClassInstance`, fixed with
explicit deref
- `Vec<u8>` in `#[napi(object)]` now maps to `Array<number>` in v3;
changed to `Buffer` to preserve the TypeScript `Buffer` type
**TypeScript**:
- `inner.rerank({ rerankHybrid: async (_, args) => ... })` →
`inner.rerank(async (args) => ...)`
- Header provider callback wrapped in `async` to match stricter typed
constructor signature
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
`DatasetConsistencyWrapper::update()` only stored datasets with a
strictly newer
version. This caused `migrate_manifest_paths_v2` to silently drop its
update since
the migration renames files without bumping the dataset version. The
subsequent
`uses_v2_manifest_paths()` call would then return the stale cached
dataset.
Changed the version check from `>` to `>=` so same-version updates are
accepted.
## Test plan
- [x] Existing `test_create_table_v2_manifest_paths_async` Python test
should pass
- [x] Existing `should be able to migrate tables to the V2 manifest
paths` NodeJS test should pass
- [x] All dataset wrapper unit tests pass locally
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This updates `DatasetConsistencyWrapper` to block less:
1. `DatasetConsistencyWrapper::get()` just returns `Arc<Dataset>` now,
instead of a guard that blocks writes.
`DatasetConsistencyWrapper::get_mut()` is gone; now write methods just
use `get()` and then later call `update()` with the new version. This
means a given table handle can do concurrent reads **and** writes.
2. In weak consistency mode, will check for dataset updates in the
background, instead of blocking calls to `get()`.
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
## Summary
Fixes#1679
This PR prevents the OpenAI embedding function from retrying when
receiving a 401 Unauthorized error. Authentication errors are permanent
failures that won't be fixed by retrying, yet the current implementation
retries all exceptions up to 7 times by default.
## Changes
- Modified `retry_with_exponential_backoff` in `utils.py` to check for
non-retryable errors before retrying
- Added `_is_non_retryable_error` helper function that detects:
- Exceptions with name `AuthenticationError` (OpenAI's 401 error)
- Exceptions with `status_code` attribute of 401 or 403
- Enhanced OpenAI embeddings to explicitly catch and re-raise
`AuthenticationError` with better logging
- Added unit test `test_openai_no_retry_on_401` to verify authentication
errors don't trigger retries
## Test Plan
- Added test that verifies:
1. A function raising `AuthenticationError` is only called once
2. No retry delays occur (sleep is never called)
- Existing tests continue to pass
- Formatting applied via `make format`
## Example Behavior
**Before**: With an invalid API key, users would see 7 retry attempts
over ~2 minutes:
```
WARNING:root:Error occurred: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}}
Retrying in 3.97 seconds (retry 1 of 7)
WARNING:root:Error occurred: Error code: 401...
Retrying in 7.94 seconds (retry 2 of 7)
...
```
**After**: With an invalid API key, the error is raised immediately:
```
ERROR:root:Authentication failed: Invalid API key provided
AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}}
```
This provides better UX and prevents unnecessary API calls that would
fail anyway.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
There are old and outdated files in our embedding registry that can
confuse coding agents. This PR deprecates the following files that have
newer, more modern methods to generate such embeddings.
- Deprecate `embeddings/siglip.py`
- Deprecate `embeddings/gte.py`
## Why this change?
Per a discussion with @AyushExel, the [embedding registry directory
](1840aa7edc/python/python/lancedb/embeddings)
in the LanceDB repo has a number of outdated files that need to be
deprecated.
See https://github.com/lancedb/docs/issues/85 for the docs gaps that
identified this.
- Add note in `openclip` docs that it can be used for SigLip embeddings,
which it now supports
- Add note in the `sentence-transformers` page that ALL text embedding
models on Hugging Face can be used
There were two issues:
1. The python code needs to get access to the underlying rust table to
setup the permutation reader and the attributes involved in this differ
between the python local table and remote table objects.
~~2. The remote table was sending projection dictionaries as arrays of
tuples and (on LanceDB cloud at least) it does not appear this is how
rest servers are setup to receive them.~~ (this is now fixed as #3023)
~~Leaving as draft as this is built on
https://github.com/lancedb/lancedb/pull/3016~~
## Problem
When applying hard filters that result in zero matches, hybrid search
crashes with `IndexError: list index out of range` during reranking.
This happens because empty result tables are passed through the full
reranker pipeline, which expects at least one result.
Traceback from the issue:
```
lancedb/query.py: in _combine_hybrid_results
results = reranker.rerank_hybrid(fts_query, vector_results, fts_results)
lancedb/rerankers/answerdotai.py: in rerank_hybrid
combined_results = self._rerank(combined_results, query)
...
IndexError: list index out of range
```
## Fix
Added an early return in `_combine_hybrid_results` when both vector and
FTS results are empty. Instead of passing empty tables through
normalization, reranking, and score restoration (which can fail in
various ways), we now build a properly-typed empty result table with the
`_relevance_score` column and return it directly.
## Test
Added `test_empty_hybrid_result_reranker` that exercises
`_combine_hybrid_results` directly with empty vector and FTS tables,
verifying:
- Returns empty table with correct schema
- Includes `_relevance_score` column
- Respects `with_row_ids` flag
Closes#2425
Completes the **merge_insert.rs** checklist item from #2949.
## Changes
- Moved `MergeResult` struct from `table.rs` to `table/merge.rs`
- Moved the `NativeTable::merge_insert` implementation into
`merge::execute_merge_insert()`, with the trait impl now delegating to
it (same pattern as `delete.rs`)
- Moved `test_merge_insert` and `test_merge_insert_use_index` tests into
`table/merge.rs`
- Improved moved tests to use `memory://` URIs instead of temporary
directories
- Cleaned up unused imports from `table.rs` (`FutureExt`,
`TryFutureExt`, `Either`, `WhenMatched`, `WhenNotMatchedBySource`,
`LanceMergeInsertBuilder`)
- `MergeResult` is re-exported from `table.rs` so the public API is
unchanged
## Testing
`cargo build -p lancedb` compiles cleanly with no warnings.
References #2949 Moved query logic and helpers from table.rs to
query.rs. Refactored tests using guidelines and added coverage for multi
vector plan structure.
When a table has a read consistency interval, queries within the
interval skip the version check. Once the interval expires, a list call
checks for new versions. If the version hasn't changed, the timer should
reset so the next interval begins, but it didn't. The timer stayed
expired, so every query after that triggered a list call, even though
nothing changed.
This affects all read operations (queries, schema lookups, searches) on
tables with read_consistency_interval set. Each operation adds a
list("_versions/") call to object storage, adding latency proportional
to the store's list performance. For high-QPS workloads, this can
saturate object store list throughput and significantly degrade query
latency.
Bug flow:
1. Every read operation (query, schema, search) calls
ensure_up_to_date()
2. ensure_up_to_date() calls is_up_to_date(), which compares
last_consistency_check.elapsed() against
read_consistency_interval
3. If the interval has expired, it calls reload()
4. reload() calls need_reload(), which calls latest_version_id() — this
is the list IOP
(list("_versions/"))
5. If no new version, reload() returns early without resetting
last_consistency_check
6. On the next query, step 2 sees the stale timer again → step 3 → step
4 → another list IOP
7. This repeats on every query forever
Caches the schema of remote tables and invalidates the cache when:
1. After 30 second TTL
2. When we do an operation that changes schema (e.g. add_columns) or
checks out a different version (e.g. checkout_version)
3. When we get a 400, 404, or 500 reponse
If the schema is retrieved close to the TTL, we optimistically fetch the
schema in the background. This means a continuous stream of queries will
never have the schema fetch on the critical path.
Closes#3014
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
BREAKING CHANGE: Arbitrary `impl RecordBatchReader` is no longer
accepted, it must be made into `Box<dyn RecordBatchReader>`.
This PR replaces `IntoArrow` with a new trait `Scannable` to define
input row data. This provides the following advantages:
1. **We can implement `Scannable` for more types than `IntoArrow`, such
as `RecordBatch` and `Vec<RecordBatch>`.** The `IntoArrow` trait was
implemented for arbitrary `T: RecordBatchReader`, and the Rust compiler
would prevent us from implementing it for foreign types like
`RecordBatch` because (theoretically) those types might implement
`RecordBatchReader` in the future. That's why we implement `Scannable`
for `Box<dyn RecordBatchReader>` instead; since it's a concrete type it
doesn't block implementing for other foreign types.
2. **We can potentially replay `Scannable` values**. Previously, we had
to choose between buffering all data in memory and supporting retries of
writes. But because `Scannable` things can optionally support
re-scanning, we now have a way of supporting retries while also
streaming.
3. **`Scannable` can provide hints like `num_rows`, which can be used to
schedule parallel writers.** Without knowing the total number of rows,
it's difficult to know whether it's worth writing multiple files in
parallel.
We don't yet fully take advantage of (2) and (3) yet, but will in future
PRs. For (2), in order to be ready to leverage this, we need to hook the
`Scannable` implementation up to Python and NodeJS bindings. Right now
they always pass down a stream, but we want to make sure they support
retries when possible. And for (3), this will need to be hooked up to
#2939 and to a pipeline for running pre-processing steps (like embedding
generation).
## Other changes
* Moved `create_table` and `add_data` into their own modules. I've
created a follow up issue to split up `table.rs` further, as it's by far
the largest file: https://github.com/lancedb/lancedb/issues/2949
* Eliminated the `HAS_DATA` generic for `CreateTableBuilder`. I didn't
see any public-facing places where we differentiated methods, which is
why I felt this simplification was okay.
* Added an `Error::External` variant and integrated some conversions to
allow certain errors to pass through transparently. This will fully work
once we upgrade Lance and get to take advantage of changes in
https://github.com/lance-format/lance/pull/5606
* Added LZ4 compression support for write requests to remote endpoints.
I checked and this has been supported on the server for > 1 year.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This changes around the output format of `Permutation` in some breaking
ways but I think the API is still new enough to be considered
experimental.
1. In order to align with both huggingface's dataset and torch's
expectations the default output format is now a list of dicts
(row-major) instead of a dict of lists (column-major). I've added a
python_col option which will return the dict of lists.
2. In order to align with pytorch's expectation the `torch` format is
now a list of tensors (row-major) instead of a 2D tensor (column-major).
I've added a torch_col option which will return the 2D tensor instead.
Added tests for torch integration with Permutation
~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013
merges as this is built on top of that~~
Closes#3000
The hybrid search `explain_plan` now shows the reranker as the top-level
node with
the vector and FTS sub-plans indented underneath, instead of just
listing them
separately with no reranker context.
**Before:**
```
Vector Search Plan:
ProjectionExec: ...
FTS Search Plan:
ProjectionExec: ...
```
**After:**
```
RRFReranker(K=60)
Vector Search Plan:
ProjectionExec: ...
FTS Search Plan:
ProjectionExec: ...
```
Other rerankers display similarly ; e.g.
`LinearCombinationReranker(weight=0.7, fill=1.0)`,
`MRRReranker(weight_vector=0.5, weight_fts=0.5)`,
`CohereReranker(model_name=name)`.
---------
Signed-off-by: dask-58 <googldhruv@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Fixes#2999
The error message previously said `"field_names must be a string when
use_tantivy=False"` implying they should use the to be deprecated
tantivy backend #2998.
Updated the error message and docstring to instead guide users to create
a separate FTS index for each field
Signed-off-by: dask-58 <googldhruv@gmail.com>
## Summary
Continues the modularization effort of table operations as outlined in
#2949.
- Extracts optimization operations (`OptimizeAction`, `OptimizeStats`,
`execute_optimize`, `compact_files_impl`, `cleanup_old_versions`,
`optimize_indices`) from
`table.rs` into `table/optimize.rs`
- Public API remains unchanged via re-exports
- Adds comprehensive tests including error cases with message assertions
## Test plan
- [x] All new optimization tests pass
- [x] All existing tests pass
- [x] `cargo clippy` passes with no warnings
- [x] `cargo fmt --check` passes
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
## Summary
- Added `repository` field to all nodejs package.json files (main
package + 7 platform-specific packages)
- This fixes the npm publish E422 error where sigstore provenance
verification fails because the repository.url was empty
## Root Cause
Failing CI:
https://github.com/lancedb/lancedb/actions/runs/21770794768/job/62821570260
npm's sigstore provenance verification requires the `repository.url`
field in package.json to match the GitHub repository URL from the
provenance bundle. The platform-specific packages
(`@lancedb/lancedb-darwin-arm64`, etc.) were missing this field
entirely, causing the publish to fail with:
```
npm error 422 Unprocessable Entity - Error verifying sigstore provenance bundle:
Failed to validate repository information: package.json: "repository.url" is "",
expected to match "https://github.com/lancedb/lancedb" from provenance
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Continues the modularization effort of schema evolution operations as
outlined in #2949
## Summary
- Extracts schema evolution operations (add_columns, alter_columns,
drop_columns) from `table.rs` into `table/schema_evolution.rs`
- Public API remains unchanged via re-exports
## Test plan
- [x] All new schema evolution tests pass
- [x] All existing tests pass
- [x] `cargo clippy` passes with no warnings
- [x] `cargo fmt --check` passes
## Summary
- Codex CLI v0.95.0 ([PR
#10258](https://github.com/openai/codex/pull/10258)) hardened git
command safety so force push (`git push -f`, `--force`,
`--force-with-lease`, `+refspec`) now requires approval, which blocks it
in non-interactive `exec` mode.
- This broke the
[codex-update-lance-dependency](https://github.com/lancedb/lancedb/actions/runs/21727536000/job/62673436482)
workflow — the job succeeded but failed to push the branch or create the
PR.
- Replace force push with `gh api` branch deletion followed by regular
`git push`.
- Also update the script to bump Java lance-core version which was
missing previously
## Test plan
- [x] Re-run the `Codex Update Lance Dependency` workflow with a test
tag to verify the push and PR creation succeed:
https://github.com/lancedb/lancedb/pull/2983🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Regenerate TypeScript docs to include the new initialStorageOptions()
and latestStorageOptions() methods added in #2966.
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
The `report-failure` jobs in npm, cargo, and pypi publish workflows
checked for
`release` or `workflow_dispatch` events, but these workflows are
triggered by tag
pushes where `github.event_name` is `push`. The condition was never
true, so failure
notifications were silently skipped.
- Use `startsWith(github.ref, 'refs/tags/...')` to match actual tag
triggers
- Add `failure()` to only notify on actual failures
This matches the pattern already used by `java-publish.yml`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Expose `initial_storage_options()` and `latest_storage_options()` in
lance Dataset, in lancedb rust, python and typescript SDKs.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Implements `InsertExec` and `RemoteInsertExec` to support running
inserts in DataFusion.
## Context
In https://github.com/lancedb/lancedb/pull/2929, I've prototyped moving
the insert pipeline into DataFusion. This will enable parallelism at two
levels:
1. Running preprocessing, such as casting the input schema or computing
embeddings
2. Writing out files
This PR is just the first part of running the actual writes. In the end,
the plans might look like:
```
InsertExec
RepartitionExec num_partitions=<write_parallelism>
ProjectionExec vector=compute_embedding()
RepartitionExec num_partitions=<num_cpus>
DataSourceExec
```
where `num_cpus` is used to take advantage of all cores, while
`write_parallelism` might be less than `num_cpus` if there are too few
rows to want to split writes across `num_cpus` files.
Later PRs will move the preprocessing steps into DataFusion, and then
hook this up to the `Table::add()` implementations.
## Relation to future SQL work
We eventually plan on having the Remote SDK go through a FlightSQL
endpoint. Then for most queries we will send just the SQL string to the
server, and not run any sort of DataFusion plan on the client.
However, I think writes will be a little special, especially bulk writes
where we need to upload large streams of data and likely want
parallelism. So we'll have different code paths for writes, and I think
using DataFusion makes sense, especially as long as we are doing the
pre-processing on the client side still.
## Summary
This PR changes takeRowIds to accept bigint[] instead of
number[], matching the type of _rowid returned by withRowId().
## Problem
When retrieving row IDs using \withRowId()\ and querying them back with
takeRowIds(), users get an error because:
1. _rowid values are returned as JavaScript bigint
2. takeRowIds() expected number[]
3. NAPI failed to convert: Error: Failed to convert napi value BigInt
into rust type i64
## Reproduction
\\\js
import lancedb from '@lancedb/lancedb';
const db = await lancedb.connect('memory://');
const table = await db.createTable('test', [{ id: 1, vector: [1.0, 2.0]
}]);
const results = await table.query().withRowId().toArray();
const rowIds = results.map(row => row._rowid);
console.log('types:', rowIds.map(id => typeof id)); // ['bigint']
await table.takeRowIds(rowIds).toArray(); // ⌠Error before fix
\\\
## Solution
- Updated TypeScript signature from takeRowIds(rowIds: number[]) to
takeRowIds(rowIds: bigint[])
- Updated Rust NAPI binding to accept Vec<BigInt> and convert using
get_u64()
Fixes#2722
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
References #2949 Part 2 of table.rs refactor. Moved UpdateResult,
UpdateBuilder, and execution logic to src/table/update.rs. No functional
changes API remains identical.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Fixes#2612
This PR exposes the private _fast_search attribute via a public
fast_search() method in the synchronous LanceVectorQueryBuilder.
Previously, enabling fast search in the sync API required accessing a
private member (query._fast_search = True). This change aligns the
synchronous API with the Async and Remote APIs, allowing for cleaner,
more Pythonic method chaining.
Changes:
Added fast_search() method to LanceVectorQueryBuilder in
python/python/lancedb/query.py.
Added a unit test verifying the flag works with high-dimensional data
(2560 dims) and chaining.
Example Usage:
Before:
```
query = table.search(vector)
query._fast_search = True # Private attribute usage
results = query.limit(10).to_pandas()
```
After:
```
results = (
table.search(vector)
.fast_search()
.limit(10)
.to_pandas()
)
```
Verification:
I have added a test case (test_fast_search_high_dimension) that
replicates the scenario described in the issue (2560 dimensions, cosine
distance) to ensure the pipeline constructs the query correctly without
errors.
Checklist:
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
- [ ] Documentation has been updated (inline docstrings).
Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
## Summary
- PR #2957 changed the permutation builder to only select `_rowid` from
the base table, but `Splitter::project()` for hash and calculated splits
replaced the selection entirely, dropping `_rowid`.
- Include `_rowid` in the column selections for hash and calculated
split projections.
- Fix a Python test that queried the permutation table for base table
columns no longer materialized.
Fixes the `test_split_hash`, `test_split_hash_with_discard`,
`test_split_calculated`, `test_shuffle_combined_with_splits`, and
`test_filter_with_splits` failures in `test_permutation.py`.
## Test plan
- [x] `cargo test -p lancedb -- permutation` (22 passed)
- [x] `pytest python/tests/test_permutation.py` (46 passed)
- [x] `npm test __test__/permutation.test.ts` (20 passed)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
References #2949 Moved DeleteResult and delete() implementation to
src/table/delete.rs. No functional changes. Added a test delete which
works. Will work on refactoring update next.
Fixes#2898
Problem:
Sync API cancellations didn’t stop remote query coroutines, so requests
could continue after interrupt.
Changes:
- Cancel run_coroutine_threadsafe futures on any BaseException in the
sync background loop
- Update cancellation test to avoid starting a real background thread
and cover GeneratorExit
Fixes the Rust SDK's `create_empty_table` to properly support embedding
column definitions, bringing it to parity with the Python SDK.
## Problem
The Rust SDK's `Connection::create_empty_table` did not support setting
embedding columns. When using `.add_embedding()` on the builder, the
embedding column definitions were lost because
`TableDefinition::new_from_schema(schema)` marks all columns as physical
only, without embedding metadata.
The Python SDK worked around this by creating an empty record batch with
proper schema metadata rather than using `create_empty_table` directly.
## Solution
Modified `CreateTableBuilder<false>` to handle embeddings
Closes#2759
Importing `PIL` alone does not guarantee that the `Image` submodule is
loaded. In a clean environment where no other code has imported
`PIL.Image` before, `PIL.Image` does not exist on the `PIL` package,
which leads to the AttributeError.
The permutation table was always intended to be a small table of row id
pointers (and split id). However, it was accidentally doing a full
materialization of the base table 🤦
This PR changes the permutation builder to only store row id and split
id.
Realized our MSRV check was inert because `rust-toolchain.toml` was
overriding the Rust version. We set the `RUSTUP_TOOLCHAIN` environment
variable, which overrides that.
Also needed to update to MSRV 1.88 (due to dependencies like Lance and
DataFusion) and fix some clippy warnings.
Unlike in Amazon S3, in Azure bucket names are not globally unique.
Instead, the combination of (storage_account_name, bucket_name) is
unique.
Therefore, when using Azure blob store, we always need a way to
configure the storage account name. One way is to use the
storage_options hash map and set azure_storage_account_name. Another way
is to set an environment variable, AZURE_STORAGE_ACCOUNT_NAME.
Prior to this PR, the second way (environment variable) did not work
with remote connections. This is because the existing code that checks
for these environment variables happens inside the Azure object store
implementation itself, which does not run locally when using remote
connections.
This PR addresses that situation by adding a check of the environment
variable. This functions as a default if the relevant storage option is
not set in the storage_options hash map.
Pandas 3.0+ string now converts to Arrow large_utf8. This PR mainly
makes sure our test accounts for the difference across the pandas
versions when constructing schema.
BREAKING CHANGE: removes `aws`, `dynamodb`, `azure`, `gcs`, `oss`,
`huggingface` from default Rust features. They can be enabled by users
as needed.
They are still enabled for Python and NodeJS, since those users don't
control the compilation of artifacts.
Closes#2911
RemoteDBConnection should support passing exist_ok to create_table, just
like LanceDBConnection (the non-remote form) does. It can support this
by passing 'exist_ok' as the mode parameter.
Implement parallel execution of multiple embedding functions using
std:🧵:scope to improve performance when a table has multiple
embedding columns.
Key changes:
- Add compute_embeddings_parallel() helper method to WithEmbeddings
- Use fast path for single embeddings (no threading overhead)
- Use scoped threads for parallel execution of multiple embeddings
- Add comprehensive tests including parallelization timing verification
- Update WithEmbeddings documentation
Performance improvements:
- I/O-bound embeddings (OpenAI, Bedrock): High benefit from concurrent
API calls
- CPU-bound embeddings (sentence-transformers): Medium benefit from core
utilization
- Single embedding: No overhead (fast path)
Closes TODO on line 266 in rust/lancedb/src/embeddings.rs
Aesthetic and styling fixes to the SDK reference docs:
- [x] Improve readability of LanceDB in the header
- [x] Make header more compact, and consistent in gradient color with
the main website/docs
- [x] Updated favicon to match with the docs page
- [x] Enable permalink display to allow users to get anchor links to
each function/method
- [x] Point readers to the main docs at
[docs.lancedb.com](https://docs.lancedb.com)
Convert test_table_names to test both remote and local connections.
This PR also includes some miscellaneous improvements in
src/test_utils/connection.rs. It starts a thread to drain stdout from
the server process. It adds the
PRINT_LANCEDB_TEST_CONNECTION_SCRIPT_OUTPUT environment variable, which
optionally displays server stdout.
Fix a bash conditional in run_with_test_connection.sh.
The page_token and limit parameters for table_names() are supported by
both local storage and LanceDB Cloud, not just Cloud as the docstring
incorrectly stated.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Adds `Table.to_lance()` and `Table.to_polars()` methods (non-abstract
methods, defaulting to `NotImplementedError`) so type checkers like
mypy, pyright and ty don’t flag them as unknown attributes on `Table`.
Not making these abstract methods should keep existing remote/other
`Table` implementations instantiable.
This is non-breaking change to existing functionality and is purely for
the purpose of pleasing static type-checkers like mypy, ty and pyright.
<img width="626" height="134" alt="image"
src="https://github.com/user-attachments/assets/f4619bca-a882-432b-bd23-ae8f189ff9e3"
/>
Issues found during integration tests:
1. describe_namespace should use POST
2. service needs to access the underlying namespace to be able to do
operations like create_empty_table directly, or get credentials in
isolated paths like a remote take
## Summary
- update all lance crates to v1.0.0 using the helper script (fallbacks
to the v1.0.0 tag)
- refresh Cargo.lock to pull the new release
- add script fallback to retry with the git tag when a crates.io release
is unavailable
## Testing
- cargo clippy --workspace --tests --all-features -- -D warnings
- cargo fmt --all
Tag: https://github.com/lance-format/lance/releases/tag/v1.0.0
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
## Summary
- infer integer vector columns as float32 when any value exceeds uint8
range or is negative
- keep uint8 for integer vectors within range and nulls only
- add sync/async tests covering large integer vector inference
## Testing
- ./.venv/bin/pytest python/python/tests/test_table.py -k
"large_int_vectors"
In #2845 we ported the lancedb integration in lance-namespace to
lancedb. But that is too specific to RestNamespace. We can improve the
user entry point so that we can put local mode and future version of the
Flight SQL-based LanceDB server all behind this single
`LanceDbNamespaceClientBuilder` API.
Also I renamed `namespace` to `namesapceClient` to avoid confusion with
the namespace path.
After the refactoring on both client and server side, we should have the
ability to fully use lance REST namespace to call into LanceDB cloud and
enterprise. We can avoid having a JNI implementation (which today does
not really do anything except for vending a connection object), and just
use lance-core's RestNamespace.
We will at this moment have a LanceDbRestNamespaceBuilder to allow users
to more easily build the RestNamespace to talk to LanceDB Cloud or
Enterprise endpoint.
In the future, we could extend this further to also support the local
mode through DirectoryNamespace. That will be a separated PR.
Adds IVF_SQ index config through Rust core and Python bindings, plus
alias names IvfHnswSq/Pq for backward compatibility. Updates
remote/table helpers and types to accept the new index type. Includes
tests covering IVF SQ creation and alias usage.
1. Use generated models in lance-namespace for request response models
to avoid multiple layers of conversions
2. Make sure the API is consistent with the namespace spec
3. Deprecate the table_names API in favor of the list_tables API in
namespace that allows full pagination support without the need to have
sorted table names
4. Add describe_namespace API which was a miss in the original
implementation
Currently a table in a namespace is still backed with a `NativeTable`,
which means after getting the location of the table and optional storage
options override from `namespace.describe_table`, all things work like a
normal local table. However, namespace also supports `query_table`,
which is exactly the same API as remote table. This PR adds a
`server_side_query` capability, when enabled, it runs the query by
calling `namespace.query_table`. For namespace that implements the
operation (e.g. REST namespace), this could hit a backend server that
could execute the query faster (e.g. using a distributed engine).
We have very low download stats for mac x86, and also latest github
runners for mac are all arm, so it makes sense at this point to
deprecate x86 support in general.
Add support for enabling stable row IDs when creating tables via the
`new_table_enable_stable_row_ids` storage option.
Stable row IDs ensure that row identifiers remain constant after
compaction, update, delete, and merge operations. This is useful for
materialized views and other use cases that need to track source rows
across these operations.
The option can be set at two levels:
- Connection level: applies to all tables created with that connection
- Table level: per-table override via create_table storage_options
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- bump all Lance crates to v1.0.0-beta.16 via ci/set_lance_version.py
- refresh Cargo.lock (reqwest/opendal/etc.) to satisfy the new release
## Verification
- cargo clippy --workspace --tests --all-features -- -D warnings
- cargo fmt --all
Triggered by
[refs/tags/v1.0.0-beta.16](https://github.com/lance-format/lance/releases/tag/v1.0.0-beta.16)
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
2025-12-01 23:07:03 -08:00
335 changed files with 101963 additions and 26090 deletions
1. Extract the run ID from the workflow URL. The URL format is https://github.com/lancedb/lancedb/actions/runs/<run_id>.
2. Use "gh run view <run_id> --json jobs,conclusion,name" to get information about the failed run.
3. Identify which jobs failed. For each failed job, use "gh run view <run_id> --job <job_id> --log-failed" to get the failure logs.
4. Analyze the failure logs to understand what went wrong. Common failures include:
- Compilation errors
- Test failures
- Clippy warnings treated as errors
- Formatting issues
- Dependency issues
5. Based on the analysis, fix the issues in the codebase:
- For compilation errors: Fix the code that doesn't compile
- For test failures: Fix the failing tests or the code they test
- For clippy warnings: Apply the suggested fixes
- For formatting issues: Run "cargo fmt --all"
- For other issues: Apply appropriate fixes
6. After making fixes, verify them locally:
- Run "cargo fmt --all" to ensure formatting is correct
- Run "cargo clippy --workspace --tests --all-features -- -D warnings" to check for issues
- Run ONLY the specific failing tests to confirm they pass now:
- For Rust test failures: Run the specific test with "cargo test -p <crate> <test_name>"
- For Python test failures: Build with "cd python && maturin develop" then run "pytest <specific_test_file>::<test_name>"
- For Java test failures: Run "cd java && mvn test -Dtest=<TestClass>#<testMethod>"
- For TypeScript test failures: Run "cd nodejs && pnpm build && pnpm test -- --testNamePattern='<test_name>'"
- Do NOT run the full test suite - only run the tests that were failing
7. If the additional guidelines are provided, follow them as well.
8. Inspect "git status --short" and "git diff" to review your changes.
9. Create a fix branch: "git checkout -b codex/fix-ci-<run_id>".
10. Stage all changes with "git add -A" and commit with message "fix: resolve CI failures from run <run_id>".
11. Push the branch: "git push origin codex/fix-ci-<run_id>". If the remote branch exists, delete it first with "gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/codex/fix-ci-<run_id>" then push. Do NOT use "git push --force" or "git push -f".
12. Create a pull request targeting "${BRANCH}":
- Title: "ci: <short summary describing the fix>" (e.g., "ci: fix clippy warnings in lancedb" or "ci: resolve test flakiness in vector search")
- First, write the PR body to /tmp/pr-body.md using a heredoc (cat <<'PREOF' > /tmp/pr-body.md). The body should include:
- Link to the failing workflow run
- Summary of what failed
- Description of the fixes applied
- Then run "gh pr create --base ${BRANCH} --body-file /tmp/pr-body.md".
13. Display the new PR URL, "git status --short", and a summary of what was fixed.
Constraints:
- Use bash commands for all operations.
- Do not merge the PR.
- Do not modify GitHub workflow files unless they are the cause of the failure.
- If any command fails, diagnose and attempt to fix the issue instead of aborting immediately.
- If you cannot fix the issue automatically, create the PR anyway with a clear explanation of what you tried and what remains to be fixed.
- env "GH_TOKEN" is available, use "gh" tools for GitHub-related operations.
# Use "chore" for beta/rc versions, "feat" for stable releases
if [[ "${VERSION}" == *beta* ]] || [[ "${VERSION}" == *rc* ]]; then
COMMIT_TYPE="chore"
else
COMMIT_TYPE="feat"
fi
cat <<EOF >/tmp/codex-prompt.txt
You are running inside the lancedb repository on a GitHub Actions runner. Update the Lance dependency to version ${VERSION} and prepare a pull request for maintainers to review.
Follow these steps exactly:
1. Use script "ci/set_lance_version.py" to update Lance dependencies. The script already refreshes Cargo metadata, so allow it to finish even if it takes time.
2. Run "cargo clippy --workspace --tests --all-features -- -D warnings". If diagnostics appear, fix them yourself and rerun clippy until it exits cleanly. Do not skip any warnings.
3. After clippy succeeds, run "cargo fmt --all" to format the workspace.
4. Ensure the repository is clean except for intentional changes. Inspect "git status --short" and "git diff" to confirm the dependency update and any required fixes.
5. Create and switch to a new branch named "${BRANCH_NAME}" (replace any duplicated hyphens if necessary).
6. Stage all relevant files with "git add -A". Commit using the message "chore: update lance dependency to v${VERSION}".
7. Push the branch to origin. If the branch already exists, force-push your changes.
8. env "GH_TOKEN" is available, use "gh" tools for github related operations like creating pull request.
9. Create a pull request targeting "main" with title "chore: update lance dependency to v${VERSION}". In the body, summarize the dependency bump, clippy/fmt verification, and link the triggering tag (${TAG}).
10. After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
1. Use script "ci/set_lance_version.py" to update Lance Rust dependencies. The script already refreshes Cargo metadata, so allow it to finish even if it takes time.
2. Update the Java lance-core dependency version in "java/pom.xml": change the "<lance-core.version>...</lance-core.version>" property to "${VERSION}".
3. Run "cargo clippy --workspace --tests --all-features -- -D warnings". If diagnostics appear, fix them yourself and rerun clippy until it exits cleanly. Do not skip any warnings.
4. After clippy succeeds, run "cargo fmt --all" to format the workspace.
5. Ensure the repository is clean except for intentional changes. Inspect "git status --short" and "git diff" to confirm the dependency update and any required fixes.
6. Create and switch to a new branch named "${BRANCH_NAME}" (replace any duplicated hyphens if necessary).
7. Stage all relevant files with "git add -A". Commit using the message "${COMMIT_TYPE}: update lance dependency to v${VERSION}".
8. Push the branch to origin. If the remote branch already exists, delete it first with "gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/${BRANCH_NAME}" then push with "git push origin ${BRANCH_NAME}". Do NOT use "git push --force" or "git push -f".
9. env "GH_TOKEN" is available, use "gh" tools for github related operations like creating pull request.
10. Create a pull request targeting "main" with title "${COMMIT_TYPE}: update lance dependency to v${VERSION}". First, write the PR body to /tmp/pr-body.md using a heredoc (cat <<'EOF' > /tmp/pr-body.md). The body should summarize the dependency bump, clippy/fmt verification, and link the triggering tag (${TAG}). Then run "gh pr create --body-file /tmp/pr-body.md".
11. After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
Constraints:
- Use bash commands; avoid modifying GitHub workflow files other than through the scripted task above.
Voyage AI provides cutting-edge embedding and rerankers.
Using voyageai API requires voyageai package, which can be installed using `pip install voyageai`. Voyage AI embeddings are used to generate embeddings for text data. The embeddings can be used for various tasks like semantic search, clustering, and classification.
You also need to set the `VOYAGE_API_KEY` environment variable to use the VoyageAI API.
- voyage-4-lite (1024 dims, optimized for latency and cost, 1M batch tokens)
- voyage-4-large (1024 dims, best retrieval quality, 120K batch tokens)
**Voyage-3 Series**
- voyage-3
- voyage-3-lite
**Domain-Specific Models**
- voyage-finance-2
- voyage-multilingual-2
- voyage-law-2
- voyage-code-2
Supported parameters (to be passed in `create` method) are:
| Parameter | Type | Default Value | Description |
|---|---|--------|---------|
| `name` | `str` | `None` | The model ID of the model to use. Supported base models for Text Embeddings: voyage-4, voyage-4-lite, voyage-4-large, voyage-3, voyage-3-lite, voyage-finance-2, voyage-multilingual-2, voyage-law-2, voyage-code-2 |
| `input_type` | `str` | `None` | Type of the input text. Default to None. Other options: query, document. |
| `truncation` | `bool` | `True` | Whether to truncate the input texts to fit within the context length. |
* Guava InternalFutureFailureAccess and InternalFutures (com.google.guava:failureaccess:1.0.2 - https://github.com/google/guava/failureaccess)
* Guava ListenableFuture only (com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava - https://github.com/google/guava/listenablefuture)
* Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.16.0 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
* Jackson module: Old JAXB Annotations (javax.xml.bind) (com.fasterxml.jackson.module:jackson-module-jaxb-annotations:2.17.1 - https://github.com/FasterXML/jackson-modules-base)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.