This hooks up a new writer implementation for the `add()` method. The
main immediate benefit is it allows streaming requests to remote tables,
and at the same time allowing retries for most inputs.
In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's
always retry-able.
For Python, all are retry-able, except `Iterator` and
`pa.RecordBatchReader`, which can only be consumed once. Some, like
`pa.datasets.Dataset` are retry-able *and* streaming.
A lot of the changes here are to make the new DataFusion write pipeline
maintain the same behavior as the existing Python-based preprocessing,
such as:
* casting input data to target schema
* rejecting NaN values if `on_bad_vectors="error"`
* applying embedding functions.
In future PRs, we'll enhance these by moving the embedding calls into
DataFusion and making sure we parallelize them. See:
https://github.com/lancedb/lancedb/issues/3048
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Upgrades `@napi-rs/cli` from v2 to v3, `napi`/`napi-derive` Rust
crates to 3.x
- Fixes a bug
([napi-rs#1170](https://github.com/napi-rs/napi-rs/issues/1170)) where
the CLI failed to locate the built `.node` binary when a custom Cargo
target directory is set (via `config.toml`)
## Changes
**package.json / CLI**:
- `napi.name` → `napi.binaryName`, `napi.triples` → `napi.targets`
- Removed `--no-const-enum` flag and fixed output dir arg
- `napi universal` → `napi universalize`
**Rust API migration**:
- `#[napi::module_init]` → `#[napi_derive::module_init]`
- `napi::JsObject` → `Object`, `.get::<_, T>()` → `.get::<T>()`
- `ErrorStrategy` removed; `ThreadsafeFunction` now takes an explicit
`Return` type with `CalleeHandled = false` const generic
- `JsFunction` + `create_threadsafe_function` replaced by typed
`Function<Args, Return>` + `build_threadsafe_function().build()`
- `RerankerCallbacks` struct removed (`Function<'env,...>` can't be
stored in structs); `VectorQuery::rerank` now accepts the function
directly
- `ClassInstance::clone()` now returns `ClassInstance`, fixed with
explicit deref
- `Vec<u8>` in `#[napi(object)]` now maps to `Array<number>` in v3;
changed to `Buffer` to preserve the TypeScript `Buffer` type
**TypeScript**:
- `inner.rerank({ rerankHybrid: async (_, args) => ... })` →
`inner.rerank(async (args) => ...)`
- Header provider callback wrapped in `async` to match stricter typed
constructor signature
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
BREAKING CHANGE: Arbitrary `impl RecordBatchReader` is no longer
accepted, it must be made into `Box<dyn RecordBatchReader>`.
This PR replaces `IntoArrow` with a new trait `Scannable` to define
input row data. This provides the following advantages:
1. **We can implement `Scannable` for more types than `IntoArrow`, such
as `RecordBatch` and `Vec<RecordBatch>`.** The `IntoArrow` trait was
implemented for arbitrary `T: RecordBatchReader`, and the Rust compiler
would prevent us from implementing it for foreign types like
`RecordBatch` because (theoretically) those types might implement
`RecordBatchReader` in the future. That's why we implement `Scannable`
for `Box<dyn RecordBatchReader>` instead; since it's a concrete type it
doesn't block implementing for other foreign types.
2. **We can potentially replay `Scannable` values**. Previously, we had
to choose between buffering all data in memory and supporting retries of
writes. But because `Scannable` things can optionally support
re-scanning, we now have a way of supporting retries while also
streaming.
3. **`Scannable` can provide hints like `num_rows`, which can be used to
schedule parallel writers.** Without knowing the total number of rows,
it's difficult to know whether it's worth writing multiple files in
parallel.
We don't yet fully take advantage of (2) and (3) yet, but will in future
PRs. For (2), in order to be ready to leverage this, we need to hook the
`Scannable` implementation up to Python and NodeJS bindings. Right now
they always pass down a stream, but we want to make sure they support
retries when possible. And for (3), this will need to be hooked up to
#2939 and to a pipeline for running pre-processing steps (like embedding
generation).
## Other changes
* Moved `create_table` and `add_data` into their own modules. I've
created a follow up issue to split up `table.rs` further, as it's by far
the largest file: https://github.com/lancedb/lancedb/issues/2949
* Eliminated the `HAS_DATA` generic for `CreateTableBuilder`. I didn't
see any public-facing places where we differentiated methods, which is
why I felt this simplification was okay.
* Added an `Error::External` variant and integrated some conversions to
allow certain errors to pass through transparently. This will fully work
once we upgrade Lance and get to take advantage of changes in
https://github.com/lance-format/lance/pull/5606
* Added LZ4 compression support for write requests to remote endpoints.
I checked and this has been supported on the server for > 1 year.
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Added `repository` field to all nodejs package.json files (main
package + 7 platform-specific packages)
- This fixes the npm publish E422 error where sigstore provenance
verification fails because the repository.url was empty
## Root Cause
Failing CI:
https://github.com/lancedb/lancedb/actions/runs/21770794768/job/62821570260
npm's sigstore provenance verification requires the `repository.url`
field in package.json to match the GitHub repository URL from the
provenance bundle. The platform-specific packages
(`@lancedb/lancedb-darwin-arm64`, etc.) were missing this field
entirely, causing the publish to fail with:
```
npm error 422 Unprocessable Entity - Error verifying sigstore provenance bundle:
Failed to validate repository information: package.json: "repository.url" is "",
expected to match "https://github.com/lancedb/lancedb" from provenance
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Expose `initial_storage_options()` and `latest_storage_options()` in
lance Dataset, in lancedb rust, python and typescript SDKs.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
This PR changes takeRowIds to accept bigint[] instead of
number[], matching the type of _rowid returned by withRowId().
## Problem
When retrieving row IDs using \withRowId()\ and querying them back with
takeRowIds(), users get an error because:
1. _rowid values are returned as JavaScript bigint
2. takeRowIds() expected number[]
3. NAPI failed to convert: Error: Failed to convert napi value BigInt
into rust type i64
## Reproduction
\\\js
import lancedb from '@lancedb/lancedb';
const db = await lancedb.connect('memory://');
const table = await db.createTable('test', [{ id: 1, vector: [1.0, 2.0]
}]);
const results = await table.query().withRowId().toArray();
const rowIds = results.map(row => row._rowid);
console.log('types:', rowIds.map(id => typeof id)); // ['bigint']
await table.takeRowIds(rowIds).toArray(); // ⌠Error before fix
\\\
## Solution
- Updated TypeScript signature from takeRowIds(rowIds: number[]) to
takeRowIds(rowIds: bigint[])
- Updated Rust NAPI binding to accept Vec<BigInt> and convert using
get_u64()
Fixes#2722
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
BREAKING CHANGE: removes `aws`, `dynamodb`, `azure`, `gcs`, `oss`,
`huggingface` from default Rust features. They can be enabled by users
as needed.
They are still enabled for Python and NodeJS, since those users don't
control the compilation of artifacts.
Closes#2911
Fixes the breaking CI for nodejs, related to the documentation of the
new Permutation API in typescript.
- Expanded the generated typings in `nodejs/lancedb/native.d.ts` to
include `SplitCalculatedOptions`, `splitNames` fields, and the
persist/options-based `splitCalculated` methods so the permutation
exports match the native API.
- The previous block comment block had an inconsistency.
`splitCalculated` takes an options object (`SplitCalculatedOptions`) in
our bindings, not a bare string. The previous example showed
`builder.splitCalculated("user_id % 3");`, which doesn’t match the
actual signature and would fail TS typecheck. I updated the comment to
`builder.splitCalculated({ calculation: "user_id % 3" });` so the
example is now correct.
- Updated the `splitCalculated` example in
`nodejs/lancedb/permutation.ts` to use the options object.
- Ran `npm docs` to ensure docs build correctly.
> [!NOTE]
> **Disclaimer**: I used GPT-5.1-Codex-Max to make these updates, but I
have read the code and run `npm run docs` to verify that they work and
are correct to the best of my knowledge.
Did a full scan of all URLs that used to point to the old mkdocs pages,
and now links to the appropriate pages on lancedb.com/docs or lance.org
docs.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pipes the num_attempts field from lance's merge insert result
through lancedb. This allows callers of merge_insert to get a better
idea of whether transaction conflicts are occurring.
* Add `ci` profile for smaller build caches. This had a meaningful
impact in Lance, and I expect a similar impact here.
https://github.com/lancedb/lance/pull/5236
* Get caching working in Rust. Previously was not working due to
`workspaces: rust`.
* Get caching working in NodeJs lint job. Previously wasn't working
because we installed the toolchain **after** we called `- uses:
Swatinem/rust-cache@v2`, which invalidates the cache locally.
* Fix broken pytest from async io transition
(`pytest.PytestRemovedIn9Warning`)
* Altered `get_num_sub_vectors` to handle bug in case of 4-bit PQ. This
was cause of `rust future panicked: unknown error`. Raised an issue
upstream to change panic to error:
https://github.com/lancedb/lance/issues/5257
* Call `npm run docs` to fix doc issue.
* Disable flakey Windows test for consistency. It's just an OS-specific
timer issue, not our fault.
* Fix Windows absolute path handling in namespaces. Was causing CI
failure `OSError: [WinError 123] The filename, directory name, or volume
label syntax is incorrect: `