## **Summary**
This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.
A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.
This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.
---
## **Design**
### **Transport**
The transport uses the **Arrow IPC Stream format, one batch at a time**.
The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.
Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.
I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.
Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.
---
### **API**
```ts
class Scannable {
readonly schema: Schema
readonly numRows: number | null
readonly rescannable: boolean
static fromFactory(schema, factory, opts?)
static fromTable(table, opts?)
static fromIterable(schema, iter, opts?)
static fromRecordBatchReader(reader, opts?)
}
```
The FFI boundary consists of a single callback:
`getNextBatch(isStart: boolean): Promise<Buffer | null>`
`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.
In addition, a schema only IPC buffer is transferred once during
construction.
---
## **Changes**
* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.
* `nodejs/src/lib.rs`
Module registration.
* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.
* `nodejs/lancedb/index.ts`
Exports the new primitive.
* `nodejs/__test__/scannable.test.ts`
Test suite for the primitive.
---
## **Validation**
Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.
The harness covered the following scenarios:
* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe
All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.
Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.
These harnesses are not meant to replace the full test suite, which is
described below.
---
## **Tests**
`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.
Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.
Runtime coverage will follow in the consumer migration work.
---
## **Status**
Ready for review.
Closes#3223
---
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
## Summary
- Upgrades `@napi-rs/cli` from v2 to v3, `napi`/`napi-derive` Rust
crates to 3.x
- Fixes a bug
([napi-rs#1170](https://github.com/napi-rs/napi-rs/issues/1170)) where
the CLI failed to locate the built `.node` binary when a custom Cargo
target directory is set (via `config.toml`)
## Changes
**package.json / CLI**:
- `napi.name` → `napi.binaryName`, `napi.triples` → `napi.targets`
- Removed `--no-const-enum` flag and fixed output dir arg
- `napi universal` → `napi universalize`
**Rust API migration**:
- `#[napi::module_init]` → `#[napi_derive::module_init]`
- `napi::JsObject` → `Object`, `.get::<_, T>()` → `.get::<T>()`
- `ErrorStrategy` removed; `ThreadsafeFunction` now takes an explicit
`Return` type with `CalleeHandled = false` const generic
- `JsFunction` + `create_threadsafe_function` replaced by typed
`Function<Args, Return>` + `build_threadsafe_function().build()`
- `RerankerCallbacks` struct removed (`Function<'env,...>` can't be
stored in structs); `VectorQuery::rerank` now accepts the function
directly
- `ClassInstance::clone()` now returns `ClassInstance`, fixed with
explicit deref
- `Vec<u8>` in `#[napi(object)]` now maps to `Array<number>` in v3;
changed to `Buffer` to preserve the TypeScript `Buffer` type
**TypeScript**:
- `inner.rerank({ rerankHybrid: async (_, args) => ... })` →
`inner.rerank(async (args) => ...)`
- Header provider callback wrapped in `async` to match stricter typed
constructor signature
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Did a full scan of all URLs that used to point to the old mkdocs pages,
and now links to the appropriate pages on lancedb.com/docs or lance.org
docs.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
I'm working on a lancedb version of pytorch data loading (and hopefully
addressing https://github.com/lancedb/lance/issues/3727).
However, rather than rely on pytorch for everything I'm moving some of
the things that pytorch does into rust. This gives us more control over
data loading (e.g. using shards or a hash-based split) and it allows
permutations to be persistent. In particular I hope to be able to:
* Create a persistent permutation
* This permutation can handle splits, filtering, shuffling, and sharding
* Create a rust data loader that can read a permutation (one or more
splits), or a subset of a permutation (for DDP)
* Create a python data loader that delegates to the rust data loader
Eventually create integrations for other data loading libraries,
including rust & node
## Summary
This PR introduces a `HeaderProvider` which is called for all remote
HTTP calls to get the latest headers to inject. This is useful for
features like adding the latest auth tokens where the header provider
can auto-refresh tokens internally and each request always set the
refreshed token.
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
This reverts commit a547c523c2 or #2281
The current implementation can cause panics and performance degradation.
I will bring this back with more testing in
https://github.com/lancedb/lancedb/pull/2311
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Enhanced clarity on read consistency settings with updated
descriptions and default behavior.
- Removed outdated warnings about eventual consistency from the
troubleshooting guide.
- **Refactor**
- Streamlined the handling of the read consistency interval across
integrations, now defaulting to "None" for improved performance.
- Simplified internal logic to offer a more consistent experience.
- **Tests**
- Updated test expectations to reflect the new default representation
for the read consistency interval.
- Removed redundant tests related to "no consistency" settings for
streamlined testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.
Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
* Make `npm run docs` fail if there are any warnings. This will catch
items missing from the API reference.
* Add a check in our CI to make sure `npm run dos` runs without warnings
and doesn't generate any new files (indicating it might be out-of-date.
* Hide constructors that aren't user facing.
* Remove unused enum `WriteMode`.
Closes#2068
Support hybrid search in both rust and node SDKs.
- Adds a new rerankers package to rust LanceDB, with the implementation
of the default RRF reranker
- Adds a new hybrid package to lancedb, with some helper methods related
to hybrid search such as normalizing scores and converting score column
to rank columns
- Adds capability to LanceDB VectorQuery to perform hybrid search if it
has both a nearest vector and full text search parameters.
- Adds wrappers for reranker implementations to nodejs SDK.
Additional rerankers will be added in followup PRs
https://github.com/lancedb/lancedb/issues/1921
---
Notes about how the rust rerankers are wrapped for calling from JS:
I wanted to keep the core reranker logic, and the invocation of the
reranker by the query code, in Rust. This aligns with the philosophy of
the new node SDK where it's just a thin wrapper around Rust. However, I
also wanted to have support for users who want to add custom rerankers
written in Javascript.
When we add a reranker to the query from Javascript, it adds a special
Rust reranker that has a callback to the Javascript code (which could
then turn around and call an underlying Rust reranker implementation if
desired). This adds a bit of complexity, but overall I think it moves us
in the right direction of having the majority of the query logic in the
underlying Rust SDK while keeping the option open to support custom
Javascript Rerankers.
This exposes the `LANCEDB_LOG` environment variable in node, so that
users can now turn on logging.
In addition, fixes a bug where only the top-level error from Rust was
being shown. This PR makes sure the full error chain is included in the
error message. In the future, will improve this so the error chain is
set on the [cause](https://nodejs.org/api/errors.html#errorcause)
property of JS errors https://github.com/lancedb/lancedb/issues/1779Fixes#1774
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.
1. Fixes#1168
2. Closes#1165
3. Closes#1082
4. Closes#439
5. Closes#897
6. Closes#642
7. Closes#281
8. Closes#114
9. Closes#990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).
While doing this we also cleaned up some inconsistencies between the
SDKs:
* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.
Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
This PR adds the same consistency semantics as was added in #828. It
*does not* add the same lazy-loading of tables, since that breaks some
existing tests.
This closes#998.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>