feat(nodejs): add Scannable primitive for streaming ingestion (#3271)

## **Summary** This PR adds a **Scannable primitive** to the Node.js bindings, bringing parity with Python's `PyScannable`. A `Scannable` wraps a schema, an optional row count hint, a rescannable flag, and a batch producing callback. On the Rust side it implements `lancedb::data::scannable::Scannable`. The goal is to give consumers such as `Table.add`, `createTable`, and `mergeInsert` a way to stream data without materializing the full dataset in JS memory. This PR introduces only the primitive. Migrating existing consumers to use it will come in follow up work. --- ## **Design** ### **Transport** The transport uses the **Arrow IPC Stream format, one batch at a time**. The JS side encodes each `RecordBatch` into a self contained IPC Stream message containing schema, batch, and end of stream. The message is returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust side decodes it using `arrow_ipc::reader::StreamReader`. Only one batch is active at a time, so JS memory stays bounded by the batch size. The Node `Buffer` size limit of about 4 GiB therefore does not constrain the stream as a whole. I initially evaluated the Arrow C Data Interface, which is the approach used in Python. I dropped that path after confirming that the `apache-arrow` npm package does not expose a C Data Interface export in any supported version from 15 to 18. JavaScript is not listed in Arrow's C Data Interface implementation table, and the upstream tracking issue remains open with no scheduled work. Third party FFI shims would introduce additional dependency risk without solving the core maintenance problem. Using IPC adds one encode and decode step per batch, but the cost is predictable and typically dominated by Lance's write path. --- ### **API** ```ts class Scannable { readonly schema: Schema readonly numRows: number | null readonly rescannable: boolean static fromFactory(schema, factory, opts?) static fromTable(table, opts?) static fromIterable(schema, iter, opts?) static fromRecordBatchReader(reader, opts?) } ``` The FFI boundary consists of a single callback: `getNextBatch(isStart: boolean): Promise<Buffer | null>` `isStart` is `true` on the first call of each new scan and `false` for every call after it. The JS side uses it to drop any cached iterator and re-invoke the factory at scan boundaries. This is what makes a rescannable source restart at batch 0 on every `scan_as_stream` call, even when a previous scan ended mid stream, for example a retried write after a network error. Without this signal a retry would resume a stale iterator and silently skip already emitted batches. In addition, a schema only IPC buffer is transferred once during construction. --- ## **Changes** * `nodejs/src/scannable.rs` Adds `NapiScannable` and the `LanceScannable` implementation. Implements `schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`. Includes per batch schema validation against the declared schema, one shot enforcement for non rescannable sources, and a scan boundary reset signal (`isStart`) so rescannable sources restart from batch 0 on every `scan_as_stream` call rather than resuming a stale iterator. * `nodejs/src/lib.rs` Module registration. * `nodejs/lancedb/scannable.ts` Defines the `Scannable` class and the four constructors listed above. Each constructor rejects option combinations it cannot honor, for example a `rescannable: true` request on a one shot iterable or reader, and a `numRows` that disagrees with an in memory table's row count. * `nodejs/lancedb/index.ts` Exports the new primitive. * `nodejs/__test__/scannable.test.ts` Test suite for the primitive. --- ## **Validation** Before implementing the bridge, I ran an end to end harness with a JS producer feeding a standalone Rust consumer built against the same `arrow-ipc` version used in the bridge. The harness covered the following scenarios: * happy path * empty stream * 1,000 small batches * 10 large batches * mixed primitive types with nullables * nested `List<Struct<>>` * truncated stream error handling * declared schema mismatch validation * a 6 GB stress test through the pipe All scenarios completed with bounded memory usage. The goal of this harness was to confirm that the IPC Stream transport works correctly end to end and that Node's `Buffer` size limit does not constrain the overall stream. Separately, the rescannable restart contract was verified with a focused harness. A rescannable source is consumed partially and the scan is dropped mid stream, then re-scanned. The re-scan replays from batch 0 rather than resuming the stale iterator. The same harness was run with the `isStart` reset path disabled and the mid stream restart case failed as expected, confirming the test exercises the real regression. These harnesses are not meant to replace the full test suite, which is described below. --- ## **Tests** `__test__/scannable.test.ts` covers construction, metadata reflection, per constructor defaults and overrides, construction time validation, the native handle surface, and schema variety across empty tables, nested types, `FixedSizeList`, and wide schemas. Runtime scan behavior including `scan_as_stream`, one shot enforcement on non rescannable sources, schema mismatch detection, IPC decode failures, and rescannable restart semantics is not exercised here. There is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's `PyScannable`, which has no dedicated test file and is covered transitively through the consumers that accept a Scannable. Runtime coverage will follow in the consumer migration work. --- ## **Status** Ready for review. Closes #3223 ---
2026-05-16 11:30:41 +00:00 · 2026-05-15 03:37:41 +05:30
parent 9330a9b851
commit df4ad9f851
9 changed files with 1183 additions and 0 deletions
--- a/nodejs/src/lib.rs
+++ b/nodejs/src/lib.rs
@@ -16,6 +16,7 @@ pub mod permutation;
 mod query;
 pub mod remote;
 mod rerankers;
+mod scannable;
 mod session;
 mod table;
 mod util;
--- a/nodejs/src/scannable.rs
+++ b/nodejs/src/scannable.rs
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: Apache-2.0
+// SPDX-FileCopyrightText: Copyright The LanceDB Authors
+
+//! NodeJS binding for the [`lancedb::data::scannable::Scannable`] trait.
+//!
+//! The JS side supplies a `getNextBatch(isStart)` callback that returns the
+//! next Arrow `RecordBatch` encoded as a self-contained Arrow IPC Stream
+//! message (schema message + record batch message + EOS marker) wrapped in a
+//! `Buffer`, or `null` when the stream is exhausted. The Rust side parses
+//! each buffer with `arrow_ipc::reader::StreamReader`, validates every
+//! standalone batch stream against the declared schema, and yields decoded
+//! `RecordBatch`es as a [`SendableRecordBatchStream`].
+//!
+//! `isStart` is `true` on the first `getNextBatch` call of each new
+//! `scan_as_stream` and `false` thereafter. JS uses it to drop any cached
+//! iterator and re-invoke its factory at scan boundaries, so retries
+//! triggered by mid-stream failures restart at batch 0.
+
+use std::io::Cursor;
+use std::sync::Arc;
+
+use arrow_array::RecordBatch;
+use arrow_ipc::reader::StreamReader;
+use arrow_schema::SchemaRef;
+use futures::stream::once;
+use lancedb::arrow::{SendableRecordBatchStream, SimpleRecordBatchStream};
+use lancedb::data::scannable::Scannable as LanceScannable;
+use lancedb::ipc::ipc_file_to_schema;
+use lancedb::{Error, Result as LanceResult};
+use napi::bindgen_prelude::*;
+use napi::threadsafe_function::ThreadsafeFunction;
+use napi_derive::napi;
+
+/// Threadsafe handle to the JS `getNextBatch` callback. The callback takes a
+/// single boolean `isStart` (`true` on the first call of each new scan) and
+/// returns a Promise that resolves to a `Buffer` containing one IPC Stream
+/// message, or `null` at end-of-stream.
+type GetNextBatchFn = ThreadsafeFunction<bool, Promise<Option<Buffer>>, bool, Status, false>;
+
+/// A Rust-side view of a JS-constructed `Scannable`.
+///
+/// Held in JS as the return value of the `Scannable` class constructor. When
+/// passed to a consumer that accepts `impl lancedb::data::scannable::Scannable`,
+/// the consumer invokes `scan_as_stream()` to pull batches through the JS
+/// callback.
+#[napi]
+pub struct NapiScannable {
+    schema: SchemaRef,
+    num_rows: Option<usize>,
+    rescannable: bool,
+    // `ThreadsafeFunction` is not `Clone`; wrap in `Arc` so the stream
+    // returned by `scan_as_stream` can own a handle independent of `self`.
+    get_next_batch: Arc<GetNextBatchFn>,
+    // Tracks whether a scan has already started; used to enforce one-shot
+    // semantics on non-rescannable sources.
+    scanned: bool,
+}
+
+#[napi]
+impl NapiScannable {
+    /// Construct a new `NapiScannable`.
+    ///
+    /// - `schema_buf` — Arrow IPC File buffer carrying only the schema (no batches).
+    /// - `num_rows` — optional row count hint; not validated against the stream.
+    /// - `rescannable` — whether `get_next_batch` may be re-driven after the
+    ///   scan completes.
+    /// - `get_next_batch` -- JS callback that yields the next batch as an Arrow
+    ///   IPC Stream message wrapped in a `Buffer`, or `null` at EOF. The
+    ///   `isStart` argument is `true` on the first call of each new scan;
+    ///   JS uses it to discard any cached iterator before pulling.
+    #[napi(constructor)]
+    pub fn new(
+        schema_buf: Buffer,
+        num_rows: Option<i64>,
+        rescannable: bool,
+        get_next_batch: Function<bool, Promise<Option<Buffer>>>,
+    ) -> napi::Result<Self> {
+        let schema = ipc_file_to_schema(schema_buf.to_vec())
+            .map_err(|e| napi::Error::from_reason(format!("Invalid schema buffer: {}", e)))?;
+        let num_rows = num_rows
+            .map(|n| {
+                usize::try_from(n)
+                    .map_err(|_| napi::Error::from_reason("num_rows must be non-negative"))
+            })
+            .transpose()?;
+        let get_next_batch = Arc::new(get_next_batch.build_threadsafe_function().build()?);
+        Ok(Self {
+            schema,
+            num_rows,
+            rescannable,
+            get_next_batch,
+            scanned: false,
+        })
+    }
+}
+
+impl std::fmt::Debug for NapiScannable {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        f.debug_struct("NapiScannable")
+            .field("schema", &self.schema)
+            .field("num_rows", &self.num_rows)
+            .field("rescannable", &self.rescannable)
+            .finish()
+    }
+}
+
+impl LanceScannable for NapiScannable {
+    fn schema(&self) -> SchemaRef {
+        self.schema.clone()
+    }
+
+    fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
+        let schema = self.schema.clone();
+
+        // One-shot enforcement for non-rescannable sources: return a stream
+        // whose first item is an error.
+        if self.scanned && !self.rescannable {
+            let err_stream = once(async {
+                Err(Error::InvalidInput {
+                    message: "Scannable has already been consumed (non-rescannable source)"
+                        .to_string(),
+                })
+            });
+            return Box::pin(SimpleRecordBatchStream::new(err_stream, schema));
+        }
+        self.scanned = true;
+
+        let tsfn = Arc::clone(&self.get_next_batch);
+        let declared_schema = schema.clone();
+
+        // State threaded through the unfold. `is_first_pull` starts true so
+        // the first call into JS signals a new-scan boundary; JS uses it to
+        // reset any cached iterator before factory()-ing a fresh one.
+        let initial = State {
+            tsfn,
+            batch_index: 0,
+            declared_schema,
+            errored: false,
+            is_first_pull: true,
+        };
+
+        let stream = futures::stream::unfold(initial, |mut state| async move {
+            if state.errored {
+                return None;
+            }
+
+            // Pull the next IPC Stream buffer from JS. `is_first_pull` is
+            // consumed here and cleared so subsequent pulls continue the
+            // same scan rather than restarting it.
+            let is_start = state.is_first_pull;
+            state.is_first_pull = false;
+            let buf = match pull_next(&state.tsfn, is_start).await {
+                Ok(Some(buf)) => buf,
+                Ok(None) => return None,
+                Err(e) => {
+                    state.errored = true;
+                    return Some((Err(e), state));
+                }
+            };
+
+            match decode_one_batch(buf.as_ref(), &state.declared_schema) {
+                Ok(batch) => {
+                    state.batch_index += 1;
+                    Some((Ok(batch), state))
+                }
+                Err(e) => {
+                    let tagged = Error::Runtime {
+                        message: format!(
+                            "[scannable/rust-bridge] failure at batch index {}: {}",
+                            state.batch_index, e
+                        ),
+                    };
+                    state.errored = true;
+                    Some((Err(tagged), state))
+                }
+            }
+        });
+
+        Box::pin(SimpleRecordBatchStream::new(stream, schema))
+    }
+
+    fn num_rows(&self) -> Option<usize> {
+        self.num_rows
+    }
+
+    fn rescannable(&self) -> bool {
+        self.rescannable
+    }
+}
+
+struct State {
+    tsfn: Arc<GetNextBatchFn>,
+    batch_index: usize,
+    declared_schema: SchemaRef,
+    errored: bool,
+    /// True for the very first pull of a new scan. Forwarded to JS so the
+    /// callback can drop any cached iterator and call its factory fresh,
+    /// which makes rescannable sources restart at batch 0 even when the
+    /// previous scan ended mid-stream.
+    is_first_pull: bool,
+}
+
+/// Invoke the JS callback and await its Promise. `is_start` is forwarded to
+/// the JS side as the `isStart` argument so it can reset its iterator at the
+/// scan boundary. Errors on the JS side surface here as rejected promises
+/// and are tunneled back as `lancedb::Error::Runtime`.
+async fn pull_next(tsfn: &GetNextBatchFn, is_start: bool) -> LanceResult<Option<Buffer>> {
+    let promise = tsfn
+        .call_async(is_start)
+        .await
+        .map_err(|e| Error::Runtime {
+            message: format!(
+                "[scannable/js-factory] napi error status={}, reason={}",
+                e.status, e.reason
+            ),
+        })?;
+    promise.await.map_err(|e| Error::Runtime {
+        message: format!(
+            "[scannable/js-iterator] napi error status={}, reason={}",
+            e.status, e.reason
+        ),
+    })
+}
+
+/// Decode one IPC Stream buffer (schema + batch + EOS) into a `RecordBatch`.
+/// Each buffer is a standalone IPC stream, so every decoded stream schema must
+/// match the one declared at construction.
+fn decode_one_batch(buf: &[u8], declared: &SchemaRef) -> LanceResult<RecordBatch> {
+    let reader = StreamReader::try_new(Cursor::new(buf), None).map_err(|e| Error::Runtime {
+        message: format!("failed to open IPC stream reader: {}", e),
+    })?;
+
+    let actual = reader.schema();
+    if actual.as_ref() != declared.as_ref() {
+        return Err(Error::InvalidInput {
+            message: format!(
+                "declared schema does not match stream schema: declared={:?} actual={:?}",
+                declared, actual
+            ),
+        });
+    }
+
+    let mut iter = reader;
+    let batch = iter
+        .next()
+        .ok_or_else(|| Error::Runtime {
+            message: "IPC stream contained schema but no record batch".to_string(),
+        })?
+        .map_err(|e| Error::Runtime {
+            message: format!("failed to decode record batch: {}", e),
+        })?;
+    Ok(batch)
+}