mirror of https://github.com/lancedb/lancedb.git synced 2026-07-03 11:00:40 +00:00

Files

Tanay df4ad9f851 feat(nodejs): add Scannable primitive for streaming ingestion (#3271 )

## **Summary**

This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.

A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.

This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.

---

## **Design**

### **Transport**

The transport uses the **Arrow IPC Stream format, one batch at a time**.

The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.

Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.

I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.

Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.

---

### **API**

```ts
class Scannable {
  readonly schema: Schema
  readonly numRows: number | null
  readonly rescannable: boolean

  static fromFactory(schema, factory, opts?)
  static fromTable(table, opts?)
  static fromIterable(schema, iter, opts?)
  static fromRecordBatchReader(reader, opts?)
}
```

The FFI boundary consists of a single callback:

`getNextBatch(isStart: boolean): Promise<Buffer | null>`

`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.

In addition, a schema only IPC buffer is transferred once during
construction.

---

## **Changes**

* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.

* `nodejs/src/lib.rs`
  Module registration.

* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.

* `nodejs/lancedb/index.ts`
  Exports the new primitive.

* `nodejs/__test__/scannable.test.ts`
  Test suite for the primitive.

---

## **Validation**

Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.

The harness covered the following scenarios:

* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe

All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.

Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.

These harnesses are not meant to replace the full test suite, which is
described below.

---

## **Tests**

`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.

Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.

Runtime coverage will follow in the consumer migration work.

---

## **Status**

Ready for review.

Closes #3223

---

2026-05-14 15:07:41 -07:00

src

feat(nodejs): add Scannable primitive for streaming ingestion (#3271 )

2026-05-14 15:07:41 -07:00

mkdocs.yml

docs: add meth/func names to mkdocstrings (#3101 )

2026-03-06 08:54:45 -08:00

openapi.yml

fix: metric type inconsistency (#2122 )

2025-03-12 10:28:37 -07:00

package-lock.json

feat: support multivector for JS SDK (#2527 )

2025-07-22 21:19:34 +08:00

package.json

chore: convert all js doc test to use snippet. (#881 )

2024-04-05 16:28:56 -07:00

README.md

docs: fix broken documentation links (#3278 )

2026-04-15 20:56:59 +08:00

requirements.txt

docs: fix rendering issues with missing index types in API docs (#3143 )

2026-03-20 09:34:42 -07:00

robots.txt

feat: pare down docs to only show API refs (#2770 )

2025-11-10 12:04:57 -05:00

tsconfig.json

doc: use code snippet for typescript examples (#880 )

2024-04-05 16:28:56 -07:00

README.md

LanceDB Documentation

LanceDB docs are available at docs.lancedb.com.

The SDK docs are built and deployed automatically by Github Actions whenever a commit is pushed to the main branch. So it is possible for the docs to show unreleased features.

Building the docs

Setup

Install LanceDB Python. See setup in Python contributing guide. Run make develop to install the Python package.
Install documentation dependencies. From LanceDB repo root: pip install -r docs/requirements.txt

Preview the docs

cd docs
mkdocs serve

If you want to just generate the HTML files:

PYTHONPATH=. mkdocs build -f docs/mkdocs.yml

If successful, you should see a docs/site directory that you can verify locally.

Adding examples

To make sure examples are correct, we put examples in test files so they can be run as part of our test suites.

You can see the tests are at:

Python: python/python/tests/docs
Typescript: nodejs/examples/

Checking python examples

cd python
pytest -vv python/tests/docs

Checking typescript examples

The @lancedb/lancedb package must be built before running the tests:

pushd nodejs
npm ci
npm run build
popd

Then you can run the examples by going to the nodejs/examples directory and running the tests like a normal npm package:

pushd nodejs/examples
npm ci
npm test
popd

API documentation

Python

The Python API documentation is organized based on the file docs/src/python/python.md. We manually add entries there so we can control the organization of the reference page. However, this means any new types must be manually added to the file. No additional steps are needed to generate the API documentation.

Typescript

The typescript API documentation is generated from the typescript source code using typedoc.

When new APIs are added, you must manually re-run the typedoc command to update the API documentation. The new files should be checked into the repository.

pushd nodejs
npm run docs
popd