mirror of https://github.com/lancedb/lancedb.git synced 2026-05-15 11:00:41 +00:00

Go to file

Tanay df4ad9f851 feat(nodejs): add Scannable primitive for streaming ingestion (#3271 )

## **Summary**

This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.

A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.

This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.

---

## **Design**

### **Transport**

The transport uses the **Arrow IPC Stream format, one batch at a time**.

The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.

Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.

I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.

Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.

---

### **API**

```ts
class Scannable {
  readonly schema: Schema
  readonly numRows: number | null
  readonly rescannable: boolean

  static fromFactory(schema, factory, opts?)
  static fromTable(table, opts?)
  static fromIterable(schema, iter, opts?)
  static fromRecordBatchReader(reader, opts?)
}
```

The FFI boundary consists of a single callback:

`getNextBatch(isStart: boolean): Promise<Buffer | null>`

`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.

In addition, a schema only IPC buffer is transferred once during
construction.

---

## **Changes**

* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.

* `nodejs/src/lib.rs`
  Module registration.

* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.

* `nodejs/lancedb/index.ts`
  Exports the new primitive.

* `nodejs/__test__/scannable.test.ts`
  Test suite for the primitive.

---

## **Validation**

Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.

The harness covered the following scenarios:

* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe

All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.

Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.

These harnesses are not meant to replace the full test suite, which is
described below.

---

## **Tests**

`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.

Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.

Runtime coverage will follow in the consumer migration work.

---

## **Status**

Ready for review.

Closes #3223

---

2026-05-14 15:07:41 -07:00

.cargo

chore: clippy::string_to_string has been replaced by implicit_clone (#2817 )

2025-11-26 16:30:35 +08:00

.github

ci(nodejs): switch from npm to pnpm 11 (#3373 )

2026-05-13 11:27:38 -07:00

ci: modify check_lance_release.py to prefer stable releases over betas (#3146 )

2026-03-17 09:21:30 -07:00

dockerfiles

refactor(python): remove legacy tantivy FTS support (#3282 )

2026-04-20 09:28:45 +08:00

docs

feat(nodejs): add Scannable primitive for streaming ingestion (#3271 )

2026-05-14 15:07:41 -07:00

java

chore: update lance dependency to v7.0.0-beta.7 (#3356 )

2026-05-07 16:04:38 -07:00

nodejs

feat(nodejs): add Scannable primitive for streaming ingestion (#3271 )

2026-05-14 15:07:41 -07:00

python

feat(python): add IVF_HNSW_FLAT vector index support (#3366 )

2026-05-11 15:08:32 -07:00

rust

feat(python): add IVF_HNSW_FLAT vector index support (#3366 )

2026-05-11 15:08:32 -07:00

.bumpversion.toml

Bump version: 0.28.0-beta.10 → 0.28.0-beta.11

2026-04-29 17:53:49 +00:00

.gitignore

feat: bump lance version to 0.40-0-beta.2 (#2772 )

2025-11-10 14:36:37 -08:00

.pre-commit-config.yaml

fix(python): typing (#2167 )

2025-03-10 09:01:23 -07:00

about.hbs

feat: add third party licenses lists (#3010 )

2026-02-09 16:16:46 -08:00

about.toml

feat: add third party licenses lists (#3010 )

2026-02-09 16:16:46 -08:00

AGENTS.md

ci: add agents and add reviewing instructions (#2754 )

2025-10-29 17:28:26 -07:00

Cargo.lock

feat(nodejs): add namespace management methods on Connection (#3371 )

2026-05-13 11:49:27 -07:00

Cargo.toml

chore: update lance dependency to v7.0.0-beta.7 (#3356 )

2026-05-07 16:04:38 -07:00

CLAUDE.md

ci: add agents and add reviewing instructions (#2754 )

2025-10-29 17:28:26 -07:00

CONTRIBUTING.md

docs: contributing guide (#1970 )

2025-01-07 15:11:16 -08:00

deny.toml

feat(python): support model-backed native FTS tokenizers (#3289 )

2026-05-08 23:53:14 +08:00

docker-compose.yml

fix(ci): upgrade LocalStack to 4.0 for S3 integration tests (#3147 )

2026-03-16 09:02:11 -07:00

LICENSE

initial commit

2023-03-17 18:15:19 -07:00

Makefile

feat: add third party licenses lists (#3010 )

2026-02-09 16:16:46 -08:00

pyright_report.csv

fix(python): typing (#2167 )

2025-03-10 09:01:23 -07:00

README.md

docs: fix broken documentation links (#3278 )

2026-04-15 20:56:59 +08:00

release_process.md

ci: enable java auto release (#1602 )

2024-09-19 10:51:03 -07:00

RUST_THIRD_PARTY_LICENSES.html

feat: add third party licenses lists (#3010 )

2026-02-09 16:16:46 -08:00

rust-toolchain.toml

chore: bump Rust toolchain from 1.91.0 to 1.94.0 (#3257 )

2026-04-10 07:57:47 -07:00

README.md

The Multimodal AI Lakehouse

How to Install ✦ Detailed Documentation ✦ Tutorials and Recipes ✦ Contributors

The ultimate multimodal data platform for AI/ML applications.

LanceDB is designed for fast, scalable, and production-ready vector search. It is built on top of the Lance columnar format. You can store, index, and search over petabytes of multimodal data and vectors with ease. LanceDB is a central location where developers can build, train and analyze their AI workloads.

Demo: Multimodal Search by Keyword, Vector or with SQL

Star LanceDB to get updates!

⭐ Click here ⭐ to see how fast we're growing!

Key Features:

Fast Vector Search: Search billions of vectors in milliseconds with state-of-the-art indexing.
Comprehensive Search: Support for vector similarity search, full-text search and SQL.
Multimodal Support: Store, query and filter vectors, metadata and multimodal data (text, images, videos, point clouds, and more).
Advanced Features: Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure. GPU support in building vector index.

Products:

Open Source & Local: 100% open source, runs locally or in your cloud. No vendor lock-in.
Cloud and Enterprise: Production-scale vector search with no servers to manage. Complete data sovereignty and security.

Ecosystem:

Columnar Storage: Built on the Lance columnar format for efficient storage and analytics.
Seamless Integration: Python, Node.js, Rust, and REST APIs for easy integration. Native Python and Javascript/Typescript support.
Rich Ecosystem: Integrations with LangChain 🦜️🔗, LlamaIndex 🦙, Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

How to Install:

Follow the Quickstart doc to set up LanceDB locally.

API & SDK: We also support Python, Typescript and Rust SDKs

Interface	Documentation
Python SDK	https://lancedb.github.io/lancedb/python/python/
Typescript SDK	https://lancedb.github.io/lancedb/js/globals/
Rust SDK	https://docs.rs/lancedb/latest/lancedb/index.html
REST API	https://docs.lancedb.com/api-reference/rest

Join Us and Contribute

We welcome contributions from everyone! Whether you're a developer, researcher, or just someone who wants to help out.

If you have any suggestions or feature requests, please feel free to open an issue on GitHub or discuss it on our Discord server.

Check out the GitHub Issues if you would like to work on the features that are planned for the future. If you have any suggestions or feature requests, please feel free to open an issue on GitHub.

Contributors

Stay in Touch With Us

Languages

HTML 39.5%

Rust 29%

Python 23%

TypeScript 8%

Shell 0.3%

Other 0.1%

README.md Unescape Escape

The Multimodal AI Lakehouse

Demo: Multimodal Search by Keyword, Vector or with SQL

Star LanceDB to get updates!

Key Features:

Products:

Ecosystem:

How to Install:

Join Us and Contribute

Contributors

Stay in Touch With Us

README.md