Commit Graph

28 Commits

Author SHA1 Message Date
Tanay
df4ad9f851 feat(nodejs): add Scannable primitive for streaming ingestion (#3271)
## **Summary**

This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.

A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.

This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.

---

## **Design**

### **Transport**

The transport uses the **Arrow IPC Stream format, one batch at a time**.

The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.

Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.

I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.

Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.

---

### **API**

```ts
class Scannable {
  readonly schema: Schema
  readonly numRows: number | null
  readonly rescannable: boolean

  static fromFactory(schema, factory, opts?)
  static fromTable(table, opts?)
  static fromIterable(schema, iter, opts?)
  static fromRecordBatchReader(reader, opts?)
}
```

The FFI boundary consists of a single callback:

`getNextBatch(isStart: boolean): Promise<Buffer | null>`

`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.

In addition, a schema only IPC buffer is transferred once during
construction.

---

## **Changes**

* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.

* `nodejs/src/lib.rs`
  Module registration.

* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.

* `nodejs/lancedb/index.ts`
  Exports the new primitive.

* `nodejs/__test__/scannable.test.ts`
  Test suite for the primitive.

---

## **Validation**

Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.

The harness covered the following scenarios:

* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe

All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.

Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.

These harnesses are not meant to replace the full test suite, which is
described below.

---

## **Tests**

`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.

Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.

Runtime coverage will follow in the consumer migration work.

---

## **Status**

Ready for review.

Closes #3223

---
2026-05-14 15:07:41 -07:00
Brendan Clement
9330a9b851 feat(nodejs): expose connectNamespace for namespace-backed connections (#3383)
### Summary 

Adds a `connectNamespace(implName, properties, options?)` to the NodeJS
SDK`. Closes #3380.

### Testing
- pnpm test
- Ran smoke test

```
import { connectNamespace } from "lancedb"
import { tmpdir } from "os";
import { mkdtempSync } from "fs";
import { join } from "path";

const dir = mkdtempSync(join(tmpdir(), "lancedb-connect-namespace-smoke-"));
console.log(`Using temp dir: ${dir}\n`);

// 1. Happy path: connect via the "dir" namespace impl, create + list a table.
console.log('Connecting via connectNamespace("dir", { root })...');
const db = await connectNamespace("dir", { root: dir });
console.log("  ✓ connected:", db.display());

console.log("Creating a table and listing it...");
await db.createTable("users", [
  { id: 1, name: "alice" },
  { id: 2, name: "bob" },
]);
console.log("  ✓ tableNames ->", await db.tableNames());

const table = await db.openTable("users");
console.log("  ✓ users.countRows ->", await table.countRows());

// 2. Storage options pass-through.
console.log("\nReconnecting with storageOptions (plumbing check)...");
const dbWithOpts = await connectNamespace(
  "dir",
  { root: dir },
  { storageOptions: { newTableDataStorageVersion: "stable" } },
);
console.log("  ✓ connected with storageOptions:", dbWithOpts.display());
await dbWithOpts.close();

// 3. Empty implName -> clear error.
console.log("\nCalling connectNamespace('', {}) (expect error)...");
try {
  await connectNamespace("", {});
  console.error("  UNEXPECTED: empty implName did not throw");
} catch (err) {
  console.log(`  ✓ Got expected error: ${err.message.split("\n")[0]}`);
}

// 4. Unknown impl -> error.
console.log("\nCalling connectNamespace('not-a-real-impl', {}) (expect error)...");
try {
  await connectNamespace("not-a-real-impl", {});
  console.error("  UNEXPECTED: unknown impl did not throw");
} catch (err) {
  console.log(`  ✓ Got expected error: ${err.message.split("\n")[0]}`);
}

// 5. Create a table inside a child namespace, then reconnect with a fresh
//    connectNamespace call and confirm the table is reachable via that
//    namespace path. (The dir+manifest impl keeps the namespace hierarchy in
//    a root manifest, so "scoping" happens via namespacePath args, not by
//    pointing root at a subdir.)
console.log("\nCreating a table inside a child namespace...");
const dir2 = mkdtempSync(join(tmpdir(), "lancedb-connect-namespace-smoke-"));
const writer = await connectNamespace("dir", {
  root: dir2,
  manifest_enabled: "true",
});
await writer.createNamespace(["analytics"]);
await writer.createTable(
  "orders",
  [
    { id: 1, total: 10 },
    { id: 2, total: 20 },
  ],
  ["analytics"],
);
console.log(
  "  ✓ writer sees tables under [analytics] ->",
  await writer.tableNames(["analytics"]),
);
await writer.close();

console.log("Reconnecting and reading the table via its namespace path...");
const reader = await connectNamespace("dir", {
  root: dir2,
  manifest_enabled: "true",
});
console.log(
  "  ✓ reader tableNames(['analytics']) ->",
  await reader.tableNames(["analytics"]),
);
const orders = await reader.openTable("orders", ["analytics"]);
console.log("  ✓ orders.countRows via reader ->", await orders.countRows());
await reader.close();

await db.close();
console.log("\nAll checks passed.");
```

```
Using temp dir: /var/folders/bj/hn6jv9c50y301d1nx0y8xmn00000gn/T/lancedb-connect-namespace-smoke-WByF1P

Connecting via connectNamespace("dir", { root })...
  ✓ connected: LanceNamespaceDatabase
Creating a table and listing it...
  ✓ tableNames -> [ 'users' ]
  ✓ users.countRows -> 2

Reconnecting with storageOptions (plumbing check)...
  ✓ connected with storageOptions: LanceNamespaceDatabase

Calling connectNamespace('', {}) (expect error)...
  ✓ Got expected error: implName must be a non-empty string

Calling connectNamespace('not-a-real-impl', {}) (expect error)...
  ✓ Got expected error: Invalid input, Failed to connect to namespace: Namespace { source: Unsupported { message: "Implementation 'not-a-real-impl' is not available. Supported: dir, rest" }, location: Location { file: "/Users/brendan/.cargo/git/checkouts/lance-8ddea23c38163eda/f693245/rust/lance-namespace-impls/src/connect.rs", line: 216, column: 14 } }

Creating a table inside a child namespace...
  ✓ writer sees tables under [analytics] -> [ 'orders' ]
Reconnecting and reading the table via its namespace path...
  ✓ reader tableNames(['analytics']) -> [ 'orders' ]
  ✓ orders.countRows via reader -> 2

All checks passed.
```

### Docs
- regenerated docs
2026-05-13 16:16:56 -07:00
Jack Ye
25dfe2cfd4 feat: add manifest-enabled directory namespace mode (#3332)
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
2026-04-29 09:22:06 -07:00
Gezi-lzq
10879d99b8 docs: fix broken documentation links (#3278) 2026-04-15 20:56:59 +08:00
Will Jones
367262662d feat(nodejs): upgrade napi-rs from v2 to v3 (#3057)
## Summary

- Upgrades `@napi-rs/cli` from v2 to v3, `napi`/`napi-derive` Rust
crates to 3.x
- Fixes a bug
([napi-rs#1170](https://github.com/napi-rs/napi-rs/issues/1170)) where
the CLI failed to locate the built `.node` binary when a custom Cargo
target directory is set (via `config.toml`)

## Changes

**package.json / CLI**:
- `napi.name` → `napi.binaryName`, `napi.triples` → `napi.targets`
- Removed `--no-const-enum` flag and fixed output dir arg
- `napi universal` → `napi universalize`

**Rust API migration**:
- `#[napi::module_init]` → `#[napi_derive::module_init]`
- `napi::JsObject` → `Object`, `.get::<_, T>()` → `.get::<T>()`
- `ErrorStrategy` removed; `ThreadsafeFunction` now takes an explicit
`Return` type with `CalleeHandled = false` const generic
- `JsFunction` + `create_threadsafe_function` replaced by typed
`Function<Args, Return>` + `build_threadsafe_function().build()`
- `RerankerCallbacks` struct removed (`Function<'env,...>` can't be
stored in structs); `VectorQuery::rerank` now accepts the function
directly
- `ClassInstance::clone()` now returns `ClassInstance`, fixed with
explicit deref
- `Vec<u8>` in `#[napi(object)]` now maps to `Array<number>` in v3;
changed to `Buffer` to preserve the TypeScript `Buffer` type

**TypeScript**:
- `inner.rerank({ rerankHybrid: async (_, args) => ... })` →
`inner.rerank(async (args) => ...)`
- Header provider callback wrapped in `async` to match stricter typed
constructor signature

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-23 14:42:55 -08:00
Prashanth Rao
135dfdc7ec docs: 404 and outdated URLs should now work (#2800)
Did a full scan of all URLs that used to point to the old mkdocs pages,
and now links to the appropriate pages on lancedb.com/docs or lance.org
docs.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-20 11:14:20 -08:00
Weston Pace
5a19cf15a6 feat: a utility for creating "permutation views" (#2552)
I'm working on a lancedb version of pytorch data loading (and hopefully
addressing https://github.com/lancedb/lance/issues/3727).

However, rather than rely on pytorch for everything I'm moving some of
the things that pytorch does into rust. This gives us more control over
data loading (e.g. using shards or a hash-based split) and it allows
permutations to be persistent. In particular I hope to be able to:

* Create a persistent permutation
* This permutation can handle splits, filtering, shuffling, and sharding
* Create a rust data loader that can read a permutation (one or more
splits), or a subset of a permutation (for DDP)
* Create a python data loader that delegates to the rust data loader

Eventually create integrations for other data loading libraries,
including rust & node
2025-10-09 18:07:31 -07:00
Jack Ye
8da74dcb37 feat: support per-request header override (#2631)
## Summary

This PR introduces a `HeaderProvider` which is called for all remote
HTTP calls to get the latest headers to inject. This is useful for
features like adding the latest auth tokens where the header provider
can auto-refresh tokens internally and each request always set the
refreshed token.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-09-10 13:44:00 -07:00
Will Jones
3d1f102087 feat: allow Python and Typescript users to create Sessions (#2530)
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration


🤖 Generated with [Claude Code](https://claude.ai/code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-24 12:06:29 -07:00
Will Jones
b3a4efd587 fix: revert change default read_consistency_interval=5s (#2327)
This reverts commit a547c523c2 or #2281

The current implementation can cause panics and performance degradation.
I will bring this back with more testing in
https://github.com/lancedb/lancedb/pull/2311

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **Documentation**
- Enhanced clarity on read consistency settings with updated
descriptions and default behavior.
- Removed outdated warnings about eventual consistency from the
troubleshooting guide.

- **Refactor**
- Streamlined the handling of the read consistency interval across
integrations, now defaulting to "None" for improved performance.
  - Simplified internal logic to offer a more consistent experience.

- **Tests**
- Updated test expectations to reflect the new default representation
for the read consistency interval.
- Removed redundant tests related to "no consistency" settings for
streamlined testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-04-14 08:48:15 -07:00
Will Jones
a547c523c2 feat!: change default read_consistency_interval=5s (#2281)
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.

Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
2025-03-28 11:04:31 -07:00
Will Jones
e05c0cd87e ci(node): check docs in CI (#2084)
* Make `npm run docs` fail if there are any warnings. This will catch
items missing from the API reference.
* Add a check in our CI to make sure `npm run dos` runs without warnings
and doesn't generate any new files (indicating it might be out-of-date.
* Hide constructors that aren't user facing.
* Remove unused enum `WriteMode`.

Closes #2068
2025-01-30 16:06:06 -08:00
Will Jones
15f8f4d627 ci: check license headers (#2076)
Based on the same workflow in Lance.
2025-01-29 08:27:07 -08:00
Bert
c9f248b058 feat: add hybrid search to node and rust SDKs (#1940)
Support hybrid search in both rust and node SDKs.

- Adds a new rerankers package to rust LanceDB, with the implementation
of the default RRF reranker
- Adds a new hybrid package to lancedb, with some helper methods related
to hybrid search such as normalizing scores and converting score column
to rank columns
- Adds capability to LanceDB VectorQuery to perform hybrid search if it
has both a nearest vector and full text search parameters.
- Adds wrappers for reranker implementations to nodejs SDK.

Additional rerankers will be added in followup PRs

https://github.com/lancedb/lancedb/issues/1921

---
Notes about how the rust rerankers are wrapped for calling from JS:

I wanted to keep the core reranker logic, and the invocation of the
reranker by the query code, in Rust. This aligns with the philosophy of
the new node SDK where it's just a thin wrapper around Rust. However, I
also wanted to have support for users who want to add custom rerankers
written in Javascript.

When we add a reranker to the query from Javascript, it adds a special
Rust reranker that has a callback to the Javascript code (which could
then turn around and call an underlying Rust reranker implementation if
desired). This adds a bit of complexity, but overall I think it moves us
in the right direction of having the majority of the query logic in the
underlying Rust SDK while keeping the option open to support custom
Javascript Rerankers.
2024-12-30 09:03:41 -05:00
Will Jones
b9921d56cc fix(node): update default log level to warn (#1801)
🤦
2024-11-06 09:13:53 -08:00
Will Jones
a324f4ad7a feat(node): enable logging and show full errors (#1775)
This exposes the `LANCEDB_LOG` environment variable in node, so that
users can now turn on logging.

In addition, fixes a bug where only the top-level error from Rust was
being shown. This PR makes sure the full error chain is included in the
error message. In the future, will improve this so the error chain is
set on the [cause](https://nodejs.org/api/errors.html#errorcause)
property of JS errors https://github.com/lancedb/lancedb/issues/1779

Fixes #1774
2024-10-29 15:13:34 -07:00
Will Jones
f3b6a1f55b feat(node): bind remote SDK to rust implementation (#1730)
Closes [#2509](https://github.com/lancedb/sophon/issues/2509)

This is the Node.js analogue of #1700
2024-10-09 11:46:27 -06:00
Cory Grinstead
b3e5ac6d2a feat(nodejs): feature parity [2/N] - add table.name and lancedb.connect({args}) (#1380)
depends on https://github.com/lancedb/lancedb/pull/1378

see proper diff here
https://github.com/universalmind303/lancedb/compare/remote-table-node...universalmind303:lancedb:table-name
2024-06-21 11:38:26 -05:00
Cory Grinstead
bc19a75f65 feat(nodejs): merge insert (#1351)
closes https://github.com/lancedb/lancedb/issues/1349
2024-06-11 15:05:15 -05:00
Weston Pace
d5586c9c32 feat: make it possible to opt in to using the v2 format (#1352)
This also exposed the max_batch_length configuration option in
python/node (it was needed to verify if we are actually in v2 mode or
not)
2024-06-04 21:52:14 -07:00
Will Jones
1d23af213b feat: expose storage options in LanceDB (#1204)
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.

1. Fixes #1168
2. Closes #1165
3. Closes #1082
4. Closes #439
5. Closes #897
6. Closes #642
7. Closes #281
8. Closes #114
9. Closes #990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
2024-04-10 10:12:04 -07:00
Weston Pace
4180b44472 feat: refactor the query API and add query support to the python async API (#1113)
In addition, there are also a number of changes in nodejs to the
docstrings of existing methods because this PR adds a jsdoc linter.
2024-04-05 16:32:47 -07:00
Weston Pace
f822255683 feat: add create_index to the async python API (#1052)
This also refactors the rust lancedb index builder API (and,
correspondingly, the nodejs API)
2024-04-05 16:32:14 -07:00
Weston Pace
8033a44d68 feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:31:36 -07:00
Will Jones
c5b0934bfb feat(node): add read_consistency_interval to Node and Rust (#1002)
This PR adds the same consistency semantics as was added in #828. It
*does not* add the same lazy-loading of tables, since that breaks some
existing tests.

This closes #998.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:30:40 -07:00
Lei Xu
cef0293985 feat(napi): Issue queries as node SDK (#868)
* Query as a fluent API and `AsyncIterator<RecordBatch>`
* Much more docs
* Add tests for auto infer vector search columns with different
dimensions.
2024-04-05 16:28:18 -07:00
Lei Xu
db4a979278 feat(napi): Provide a new createIndex API in the napi SDK. (#857) 2024-04-05 16:27:51 -07:00
Lei Xu
efcaa433fe feat: rework NodeJS SDK using napi (#847)
Use Napi to write a Node.js SDK that follows Polars for better
maintainability, while keeping most of the logic in Rust.
2024-04-05 16:27:51 -07:00