BREAKING CHANGE: When passing multiple where clauses to a query, they
now stack instead of replacing the previous filter.
Previously, calling `where`/`only_if` more than once on a query silently
replaced the previous filter, so only the last filter was applied. This
was
surprising and could return rows that an earlier filter should have
excluded.
This implements the alternative suggested in
https://github.com/lancedb/lancedb/pull/3514#issuecomment-4664901580:
instead of
rejecting a second filter, repeated filters are combined with a logical
AND
(`(previous) AND (new)`).
The combination happens in the Rust core (`QueryBase::only_if` and
`only_if_expr`), so it applies to all SDKs at once (Rust, Python async,
and
TypeScript). The Python sync query builder keeps its own filter state,
so it
combines filters in the binding layer as well.
SQL string and expression filters are combined within their own
representation.
When the two representations are mixed, the expression is lowered to SQL
(via
`expr_to_sql_string`) and the filters are combined as SQL strings, so
chaining
`where` works regardless of which form each filter takes.
Fixes#2649
## Tests
- Rust: `cargo test --features remote -p lancedb --lib query`
- Python: `uv run --extra tests pytest python/tests/test_query.py`
- TypeScript: `pnpm test __test__/query.test.ts`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
### Description
Adding branch support for RemoteTable by threading a branch selector
onto every operation the data plane accepts it on. Exposes the
currentBranch to nodejs and python through the bindings.
Matching the server handlers, the branch rides as:
- a `?branch=` query parameter for Arrow-body and query-only ops
(insert, merge_insert, multipart_*, version/list, drop_index)
- a `branch` field in the JSON body for everything else (count_rows,
query, update, delete, create_index, column ops, index list/stats,
stats, restore, describe, tags create/update)
A main-branch handle (`branch == None`) produces byte-identical requests
to before: no `branch` field and no `?branch=`
- Handle-per-branch: `create_branch` / `checkout_branch` return a new
handle with fresh caches and reset version/freshness state, mirroring
`NativeTable`.
- `create_branch` maps 409 to already-exists, 400 to invalid, and 404 to
not-found with source context, and sends without retry so the 409 stays
observable.
- `Ref` translation covers version, version-number (relative to the
handle's branch), and tag (resolved via the tags endpoint); `"main"` and
empty normalize to the main branch.
- Python branch handles persist their branch (and pinned version) across
pickle/fork, so a forked or pickled handle reopens on its branch rather
than silently reverting to main.
### Tests
- Rust mock tests per op category (query-param and body mechanisms,
branch CRUD, error paths, backward-compat).
- Python sync branch CRUD, `open_table(branch=)`, and a pickle
round-trip regression test.
## Summary
Surfaces the rich per-index metadata added in #3497 to the Python and
Node.js language bindings. Closes#3495.
New optional fields exposed on `IndexConfig` in both bindings:
- `index_uuid` / `indexUuid` — UUID of the first index segment
- `type_url` / `typeUrl` — protobuf type URL for the index
- `created_at` / `createdAt` — creation timestamp (milliseconds since
Unix epoch)
- `num_indexed_rows` / `numIndexedRows` — rows covered by the index
- `num_unindexed_rows` / `numUnindexedRows` — rows not yet indexed
- `size_bytes` / `sizeBytes` — total index file size in bytes
- `num_segments` / `numSegments` — number of index segments
- `index_version` / `indexVersion` — on-disk format version
- `index_details` / `indexDetails` — type-specific JSON details string
All fields are `None`/`undefined` for remote tables (which don't yet
surface this metadata through the server response).
## Changes
- `python/src/index.rs`: extend `IndexConfig` pyclass; update `From`
impl; update `__getitem__`
- `python/python/lancedb/_lancedb.pyi`: add type hints for new fields
- `python/python/tests/test_table.py`: new `test_index_config_fields`
test
- `nodejs/src/table.rs`: extend `IndexConfig` napi struct; update `From`
impl
- `nodejs/__test__/table.test.ts`: new test; update existing `toEqual`
assertions to `expect.objectContaining` to accommodate new fields
## Test plan
- [x] Python: `uv run --extra tests pytest
python/tests/test_table.py::test_index_config_fields`
- [x] Node.js: `pnpm test __test__/table.test.ts`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds an FM-Index — a scalar index over string and binary columns that
accelerates substring search (`contains(col, 'needle')`), distinct from
the tokenized `FTS` index — across the Rust core and the Python and
TypeScript bindings.
## Rust
- `Index::Fm(FmIndexBuilder)` and `IndexType::Fm`.
- `make_index_params` maps `Index::Fm` to Lance's
`ScalarIndexParams::for_builtin(BuiltinIndexType::Fm)`.
- `supported_fm_data_type` validates
`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary` columns.
- `list_indices` round-trips the type (`"Fm"` → `IndexType::Fm`); the
remote wire type is `"FM"`.
## Python
Adds `lancedb.index.Fm`, accepted by `create_index`:
```python
from lancedb.index import Fm
await tbl.create_index("text", config=Fm())
```
## TypeScript
Adds the `Index.fm()` factory:
```ts
await tbl.createIndex("text", { config: Index.fm() });
```
## Summary
This PR extends nested-field regression coverage across Rust
local/remote, Python sync/async, and Node so canonical escaped paths
stay consistent across scalar, vector, and FTS index lifecycle behavior.
It also aligns LanceDB's LabelList type gate with Lance by accepting
`LargeList<primitive>` columns while keeping `List<Struct<...>>`
unsupported until Lance defines stable membership semantics for struct
labels.
Part of #3406.
### Description
Stacked on #3490. Adds an optional version to branch checkout across the
Rust core and the Python and TypeScript SDKs, so you can open a specific
version on a branch ("version V of branch B"), not just the branch's
latest version
Rust
```rust
// Open version 3 of branch "exp" (a read-only view): check out from an
// existing table, or open it directly from the connection.
let exp_v3 = table.checkout_branch("exp", Some(3)).await?;
let exp_v3 = db.open_table("items").branch("exp").version(3).execute().await?;
// checkout_latest re-attaches to the branch's writable HEAD.
exp_v3.checkout_latest().await?;
// With no branch, a version opens main at that version.
let main_v3 = db.open_table("items").version(3).execute().await?;
```
Python
```python
# Open version 3 of branch "exp" (a read-only view): check out from an
# existing table, or open it directly from the connection.
branch_v3 = await table.branches.checkout("exp", version=3)
branch_v3 = await db.open_table("items", branch="exp", version=3)
# checkout_latest re-attaches to the branch's writable HEAD.
await branch_v3.checkout_latest()
# With no branch, a version opens main at that version.
main_v3 = await db.open_table("items", version=3)
```
TypeScript
```typescript
// Open version 3 of branch "exp" (a read-only view): check out from an
// existing table, or open it directly from the connection.
const branchV3 = await (await table.branches()).checkout("exp", 3);
const opened = await db.openTable("items", undefined, { branch: "exp", version: 3 });
// checkoutLatest re-attaches to the branch's writable HEAD.
await branchV3.checkoutLatest();
// With no branch, a version opens main at that version.
const mainV3 = await db.openTable("items", undefined, { version: 3 });
```
### Testing
- Added unit tests (Rust, Python sync + async, TypeScript):
branch-scoped resolution at a version number shared with `main` and with
another branch, read-only enforcement on a pinned handle,
`checkout_latest` recovery to the branch's HEAD, fork-point reads, and
the nonexistent-version/branch error paths.
- Ran smoke tests against the Python and TypeScript SDKs on local
machine.
### Description
Adds first-class support for table branches across the Rust core and the
Python and TypeScript SDKs.
Rust
```rust
use lance::dataset::refs::Ref;
// Create a branch from main and write to it — main is untouched.
let exp = table.create_branch("exp", Ref::Version(None, None)).await?;
exp.add(batches).await?;
// Reopen the branch later: check out from a table, or open it directly.
let exp = table.checkout_branch("exp").await?;
let exp = db.open_table("items").branch("exp").execute().await?;
let branches = table.list_branches().await?;
table.delete_branch("exp").await?;
```
Python
```python
# Create a branch from main and write to it
branch = await table.branches.create("exp", from_ref="main")
await branch.add(data)
# Reopen the branch later: check out from a table, or open it directly.
branch = await table.branches.checkout("exp")
branch = await db.open_table("items", branch="exp")
await table.branches.list()
await table.branches.delete("exp")
```
TypeScript
```typescript
const branches = await table.branches();
// Create a branch from main and write to it
const branch = await branches.create("exp");
await branch.add(data);
// Reopen the branch later: check out from a table, or open it directly.
const checkedOut = await branches.checkout("exp");
const opened = await db.openTable("items", undefined, { branch: "exp" });
await branches.list();
await branches.delete("exp");
```
### Testing
- Added unit tests
- ran smoke tests against python and typescript sdks on local machine
### Next steps
- Add RemoteTable support
- Add Branch Comparison support
- Merge Branching support
BREAKING CHANGE: direct Rust users lose the `IndexStatistics::loss`
field. Python and Node.js consumers are unaffected in practice for
remote tables (the value was always `None`/absent), but the attribute is
gone for local tables too.
`IndexStatistics::loss` was local-only — LanceDB Cloud never returned
it, so
`RemoteTable::index_stats` always set `loss: None`. It's vestigial; this
removes it.
- Remove `loss` from `IndexStatistics` and the internal `IndexMetadata`
in `rust/lancedb/src/index.rs`, plus the summing logic in
`NativeTable::index_stats`.
- Drop `loss` from the Python and Node.js bindings (and their
tests/docs).
Fixes#3493🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
### Summary
Adds update_field_metadata to the client SDK (Rust core, Python, and
TypeScript) so clients can edit per-field (column) Arrow metadata
(schema.fields[].metadata)
### Testing
- added unit tests
- ran E2E against a local server on both local and remote tables (set →
merge → delete), across Python sync/async and TypeScript
### Next steps
- deprecate replace_field_metadata in the python lancedb favor of this
(typescript didn't have replace_field_metadata method). This matches
Lance's API direction (Lance already deprecated replace_field_metadata
for update_field_metadata)
## Summary
When an `LsmWriteSpec` is installed on a table (#3396), `merge_insert`
upsert
calls are dispatched through Lance's MemWAL `ShardWriter` (LSM-style
append)
instead of the standard merge path.
- **`use_lsm_write`** — a `merge_insert` builder option, default `true`;
set it
`false` to use the standard path for a call even when a spec is set.
- **`assume_pre_sharded`** — a `merge_insert` builder option, default
`false`;
skips the per-row shard check and routes by the first row only.
- **`close_lsm_writers`** — drains and closes the table's cached MemWAL
shard
writers.
- The `merge_insert` **`on`** columns default to, and are validated
against,
the table's unenforced primary key.
- Shard writers are cached alongside the dataset (in
`DatasetConsistencyWrapper`) and reused for the session.
- `MergeResult` gains **`num_rows`** — on the LSM path the insert/update
breakdown is unknown until compaction, so only the total is reported.
Routing covers all three sharding strategies — bucket (murmur3,
Iceberg-compatible), identity, and unsharded. Each `merge_insert` call
targets
a single shard; the whole input is collected and validated before a
single
atomic `ShardWriter::put`, so a validation failure leaves the MemWAL
untouched.
Bindings: Python (`merge_insert(...).use_lsm_write(...)` /
`.assume_pre_sharded(...)`, `Table.close_lsm_writers`) and TypeScript
(`mergeInsert(...).useLsmWrite(...)` / `.assumePreSharded(...)`,
`Table.closeLsmWriters`).
## Context
Reconstructed from the original #3354 branch onto current `main`: the
branch
predated the #3394 (unenforced primary key) / #3396 (`LsmWriteSpec`)
split and
has been rebuilt on that merged foundation. Depends on Lance
`v7.0.0-beta.13`.
The MemWAL read path (reading un-flushed shard data back into queries)
and
remote (LanceDB Cloud) LSM support are follow-ups.
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
## Summary
- Bump lance dependency from `v7.0.0-beta.13` to `v7.0.0-rc.1`
- Remove PK constraint from `LsmWriteSpec::Bucket` docs and
`Table::set_lsm_write_spec` docs
- Remove test assertions that expected rejection when no PK is set or
when bucket column != PK
Closes https://github.com/lance-format/lance/issues/6917
LanceDB default vector column discovery only considered top-level
fields, so tables with a single nested vector leaf still required users
to pass an explicit field path. This updates Rust and Python discovery
to recurse into struct fields, return canonical field paths, and
preserve actionable errors when no default or multiple defaults exist.
The explicit nested path flow for index creation and search remains
supported across Rust, Python, and Node, with regression coverage for
single nested vector leaves, multiple candidate leaves, and schemas
without vector leaves.
Closes#3405.
### Summary
- Add an optional `progress` callback to `Table.add(data, { progress
})`. Callback fires once per batch written and once more with `done:
true` when the write completes.
- Errors thrown from the user's callback are logged with `console.warn`
and swallowed
### Testing
- npm test
- ran smoke test script to verify functionality
## Summary
Split out from #3354
Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` /
`unset_lsm_write_spec` to
install and clear the spec that selects Lance's MemWAL LSM-style write
path for
`merge_insert`.
`LsmWriteSpec` offers three sharding strategies, all built on Lance's
`InitializeMemWalBuilder`:
- `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by
the
single-column unenforced primary key.
- `LsmWriteSpec::identity(column)` — identity sharding by the raw value
of a
scalar column.
- `LsmWriteSpec::unsharded()` — a single MemWAL shard.
Each can be refined with `with_maintained_indexes(...)` (indexes the
MemWAL
keeps up to date as rows are appended) and
`with_writer_config_defaults(...)`
(default `ShardWriter` configuration recorded in the MemWAL index, so
every
writer starts from the same defaults). All variants require the table to
have
an unenforced primary key.
- `set_lsm_write_spec` installs the spec by initializing the MemWAL
index;
`unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting
to
the standard `merge_insert` path. `unset` is idempotent.
- Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`,
`set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript
(`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` /
`"unsharded"`). `RemoteTable` returns `NotSupported`.
The actual `merge_insert` LSM dispatch and `ShardWriter` write path are
a
follow-up — this PR only installs and clears the spec.
## Summary
Adds `Table::set_unenforced_primary_key` — records a single column as
the
table's unenforced primary key in Lance schema field metadata.
"Unenforced"
means LanceDB does not check uniqueness on write; the key is metadata
that
`merge_insert` consumes.
- Single-column only; the column must exist and have a supported dtype
(Int32, Int64, Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary).
The
API accepts an iterable for binding ergonomics but requires exactly one
column — compound keys are rejected.
- The primary key is immutable: calling this on a table that already has
an
unenforced primary key is rejected. Concurrent writers racing to set the
key
fail at commit time rather than silently overriding it.
- `RemoteTable` returns `NotSupported`.
- Bindings: Python (`AsyncTable`, `LanceTable`, `RemoteTable`) and
TypeScript
(`Table.setUnenforcedPrimaryKey`).
## Context
Split out from #3354 per review feedback, so the unenforced primary key
and the
`merge_insert` sharding spec land as separate reviewable PRs.
No Lance dependency bump — `main` is already on v7.0.0-beta.10, which
includes
the field-metadata round-trip fix the API relies on. Enforcing
primary-key
immutability at the Lance commit layer (so the cross-column concurrent
race is
also rejected) is a companion Lance change: lance-format/lance#6810.
### Summary
- Expose Connection.renameTable in the Node.js bindings and align it
with existing namespace-aware connection APIs.
### Changes
- Add napi-rs rename_table on Connection, delegating to Rust
Connection::rename_table.
- Add renameTable(oldName, newName, namespacePath?) on abstract
Connection and implement on LocalConnection.
- Add a connection test that renames a table and checks names / open
behavior.
#### Testing
- cd nodejs && npm run build
- cd nodejs && npm test __test__/connection.test.ts
fix : #3364
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
## **Summary**
This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.
A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.
This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.
---
## **Design**
### **Transport**
The transport uses the **Arrow IPC Stream format, one batch at a time**.
The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.
Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.
I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.
Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.
---
### **API**
```ts
class Scannable {
readonly schema: Schema
readonly numRows: number | null
readonly rescannable: boolean
static fromFactory(schema, factory, opts?)
static fromTable(table, opts?)
static fromIterable(schema, iter, opts?)
static fromRecordBatchReader(reader, opts?)
}
```
The FFI boundary consists of a single callback:
`getNextBatch(isStart: boolean): Promise<Buffer | null>`
`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.
In addition, a schema only IPC buffer is transferred once during
construction.
---
## **Changes**
* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.
* `nodejs/src/lib.rs`
Module registration.
* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.
* `nodejs/lancedb/index.ts`
Exports the new primitive.
* `nodejs/__test__/scannable.test.ts`
Test suite for the primitive.
---
## **Validation**
Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.
The harness covered the following scenarios:
* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe
All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.
Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.
These harnesses are not meant to replace the full test suite, which is
described below.
---
## **Tests**
`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.
Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.
Runtime coverage will follow in the consumer migration work.
---
## **Status**
Ready for review.
Closes#3223
---
### Summary
- Closes#3362
- Adds `prewarmData(columns?: string[])` to the Node bindings, mirroring
the Rust and Python implementations
### Testing
- [x] `npm run build` (regenerates the napi `.node` module + TS
declarations)
- [x] `npm run lint`
- [x] `npm test
- [ ] live test against remote table - just waiting for my dev stack to
get created
### Documentation
- updated docs
Fixes#3269.
## What I observed
Using a reranker in a hybrid query could keep the Node.js process alive
even after `table.close()` and `db.close()`.
## Root cause
The reranker callback bridge used a `ThreadsafeFunction` in referenced
mode, which can keep the event loop alive longer than intended.
## Minimal fix
- In `nodejs/src/rerankers.rs`, create the reranker callback TSFN in
weak mode (`.weak::<true>()`).
- Add a regression test in `nodejs/__test__/rerankers.test.ts` that
spawns a child process, runs a rerank query, and asserts the process
exits naturally.
## Validation
- Built Node bindings successfully.
- Ran targeted tests: `rerankers.test.ts` passes (including new
regression test).
- Pre-commit checks for changed files were run and clean.
1. Refactored every client (Rust core, Python, Node/TypeScript) so
“namespace” usage is explicit: code now keeps namespace paths
(namespace_path) separate from namespace clients (namespace_client).
Connections propagate the client, table creation routes through it, and
managed versioning defaults are resolved from namespace metadata. Python
gained LanceNamespaceDBConnection/async counterparts, and the
namespace-focused tests were rewritten to match the clarified API
surface.
2. Synchronized the workspace with Lance 5.0.0-beta.3 (see
https://github.com/lance-format/lance/pull/6186 for the upstream
namespace refactor), updating Cargo/uv lockfiles and ensuring all
bindings align with the new namespace semantics.
3. Added a namespace-backed code path to lancedb.connect() via new
keyword arguments (namespace_client_impl, namespace_client_properties,
plus the existing pushdown-ops flag). When those kwargs are supplied,
connect() delegates to connect_namespace, so users can opt into
namespace clients without changing APIs. (The async helper will gain
parity in a later change)
Fixes#2716
## Summary
Add support for querying with Float16Array, Float64Array, and Uint8Array
vectors in the Node.js SDK, eliminating precision loss from the previous
\Float32Array.from()\ conversion.
## Implementation
Follows @wjones127's [5-step
plan](https://github.com/lancedb/lancedb/issues/2716#issuecomment-3447750543):
### Rust (\
odejs/src/query.rs\)
1. \ytes_to_arrow_array(data: Uint8Array, dtype: String)\ helper that:
- Creates an Arrow \Buffer\ from the raw bytes
- Wraps it in a typed \ScalarBuffer<T>\ based on the dtype enum
- Constructs a \PrimitiveArray\ and returns \Arc<dyn Array>\
2. \
earest_to_raw(data, dtype)\ and \dd_query_vector_raw(data, dtype)\ NAPI
methods that pass the type-erased array to the core \
earest_to\/\dd_query_vector\ which already accept \impl
IntoQueryVector\ for \Arc<dyn Array>\
### TypeScript (\
odejs/lancedb/query.ts\, \rrow.ts\)
3. Extended \IntoVector\ type to include \Uint8Array\ (and
\Float16Array\ via runtime check for Node 22+)
4. \xtractVectorBuffer()\ helper detects non-Float32 typed arrays and
extracts their underlying byte buffer + dtype string
5. \
earestTo()\ and \ddQueryVector()\ route through the raw NAPI path when
the input is Float16/Float64/Uint8
### Backward compatibility
Existing \Float32Array\ and \
umber[]\ inputs are unchanged -- they still use the original \
earest_to(Float32Array)\ NAPI method. The new raw path is only used when
a non-Float32 typed array is detected.
## Usage
\\\ ypescript
// Float16Array (Node 22+) -- no precision loss
const f16vec = new Float16Array([0.1, 0.2, 0.3]);
const results = await
table.query().nearestTo(f16vec).limit(10).toArray();
// Float64Array -- no precision loss
const f64vec = new Float64Array([0.1, 0.2, 0.3]);
const results = await
table.query().nearestTo(f64vec).limit(10).toArray();
// Uint8Array (binary embeddings)
const u8vec = new Uint8Array([1, 0, 1, 1, 0]);
const results = await
table.query().nearestTo(u8vec).limit(10).toArray();
// Existing usage unchanged
const results = await table.query().nearestTo([0.1, 0.2,
0.3]).limit(10).toArray();
\\\
## Note on dependencies
The Rust side uses \rrow_array\, \rrow_buffer\, and \half\ crates.
These should already be in the dependency tree via \lancedb\ core, but
\Cargo.toml\ may need explicit entries for \half\ and the arrow
sub-crates in the nodejs workspace.
---------
Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Add support for passing field/data type information into add_columns()
method, bringing parity with Python bindings. The method now accepts:
- AddColumnsSql[] - SQL expressions (existing functionality)
- Field - single Arrow field with explicit data type
- Field[] - array of Arrow fields with explicit data types
- Schema - Arrow schema with explicit data types
New columns added via Field/Schema are initialized with null values. All
field-based columns must be nullable due to null initialization.
Resolves#3107
---------
Signed-off-by: Pratik <pratikrocks.dey11@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
When we create tables without using Arrow by parsing JS records we
always infer to float64. Many times embeddings are not float64 and it
would be nice to be able to use the native type without requiring users
to pull in Arrow. We can utilize JS's builtin Float32Array to do this.
This PR also adds support for UInt8/16/32 and Int8/16/32 arrays as well.
Closes#3115
## Summary
This PR changes takeRowIds to accept bigint[] instead of
number[], matching the type of _rowid returned by withRowId().
## Problem
When retrieving row IDs using \withRowId()\ and querying them back with
takeRowIds(), users get an error because:
1. _rowid values are returned as JavaScript bigint
2. takeRowIds() expected number[]
3. NAPI failed to convert: Error: Failed to convert napi value BigInt
into rust type i64
## Reproduction
\\\js
import lancedb from '@lancedb/lancedb';
const db = await lancedb.connect('memory://');
const table = await db.createTable('test', [{ id: 1, vector: [1.0, 2.0]
}]);
const results = await table.query().withRowId().toArray();
const rowIds = results.map(row => row._rowid);
console.log('types:', rowIds.map(id => typeof id)); // ['bigint']
await table.takeRowIds(rowIds).toArray(); // ⌠Error before fix
\\\
## Solution
- Updated TypeScript signature from takeRowIds(rowIds: number[]) to
takeRowIds(rowIds: bigint[])
- Updated Rust NAPI binding to accept Vec<BigInt> and convert using
get_u64()
Fixes#2722
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
I'm working on a lancedb version of pytorch data loading (and hopefully
addressing https://github.com/lancedb/lance/issues/3727).
However, rather than rely on pytorch for everything I'm moving some of
the things that pytorch does into rust. This gives us more control over
data loading (e.g. using shards or a hash-based split) and it allows
permutations to be persistent. In particular I hope to be able to:
* Create a persistent permutation
* This permutation can handle splits, filtering, shuffling, and sharding
* Create a rust data loader that can read a permutation (one or more
splits), or a subset of a permutation (for DDP)
* Create a python data loader that delegates to the rust data loader
Eventually create integrations for other data loading libraries,
including rust & node
The [`FieldLike` type in
arrow.ts](5ec12c9971/nodejs/lancedb/arrow.ts (L71-L78))
can have a `type: string` property, but before this change, actually
trying to create a table that has a schema that specifies field types by
name results in an error:
```
Error: Expected a Type but object was null/undefined
```
This change adds support for mapping some type name strings to arrow
`DataType`s, so that passing `FieldLike`s with a `type: string` property
to `sanitizeField` does not throw an error.
The type names that can be passed are upper/lowercase variations of the
keys of the `constructorsByTypeName` object. This does not support
mapping types that need parameters, such as timestamps which need
timezones.
With this, it is possible to create empty tables from `SchemaLike`
objects without instantiating arrow types, e.g.:
```
import { SchemaLike } from "../lancedb/arrow"
// ...
const schemaLike = {
fields: [
{
name: "id",
type: "int64",
nullable: true,
},
{
name: "vector",
type: "float64",
nullable: true,
},
],
// ...
} satisfies SchemaLike;
const table = await con.createEmptyTable("test", schemaLike);
```
This change also makes `FieldLike.nullable` required since the `sanitizeField` function throws if it is undefined.
**Problem**: When a vector field is marked as nullable, users should be
able to omit it or pass `undefined`, but this was throwing an error:
"Table has embeddings: 'vector', but no embedding function was provided"
fixes: #2646
**Solution**: Modified `validateSchemaEmbeddings` to check
`field.nullable` before treating `undefined` values as missing embedding
fields.
**Changes**:
- Fixed validation logic in `nodejs/lancedb/arrow.ts`
- Enabled previously skipped test for nullable fields
- Added reproduction test case
**Behavior**:
- ✅ `{ vector: undefined }` now works for nullable fields
- ✅ `{}` (omitted field) now works for nullable fields
- ✅ `{ vector: null }` still works (unchanged)
- ✅ Non-nullable fields still properly throw errors (unchanged)
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: neha <neha@posthog.com>
### Bug Fix: Undefined Values in Nullable Fields
**Issue**: When inserting data with `undefined` values into nullable
fields, LanceDB was incorrectly coercing them to default values (`false`
for booleans, `NaN` for numbers, `""` for strings) instead of `null`.
**Fix**: Modified the `makeVector()` function in `arrow.ts` to properly
convert `undefined` values to `null` for nullable fields before passing
data to Apache Arrow.
fixes: #2645
**Result**: Now `{ text: undefined, number: undefined, bool: undefined
}` correctly becomes `{ text: null, number: null, bool: null }` when
fields are marked as nullable in the schema.
**Files Changed**:
- `nodejs/lancedb/arrow.ts` (core fix)
- `nodejs/__test__/arrow.test.ts` (test coverage)
- This ensures proper null handling for nullable fields as expected by
users.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
### Solution
Added special handling in `makeVector` function for boolean arrays where
all values are null. The fix creates a proper null bitmap using
`makeData` and `arrowMakeVector` instead of relying on Apache Arrow's
`vectorFromArray` which doesn't handle this edge case correctly.
fixes: #2644
### Changes
- Added null value detection for boolean types in `makeVector` function
- Creates proper Arrow data structure with null bitmap when all boolean
values are null
- Preserves existing behavior for non-null boolean values and other data
types
- Fixes the boolean null value bug while maintaining backward
compatibility.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Support shallow cloning a dataset at a specific location to create a new
dataset, using the shallow_clone feature in Lance. Also introduce remote
`clone` API for remote tables for this functionality.
- Fixes issue where passing `{ vector: undefined }` with an embedding
function threw "Found field not in schema" error instead of calling the
embedding function like `null` or omitted fields.
**Changes:**
- Modified `rowPathsAndValues` to skip undefined values during schema
inference
- Added test case verifying undefined, null, and omitted vector fields
all work correctly
**Before:** `{ vector: undefined }` → Error
**After:** `{ vector: undefined }` → Calls embedding function
Closes#2647
## Summary
This PR introduces a `HeaderProvider` which is called for all remote
HTTP calls to get the latest headers to inject. This is useful for
features like adding the latest auth tokens where the header provider
can auto-refresh tokens internally and each request always set the
refreshed token.
---------
Co-authored-by: Claude <noreply@anthropic.com>
This PR adds mTLS (mutual TLS) configuration support for the LanceDB
remote HTTP client, allowing users to authenticate with client
certificates and configure custom CA certificates for server
verification.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Enables two new parameters when building indices:
* `name`: Allows explicitly setting a name on the index. Default is
`{col_name}_idx`.
* `train` (default `True`): When set to `False`, an empty index will be
immediately created.
The upgrade of Lance means there are also additional behaviors from
cd76a993b8:
* When a scalar index is created on a Table, it will be kept around even
if all rows are deleted or updated.
* Scalar indices can be created on empty tables. They will default to
`train=False` if the table is empty.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Adds tests to ensure that users can paginate through simple scan, FTS,
and vector search results using `limit` and `offset`.
Tests upstream work: https://github.com/lancedb/lance/pull/4318Closes#2459