## Summary
Split out from #3354
Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` /
`unset_lsm_write_spec` to
install and clear the spec that selects Lance's MemWAL LSM-style write
path for
`merge_insert`.
`LsmWriteSpec` offers three sharding strategies, all built on Lance's
`InitializeMemWalBuilder`:
- `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by
the
single-column unenforced primary key.
- `LsmWriteSpec::identity(column)` — identity sharding by the raw value
of a
scalar column.
- `LsmWriteSpec::unsharded()` — a single MemWAL shard.
Each can be refined with `with_maintained_indexes(...)` (indexes the
MemWAL
keeps up to date as rows are appended) and
`with_writer_config_defaults(...)`
(default `ShardWriter` configuration recorded in the MemWAL index, so
every
writer starts from the same defaults). All variants require the table to
have
an unenforced primary key.
- `set_lsm_write_spec` installs the spec by initializing the MemWAL
index;
`unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting
to
the standard `merge_insert` path. `unset` is idempotent.
- Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`,
`set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript
(`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` /
`"unsharded"`). `RemoteTable` returns `NotSupported`.
The actual `merge_insert` LSM dispatch and `ShardWriter` write path are
a
follow-up — this PR only installs and clears the spec.
## **Summary**
This PR adds a **Scannable primitive** to the Node.js bindings, bringing
parity with Python's `PyScannable`.
A `Scannable` wraps a schema, an optional row count hint, a rescannable
flag, and a batch producing callback. On the Rust side it implements
`lancedb::data::scannable::Scannable`. The goal is to give consumers
such as `Table.add`, `createTable`, and `mergeInsert` a way to stream
data without materializing the full dataset in JS memory.
This PR introduces only the primitive. Migrating existing consumers to
use it will come in follow up work.
---
## **Design**
### **Transport**
The transport uses the **Arrow IPC Stream format, one batch at a time**.
The JS side encodes each `RecordBatch` into a self contained IPC Stream
message containing schema, batch, and end of stream. The message is
returned as a `Buffer` through a napi `ThreadsafeFunction`. The Rust
side decodes it using `arrow_ipc::reader::StreamReader`.
Only one batch is active at a time, so JS memory stays bounded by the
batch size. The Node `Buffer` size limit of about 4 GiB therefore does
not constrain the stream as a whole.
I initially evaluated the Arrow C Data Interface, which is the approach
used in Python. I dropped that path after confirming that the
`apache-arrow` npm package does not expose a C Data Interface export in
any supported version from 15 to 18. JavaScript is not listed in Arrow's
C Data Interface implementation table, and the upstream tracking issue
remains open with no scheduled work.
Third party FFI shims would introduce additional dependency risk without
solving the core maintenance problem. Using IPC adds one encode and
decode step per batch, but the cost is predictable and typically
dominated by Lance's write path.
---
### **API**
```ts
class Scannable {
readonly schema: Schema
readonly numRows: number | null
readonly rescannable: boolean
static fromFactory(schema, factory, opts?)
static fromTable(table, opts?)
static fromIterable(schema, iter, opts?)
static fromRecordBatchReader(reader, opts?)
}
```
The FFI boundary consists of a single callback:
`getNextBatch(isStart: boolean): Promise<Buffer | null>`
`isStart` is `true` on the first call of each new scan and `false` for
every call after it. The JS side uses it to drop any cached iterator and
re-invoke the factory at scan boundaries. This is what makes a
rescannable source restart at batch 0 on every `scan_as_stream` call,
even when a previous scan ended mid stream, for example a retried write
after a network error. Without this signal a retry would resume a stale
iterator and silently skip already emitted batches.
In addition, a schema only IPC buffer is transferred once during
construction.
---
## **Changes**
* `nodejs/src/scannable.rs`
Adds `NapiScannable` and the `LanceScannable` implementation. Implements
`schema()`, `num_rows()`, `rescannable()`, and `scan_as_stream()`.
Includes per batch schema validation against the declared schema, one
shot enforcement for non rescannable sources, and a scan boundary reset
signal (`isStart`) so rescannable sources restart from batch 0 on every
`scan_as_stream` call rather than resuming a stale iterator.
* `nodejs/src/lib.rs`
Module registration.
* `nodejs/lancedb/scannable.ts`
Defines the `Scannable` class and the four constructors listed above.
Each constructor rejects option combinations it cannot honor, for
example a `rescannable: true` request on a one shot iterable or reader,
and a `numRows` that disagrees with an in memory table's row count.
* `nodejs/lancedb/index.ts`
Exports the new primitive.
* `nodejs/__test__/scannable.test.ts`
Test suite for the primitive.
---
## **Validation**
Before implementing the bridge, I ran an end to end harness with a JS
producer feeding a standalone Rust consumer built against the same
`arrow-ipc` version used in the bridge.
The harness covered the following scenarios:
* happy path
* empty stream
* 1,000 small batches
* 10 large batches
* mixed primitive types with nullables
* nested `List<Struct<>>`
* truncated stream error handling
* declared schema mismatch validation
* a 6 GB stress test through the pipe
All scenarios completed with bounded memory usage. The goal of this
harness was to confirm that the IPC Stream transport works correctly end
to end and that Node's `Buffer` size limit does not constrain the
overall stream.
Separately, the rescannable restart contract was verified with a focused
harness. A rescannable source is consumed partially and the scan is
dropped mid stream, then re-scanned. The re-scan replays from batch 0
rather than resuming the stale iterator. The same harness was run with
the `isStart` reset path disabled and the mid stream restart case failed
as expected, confirming the test exercises the real regression.
These harnesses are not meant to replace the full test suite, which is
described below.
---
## **Tests**
`__test__/scannable.test.ts` covers construction, metadata reflection,
per constructor defaults and overrides, construction time validation,
the native handle surface, and schema variety across empty tables,
nested types, `FixedSizeList`, and wide schemas.
Runtime scan behavior including `scan_as_stream`, one shot enforcement
on non rescannable sources, schema mismatch detection, IPC decode
failures, and rescannable restart semantics is not exercised here. There
is no in tree JS consumer of `NapiScannable` yet. This mirrors Python's
`PyScannable`, which has no dedicated test file and is covered
transitively through the consumers that accept a Scannable.
Runtime coverage will follow in the consumer migration work.
---
## **Status**
Ready for review.
Closes#3223
---
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
## Summary
- Add a `user_id` field to `ClientConfig` that allows users to identify
themselves to LanceDB Cloud/Enterprise
- The user_id is sent as the `x-lancedb-user-id` HTTP header in all
requests
- Supports three configuration methods:
- Direct assignment via `ClientConfig.user_id`
- Environment variable `LANCEDB_USER_ID`
- Indirect env var lookup via `LANCEDB_USER_ID_ENV_KEY`
Closes#3230🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
* Move away from buildjet, which is shutting down runners for GHA [^1]
* Add `Cargo.lock` to build jobs, so when we upgrade locked dependencies
we check the builds actually pass. CI started failing because
dependencies were changed in #3116 without running all build jobs.
* Add fixes for aws-lc-rs build in NodeJS.
[^1]: https://buildjet.com/for-github-actions/blog/we-are-shutting-down
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Did a full scan of all URLs that used to point to the old mkdocs pages,
and now links to the appropriate pages on lancedb.com/docs or lance.org
docs.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pipes the num_attempts field from lance's merge insert result
through lancedb. This allows callers of merge_insert to get a better
idea of whether transaction conflicts are occurring.
I'm working on a lancedb version of pytorch data loading (and hopefully
addressing https://github.com/lancedb/lance/issues/3727).
However, rather than rely on pytorch for everything I'm moving some of
the things that pytorch does into rust. This gives us more control over
data loading (e.g. using shards or a hash-based split) and it allows
permutations to be persistent. In particular I hope to be able to:
* Create a persistent permutation
* This permutation can handle splits, filtering, shuffling, and sharding
* Create a rust data loader that can read a permutation (one or more
splits), or a subset of a permutation (for DDP)
* Create a python data loader that delegates to the rust data loader
Eventually create integrations for other data loading libraries,
including rust & node
## Summary
This PR introduces a `HeaderProvider` which is called for all remote
HTTP calls to get the latest headers to inject. This is useful for
features like adding the latest auth tokens where the header provider
can auto-refresh tokens internally and each request always set the
refreshed token.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Enables two new parameters when building indices:
* `name`: Allows explicitly setting a name on the index. Default is
`{col_name}_idx`.
* `train` (default `True`): When set to `False`, an empty index will be
immediately created.
The upgrade of Lance means there are also additional behaviors from
cd76a993b8:
* When a scalar index is created on a Table, it will be kept around even
if all rows are deleted or updated.
* Scalar indices can be created on empty tables. They will default to
`train=False` if the table is empty.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
These operations have existed in lance for a long while and many users
need to drop down to lance for this capability. This PR adds the API and
implements it using filters (e.g. `_rowid IN (...)`) so that in doesn't
currently add any load to `BaseTable`. I'm not sure that is sustainable
as base table implementations may want to specialize how they handle
this method. However, I figure it is a good starting point.
In addition, unlike Lance, this API does not currently guarantee
anything about the order of the take results. This is necessary for the
fallback filter approach to work (SQL filters cannot guarantee result
order)
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
Provides the ability to set a timeout for merge insert. The default
underlying timeout is however long the first attempt takes, or if there
are multiple attempts, 30 seconds. This has two use cases:
1. Make the timeout shorter, when you want to fail if it takes too long.
2. Allow taking more time to do retries.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for specifying a timeout when performing merge insert
operations in Python, Node.js, and Rust APIs.
- Introduced a new option to control the maximum allowed execution time
for merge inserts, including retry timeout handling.
- **Documentation**
- Updated and added documentation to describe the new timeout option and
its usage in APIs.
- **Tests**
- Added and updated tests to verify correct timeout behavior during
merge insert operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
return version info for all write operations (add, update, merge_insert
and column modification operations)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Table modification operations (add, update, delete, merge,
add/alter/drop columns) now return detailed result objects including
version numbers and operation statistics.
- Result objects provide clearer feedback such as rows affected and new
table version after each operation.
- **Documentation**
- Updated documentation to describe new result objects and their fields
for all relevant table operations.
- Added documentation for new result interfaces and updated method
return types in Node.js and Python APIs.
- **Tests**
- Enhanced test coverage to assert correctness of returned versioning
and operation metadata after table modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Based on this comment:
https://github.com/lancedb/lancedb/issues/2228#issuecomment-2730463075
and https://github.com/lancedb/lance/pull/2357
Here is my attempt at implementing bindings for returning merge stats
from a `merge_insert.execute` call for lancedb.
Note: I have almost no idea what I am doing in Rust but tried to follow
existing code patterns and pay attention to compiler hints.
- The change in nodejs binding appeared to be necessary to get
compilation to work, presumably this could actual work properly by
returning some kind of NAPI JS object of the stats data?
- I am unsure of what to do with the remote/table.rs changes -
necessarily for compilation to work; I assume this is related to LanceDB
cloud, but unsure the best way to handle that at this point.
Proof of function:
```python
import pandas as pd
import lancedb
db = lancedb.connect("/tmp/test.db")
test_data = pd.DataFrame(
{
"title": ["Hello", "Test Document", "Example", "Data Sample", "Last One"],
"id": [1, 2, 3, 4, 5],
"content": [
"World",
"This is a test",
"Another example",
"More test data",
"Final entry",
],
}
)
table = db.create_table("documents", data=test_data, exist_ok=True, mode="overwrite")
update_data = pd.DataFrame(
{
"title": [
"Hello, World",
"Test Document, it's good",
"Example",
"Data Sample",
"Last One",
"New One",
],
"id": [1, 2, 3, 4, 5, 6],
"content": [
"World",
"This is a test",
"Another example",
"More test data",
"Final entry",
"New content",
],
}
)
stats = (
table.merge_insert(on="id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(update_data)
)
print(stats)
```
returns
```
{'num_inserted_rows': 1, 'num_updated_rows': 5, 'num_deleted_rows': 0}
```
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **New Features**
- Merge-insert operations now return detailed statistics, including
counts of inserted, updated, and deleted rows.
- **Bug Fixes**
- Tests updated to validate returned merge-insert statistics for
accuracy.
- **Documentation**
- Method documentation improved to reflect new return values and clarify
merge operation results.
- Added documentation for the new `MergeStats` interface detailing
operation statistics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
* Add a new "table stats" API to expose basic table and fragment
statistics with local and remote table implementations
### Questions
* This is using `calculate_data_stats` to determine total bytes in the
table. This seems like a potentially expensive operation - are there any
concerns about performance for large datasets?
### Notes
* bytes_on_disk seems to be stored at the column level but there does
not seem to be a way to easily calculate total bytes per fragment. This
may need to be added in lance before we can support fragment size
(bytes) statistics.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added a method to retrieve comprehensive table statistics, including
total rows, index counts, storage size, and detailed fragment size
metrics such as minimum, maximum, mean, and percentiles.
- Enabled fetching of table statistics from remote sources through
asynchronous requests.
- Extended table interfaces across Python, Rust, and Node.js to support
synchronous and asynchronous retrieval of table statistics.
- **Tests**
- Introduced tests to verify the accuracy of the new table statistics
feature for both populated and empty tables.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
* Add new wait_for_index() table operation that polls until indices are
created/fully indexed
* Add an optional wait timeout parameter to all create_index operations
* Python and NodeJS interfaces
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **New Features**
- Added optional waiting for index creation completion with configurable
timeout.
- Introduced methods to poll and wait for indices to be fully built
across sync and async tables.
- Extended index creation APIs to accept a wait timeout parameter.
- **Bug Fixes**
- Added a new timeout error variant for improved error reporting on
index operations.
- **Tests**
- Added tests covering successful index readiness waiting, timeout
scenarios, and missing index cases.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced full-text search capabilities with support for phrase
queries, fuzzy matching, boosting, and multi-column matching.
- Search methods now accept full-text query objects directly, improving
query flexibility and precision.
- Python and JavaScript SDKs updated to handle full-text queries
seamlessly, including async search support.
- **Tests**
- Added comprehensive tests covering fuzzy search, phrase search, and
boosted queries to ensure robust full-text search functionality.
- **Documentation**
- Updated query class documentation to reflect new constructor options
and removal of deprecated methods for clarity and simplicity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This reverts commit a547c523c2 or #2281
The current implementation can cause panics and performance degradation.
I will bring this back with more testing in
https://github.com/lancedb/lancedb/pull/2311
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Enhanced clarity on read consistency settings with updated
descriptions and default behavior.
- Removed outdated warnings about eventual consistency from the
troubleshooting guide.
- **Refactor**
- Streamlined the handling of the read consistency interval across
integrations, now defaulting to "None" for improved performance.
- Simplified internal logic to offer a more consistent experience.
- **Tests**
- Updated test expectations to reflect the new default representation
for the read consistency interval.
- Removed redundant tests related to "no consistency" settings for
streamlined testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Closes#2287
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added configurable timeout support for query executions. Users can now
specify maximum wait times for queries, enhancing control over
long-running operations across various integrations.
- **Tests**
- Expanded test coverage to validate timeout behavior in both
synchronous and asynchronous query flows, ensuring timely error
responses when query execution exceeds the specified limit.
- Introduced a new test suite to verify query operations when a timeout
is reached, checking for appropriate error handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated dependency versions for improved performance and
compatibility.
- **New Features**
- Added support for structured full-text search with expanded query
types (e.g., match, phrase, boost, multi-match) and flexible input
formats.
- Introduced a new method to check server support for structural
full-text search features.
- Enhanced the query system with new classes and interfaces for handling
various full-text queries.
- Expanded the functionality of existing methods to accept more complex
query structures, including updates to method signatures.
- **Bug Fixes**
- Improved error handling and reporting for full-text search queries.
- **Refactor**
- Enhanced query processing with streamlined input handling and improved
error reporting, ensuring more robust and consistent search results
across platforms.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.
Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
- adds `loss` into the index stats for vector index
- now `optimize` can retrain the vector index
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Previously, users could only specify new data types in `alterColumns` as
strings:
```ts
await tbl.alterColumns([
path: "price",
dataType: "float"
]);
```
But this has some problems:
1. It wasn't clear what were valid types
2. It was impossible to specify nested types, like lists and vector
columns.
This PR changes it to take an Arrow data type, similar to how the Python
API works. This allows casting vector types:
```ts
await tbl.alterColumns([
{
path: "vector",
dataType: new arrow.FixedSizeList(
2,
new arrow.Field("item", new arrow.Float16(), false),
),
},
]);
```
Closes#2185
Closes#1106
Unfortunately, these need to be set at the connection level. I
investigated whether if we let users provide a callback they could use
`AsyncLocalStorage` to access their context. However, it doesn't seem
like NAPI supports this right now. I filed an issue:
https://github.com/napi-rs/napi-rs/issues/2456
This opens up the door for more custom database implementations than the
two we have today. The biggest change should be inivisble:
`ConnectionInternal` has been renamed to `Database`, made public, and
refactored
However, there are a few breaking changes. `data_storage_version` and
`enable_v2_manifest_paths` have been moved from options on
`create_table` to options for the database which are now set via
`storage_options`.
Before:
```
db = connect(uri)
tbl = db.create_table("my_table", data, data_storage_version="legacy", enable_v2_manifest_paths=True)
```
After:
```
db = connect(uri, storage_options={
"new_table_enable_v2_manifest_paths": "true",
"new_table_data_storage_version": "legacy"
})
tbl = db.create_table("my_table", data)
```
BREAKING CHANGE: the data_storage_version, enable_v2_manifest_paths
options have moved from options to create_table to storage_options.
BREAKING CHANGE: the use_legacy_format option has been removed,
data_storage_version has replaced it for some time now
* Make `npm run docs` fail if there are any warnings. This will catch
items missing from the API reference.
* Add a check in our CI to make sure `npm run dos` runs without warnings
and doesn't generate any new files (indicating it might be out-of-date.
* Hide constructors that aren't user facing.
* Remove unused enum `WriteMode`.
Closes#2068
* Sets `"useCodeBlocks": true`
* Adds a post-processing script `nodejs/typedoc_post_process.js` that
puts the parameter description on the same line as the parameter name,
like it is in our Python docs. This makes the text hierarchy clearer in
those sections and also makes the sections shorter.
BREAKING CHANGE: default tokenizer no longer does stemming or stop-word
removal. Users should explicitly turn that option on in the future.
- upgrade lance to 0.19.1
- update the FTS docs
- update the FTS API
Upstream change notes:
https://github.com/lancedb/lance/releases/tag/v0.19.1
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
We aren't yet ready to switch over the examples since almost all JS
examples rely on embeddings and we haven't yet ported those over.
However, this makes it possible for those that are interested to start
using `@lancedb/lancedb`