## Summary When an `LsmWriteSpec` is installed on a table (#3396), `merge_insert` upsert calls are dispatched through Lance's MemWAL `ShardWriter` (LSM-style append) instead of the standard merge path. - **`use_lsm_write`** — a `merge_insert` builder option, default `true`; set it `false` to use the standard path for a call even when a spec is set. - **`assume_pre_sharded`** — a `merge_insert` builder option, default `false`; skips the per-row shard check and routes by the first row only. - **`close_lsm_writers`** — drains and closes the table's cached MemWAL shard writers. - The `merge_insert` **`on`** columns default to, and are validated against, the table's unenforced primary key. - Shard writers are cached alongside the dataset (in `DatasetConsistencyWrapper`) and reused for the session. - `MergeResult` gains **`num_rows`** — on the LSM path the insert/update breakdown is unknown until compaction, so only the total is reported. Routing covers all three sharding strategies — bucket (murmur3, Iceberg-compatible), identity, and unsharded. Each `merge_insert` call targets a single shard; the whole input is collected and validated before a single atomic `ShardWriter::put`, so a validation failure leaves the MemWAL untouched. Bindings: Python (`merge_insert(...).use_lsm_write(...)` / `.assume_pre_sharded(...)`, `Table.close_lsm_writers`) and TypeScript (`mergeInsert(...).useLsmWrite(...)` / `.assumePreSharded(...)`, `Table.closeLsmWriters`). ## Context Reconstructed from the original #3354 branch onto current `main`: the branch predated the #3394 (unenforced primary key) / #3396 (`LsmWriteSpec`) split and has been rebuilt on that merged foundation. Depends on Lance `v7.0.0-beta.13`. The MemWAL read path (reading un-flushed shard data back into queries) and remote (LanceDB Cloud) LSM support are follow-ups. --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
1.2 KiB
@lancedb/lancedb • Docs
@lancedb/lancedb / LsmWriteSpec
Interface: LsmWriteSpec
Specification selecting Lance's MemWAL LSM-style write path for
mergeInsert.
specType is "bucket", "identity", or "unsharded". For "bucket",
column and numBuckets are required; for "identity", column is
required and must be a deterministic function of the unenforced primary
key (every row with a given primary key must always produce the same
column value, or upserts of that key can land in different shards and a
stale version can win).
Properties
column?
optional column: string;
Bucket and identity variants: the sharding column.
maintainedIndexes?
optional maintainedIndexes: string[];
Names of indexes the MemWAL should keep up to date during writes.
numBuckets?
optional numBuckets: number;
Bucket variant: the number of buckets, in [1, 1024].
specType
specType: "bucket" | "identity" | "unsharded";
One of "bucket", "identity", or "unsharded".
writerConfigDefaults?
optional writerConfigDefaults: Record<string, string>;
Default ShardWriter configuration recorded in the MemWAL index.