Compare commits

..

13 Commits

Author SHA1 Message Date
Xuanwo
ba7a0f2579 Merge remote-tracking branch 'origin/main' into xuanwo/remote-pytorch-multiprocessing
# Conflicts:
#	python/python/lancedb/remote/table.py
2026-06-01 17:54:45 +08:00
Xuanwo
5744ec048a Refine remote table pickle handling 2026-06-01 17:49:04 +08:00
dependabot[bot]
968277be79 chore(deps): bump the rust-minor-patch group with 5 updates (#3465)
Bumps the rust-minor-patch group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [log](https://github.com/rust-lang/log) | `0.4.29` | `0.4.30` |
| [serde_json](https://github.com/serde-rs/json) | `1.0.149` | `1.0.150`
|
| [http](https://github.com/hyperium/http) | `1.4.0` | `1.4.1` |
| [uuid](https://github.com/uuid-rs/uuid) | `1.23.1` | `1.23.2` |
| [aws-smithy-runtime](https://github.com/smithy-lang/smithy-rs) |
`1.11.1` | `1.11.3` |

Updates `log` from 0.4.29 to 0.4.30
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/rust-lang/log/releases">log's
releases</a>.</em></p>
<blockquote>
<h2>0.4.30</h2>
<h3>What's Changed</h3>
<ul>
<li>Support capturing of <code>std::net</code> types by <a
href="https://github.com/KodrAus"><code>@​KodrAus</code></a> in <a
href="https://redirect.github.com/rust-lang/log/pull/724">rust-lang/log#724</a></li>
</ul>
<h3>New Contributors</h3>
<ul>
<li><a href="https://github.com/V0ldek"><code>@​V0ldek</code></a> made
their first contribution in <a
href="https://redirect.github.com/rust-lang/log/pull/720">rust-lang/log#720</a></li>
<li><a href="https://github.com/woodruffw"><code>@​woodruffw</code></a>
made their first contribution in <a
href="https://redirect.github.com/rust-lang/log/pull/723">rust-lang/log#723</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/rust-lang/log/compare/0.4.29...0.4.30">https://github.com/rust-lang/log/compare/0.4.29...0.4.30</a></p>
<h3>Notable Changes</h3>
<ul>
<li>MSRV is bumped to 1.71.0 in <a
href="https://redirect.github.com/rust-lang/log/pull/723">rust-lang/log#723</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/rust-lang/log/blob/master/CHANGELOG.md">log's
changelog</a>.</em></p>
<blockquote>
<h2>[0.4.30] - 2026-05-21</h2>
<h3>What's Changed</h3>
<ul>
<li>Support capturing of <code>std::net</code> types by <a
href="https://github.com/KodrAus"><code>@​KodrAus</code></a> in <a
href="https://redirect.github.com/rust-lang/log/pull/724">rust-lang/log#724</a></li>
</ul>
<h3>New Contributors</h3>
<ul>
<li><a href="https://github.com/V0ldek"><code>@​V0ldek</code></a> made
their first contribution in <a
href="https://redirect.github.com/rust-lang/log/pull/720">rust-lang/log#720</a></li>
<li><a href="https://github.com/woodruffw"><code>@​woodruffw</code></a>
made their first contribution in <a
href="https://redirect.github.com/rust-lang/log/pull/723">rust-lang/log#723</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/rust-lang/log/compare/0.4.29...0.4.30">https://github.com/rust-lang/log/compare/0.4.29...0.4.30</a></p>
<h3>Notable Changes</h3>
<ul>
<li>MSRV is bumped to 1.71.0 in <a
href="https://redirect.github.com/rust-lang/log/pull/723">rust-lang/log#723</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="9c55760b49"><code>9c55760</code></a>
Merge pull request <a
href="https://redirect.github.com/rust-lang/log/issues/725">#725</a>
from rust-lang/cargo/0.4.30</li>
<li><a
href="d1acb0585c"><code>d1acb05</code></a>
update docs on current MSRV and note latest bump in changelog</li>
<li><a
href="50682937b0"><code>5068293</code></a>
prepare for 0.4.30 release</li>
<li><a
href="7ccd873cb5"><code>7ccd873</code></a>
Merge pull request <a
href="https://redirect.github.com/rust-lang/log/issues/724">#724</a>
from rust-lang/feat/net-to-value</li>
<li><a
href="923dfaaf00"><code>923dfaa</code></a>
fix up test cfgs</li>
<li><a
href="ecb7de8daf"><code>ecb7de8</code></a>
gate net value impls on std</li>
<li><a
href="67bb4f6d2e"><code>67bb4f6</code></a>
run fmt</li>
<li><a
href="25f49fe3d3"><code>25f49fe</code></a>
rework net type capturing</li>
<li><a
href="7087dcb95c"><code>7087dcb</code></a>
feat: impl ToValue for core::net types</li>
<li><a
href="67bc7e32c6"><code>67bc7e3</code></a>
Merge pull request <a
href="https://redirect.github.com/rust-lang/log/issues/723">#723</a>
from woodruffw-forks/ww/ci</li>
<li>Additional commits viewable in <a
href="https://github.com/rust-lang/log/compare/0.4.29...0.4.30">compare
view</a></li>
</ul>
</details>
<br />

Updates `serde_json` from 1.0.149 to 1.0.150
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/serde-rs/json/releases">serde_json's
releases</a>.</em></p>
<blockquote>
<h2>v1.0.150</h2>
<ul>
<li>Reject non-string enum object keys (<a
href="https://redirect.github.com/serde-rs/json/issues/1324">#1324</a>,
thanks <a
href="https://github.com/puneetdixit200"><code>@​puneetdixit200</code></a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="a1ae73ac6a"><code>a1ae73a</code></a>
Release 1.0.150</li>
<li><a
href="1a360b0a6c"><code>1a360b0</code></a>
Merge pull request <a
href="https://redirect.github.com/serde-rs/json/issues/1324">#1324</a>
from puneetdixit200/reject-non-string-enum-keys</li>
<li><a
href="2037b634f9"><code>2037b63</code></a>
Reject non-string enum object keys</li>
<li><a
href="5d30df60e9"><code>5d30df6</code></a>
Resolve manual_assert_eq pedantic clippy lint</li>
<li><a
href="dc8003a88e"><code>dc8003a</code></a>
Raise required compiler for preserve_order feature to 1.85</li>
<li><a
href="a42fa980f8"><code>a42fa98</code></a>
Unpin CI miri toolchain</li>
<li><a
href="684a60eba1"><code>684a60e</code></a>
Pin CI miri to nightly-2026-02-11</li>
<li><a
href="7c7da3302b"><code>7c7da33</code></a>
Raise required compiler to Rust 1.71</li>
<li><a
href="acf4850e29"><code>acf4850</code></a>
Simplify Number::is_f64</li>
<li><a
href="6b8ceab565"><code>6b8ceab</code></a>
Resolve unnecessary_map_or clippy lint</li>
<li>Additional commits viewable in <a
href="https://github.com/serde-rs/json/compare/v1.0.149...v1.0.150">compare
view</a></li>
</ul>
</details>
<br />

Updates `http` from 1.4.0 to 1.4.1
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/hyperium/http/releases">http's
releases</a>.</em></p>
<blockquote>
<h2>v1.4.1</h2>
<h2>tl;dr</h2>
<ul>
<li>Fix <code>PathAndQuery::from_static()</code> and
<code>from_shared()</code> to reject inputs that do not start with
<code>/</code>.</li>
<li>Fix <code>Extend</code> for <code>HeaderMap</code> to clamp max size
hint and not overflow.</li>
<li>Fix <code>header::IntoIter</code> that could use-after-free if the
generic value type could panic on drop.</li>
<li>Fix <code>header::{IterMut, ValuesIterMut}</code> to not violate
stacked borrows.</li>
</ul>
<h2>What's Changed</h2>
<ul>
<li>chore(header): fix clippy::assign_op_pattern by <a
href="https://github.com/rxc-amzn"><code>@​rxc-amzn</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/806">hyperium/http#806</a></li>
<li>ci: pin itoa in msrv job by <a
href="https://github.com/seanmonstar"><code>@​seanmonstar</code></a> in
<a
href="https://redirect.github.com/hyperium/http/pull/813">hyperium/http#813</a></li>
<li>Remove unnecessary explicit lifetimes by <a
href="https://github.com/jplatte"><code>@​jplatte</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/815">hyperium/http#815</a></li>
<li>chore(ci): update to actions/checkout@v6 by <a
href="https://github.com/tottoto"><code>@​tottoto</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/819">hyperium/http#819</a></li>
<li>tests: update to rand 0.10 by <a
href="https://github.com/tottoto"><code>@​tottoto</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/818">hyperium/http#818</a></li>
<li>refactor: Remove usage of float instruction by <a
href="https://github.com/AurelienFT"><code>@​AurelienFT</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/823">hyperium/http#823</a></li>
<li>refactor(uri): consolidate PathAndQuery::from_shared and from_static
by <a
href="https://github.com/seanmonstar"><code>@​seanmonstar</code></a> in
<a
href="https://redirect.github.com/hyperium/http/pull/825">hyperium/http#825</a></li>
<li>fix(uri): reject Path::from_shared/from_static if doesn't start with
slash by <a
href="https://github.com/seanmonstar"><code>@​seanmonstar</code></a> in
<a
href="https://redirect.github.com/hyperium/http/pull/826">hyperium/http#826</a></li>
<li>Rephrase comment by <a
href="https://github.com/daalfox"><code>@​daalfox</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/827">hyperium/http#827</a></li>
<li>Fix typo in request builder docs by <a
href="https://github.com/vleksis"><code>@​vleksis</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/831">hyperium/http#831</a></li>
<li>fix: clamp Extend size hint so HeaderMap reserve cannot overflow by
<a href="https://github.com/SAY-5"><code>@​SAY-5</code></a> in <a
href="https://redirect.github.com/hyperium/http/pull/833">hyperium/http#833</a></li>
<li>fix(headers): fix stacked borrows for IterMut/ValuesIterMut by <a
href="https://github.com/seanmonstar"><code>@​seanmonstar</code></a> in
<a
href="https://redirect.github.com/hyperium/http/pull/837">hyperium/http#837</a></li>
<li>fix(header): use a set_len guard in IntoIter drop by <a
href="https://github.com/seanmonstar"><code>@​seanmonstar</code></a> in
<a
href="https://redirect.github.com/hyperium/http/pull/838">hyperium/http#838</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/rxc-amzn"><code>@​rxc-amzn</code></a>
made their first contribution in <a
href="https://redirect.github.com/hyperium/http/pull/806">hyperium/http#806</a></li>
<li><a
href="https://github.com/AurelienFT"><code>@​AurelienFT</code></a> made
their first contribution in <a
href="https://redirect.github.com/hyperium/http/pull/823">hyperium/http#823</a></li>
<li><a href="https://github.com/daalfox"><code>@​daalfox</code></a> made
their first contribution in <a
href="https://redirect.github.com/hyperium/http/pull/827">hyperium/http#827</a></li>
<li><a href="https://github.com/vleksis"><code>@​vleksis</code></a> made
their first contribution in <a
href="https://redirect.github.com/hyperium/http/pull/831">hyperium/http#831</a></li>
<li><a href="https://github.com/SAY-5"><code>@​SAY-5</code></a> made
their first contribution in <a
href="https://redirect.github.com/hyperium/http/pull/833">hyperium/http#833</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/hyperium/http/compare/v1.4.0...v1.4.1">https://github.com/hyperium/http/compare/v1.4.0...v1.4.1</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/hyperium/http/blob/master/CHANGELOG.md">http's
changelog</a>.</em></p>
<blockquote>
<h1>1.4.1 (May 25, 2026)</h1>
<ul>
<li>Fix <code>PathAndQuery::from_static()</code> and
<code>from_shared()</code> to reject inputs that do not start with
<code>/</code>.</li>
<li>Fix <code>Extend</code> for <code>HeaderMap</code> to clamp max size
hint and not overflow.</li>
<li>Fix <code>header::IntoIter</code> that could use-after-free if the
generic value type could panic on drop.</li>
<li>Fix <code>header::{IterMut, ValuesIterMut}</code> to not violate
stacked borrows.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="a24c968ba3"><code>a24c968</code></a>
v1.4.1</li>
<li><a
href="bc3b0441be"><code>bc3b044</code></a>
fix(header): use a set_len guard in IntoIter drop (<a
href="https://redirect.github.com/hyperium/http/issues/838">#838</a>)</li>
<li><a
href="1b968dc519"><code>1b968dc</code></a>
fix(header): fix stacked borrows for IterMut/ValuesIterMut (<a
href="https://redirect.github.com/hyperium/http/issues/837">#837</a>)</li>
<li><a
href="6e2dd42a15"><code>6e2dd42</code></a>
fix: clamp Extend size hint so HeaderMap reserve cannot overflow (<a
href="https://redirect.github.com/hyperium/http/issues/833">#833</a>)</li>
<li><a
href="68e0abb052"><code>68e0abb</code></a>
docs: fix typo in request builder docs (<a
href="https://redirect.github.com/hyperium/http/issues/831">#831</a>)</li>
<li><a
href="29dd307b3e"><code>29dd307</code></a>
docs(extensions): rephrase internal comment (<a
href="https://redirect.github.com/hyperium/http/issues/827">#827</a>)</li>
<li><a
href="ae48fb55b0"><code>ae48fb5</code></a>
fix(uri): reject Path::from_shared/from_static if doesn't start with
slash (#...</li>
<li><a
href="1ad200ec4c"><code>1ad200e</code></a>
refactor(uri): consolidate PathAndQuery::from_shared and from_static (<a
href="https://redirect.github.com/hyperium/http/issues/825">#825</a>)</li>
<li><a
href="d59d939f92"><code>d59d939</code></a>
refactor: Remove usage of float instruction (<a
href="https://redirect.github.com/hyperium/http/issues/823">#823</a>)</li>
<li><a
href="ed680c4d90"><code>ed680c4</code></a>
tests: update to rand 0.10 (<a
href="https://redirect.github.com/hyperium/http/issues/818">#818</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/hyperium/http/compare/v1.4.0...v1.4.1">compare
view</a></li>
</ul>
</details>
<br />

Updates `uuid` from 1.23.1 to 1.23.2
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/uuid-rs/uuid/releases">uuid's
releases</a>.</em></p>
<blockquote>
<h2>v1.23.2</h2>
<h2>What's Changed</h2>
<ul>
<li>Improve error messages for ambiguous formats by <a
href="https://github.com/KodrAus"><code>@​KodrAus</code></a> in <a
href="https://redirect.github.com/uuid-rs/uuid/pull/882">uuid-rs/uuid#882</a></li>
<li>Prepare for 1.23.2 release by <a
href="https://github.com/KodrAus"><code>@​KodrAus</code></a> in <a
href="https://redirect.github.com/uuid-rs/uuid/pull/883">uuid-rs/uuid#883</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/uuid-rs/uuid/compare/v1.23.1...v1.23.2">https://github.com/uuid-rs/uuid/compare/v1.23.1...v1.23.2</a></p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d11965705f"><code>d119657</code></a>
Merge pull request <a
href="https://redirect.github.com/uuid-rs/uuid/issues/883">#883</a> from
uuid-rs/cargo/v1.23.2</li>
<li><a
href="0651cfcb89"><code>0651cfc</code></a>
prepare for 1.23.2 release</li>
<li><a
href="e8dea0c1fd"><code>e8dea0c</code></a>
Merge pull request <a
href="https://redirect.github.com/uuid-rs/uuid/issues/882">#882</a> from
uuid-rs/fix/error-msgs</li>
<li><a
href="bdc429a8c7"><code>bdc429a</code></a>
fix up serde messages</li>
<li><a
href="d4342e400d"><code>d4342e4</code></a>
make indexes 0 based and fix up more error messages</li>
<li><a
href="4ad479fc20"><code>4ad479f</code></a>
work on more accurate parser errors</li>
<li>See full diff in <a
href="https://github.com/uuid-rs/uuid/compare/v1.23.1...v1.23.2">compare
view</a></li>
</ul>
</details>
<br />

Updates `aws-smithy-runtime` from 1.11.1 to 1.11.3
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/smithy-lang/smithy-rs/commits">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-30 19:22:50 -07:00
Xuanwo
5638907fa5 chore: update Lance to v7.2.0-beta.1 (#3461)
Update the Rust workspace Lance git dependencies and Java lance-core
dependency to v7.2.0-beta.1.

This keeps LanceDB aligned with the latest Lance beta release and
refreshes the Cargo lockfile for the new Lance dependency graph.
2026-05-30 00:18:22 +08:00
Heng Ge
048f52c2aa feat(table): route merge_insert through the MemWAL LSM write path (#3354)
## Summary

When an `LsmWriteSpec` is installed on a table (#3396), `merge_insert`
upsert
calls are dispatched through Lance's MemWAL `ShardWriter` (LSM-style
append)
instead of the standard merge path.

- **`use_lsm_write`** — a `merge_insert` builder option, default `true`;
set it
  `false` to use the standard path for a call even when a spec is set.
- **`assume_pre_sharded`** — a `merge_insert` builder option, default
`false`;
  skips the per-row shard check and routes by the first row only.
- **`close_lsm_writers`** — drains and closes the table's cached MemWAL
shard
  writers.
- The `merge_insert` **`on`** columns default to, and are validated
against,
  the table's unenforced primary key.
- Shard writers are cached alongside the dataset (in
  `DatasetConsistencyWrapper`) and reused for the session.
- `MergeResult` gains **`num_rows`** — on the LSM path the insert/update
  breakdown is unknown until compaction, so only the total is reported.

Routing covers all three sharding strategies — bucket (murmur3,
Iceberg-compatible), identity, and unsharded. Each `merge_insert` call
targets
a single shard; the whole input is collected and validated before a
single
atomic `ShardWriter::put`, so a validation failure leaves the MemWAL
untouched.

Bindings: Python (`merge_insert(...).use_lsm_write(...)` /
`.assume_pre_sharded(...)`, `Table.close_lsm_writers`) and TypeScript
(`mergeInsert(...).useLsmWrite(...)` / `.assumePreSharded(...)`,
`Table.closeLsmWriters`).

## Context

Reconstructed from the original #3354 branch onto current `main`: the
branch
predated the #3394 (unenforced primary key) / #3396 (`LsmWriteSpec`)
split and
has been rebuilt on that merged foundation. Depends on Lance
`v7.0.0-beta.13`.

The MemWAL read path (reading un-flushed shard data back into queries)
and
remote (LanceDB Cloud) LSM support are follow-ups.

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
2026-05-29 08:48:11 -07:00
Will Jones
458dcabbd2 chore: upgrade Rust toolchain to 1.95.0 (#3390)
Bumps the pinned toolchain in `rust-toolchain.toml` from 1.94.0 to
1.95.0.

Fixes new lints surfaced by clippy on 1.95.0:

- `manual_checked_ops` — fragment size mean in `table.rs` uses
`checked_div`
- `explicit_counter_loop` — shuffle test loop in `shuffle.rs`

No rustc warnings were introduced.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 08:21:45 -07:00
Xuanwo
60ac5c9a7c test(python): fix remote create_index schema fixture (#3462)
The latest main Python workflow fails across multiple matrix jobs
because `test_remote_create_index_new_api` opens a remote table whose
mocked schema only exposes `id`, while the new `create_index(...,
config=...)` path validates the requested indexed columns.

This updates the remote-table fixture to include the indexed columns
used by the smoke test and checks the emitted column payloads, keeping
the test aligned with the schema-aware API path.
2026-05-29 23:04:42 +08:00
Will Jones
d05fe8ec44 feat(python): unify sync create_index API to match async API (#2882)
## Summary

- Transitions `LanceTable` and `RemoteTable` to use the unified
`create_index()` API matching `AsyncTable`
- Deprecates `create_scalar_index()` and `create_fts_index()` with
deprecation warnings
- Adds detection logic to distinguish legacy vs new API calls
- Adds `@overload` decorators for type checker compatibility
- Adds `accelerator` parameter to IVF config classes for GPU support

**New API:**
```python
table.create_index("vec", config=IvfPq(distance_type="l2"))
table.create_index("col", config=BTree())
table.create_index("text_col", config=FTS(with_position=True))
```

**Legacy API (deprecated):**
```python
table.create_index("l2", vector_column_name="vec")  # emits DeprecationWarning
table.create_scalar_index("col", index_type="BTREE")  # deprecated
table.create_fts_index("text_col")  # deprecated
```

Fixes #2879

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-05-28 16:41:47 -07:00
Will Jones
ab982d7f65 perf: migrate list_indices to use Lance's describe_indices (#3108)
This needs https://github.com/lance-format/lance/pull/6099 to work.

Closes #3140

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 16:41:05 -07:00
Will Jones
a3339b7bdd ci: drop manylinux2_17 wheel builds (#3455)
manylinux2_17 reached EOL in 2024 and pyarrow stopped publishing 2_17
wheels long ago. We already build manylinux2_28 wheels, so drop the 2_17
matrix entries.

Fixes #3452

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 11:30:42 -07:00
Will Jones
b20cdc4f93 ci: fix pypi publish on mac/windows/arm (#3449)
The python-v0.32.0 publish run failed on every build matrix entry. Three
independent issues:

1. **Mac and Windows**: `pypa/gh-action-pypi-publish` only runs on
Linux, but was being called inline from each build job.
2. **Linux (all arches)**: `pypa/gh-action-pypi-publish` derives its
docker image name from `github.action_repository`, which is empty when
the action is invoked from inside a composite action
(actions/runner#2473 — pypa's own `action.yml` references this bug). It
falls back to `github.repository`, generating
`docker://ghcr.io/lancedb/lancedb:<tag>`, which doesn't exist →
`denied`. Only the ARM matrix entry surfaced this because it failed
first and cancel-cascaded the rest.
3. **Windows**: `upload-artifact` in `build_windows_wheel` pointed at
`python\target\wheels`, but maturin writes to the workspace-root
`target/wheels`. The artifact was always empty. Also, `pypi-publish.yml`
passed a `vcpkg_token` input that the composite doesn't declare.

## Changes

- Build jobs (linux/mac/windows) now upload their wheels as
`actions/upload-artifact` artifacts.
- New Linux `publish` job downloads all wheel artifacts and runs the
Fury or PyPA publish step directly (not via a composite), so
`github.action_repository` resolves correctly.
- Delete the unused `upload_wheel` composite action.
- Drop the broken upload-artifact step inside `build_windows_wheel`.
- Remove the bogus `vcpkg_token` input.
- Fury upload now loops over all wheels instead of just the first.
- Bump `actions/checkout`, `actions/upload-artifact`,
`actions/download-artifact` to current major versions (Node 24) to clear
deprecation warnings.
- Bump Windows job timeout 60 → 90 minutes; previous run was
cancel-timing-out on a 60m cap.
- Use `rust-lld` as the Windows MSVC linker via
`CARGO_TARGET_X86_64_PC_WINDOWS_MSVC_LINKER`. `link.exe` is
single-threaded and the long pole on Windows builds.

Fixes #3445

## Test plan

- [x] Open this PR — `paths` filter triggers a dry-run build on all
three platforms.
- [x] Verify all three builds produce wheels.
- [x] Confirm the `pypa/gh-action-pypi-publish` container actually
starts (the actions/runner#2473 bug) via the `publish-dry-run` job
pointed at TestPyPI.
- [x] **REMOVE BEFORE MERGE**: drop the `publish-dry-run` job and the
now-redundant `actions/upload-artifact` runs on PRs (currently always-on
so the dry-run has wheels to publish).
- [ ] After merge, cherry-pick onto `python-v0.32.0` and force-push the
tag to re-trigger the publish.
2026-05-27 13:43:42 -07:00
LanceDB Robot
e77a62e35a chore: update lance dependency to v7.1.0-beta.4 (#3450)
## Summary

- Updates Lance Rust workspace dependencies to `v7.1.0-beta.4` using
`ci/set_lance_version.py`.
- Updates the Java `lance-core` dependency property to `7.1.0-beta.4`.
- Triggering Lance tag:
https://github.com/lance-format/lance/releases/tag/v7.1.0-beta.4

## Verification

- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`

Co-authored-by: Daniel Rammer <hamersaw@protonmail.com>
2026-05-27 08:45:28 -05:00
Xuanwo
8d5f65af61 Support remote tables in PyTorch dataloaders 2026-05-22 19:02:38 +08:00
55 changed files with 3881 additions and 546 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.30.0"
current_version = "0.30.0-beta.1"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -29,7 +29,3 @@ runs:
args: ${{ inputs.args }}
docker-options: "-e PIP_EXTRA_INDEX_URL='https://pypi.fury.io/lance-format/ https://pypi.fury.io/lancedb/'"
working-directory: python
- uses: actions/upload-artifact@v4
with:
name: windows-wheels
path: python\target\wheels

View File

@@ -8,6 +8,9 @@ on:
# This should trigger a dry run (we skip the final publish step)
paths:
- .github/workflows/pypi-publish.yml
- .github/workflows/build_linux_wheel/action.yml
- .github/workflows/build_mac_wheel/action.yml
- .github/workflows/build_windows_wheel/action.yml
- Cargo.toml # Change in dependency frequently breaks builds
- Cargo.lock
@@ -21,32 +24,21 @@ jobs:
linux:
name: Python ${{ matrix.config.platform }} manylinux${{ matrix.config.manylinux }}
timeout-minutes: 60
permissions:
id-token: write
contents: read
strategy:
matrix:
config:
- platform: x86_64
manylinux: "2_17"
extra_args: ""
runner: ubuntu-22.04
- platform: x86_64
manylinux: "2_28"
extra_args: "--features fp16kernels"
runner: ubuntu-22.04
- platform: aarch64
manylinux: "2_17"
extra_args: ""
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
runner: ubuntu-2404-8x-arm64
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
- platform: aarch64
manylinux: "2_28"
extra_args: "--features fp16kernels"
runner: ubuntu-2404-8x-arm64
runs-on: ${{ matrix.config.runner }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0
lfs: true
@@ -60,15 +52,14 @@ jobs:
args: "--release --strip ${{ matrix.config.extra_args }}"
arm-build: ${{ matrix.config.platform == 'aarch64' }}
manylinux: ${{ matrix.config.manylinux }}
- uses: ./.github/workflows/upload_wheel
- uses: actions/upload-artifact@v7
if: startsWith(github.ref, 'refs/tags/python-v')
with:
fury_token: ${{ secrets.FURY_TOKEN }}
name: wheels-linux-${{ matrix.config.platform }}-${{ matrix.config.manylinux }}
path: target/wheels/lancedb-*.whl
if-no-files-found: error
mac:
timeout-minutes: 90
permissions:
id-token: write
contents: read
runs-on: ${{ matrix.config.runner }}
strategy:
matrix:
@@ -78,7 +69,7 @@ jobs:
env:
MACOSX_DEPLOYMENT_TARGET: 10.15
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0
lfs: true
@@ -90,18 +81,21 @@ jobs:
with:
python-minor-version: 10
args: "--release --strip --target ${{ matrix.config.target }} --features fp16kernels"
- uses: ./.github/workflows/upload_wheel
- uses: actions/upload-artifact@v7
if: startsWith(github.ref, 'refs/tags/python-v')
with:
fury_token: ${{ secrets.FURY_TOKEN }}
name: wheels-mac-${{ matrix.config.target }}
path: target/wheels/lancedb-*.whl
if-no-files-found: error
windows:
timeout-minutes: 60
permissions:
id-token: write
contents: read
timeout-minutes: 90
runs-on: windows-latest
env:
# link.exe is single-threaded and the long pole on Windows builds. Use
# rustc's bundled lld-link instead.
CARGO_TARGET_X86_64_PC_WINDOWS_MSVC_LINKER: rust-lld
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0
lfs: true
@@ -113,18 +107,70 @@ jobs:
with:
python-minor-version: 10
args: "--release --strip"
vcpkg_token: ${{ secrets.VCPKG_GITHUB_PACKAGES }}
- uses: ./.github/workflows/upload_wheel
- uses: actions/upload-artifact@v7
if: startsWith(github.ref, 'refs/tags/python-v')
with:
fury_token: ${{ secrets.FURY_TOKEN }}
name: wheels-windows
path: target/wheels/lancedb-*.whl
if-no-files-found: error
publish:
name: Publish wheels
if: startsWith(github.ref, 'refs/tags/python-v')
needs: [linux, mac, windows]
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v6
- name: Download wheel artifacts
uses: actions/download-artifact@v8
with:
pattern: wheels-*
path: target/wheels
merge-multiple: true
- name: List wheels
run: ls -la target/wheels
- name: Choose repo
id: choose_repo
run: |
if [[ ${{ github.ref }} == *beta* ]]; then
echo "repo=fury" >> $GITHUB_OUTPUT
else
echo "repo=pypi" >> $GITHUB_OUTPUT
fi
- name: Publish to Fury
if: steps.choose_repo.outputs.repo == 'fury'
env:
FURY_TOKEN: ${{ secrets.FURY_TOKEN }}
run: |
shopt -s nullglob
WHEELS=(target/wheels/lancedb-*.whl)
if [[ ${#WHEELS[@]} -eq 0 ]]; then
echo "No wheels found in target/wheels/" >&2
exit 1
fi
for WHEEL in "${WHEELS[@]}"; do
echo "Uploading $WHEEL to Fury"
curl -f -F package=@"$WHEEL" "https://$FURY_TOKEN@push.fury.io/lancedb/"
done
# NOTE: pypa/gh-action-pypi-publish must be invoked directly from a
# workflow file, not from inside a composite action. When called from a
# composite, `github.action_repository` is empty (actions/runner#2473)
# and the action falls back to `github.repository`, producing a bogus
# `docker://ghcr.io/<repo>:<ref>` image reference that GHA tries to pull.
- name: Publish to PyPI
if: steps.choose_repo.outputs.repo == 'pypi'
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages-dir: target/wheels/
gh-release:
if: startsWith(github.ref, 'refs/tags/python-v')
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
with:
fetch-depth: 0
lfs: true
@@ -187,13 +233,13 @@ jobs:
report-failure:
name: Report Workflow Failure
runs-on: ubuntu-latest
needs: [linux, mac, windows]
needs: [linux, mac, windows, publish]
permissions:
contents: read
issues: write
if: always() && failure() && startsWith(github.ref, 'refs/tags/python-v')
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- uses: ./.github/actions/create-failure-issue
with:
job-results: ${{ toJSON(needs) }}

View File

@@ -1,34 +0,0 @@
name: upload-wheel
description: "Upload wheels to Pypi"
inputs:
fury_token:
required: true
description: "release token for the fury repo"
runs:
using: "composite"
steps:
- name: Choose repo
shell: bash
id: choose_repo
run: |
if [[ ${{ github.ref }} == *beta* ]]; then
echo "repo=fury" >> $GITHUB_OUTPUT
else
echo "repo=pypi" >> $GITHUB_OUTPUT
fi
- name: Publish to Fury
if: steps.choose_repo.outputs.repo == 'fury'
shell: bash
env:
FURY_TOKEN: ${{ inputs.fury_token }}
run: |
WHEEL=$(ls target/wheels/lancedb-*.whl 2> /dev/null | head -n 1)
echo "Uploading $WHEEL to Fury"
curl -f -F package=@$WHEEL https://$FURY_TOKEN@push.fury.io/lancedb/
- name: Publish to PyPI
if: steps.choose_repo.outputs.repo == 'pypi'
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages-dir: target/wheels/

319
Cargo.lock generated
View File

@@ -568,7 +568,7 @@ dependencies = [
"bytes",
"fastrand",
"hex",
"http 1.4.0",
"http 1.4.1",
"sha1 0.10.6",
"time",
"tokio",
@@ -631,7 +631,7 @@ dependencies = [
"bytes-utils",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"http-body 0.4.6",
"http-body 1.0.1",
"percent-encoding",
@@ -661,7 +661,7 @@ dependencies = [
"bytes",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"http-body-util",
"regex-lite",
"tracing",
@@ -686,7 +686,7 @@ dependencies = [
"bytes",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"regex-lite",
"tracing",
]
@@ -710,7 +710,7 @@ dependencies = [
"bytes",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"regex-lite",
"tracing",
]
@@ -740,7 +740,7 @@ dependencies = [
"hex",
"hmac 0.13.0",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"lru",
"percent-encoding",
@@ -769,7 +769,7 @@ dependencies = [
"bytes",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"regex-lite",
"tracing",
]
@@ -793,7 +793,7 @@ dependencies = [
"bytes",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"regex-lite",
"tracing",
]
@@ -818,7 +818,7 @@ dependencies = [
"aws-types",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"regex-lite",
"tracing",
]
@@ -840,7 +840,7 @@ dependencies = [
"hex",
"hmac 0.13.0",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"p256",
"percent-encoding",
"ring",
@@ -873,7 +873,7 @@ dependencies = [
"bytes",
"crc-fast",
"hex",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"md-5 0.11.0",
@@ -907,7 +907,7 @@ dependencies = [
"bytes-utils",
"futures-core",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"percent-encoding",
@@ -928,7 +928,7 @@ dependencies = [
"h2 0.3.27",
"h2 0.4.14",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"http-body 0.4.6",
"hyper 0.14.32",
"hyper 1.9.0",
@@ -990,7 +990,7 @@ dependencies = [
"bytes",
"fastrand",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"http-body 0.4.6",
"http-body 1.0.1",
"http-body-util",
@@ -1011,7 +1011,7 @@ dependencies = [
"aws-smithy-types",
"bytes",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"pin-project-lite",
"tokio",
"tracing",
@@ -1037,7 +1037,7 @@ checksum = "7442cb268338f0eb8278140a107c046756aa01093d8ef5e99628d34ae09c94f5"
dependencies = [
"aws-smithy-runtime-api",
"aws-smithy-types",
"http 1.4.0",
"http 1.4.1",
]
[[package]]
@@ -1051,7 +1051,7 @@ dependencies = [
"bytes-utils",
"futures-core",
"http 0.2.12",
"http 1.4.0",
"http 1.4.1",
"http-body 0.4.6",
"http-body 1.0.1",
"http-body-util",
@@ -1099,7 +1099,7 @@ dependencies = [
"axum-core",
"bytes",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"hyper 1.9.0",
@@ -1132,7 +1132,7 @@ dependencies = [
"async-trait",
"bytes",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"mime",
@@ -1411,6 +1411,12 @@ dependencies = [
"syn 2.0.117",
]
[[package]]
name = "bytecount"
version = "0.6.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "175812e0be2bccb6abe50bb8d566126198344f707e304f45c648fd8f2cc0365e"
[[package]]
name = "bytemuck"
version = "1.25.0"
@@ -1534,9 +1540,9 @@ dependencies = [
[[package]]
name = "cedarwood"
version = "0.4.6"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6d910bedd62c24733263d0bed247460853c9d22e8956bd4cd964302095e04e90"
checksum = "c0524a528a6a0288df1863c3c20fe92c301875b4941e7b6c4b394ab08c5a4c55"
dependencies = [
"smallvec",
]
@@ -3296,9 +3302,8 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
[[package]]
name = "fsst"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bcd0ce0249ac12fd44fcde62d435c36d881952c2f0df4d1de24b45e1dbba5ddb"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"rand 0.9.4",
@@ -3688,7 +3693,7 @@ dependencies = [
"fnv",
"futures-core",
"futures-sink",
"http 1.4.0",
"http 1.4.1",
"indexmap 2.14.0",
"slab",
"tokio",
@@ -3794,7 +3799,7 @@ checksum = "629d8f3bbeda9d148036d6b0de0a3ab947abd08ce90626327fc3547a49d59d97"
dependencies = [
"dirs",
"futures",
"http 1.4.0",
"http 1.4.1",
"indicatif",
"libc",
"log",
@@ -3817,7 +3822,7 @@ checksum = "430b33fa84f92796d4d263070b6c0d3ca219df7b9a0e1853ee431029b1612bcd"
dependencies = [
"async-trait",
"bytes",
"http 1.4.0",
"http 1.4.1",
"more-asserts",
"serde",
"thiserror 2.0.18",
@@ -3871,9 +3876,9 @@ dependencies = [
[[package]]
name = "http"
version = "1.4.0"
version = "1.4.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e3ba2a386d7f85a81f119ad7498ebe444d2e22c2af0b86b069416ace48b3311a"
checksum = "8be7462df143984c4598a256ef469b251d7d7f9e271135073e78fc535414f3d0"
dependencies = [
"bytes",
"itoa",
@@ -3897,7 +3902,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1efedce1fb8e6913f23e0c92de8e62cd5b772a67e7b3946df930a62566c93184"
dependencies = [
"bytes",
"http 1.4.0",
"http 1.4.1",
]
[[package]]
@@ -3908,7 +3913,7 @@ checksum = "b021d93e26becf5dc7e1b75b1bed1fd93124b374ceb73f43d4d4eafec896a64a"
dependencies = [
"bytes",
"futures-core",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"pin-project-lite",
]
@@ -3975,7 +3980,7 @@ dependencies = [
"futures-channel",
"futures-core",
"h2 0.4.14",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"httparse",
"httpdate",
@@ -4007,7 +4012,7 @@ version = "0.27.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "33ca68d021ef39cf6463ab54c1d0f5daf03377b70561305bb89a8f83aab66e0f"
dependencies = [
"http 1.4.0",
"http 1.4.1",
"hyper 1.9.0",
"hyper-util",
"rustls 0.23.40",
@@ -4028,7 +4033,7 @@ dependencies = [
"bytes",
"futures-channel",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"hyper 1.9.0",
"ipnet",
@@ -4090,6 +4095,21 @@ dependencies = [
"zerovec",
]
[[package]]
name = "icu_locale"
version = "2.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d5a396343c7208121dc86e35623d3dfe19814a7613cfd14964994cdc9c9a2e26"
dependencies = [
"icu_collections",
"icu_locale_core",
"icu_locale_data",
"icu_provider",
"potential_utf",
"tinystr",
"zerovec",
]
[[package]]
name = "icu_locale_core"
version = "2.2.0"
@@ -4098,11 +4118,18 @@ checksum = "92219b62b3e2b4d88ac5119f8904c10f8f61bf7e95b640d25ba3075e6cac2c29"
dependencies = [
"displaydoc",
"litemap",
"serde",
"tinystr",
"writeable",
"zerovec",
]
[[package]]
name = "icu_locale_data"
version = "2.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d5fdcc9ac77c6d74ff5cf6e65ef3181d6af32003b16fce3a77fb451d2f695993"
[[package]]
name = "icu_normalizer"
version = "2.2.0"
@@ -4151,6 +4178,8 @@ checksum = "139c4cf31c8b5f33d7e199446eff9c1e02decfc2f0eec2c8d71f65befa45b421"
dependencies = [
"displaydoc",
"icu_locale_core",
"serde",
"stable_deref_trait",
"writeable",
"yoke",
"zerofrom",
@@ -4158,6 +4187,27 @@ dependencies = [
"zerovec",
]
[[package]]
name = "icu_segmenter"
version = "2.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5c0794db0b1a86193ac9c48768d0e6c52c54448e0870ad87907d456ee0dac964"
dependencies = [
"icu_collections",
"icu_locale",
"icu_provider",
"icu_segmenter_data",
"potential_utf",
"utf8_iter",
"zerovec",
]
[[package]]
name = "icu_segmenter_data"
version = "2.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e4a2c462a4d927d512f5f882a033ddd62f33a05bb9f230d98f736ac3dc85938f"
[[package]]
name = "id-arena"
version = "2.3.0"
@@ -4319,19 +4369,20 @@ checksum = "9028f49264629065d057f340a86acb84867925865f73bbf8d47b4d149a7e88b8"
[[package]]
name = "jieba-macros"
version = "0.9.0"
version = "0.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a29cfc5dcd898604c6f80363411fa6b6b08e27d1d253d6225b9cb6702ea02fc0"
checksum = "46adade69b634535a8f495cf87710ed893cff53e1dbc9dd750c2ab81c5defb82"
dependencies = [
"phf_codegen 0.13.1",
]
[[package]]
name = "jieba-rs"
version = "0.9.0"
version = "0.10.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3245d6e9d1d5facbd6a23848d6b67e3439738ccbb4fa5a3d65da315ba1a910a2"
checksum = "11b53580aaa8ec8b713da271da434f8947409242c537a9ab3f7b76bdbb19e8a9"
dependencies = [
"bytecount",
"cedarwood",
"jieba-macros",
"phf 0.13.1",
@@ -4519,9 +4570,8 @@ checksum = "e037a2e1d8d5fdbd49b16a4ea09d5d6401c1f29eca5ff29d03d3824dba16256a"
[[package]]
name = "lance"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3944aca86f4c78f4da04af1c2bf33e664a2826b7af72972ad200d6b9de59019f"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arc-swap",
"arrow",
@@ -4566,6 +4616,7 @@ dependencies = [
"lance-io",
"lance-linalg",
"lance-namespace",
"lance-select",
"lance-table",
"lance-tokenizer",
"log",
@@ -4593,9 +4644,8 @@ dependencies = [
[[package]]
name = "lance-arrow"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "253f4a0a70580c985b91e65e9ca6cad644825a4078de28d8efbacf3ffbd7ecdc"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4613,11 +4663,34 @@ dependencies = [
"rand 0.9.4",
]
[[package]]
name = "lance-arrow-scalar"
version = "58.0.0"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-buffer",
"arrow-cast",
"arrow-data",
"arrow-row",
"arrow-schema",
"half",
]
[[package]]
name = "lance-arrow-stats"
version = "58.0.0"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-schema",
"lance-arrow-scalar",
]
[[package]]
name = "lance-bitpacking"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "80c4d12521b1945041dd515a56aa0854973138e7ac12111c92572e33e4ecb593"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrayref",
"paste",
@@ -4626,9 +4699,8 @@ dependencies = [
[[package]]
name = "lance-core"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "13f84020da5a484e2f07dd1796e09785ed7cd889857ebc4cb77e32ef214ee594"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4663,9 +4735,8 @@ dependencies = [
[[package]]
name = "lance-datafusion"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7460597a66534a75987993d4dac5bc330586d99c5b79ae73367dbcbd4e29e576"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow",
"arrow-array",
@@ -4695,9 +4766,8 @@ dependencies = [
[[package]]
name = "lance-datagen"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "046f5506ed2271cd941a050de7bf535dd3aedc291aadec836a63fa56c5926e3b"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow",
"arrow-array",
@@ -4715,9 +4785,8 @@ dependencies = [
[[package]]
name = "lance-encoding"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7af54edf43dcf9d6a56cc636eb35d457e68373c6448dca3f0891b3325b4a24e6"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4752,9 +4821,8 @@ dependencies = [
[[package]]
name = "lance-file"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0772ae2d6207995dc1eb28aff9507f78e90b3362b58f311da001e9dc25f3d736"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4785,9 +4853,8 @@ dependencies = [
[[package]]
name = "lance-index"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e71fbfb51096a903cb524fe0da716f5f15fbc4a6b6f84cd6dec21abf319c5e84"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arc-swap",
"arrow",
@@ -4817,6 +4884,7 @@ dependencies = [
"jieba-rs",
"jsonb",
"lance-arrow",
"lance-arrow-stats",
"lance-core",
"lance-datafusion",
"lance-datagen",
@@ -4824,6 +4892,7 @@ dependencies = [
"lance-file",
"lance-io",
"lance-linalg",
"lance-select",
"lance-table",
"lance-tokenizer",
"libm",
@@ -4851,9 +4920,8 @@ dependencies = [
[[package]]
name = "lance-io"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bab8c98ef1b870b20541d27f3ca4efdf7c9f5c25214233be07d231ba88900219"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow",
"arrow-arith",
@@ -4872,7 +4940,7 @@ dependencies = [
"chrono",
"deepsize",
"futures",
"http 1.4.0",
"http 1.4.1",
"io-uring",
"lance-arrow",
"lance-core",
@@ -4895,9 +4963,8 @@ dependencies = [
[[package]]
name = "lance-linalg"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6b4c51cad0ac780b02dc4da48528479e7693c03e8d05390510bbc69ca2a9a1f1"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4913,9 +4980,8 @@ dependencies = [
[[package]]
name = "lance-namespace"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "014e8332ca0615506342e0d3af608639864b68396973be14239f09c9f21f1fc2"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow",
"async-trait",
@@ -4927,9 +4993,8 @@ dependencies = [
[[package]]
name = "lance-namespace-impls"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e8d1231906a3cf92dd3dcda7d14a09c4835af6cd2bcd76dfd2481e87f20a282d"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow",
"arrow-ipc",
@@ -4976,11 +5041,25 @@ dependencies = [
"url",
]
[[package]]
name = "lance-select"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-buffer",
"byteorder",
"bytes",
"deepsize",
"itertools 0.13.0",
"lance-core",
"roaring",
]
[[package]]
name = "lance-table"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b16f1355904aea4ebb04ffc70c58c97901e10bde44452b4b021de4a1f329250d"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow",
"arrow-array",
@@ -4999,6 +5078,7 @@ dependencies = [
"lance-core",
"lance-file",
"lance-io",
"lance-select",
"log",
"object_store",
"prost",
@@ -5019,9 +5099,8 @@ dependencies = [
[[package]]
name = "lance-testing"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3094c2aacbd1fa093d809fc54fa911e3498671ba451041341d7caaa18460e6b2"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"arrow-array",
"arrow-schema",
@@ -5032,10 +5111,10 @@ dependencies = [
[[package]]
name = "lance-tokenizer"
version = "7.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b39b7f5ed9d0c0b716bf599b559d888267ed1dfe4c4e29d3648b51d2a28940cf"
version = "7.2.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v7.2.0-beta.1#b9995aba6115e8e4bc43179a45cbd0f9a170f305"
dependencies = [
"icu_segmenter",
"jieba-rs",
"lindera",
"rust-stemmers",
@@ -5082,7 +5161,7 @@ dependencies = [
"futures",
"half",
"hf-hub",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"lance",
"lance-arrow",
@@ -5362,9 +5441,9 @@ dependencies = [
[[package]]
name = "log"
version = "0.4.29"
version = "0.4.30"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897"
checksum = "616ec5685824bcc94416c6d4a7a446eea774a31efd7062c8480ba6fd06d7a6e5"
[[package]]
name = "loom"
@@ -5934,7 +6013,7 @@ dependencies = [
"futures-channel",
"futures-core",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body-util",
"httparse",
"humantime",
@@ -6047,7 +6126,7 @@ dependencies = [
"base64 0.22.1",
"bytes",
"futures",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"jiff",
"log",
@@ -6072,7 +6151,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "048b1b29c503263bdd80a9afe46a68cd02ea9bd361185b1feab4b151078998e9"
dependencies = [
"futures",
"http 1.4.0",
"http 1.4.1",
"mea",
"opendal-core",
]
@@ -6116,7 +6195,7 @@ checksum = "7452bf3ec61cfd81ac9ad9ada17825931e9e371d44a045c6bfab9596c0a2ac3b"
dependencies = [
"base64 0.22.1",
"bytes",
"http 1.4.0",
"http 1.4.1",
"log",
"opendal-core",
"opendal-service-azure-common",
@@ -6136,7 +6215,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8f9884c2d8cf8ba2bb077d79c877dac5863ba3bab9e2c9c1e41a2e0491404772"
dependencies = [
"bytes",
"http 1.4.0",
"http 1.4.1",
"log",
"opendal-core",
"opendal-service-azure-common",
@@ -6154,7 +6233,7 @@ version = "0.56.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ffb0e45d6c8dcf66ce2da20e241bcb80e6e540e109a4ff20f318f6c9b4c54e0c"
dependencies = [
"http 1.4.0",
"http 1.4.1",
"opendal-core",
]
@@ -6166,7 +6245,7 @@ checksum = "70a49477a10163431896d106136117f5670717f9c9e49cf6f710528800c6633a"
dependencies = [
"async-trait",
"bytes",
"http 1.4.0",
"http 1.4.1",
"log",
"opendal-core",
"percent-encoding",
@@ -6187,7 +6266,7 @@ checksum = "7b2ab7a2a8a11dfe257ef4db5c0de798acbcd0d6429c37382dad2154bc06a388"
dependencies = [
"bytes",
"hf-xet",
"http 1.4.0",
"http 1.4.1",
"log",
"opendal-core",
"percent-encoding",
@@ -6203,7 +6282,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "29c8a917829ad06d21b639558532cb0101fe49b040d946d673a73018683fac05"
dependencies = [
"bytes",
"http 1.4.0",
"http 1.4.1",
"log",
"opendal-core",
"quick-xml 0.38.4",
@@ -6222,7 +6301,7 @@ dependencies = [
"base64 0.22.1",
"bytes",
"crc32c",
"http 1.4.0",
"http 1.4.1",
"log",
"md-5 0.10.6",
"opendal-core",
@@ -6966,6 +7045,8 @@ version = "0.1.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0103b1cef7ec0cf76490e969665504990193874ea05c85ff9bab8b911d0a0564"
dependencies = [
"serde_core",
"writeable",
"zerovec",
]
@@ -7629,7 +7710,7 @@ checksum = "57ac2757f3140aa2e213b554148ae0b52733e624fc6723f0cc6bb3d440176c95"
dependencies = [
"anyhow",
"form_urlencoded",
"http 1.4.0",
"http 1.4.1",
"log",
"percent-encoding",
"reqsign-core",
@@ -7647,7 +7728,7 @@ dependencies = [
"anyhow",
"bytes",
"form_urlencoded",
"http 1.4.0",
"http 1.4.1",
"log",
"percent-encoding",
"quick-xml 0.39.4",
@@ -7669,7 +7750,7 @@ dependencies = [
"base64 0.22.1",
"bytes",
"form_urlencoded",
"http 1.4.0",
"http 1.4.1",
"jsonwebtoken",
"log",
"pem",
@@ -7694,7 +7775,7 @@ dependencies = [
"futures",
"hex",
"hmac 0.12.1",
"http 1.4.0",
"http 1.4.1",
"jiff",
"log",
"percent-encoding",
@@ -7721,7 +7802,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "35cc609b49c69e76ecaceb775a03f792d1ed3e7755ab3548d4534fd801e3242e"
dependencies = [
"form_urlencoded",
"http 1.4.0",
"http 1.4.1",
"jsonwebtoken",
"log",
"percent-encoding",
@@ -7746,7 +7827,7 @@ dependencies = [
"futures-core",
"futures-util",
"h2 0.4.14",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"hyper 1.9.0",
@@ -7790,7 +7871,7 @@ dependencies = [
"bytes",
"futures-core",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"hyper 1.9.0",
@@ -7844,7 +7925,7 @@ checksum = "199dda04a536b532d0cc04d7979e39b1c763ea749bf91507017069c00b96056f"
dependencies = [
"anyhow",
"async-trait",
"http 1.4.0",
"http 1.4.1",
"reqwest 0.13.3",
"thiserror 2.0.18",
"tower-service",
@@ -8479,6 +8560,12 @@ dependencies = [
"digest 0.11.3",
]
[[package]]
name = "sha1_smol"
version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "bbfa15b3dddfee50a0fff136974b3e1bde555604ba463834a7eb7deb6417705d"
[[package]]
name = "sha2"
version = "0.10.9"
@@ -9198,6 +9285,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8323304221c2a851516f22236c5722a72eaa19749016521d6dff0824447d96d"
dependencies = [
"displaydoc",
"serde_core",
"zerovec",
]
@@ -9386,7 +9474,7 @@ checksum = "1e9cd434a998747dd2c4276bc96ee2e0c7a2eadf3cae88e52be55a05fa9053f5"
dependencies = [
"bitflags 2.11.1",
"bytes",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"pin-project-lite",
@@ -9406,7 +9494,7 @@ dependencies = [
"bytes",
"futures-core",
"futures-util",
"http 1.4.0",
"http 1.4.1",
"http-body 1.0.1",
"http-body-util",
"pin-project-lite",
@@ -9695,13 +9783,14 @@ checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821"
[[package]]
name = "uuid"
version = "1.23.1"
version = "1.23.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ddd74a9687298c6858e9b88ec8935ec45d22e8fd5e6394fa1bd4e99a87789c76"
checksum = "d258b83ceec21034727ecee8c382cfa6c3e133699b0742c64571814fb420c9f7"
dependencies = [
"getrandom 0.4.2",
"js-sys",
"serde_core",
"sha1_smol",
"wasm-bindgen",
]
@@ -10426,7 +10515,7 @@ dependencies = [
"clap",
"crc32fast",
"futures",
"http 1.4.0",
"http 1.4.1",
"hyper 1.9.0",
"lazy_static",
"more-asserts",
@@ -10500,7 +10589,7 @@ dependencies = [
"chrono",
"clap",
"gearhash",
"http 1.4.0",
"http 1.4.1",
"itertools 0.14.0",
"lazy_static",
"more-asserts",
@@ -10665,6 +10754,7 @@ dependencies = [
"displaydoc",
"yoke",
"zerofrom",
"zerovec",
]
[[package]]
@@ -10673,6 +10763,7 @@ version = "0.11.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "90f911cbc359ab6af17377d242225f4d75119aec87ea711a880987b18cd7b239"
dependencies = [
"serde",
"yoke",
"zerofrom",
"zerovec-derive",

View File

@@ -13,20 +13,20 @@ categories = ["database-implementations"]
rust-version = "1.91.0"
[workspace.dependencies]
lance = { "version" = "=7.0.0", default-features = false }
lance-core = "=7.0.0"
lance-datagen = "=7.0.0"
lance-file = "=7.0.0"
lance-io = { "version" = "=7.0.0", default-features = false }
lance-index = "=7.0.0"
lance-linalg = "=7.0.0"
lance-namespace = "=7.0.0"
lance-namespace-impls = { "version" = "=7.0.0", default-features = false }
lance-table = "=7.0.0"
lance-testing = "=7.0.0"
lance-datafusion = "=7.0.0"
lance-encoding = "=7.0.0"
lance-arrow = "=7.0.0"
lance = { "version" = "=7.2.0-beta.1", default-features = false, "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=7.2.0-beta.1", default-features = false, "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=7.2.0-beta.1", default-features = false, "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=7.2.0-beta.1", "tag" = "v7.2.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "58.0.0", optional = false }

View File

@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
<dependency>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-core</artifactId>
<version>0.30.0</version>
<version>0.30.0-beta.1</version>
</dependency>
```

View File

@@ -76,6 +76,57 @@ the query optimizer chooses a suboptimal path.
***
### useLsmWrite()
```ts
useLsmWrite(useLsmWrite): MergeInsertBuilder
```
Controls whether the merge uses the MemWAL LSM write path.
By default (unset), a `mergeInsert` on a table with an LSM write spec is
routed through Lance's MemWAL shard writer, and a table without one uses
the standard path. Pass `false` to force the standard path even when a
spec is set. Pass `true` to require a spec — `mergeInsert` rejects if none
is installed.
#### Parameters
* **useLsmWrite**: `boolean`
Whether to use the LSM write path.
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### validateSingleShard()
```ts
validateSingleShard(validateSingleShard): MergeInsertBuilder
```
Controls how an LSM merge checks that its input targets a single shard.
When a table has an LSM write spec, every row in a `mergeInsert` call must
route to the same shard. When `true` (the default), every row is inspected
to verify this. When `false`, only the first row is inspected and the
shard it routes to is used for the whole input — a faster path for callers
that have already pre-sharded their input. Has no effect on tables without
an LSM write spec.
#### Parameters
* **validateSingleShard**: `boolean`
Whether to check every row routes to one shard. Defaults to `true`.
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### whenMatchedUpdateAll()
```ts

View File

@@ -187,6 +187,25 @@ Any attempt to use the table after it is closed will result in an error.
***
### closeLsmWriters()
```ts
abstract closeLsmWriters(): Promise<void>
```
Drain and close any cached MemWAL shard writers held for this table.
When an [LsmWriteSpec](../interfaces/LsmWriteSpec.md) is installed, `mergeInsert` opens MemWAL
shard writers and caches them for reuse across calls. This closes them,
flushing pending data; writers reopen lazily on the next `mergeInsert`.
It is a no-op when no writers are cached.
#### Returns
`Promise`&lt;`void`&gt;
***
### countRows()
```ts

View File

@@ -11,7 +11,10 @@ Specification selecting Lance's MemWAL LSM-style write path for
`specType` is `"bucket"`, `"identity"`, or `"unsharded"`. For `"bucket"`,
`column` and `numBuckets` are required; for `"identity"`, `column` is
required.
required and must be a deterministic function of the unenforced primary
key (every row with a given primary key must always produce the same
`column` value, or upserts of that key can land in different shards and a
stale version can win).
## Properties

View File

@@ -32,6 +32,14 @@ numInsertedRows: number;
***
### numRows
```ts
numRows: number;
```
***
### numUpdatedRows
```ts

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.30.0-final.0</version>
<version>0.30.0-beta.1</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.30.0-final.0</version>
<version>0.30.0-beta.1</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>
@@ -28,7 +28,7 @@
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<arrow.version>15.0.0</arrow.version>
<lance-core.version>7.0.0</lance-core.version>
<lance-core.version>7.2.0-beta.1</lance-core.version>
<spotless.skip>false</spotless.skip>
<spotless.version>2.30.0</spotless.version>
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>

View File

@@ -1,7 +1,7 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.30.0"
version = "0.30.0-beta.1"
publish = false
license.workspace = true
description.workspace = true

View File

@@ -2625,3 +2625,97 @@ describe("setLsmWriteSpec / unsetLsmWriteSpec", () => {
).rejects.toThrow();
});
});
describe("LSM merge insert", () => {
let tmpDir: tmp.DirResult;
beforeEach(() => {
tmpDir = tmp.dirSync({ unsafeCleanup: true });
});
afterEach(() => tmpDir.removeCallback());
async function bucketTable(conn: Connection): Promise<Table> {
// The primary key column must be non-nullable.
const table = await conn.createEmptyTable(
"t",
new arrow.Schema([
new arrow.Field("id", new arrow.Utf8(), false),
new arrow.Field("value", new arrow.Float64(), true),
]),
);
await table.add([
{ id: "a", value: 1 },
{ id: "b", value: 2 },
]);
await table.setUnenforcedPrimaryKey("id");
// numBuckets = 1: every row routes to the single bucket.
await table.setLsmWriteSpec({
specType: "bucket",
column: "id",
numBuckets: 1,
});
return table;
}
it("routes merge_insert through the shard writer", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
const res = await table
.mergeInsert("id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute([
{ id: "c", value: 3 },
{ id: "d", value: 4 },
]);
// LSM path: rows go to the MemWAL, so only numRows is populated.
expect(res.numRows).toBe(2);
expect(res.version).toBe(0);
expect(res.numInsertedRows).toBe(0);
await table.closeLsmWriters();
});
it("falls back to the standard path with useLsmWrite(false)", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
const res = await table
.mergeInsert("id")
.whenNotMatchedInsertAll()
.useLsmWrite(false)
.execute([
{ id: "b", value: 9 },
{ id: "e", value: 5 },
]);
// Standard path commits: id="e" inserted ("b" already exists).
expect(res.numInsertedRows).toBe(1);
expect(await table.countRows()).toBe(3);
});
it("supports validateSingleShard(false)", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
const res = await table
.mergeInsert("id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.validateSingleShard(false)
.execute([{ id: "f", value: 6 }]);
expect(res.numRows).toBe(1);
});
it("rejects a non-upsert merge under an LSM spec", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
await expect(
table
.mergeInsert("id")
.whenNotMatchedInsertAll()
.execute([{ id: "g", value: 7 }]),
).rejects.toThrow();
});
});

View File

@@ -87,6 +87,41 @@ export class MergeInsertBuilder {
this.#schema,
);
}
/**
* Controls whether the merge uses the MemWAL LSM write path.
*
* By default (unset), a `mergeInsert` on a table with an LSM write spec is
* routed through Lance's MemWAL shard writer, and a table without one uses
* the standard path. Pass `false` to force the standard path even when a
* spec is set. Pass `true` to require a spec — `mergeInsert` rejects if none
* is installed.
*
* @param useLsmWrite - Whether to use the LSM write path.
*/
useLsmWrite(useLsmWrite: boolean): MergeInsertBuilder {
return new MergeInsertBuilder(
this.#native.useLsmWrite(useLsmWrite),
this.#schema,
);
}
/**
* Controls how an LSM merge checks that its input targets a single shard.
*
* When a table has an LSM write spec, every row in a `mergeInsert` call must
* route to the same shard. When `true` (the default), every row is inspected
* to verify this. When `false`, only the first row is inspected and the
* shard it routes to is used for the whole input — a faster path for callers
* that have already pre-sharded their input. Has no effect on tables without
* an LSM write spec.
*
* @param validateSingleShard - Whether to check every row routes to one shard. Defaults to `true`.
*/
validateSingleShard(validateSingleShard: boolean): MergeInsertBuilder {
return new MergeInsertBuilder(
this.#native.validateSingleShard(validateSingleShard),
this.#schema,
);
}
/**
* Executes the merge insert operation
*

View File

@@ -161,7 +161,10 @@ export interface Version {
*
* `specType` is `"bucket"`, `"identity"`, or `"unsharded"`. For `"bucket"`,
* `column` and `numBuckets` are required; for `"identity"`, `column` is
* required.
* required and must be a deterministic function of the unenforced primary
* key (every row with a given primary key must always produce the same
* `column` value, or upserts of that key can land in different shards and a
* stale version can win).
*/
export interface LsmWriteSpec {
/** One of `"bucket"`, `"identity"`, or `"unsharded"`. */
@@ -567,6 +570,16 @@ export abstract class Table {
* @returns {Promise<void>}
*/
abstract unsetLsmWriteSpec(): Promise<void>;
/**
* Drain and close any cached MemWAL shard writers held for this table.
*
* When an {@link LsmWriteSpec} is installed, `mergeInsert` opens MemWAL
* shard writers and caches them for reuse across calls. This closes them,
* flushing pending data; writers reopen lazily on the next `mergeInsert`.
* It is a no-op when no writers are cached.
* @returns {Promise<void>}
*/
abstract closeLsmWriters(): Promise<void>;
/** Retrieve the version of the table */
abstract version(): Promise<number>;
@@ -1041,6 +1054,10 @@ export class LocalTable extends Table {
return await this.inner.unsetLsmWriteSpec();
}
async closeLsmWriters(): Promise<void> {
return await this.inner.closeLsmWriters();
}
async version(): Promise<number> {
return await this.inner.version();
}

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.30.0",
"version": "0.30.0-beta.1",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.30.0",
"version": "0.30.0-beta.1",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",

View File

@@ -50,6 +50,20 @@ impl NativeMergeInsertBuilder {
this
}
#[napi]
pub fn use_lsm_write(&self, use_lsm_write: bool) -> Self {
let mut this = self.clone();
this.inner.use_lsm_write(use_lsm_write);
this
}
#[napi]
pub fn validate_single_shard(&self, validate_single_shard: bool) -> Self {
let mut this = self.clone();
this.inner.validate_single_shard(validate_single_shard);
this
}
#[napi(catch_unwind)]
pub async fn execute(&self, buf: Buffer) -> napi::Result<MergeResult> {
let data = ipc_file_to_batches(buf.to_vec())

View File

@@ -391,6 +391,11 @@ impl Table {
.default_error()
}
#[napi(catch_unwind)]
pub async fn close_lsm_writers(&self) -> napi::Result<()> {
self.inner_ref()?.close_lsm_writers().await.default_error()
}
#[napi(catch_unwind)]
pub async fn version(&self) -> napi::Result<i64> {
self.inner_ref()?
@@ -940,6 +945,7 @@ pub struct MergeResult {
pub num_updated_rows: i64,
pub num_deleted_rows: i64,
pub num_attempts: i64,
pub num_rows: i64,
}
impl From<lancedb::table::MergeResult> for MergeResult {
@@ -950,6 +956,7 @@ impl From<lancedb::table::MergeResult> for MergeResult {
num_updated_rows: value.num_updated_rows as i64,
num_deleted_rows: value.num_deleted_rows as i64,
num_attempts: value.num_attempts as i64,
num_rows: value.num_rows as i64,
}
}
}

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.33.0"
current_version = "0.33.0-beta.1"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-python"
version = "0.33.0"
version = "0.33.0-beta.1"
publish = false
edition.workspace = true
description = "Python bindings for LanceDB"

View File

@@ -315,6 +315,15 @@ def deserialize_conn(
manifest_enabled=parsed.get("manifest_enabled", False),
namespace_client_properties=parsed.get("namespace_client_properties"),
)
elif connection_type == "remote":
return RemoteDBConnection(
parsed["db_url"],
parsed["api_key"],
parsed.get("region", "us-east-1"),
host_override=parsed.get("host_override"),
client_config=parsed.get("client_config"),
storage_options=storage_options,
)
else:
raise ValueError(f"Unknown connection_type: {connection_type}")

View File

@@ -220,6 +220,7 @@ class Table:
async def set_unenforced_primary_key(self, columns: List[str]) -> None: ...
async def set_lsm_write_spec(self, spec: LsmWriteSpec) -> None: ...
async def unset_lsm_write_spec(self) -> None: ...
async def close_lsm_writers(self) -> None: ...
@property
def tags(self) -> Tags: ...
def query(self) -> Query: ...
@@ -420,6 +421,7 @@ class MergeResult:
num_inserted_rows: int
num_deleted_rows: int
num_attempts: int
num_rows: int
class LsmWriteSpec:
"""Specification selecting Lance's MemWAL LSM-style write path for

View File

@@ -281,6 +281,9 @@ class HnswPq:
m: int = 20
ef_construction: int = 300
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -386,6 +389,9 @@ class HnswSq:
m: int = 20
ef_construction: int = 300
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -579,6 +585,9 @@ class IvfFlat:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -609,6 +618,9 @@ class IvfSq:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -739,6 +751,9 @@ class IvfPq:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -792,6 +807,9 @@ class IvfRq:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
__all__ = [

View File

@@ -34,6 +34,8 @@ class LanceMergeInsertBuilder(object):
self._when_not_matched_by_source_condition = None
self._timeout = None
self._use_index = True
self._use_lsm_write = None
self._validate_single_shard = None
def when_matched_update_all(
self, *, where: Optional[str] = None
@@ -96,6 +98,46 @@ class LanceMergeInsertBuilder(object):
self._use_index = use_index
return self
def use_lsm_write(self, use_lsm_write: bool) -> LanceMergeInsertBuilder:
"""
Controls whether the merge uses the MemWAL LSM write path.
By default (unset), a `merge_insert` on a table with an LSM write spec
is routed through Lance's MemWAL shard writer, and a table without one
uses the standard path. Pass `False` to force the standard path even
when a spec is set. Pass `True` to require a spec — `merge_insert`
raises an error if none is installed.
Parameters
----------
use_lsm_write: bool
Whether to use the LSM write path.
"""
self._use_lsm_write = use_lsm_write
return self
def validate_single_shard(
self, validate_single_shard: bool
) -> LanceMergeInsertBuilder:
"""
Controls how an LSM merge checks that its input targets a single shard.
When a table has an LSM write spec, every row in a `merge_insert` call
must route to the same shard. When `True` (the default), every row is
inspected to verify this. When `False`, only the first row is inspected
and the shard it routes to is used for the whole input — a faster path
for callers that have already pre-sharded their input.
Has no effect on tables without an LSM write spec.
Parameters
----------
validate_single_shard: bool
Whether to check every row routes to one shard. Defaults to `True`.
"""
self._validate_single_shard = validate_single_shard
return self
def execute(
self,
new_data: DATA,

View File

@@ -3,12 +3,13 @@
import copy
import json
import os
from deprecation import deprecated
import pyarrow as pa
from ._lancedb import async_permutation_builder, PermutationReader
from .table import LanceTable
from .table import LanceTable, Table
from .background_loop import LOOP
from .util import batch_to_tensor, batch_to_tensor_rows
from typing import Any, Callable, Iterator, Literal, Optional, TYPE_CHECKING, Union
@@ -354,6 +355,49 @@ class Transforms:
DEFAULT_BATCH_SIZE = 100
def _table_to_pickle_state(table: Table) -> dict[str, Any]:
from .remote.table import RemoteTable
if isinstance(table, RemoteTable):
return {
"kind": "remote",
"table": table,
}
if not isinstance(table, LanceTable):
raise ValueError(f"Cannot pickle table of type {type(table)!r}")
base_uri = table._conn.uri
if base_uri.startswith("memory://"):
return {
"kind": "memory",
"name": table.name,
"data": table.to_arrow(),
}
return {
"kind": "local",
"name": table.name,
"uri": base_uri,
"namespace": table._namespace_path,
"storage_options": table._conn.storage_options,
}
def _table_from_pickle_state(state: dict[str, Any]) -> Table:
from . import connect
kind = state["kind"]
if kind == "remote":
return state["table"]
if kind == "memory":
return connect("memory://").create_table(state["name"], state["data"])
if kind == "local":
db = connect(state["uri"], storage_options=state["storage_options"])
return db.open_table(state["name"], namespace_path=state["namespace"] or None)
raise ValueError(f"Unknown table pickle state kind: {kind}")
class Permutation:
"""
A Permutation is a view of a dataset that can be used as input to model training
@@ -369,15 +413,15 @@ class Permutation:
def __init__(
self,
base_table: LanceTable,
permutation_table: Optional[LanceTable],
base_table: Table,
permutation_table: Optional[Table],
split: int,
selection: dict[str, str],
batch_size: int,
transform_fn: Callable[pa.RecordBatch, Any],
offset: Optional[int] = None,
limit: Optional[int] = None,
connection_factory: Optional[Callable[[str], LanceTable]] = None,
connection_factory: Optional[Callable[[str], Table]] = None,
_reader: Optional[PermutationReader] = None,
):
"""
@@ -397,6 +441,7 @@ class Permutation:
if _reader is None:
_reader = LOOP.run(self._build_reader())
self.reader: PermutationReader = _reader
self._pid = os.getpid()
async def _build_reader(self) -> PermutationReader:
reader = await PermutationReader.from_tables(
@@ -428,29 +473,25 @@ class Permutation:
return new
def with_connection_factory(
self, connection_factory: Callable[[str], LanceTable]
self, connection_factory: Callable[[str], Table]
) -> "Permutation":
"""
Creates a new permutation that will use ``connection_factory`` to reopen
the base table when this permutation is unpickled in a worker process.
The factory is a callable that takes a single argument the base table
name and returns a [LanceTable]. It must be picklable; the worker
The factory is a callable that takes a single argument, the base table
name, and returns a LanceDB table. It must be picklable; the worker
will pickle it via standard ``pickle`` and call it to recover the base
table. Picklable callables in practice means top-level (module-level)
functions, ``functools.partial`` of such functions, or instances of
picklable classes implementing ``__call__``. Lambdas and closures over
local variables don't pickle with the default protocol.
Setting a factory is necessary when the URI alone is not enough to
re-open the connection — most importantly for LanceDB Cloud (``db://``)
connections, where ``api_key`` and ``region`` aren't recoverable from
the connection object after construction.
For local file or cloud-storage paths the factory is optional: if not
set, ``__getstate__`` falls back to capturing
``(uri, storage_options, namespace_path)`` and re-opening via
``lancedb.connect(uri, storage_options=...)``.
A factory is optional for normal local and remote LanceDB connections:
if not set, ``__getstate__`` captures the table's own picklable reopen
state. Use a factory when that default state is not enough, for example
when credentials should be loaded from the worker environment instead
of being embedded in the pickle.
Examples
--------
@@ -508,7 +549,7 @@ class Permutation:
return new
@classmethod
def identity(cls, table: LanceTable) -> "Permutation":
def identity(cls, table: Table) -> "Permutation":
"""
Creates an identity permutation for the given table.
"""
@@ -517,8 +558,8 @@ class Permutation:
@classmethod
def from_tables(
cls,
base_table: LanceTable,
permutation_table: Optional[LanceTable] = None,
base_table: Table,
permutation_table: Optional[Table] = None,
split: Optional[Union[str, int]] = None,
) -> "Permutation":
"""
@@ -594,11 +635,10 @@ class Permutation:
The base table is captured either via a user-supplied
``connection_factory`` (see [with_connection_factory]) or, as a
fallback, by introspecting ``(uri, storage_options, namespace_path)``
on the connection. The permutation table — always an in-memory
LanceDB table — is captured as a pyarrow Table (which pickles via
Arrow IPC natively). The reader is dropped from the wire format;
``__setstate__`` rebuilds it from the restored tables.
fallback, by the table's own picklable reopen state. The permutation
table is captured as a pyarrow Table (which pickles via Arrow IPC
natively). The reader is dropped from the wire format and rebuilt
lazily on first use.
"""
permutation_data: Optional[pa.Table] = None
if self.permutation_table is not None:
@@ -622,39 +662,9 @@ class Permutation:
# namespace from the existing connection.
return common
# URI-introspection fallback: only viable for native (OSS) connections
# where (uri, storage_options) is enough to reopen. Remote / cloud
# connections don't expose recoverable api_key / region — those users
# must call with_connection_factory().
try:
base_uri = self.base_table._conn.uri
storage_options = self.base_table._conn.storage_options
except AttributeError as e:
raise ValueError(
"Cannot pickle this Permutation: the base table's connection "
"does not expose a uri/storage_options, which usually means it "
"is a remote (LanceDB Cloud) connection. Call "
"Permutation.with_connection_factory(...) first to provide a "
"picklable callable that re-opens the base table from a worker "
"process."
) from e
if base_uri.startswith("memory://"):
# In-memory base tables don't exist in any worker process by
# default, so dump the entire base table into the pickle. This
# can be expensive for large datasets — users with large
# in-memory base tables should either persist them or set a
# connection_factory.
return {
**common,
"base_table_data": self.base_table.to_arrow(),
}
return {
**common,
"base_table_uri": base_uri,
"base_table_namespace": self.base_table._namespace_path,
"base_table_storage_options": storage_options,
"base_table_state": _table_to_pickle_state(self.base_table),
}
def __setstate__(self, state: dict[str, Any]) -> None:
@@ -663,6 +673,8 @@ class Permutation:
connection_factory = state["connection_factory"]
if connection_factory is not None:
base_table = connection_factory(state["base_table_name"])
elif "base_table_state" in state:
base_table = _table_from_pickle_state(state["base_table_state"])
elif "base_table_data" in state:
# In-memory base table inlined into the pickle; rebuild the same
# way we rebuild the in-memory permutation table.
@@ -680,7 +692,7 @@ class Permutation:
namespace_path=state["base_table_namespace"] or None,
)
permutation_table: Optional[LanceTable] = None
permutation_table: Optional[Table] = None
if state["permutation_data"] is not None:
mem_db = connect("memory://")
permutation_table = mem_db.create_table(
@@ -696,10 +708,28 @@ class Permutation:
self.offset = state["offset"]
self.limit = state["limit"]
self.connection_factory = connection_factory
self.reader = None
self._pid = None
def _ensure_open(self) -> None:
pid = os.getpid()
if self.reader is not None and getattr(self, "_pid", None) == pid:
return
# The reader owns Rust-side table handles. Rebuild it after unpickle or
# fork even though the Python table wrappers reopen themselves.
if hasattr(self.base_table, "_ensure_open"):
self.base_table._ensure_open()
if self.permutation_table is not None and hasattr(
self.permutation_table, "_ensure_open"
):
self.permutation_table._ensure_open()
self.reader = LOOP.run(self._build_reader())
self._pid = pid
@property
def schema(self) -> pa.Schema:
self._ensure_open()
async def do_output_schema():
return await self.reader.output_schema(self.selection)
@@ -717,6 +747,7 @@ class Permutation:
"""
The number of rows in the permutation
"""
self._ensure_open()
return self.reader.count_rows()
@property
@@ -875,6 +906,7 @@ class Permutation:
If skip_last_batch is True, the last batch will be skipped if it is not a
multiple of batch_size.
"""
self._ensure_open()
async def get_iter():
return await self.reader.read(self.selection, batch_size=batch_size)
@@ -976,6 +1008,7 @@ class Permutation:
so `with_format` and `with_transform` affect this method in the same way
they affect iteration.
"""
self._ensure_open()
async def do_take_offsets():
return await self.reader.take_offsets(offsets, selection=self.selection)
@@ -1011,9 +1044,11 @@ class Permutation:
"""
Skip the first `skip` rows of the permutation
"""
self._ensure_open()
new = copy.copy(self)
new.offset = skip
new.reader = LOOP.run(new._build_reader())
new._pid = os.getpid()
return new
@deprecated(details="Use with_take instead")
@@ -1032,9 +1067,11 @@ class Permutation:
"""
Limit the permutation to `limit` rows (following any `skip`)
"""
self._ensure_open()
new = copy.copy(self)
new.limit = limit
new.reader = LOOP.run(new._build_reader())
new._pid = os.getpid()
return new
@deprecated(details="Use with_repeat instead")

View File

@@ -3,6 +3,7 @@
from datetime import timedelta
import json
import logging
from concurrent.futures import ThreadPoolExecutor
import sys
@@ -17,7 +18,7 @@ else:
# Remove this import to fix circular dependency
# from lancedb import connect_async
from lancedb.remote import ClientConfig
from lancedb.remote import ClientConfig, RetryConfig, TimeoutConfig, TlsConfig
import pyarrow as pa
from ..common import DATA
@@ -36,6 +37,64 @@ from ..table import Table
from ..util import validate_table_name
def _duration_seconds(value: Optional[timedelta]) -> Optional[float]:
return value.total_seconds() if value is not None else None
def _timeout_config_to_dict(
config: Optional[TimeoutConfig],
) -> Optional[dict[str, Any]]:
if config is None:
return None
return {
"timeout": _duration_seconds(config.timeout),
"connect_timeout": _duration_seconds(config.connect_timeout),
"read_timeout": _duration_seconds(config.read_timeout),
"pool_idle_timeout": _duration_seconds(config.pool_idle_timeout),
}
def _retry_config_to_dict(config: RetryConfig) -> dict[str, Any]:
return {
"retries": config.retries,
"connect_retries": config.connect_retries,
"read_retries": config.read_retries,
"backoff_factor": config.backoff_factor,
"backoff_jitter": config.backoff_jitter,
"statuses": config.statuses,
}
def _tls_config_to_dict(config: Optional[TlsConfig]) -> Optional[dict[str, Any]]:
if config is None:
return None
return {
"cert_file": config.cert_file,
"key_file": config.key_file,
"ssl_ca_cert": config.ssl_ca_cert,
"assert_hostname": config.assert_hostname,
}
def _client_config_to_dict(config: ClientConfig) -> dict[str, Any]:
if config.header_provider is not None:
raise ValueError(
"Cannot serialize a remote connection with a header_provider. "
"Use static api_key/extra_headers or provide a worker-side "
"connection factory instead."
)
return {
"user_agent": config.user_agent,
"retry_config": _retry_config_to_dict(config.retry_config),
"timeout_config": _timeout_config_to_dict(config.timeout_config),
"extra_headers": config.extra_headers,
"id_delimiter": config.id_delimiter,
"tls_config": _tls_config_to_dict(config.tls_config),
"header_provider": None,
"user_id": config.user_id,
}
class RemoteDBConnection(DBConnection):
"""A connection to a remote LanceDB database."""
@@ -89,6 +148,11 @@ class RemoteDBConnection(DBConnection):
parsed = urlparse(db_url)
if parsed.scheme != "db":
raise ValueError(f"Invalid scheme: {parsed.scheme}, only accepts db://")
self.db_url = db_url
self.api_key = api_key
self.region = region
self.host_override = host_override
self.storage_options = storage_options
self.db_name = parsed.netloc
self.client_config = client_config
@@ -111,6 +175,20 @@ class RemoteDBConnection(DBConnection):
def __repr__(self) -> str:
return f"RemoteConnect(name={self.db_name})"
@override
def serialize(self) -> str:
return json.dumps(
{
"connection_type": "remote",
"db_url": self.db_url,
"api_key": self.api_key,
"region": self.region,
"host_override": self.host_override,
"client_config": _client_config_to_dict(self.client_config),
"storage_options": self.storage_options,
}
)
@override
def list_namespaces(
self,
@@ -331,7 +409,12 @@ class RemoteDBConnection(DBConnection):
)
table = LOOP.run(self._conn.open_table(name, namespace_path=namespace_path))
return RemoteTable(table, self.db_name)
return RemoteTable(
table,
self.db_name,
connection_state=self.serialize,
namespace_path=namespace_path,
)
def clone_table(
self,
@@ -380,7 +463,12 @@ class RemoteDBConnection(DBConnection):
is_shallow=is_shallow,
)
)
return RemoteTable(table, self.db_name)
return RemoteTable(
table,
self.db_name,
connection_state=self.serialize,
namespace_path=target_namespace_path,
)
@override
def create_table(
@@ -525,7 +613,12 @@ class RemoteDBConnection(DBConnection):
fill_value=fill_value,
)
)
return RemoteTable(table, self.db_name)
return RemoteTable(
table,
self.db_name,
connection_state=self.serialize,
namespace_path=namespace_path,
)
@override
def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):

View File

@@ -2,11 +2,25 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
from datetime import timedelta
import deprecation
import logging
from functools import cached_property
from typing import Any, Callable, Dict, Iterable, List, Optional, Union, Literal
import os
from typing import (
Any,
Callable,
Dict,
Iterable,
List,
Optional,
Union,
Literal,
overload,
)
import warnings
from lancedb import __version__
from lancedb._lancedb import (
AddColumnsResult,
AddResult,
@@ -32,6 +46,7 @@ from lancedb.index import (
LabelList,
)
from lancedb.remote.db import LOOP
from lancedb.table import IndexConfigType, KNOWN_METRICS
import pyarrow as pa
from lancedb.common import DATA, VEC, VECTOR_COLUMN_NAME
@@ -49,14 +64,80 @@ class RemoteTable(Table):
self,
table: AsyncTable,
db_name: str,
*,
connection_state: Optional[Union[str, Callable[[], str]]] = None,
namespace_path: Optional[List[str]] = None,
):
self._table = table
self._table_handle = table
self._name = table.name
self.db_name = db_name
self._connection_state = connection_state
self._namespace_path = list(namespace_path or [])
self._checkout_version: Optional[int] = None
self._pid = os.getpid()
def _serialized_connection_state(self) -> str:
if self._connection_state is None:
raise RuntimeError(
"Cannot reopen this remote table because it does not carry "
"serialized connection state"
)
if callable(self._connection_state):
self._connection_state = self._connection_state()
return self._connection_state
@property
def _table(self) -> AsyncTable:
self._ensure_open()
assert self._table_handle is not None
return self._table_handle
@_table.setter
def _table(self, table: AsyncTable) -> None:
self._table_handle = table
self._name = table.name
self._pid = os.getpid()
def _ensure_open(self) -> None:
pid = os.getpid()
if self._table_handle is not None and self._pid == pid:
return
# Pickle clears the handle; fork inherits a handle created in the
# parent process. In both cases reopen before touching the Rust client.
from lancedb import deserialize_conn
db = deserialize_conn(self._serialized_connection_state(), for_worker=True)
table = db.open_table(self._name, namespace_path=self._namespace_path)
if self._checkout_version is not None:
table.checkout(self._checkout_version)
self._table_handle = table._table
self.db_name = table.db_name
self._pid = pid
def __getstate__(self) -> dict:
return {
"connection_state": self._serialized_connection_state(),
"db_name": self.db_name,
"name": self.name,
"namespace_path": self._namespace_path,
"checkout_version": self._checkout_version,
}
def __setstate__(self, state: dict) -> None:
self._table_handle = None
self._name = state["name"]
self.db_name = state["db_name"]
self._connection_state = state["connection_state"]
self._namespace_path = state["namespace_path"]
self._checkout_version = state["checkout_version"]
self._pid = None
@property
def name(self) -> str:
"""The name of the table"""
return self._table.name
return self._name
def __repr__(self) -> str:
return f"RemoteTable({self.db_name}.{self.name})"
@@ -106,13 +187,19 @@ class RemoteTable(Table):
raise NotImplementedError("to_pandas() is not yet supported on LanceDB cloud.")
def checkout(self, version: Union[int, str]):
return LOOP.run(self._table.checkout(version))
result = LOOP.run(self._table.checkout(version))
self._checkout_version = self.version
return result
def checkout_latest(self):
return LOOP.run(self._table.checkout_latest())
result = LOOP.run(self._table.checkout_latest())
self._checkout_version = None
return result
def restore(self, version: Optional[Union[int, str]] = None):
return LOOP.run(self._table.restore(version))
result = LOOP.run(self._table.restore(version))
self._checkout_version = None
return result
def list_indices(self) -> Iterable[IndexConfig]:
"""List all the indices on the table"""
@@ -122,6 +209,11 @@ class RemoteTable(Table):
"""List all the stats of a specified index"""
return LOOP.run(self._table.index_stats(index_uuid))
@deprecation.deprecated(
deprecated_in="0.25.0",
current_version=__version__,
details="Use create_index() with config=BTree()/Bitmap()/LabelList() instead.",
)
def create_scalar_index(
self,
column: str,
@@ -131,7 +223,12 @@ class RemoteTable(Table):
wait_timeout: Optional[timedelta] = None,
name: Optional[str] = None,
):
"""Creates a scalar index
"""Creates a scalar index.
.. deprecated:: 0.25.0
Use :meth:`create_index` with a BTree, Bitmap, or LabelList config instead.
Example: ``table.create_index("column", config=BTree())``
Parameters
----------
column : str
@@ -162,6 +259,11 @@ class RemoteTable(Table):
)
)
@deprecation.deprecated(
deprecated_in="0.25.0",
current_version=__version__,
details="Use create_index() with config=FTS() instead.",
)
def create_fts_index(
self,
column: str,
@@ -182,6 +284,12 @@ class RemoteTable(Table):
prefix_only: bool = False,
name: Optional[str] = None,
):
"""Create a full-text search index on a column.
.. deprecated:: 0.25.0
Use :meth:`create_index` with an FTS config instead.
Example: ``table.create_index("text_column", config=FTS())``
"""
config = FTS(
with_position=with_position,
base_tokenizer=base_tokenizer,
@@ -205,9 +313,43 @@ class RemoteTable(Table):
)
)
# New unified API overload
@overload
def create_index(
self,
metric="l2",
column: str,
/,
*,
config: IndexConfigType,
wait_timeout: Optional[timedelta] = ...,
name: Optional[str] = ...,
train: bool = ...,
) -> None: ...
# Legacy API overload (deprecated)
@overload
def create_index(
self,
metric: Literal["l2", "cosine", "dot", "hamming"] = ...,
vector_column_name: str = ...,
index_cache_size: Optional[int] = ...,
num_partitions: Optional[int] = ...,
num_sub_vectors: Optional[int] = ...,
replace: Optional[bool] = ...,
accelerator: Optional[str] = ...,
index_type: Literal[
"VECTOR", "IVF_FLAT", "IVF_SQ", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
] = ...,
wait_timeout: Optional[timedelta] = ...,
*,
num_bits: int = ...,
name: Optional[str] = ...,
train: bool = ...,
) -> None: ...
def create_index(
self,
metric: str = "l2",
vector_column_name: str = VECTOR_COLUMN_NAME,
index_cache_size: Optional[int] = None,
num_partitions: Optional[int] = None,
@@ -218,89 +360,113 @@ class RemoteTable(Table):
wait_timeout: Optional[timedelta] = None,
*,
num_bits: int = 8,
config: Optional[IndexConfigType] = None,
name: Optional[str] = None,
train: bool = True,
):
"""Create an index on the table.
"""Create an index on a column.
Parameters
----------
metric : str
The metric to use for the index. Default is "l2".
vector_column_name : str
The name of the vector column. Default is "vector".
This method supports both the new unified API and the legacy API
for backwards compatibility. The new API takes the column name as the
first positional argument and an index configuration object via
``config``; the legacy API takes the distance metric as the first
argument plus separate ``vector_column_name`` / ``num_partitions`` /
etc. parameters, and emits a ``DeprecationWarning``.
Examples
--------
>>> import lancedb
>>> import uuid
>>> from lancedb.schema import vector
>>> db = lancedb.connect("db://...", api_key="...", # doctest: +SKIP
... region="...") # doctest: +SKIP
>>> table_name = uuid.uuid4().hex
>>> schema = pa.schema(
... [
... pa.field("id", pa.uint32(), False),
... pa.field("vector", vector(128), False),
... pa.field("s", pa.string(), False),
... ]
New API (recommended):
>>> table.create_index( # doctest: +SKIP
... "vector", config=IvfPq(distance_type="l2")
... )
>>> table = db.create_table( # doctest: +SKIP
... table_name, # doctest: +SKIP
... schema=schema, # doctest: +SKIP
>>> table.create_index("category", config=BTree()) # doctest: +SKIP
>>> table.create_index("content", config=FTS()) # doctest: +SKIP
Legacy API (deprecated):
>>> table.create_index( # doctest: +SKIP
... "l2", vector_column_name="vector"
... )
>>> table.create_index("l2", "vector") # doctest: +SKIP
"""
# Detect whether this is a legacy API call
is_legacy = self._is_legacy_create_index_call(
metric,
config,
num_partitions,
num_sub_vectors,
vector_column_name,
accelerator,
index_cache_size,
replace,
)
if accelerator is not None:
logging.warning(
"GPU accelerator is not yet supported on LanceDB cloud."
"If you have 100M+ vectors to index,"
"please contact us at contact@lancedb.com"
)
if replace is not None:
logging.warning(
"replace is not supported on LanceDB cloud."
"Existing indexes will always be replaced."
if is_legacy:
warnings.warn(
"The create_index() API with metric/num_partitions parameters is "
"deprecated and will be removed in a future version. "
"Please migrate to the new unified API:\n"
" # Old (deprecated):\n"
" table.create_index('l2', vector_column_name='my_vector')\n"
" # New (recommended):\n"
" table.create_index('my_vector', config=IvfPq(distance_type='l2'))",
DeprecationWarning,
stacklevel=2,
)
index_type = index_type.upper()
if index_type == "VECTOR" or index_type == "IVF_PQ":
config = IvfPq(
distance_type=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
num_bits=num_bits,
)
elif index_type == "IVF_RQ":
config = IvfRq(
distance_type=metric,
num_partitions=num_partitions,
num_bits=num_bits,
)
elif index_type == "IVF_SQ":
config = IvfSq(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_HNSW_PQ":
raise ValueError(
"IVF_HNSW_PQ is not supported on LanceDB cloud."
"Please use IVF_HNSW_SQ instead."
)
elif index_type == "IVF_HNSW_SQ":
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_HNSW_FLAT":
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_FLAT":
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
column = vector_column_name
if accelerator is not None:
logging.warning(
"GPU accelerator is not yet supported on LanceDB cloud."
"If you have 100M+ vectors to index,"
"please contact us at contact@lancedb.com"
)
if replace is not None:
logging.warning(
"replace is not supported on LanceDB cloud."
"Existing indexes will always be replaced."
)
idx_type = index_type.upper()
if idx_type == "VECTOR" or idx_type == "IVF_PQ":
config = IvfPq(
distance_type=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
num_bits=num_bits,
)
elif idx_type == "IVF_RQ":
config = IvfRq(
distance_type=metric,
num_partitions=num_partitions,
num_bits=num_bits,
)
elif idx_type == "IVF_SQ":
config = IvfSq(distance_type=metric, num_partitions=num_partitions)
elif idx_type == "IVF_HNSW_PQ":
raise ValueError(
"IVF_HNSW_PQ is not supported on LanceDB cloud."
"Please use IVF_HNSW_SQ instead."
)
elif idx_type == "IVF_HNSW_SQ":
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
elif idx_type == "IVF_HNSW_FLAT":
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
elif idx_type == "IVF_FLAT":
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
else:
raise ValueError(
f"Unknown vector index type: {idx_type}. Valid options are"
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
)
else:
raise ValueError(
f"Unknown vector index type: {index_type}. Valid options are"
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
)
column = metric
LOOP.run(
self._table.create_index(
vector_column_name,
column,
config=config,
wait_timeout=wait_timeout,
name=name,
@@ -308,6 +474,37 @@ class RemoteTable(Table):
)
)
def _is_legacy_create_index_call(
self,
first_arg: str,
config: Optional[IndexConfigType],
num_partitions: Optional[int],
num_sub_vectors: Optional[int],
vector_column_name: str,
accelerator: Optional[str],
index_cache_size: Optional[int],
replace: Optional[bool],
) -> bool:
"""Detect if this is a legacy create_index call."""
if config is not None:
return False
if any(
x is not None
for x in (
num_partitions,
num_sub_vectors,
accelerator,
index_cache_size,
replace,
)
):
return True
if vector_column_name != VECTOR_COLUMN_NAME:
return True
if first_arg.lower() in KNOWN_METRICS:
return True
return False
def add(
self,
data: DATA,
@@ -668,6 +865,10 @@ class RemoteTable(Table):
"""Not supported on LanceDB Cloud."""
return LOOP.run(self._table.unset_lsm_write_spec())
def close_lsm_writers(self) -> None:
"""No-op on LanceDB Cloud (no local shard writers)."""
return LOOP.run(self._table.close_lsm_writers())
def drop_index(self, index_name: str):
return LOOP.run(self._table.drop_index(index_name))

View File

@@ -174,6 +174,24 @@ if TYPE_CHECKING:
DistanceType,
)
# Type alias for index configuration objects
IndexConfigType = Union[
IvfFlat,
IvfPq,
IvfSq,
IvfRq,
HnswFlat,
HnswPq,
HnswSq,
BTree,
Bitmap,
LabelList,
FTS,
]
# Known distance metrics for legacy API detection
KNOWN_METRICS = {"l2", "cosine", "dot", "hamming"}
def _into_pyarrow_reader(
data, schema: Optional[pa.Schema] = None
@@ -807,11 +825,49 @@ class Table(ABC):
"""
raise NotImplementedError
# New unified API overload
@overload
def create_index(
self,
metric="l2",
num_partitions=256,
num_sub_vectors=96,
column: str,
/,
*,
config: IndexConfigType,
replace: bool = ...,
wait_timeout: Optional[timedelta] = ...,
name: Optional[str] = ...,
train: bool = ...,
) -> None: ...
# Legacy API overload (deprecated)
@overload
def create_index(
self,
metric: Literal["l2", "cosine", "dot", "hamming"] = ...,
num_partitions: Optional[int] = ...,
num_sub_vectors: Optional[int] = ...,
vector_column_name: str = ...,
replace: bool = ...,
accelerator: Optional[str] = ...,
index_cache_size: Optional[int] = ...,
*,
index_type: VectorIndexType = ...,
wait_timeout: Optional[timedelta] = ...,
num_bits: int = ...,
max_iterations: int = ...,
sample_rate: int = ...,
m: int = ...,
ef_construction: int = ...,
name: Optional[str] = ...,
train: bool = ...,
target_partition_size: Optional[int] = ...,
) -> None: ...
def create_index(
self,
metric: DistanceType = "l2",
num_partitions: Optional[int] = None,
num_sub_vectors: Optional[int] = None,
vector_column_name: str = VECTOR_COLUMN_NAME,
replace: bool = True,
accelerator: Optional[str] = None,
@@ -824,46 +880,53 @@ class Table(ABC):
sample_rate: int = 256,
m: int = 20,
ef_construction: int = 300,
config: Optional[IndexConfigType] = None,
name: Optional[str] = None,
train: bool = True,
target_partition_size: Optional[int] = None,
):
"""Create an index on the table.
"""Create an index on a column.
This method supports both the new unified API and the legacy API
for backwards compatibility. The new API takes the column name as the
first positional argument and an index configuration object via
``config``; the legacy API takes the distance metric as the first
argument plus separate ``vector_column_name`` / ``num_partitions`` /
etc. parameters, and emits a ``DeprecationWarning``.
Parameters
----------
metric: str, default "l2"
The distance metric to use when creating the index.
Valid values are "l2", "cosine", "dot", or "hamming".
l2 is euclidean distance.
Hamming is available only for binary vectors.
num_partitions: int, default 256
The number of IVF partitions to use when creating the index.
Default is 256.
num_sub_vectors: int, default 96
The number of PQ sub-vectors to use when creating the index.
Default is 96.
vector_column_name: str, default "vector"
The vector column name to create the index.
replace: bool, default True
- If True, replace the existing index if it exists.
metric : str
For new API: the column name to index.
For legacy API: the distance metric ("l2", "cosine", "dot", "hamming").
config : IndexConfigType, optional
The index configuration object. If provided, uses the new unified API.
Can be one of: IvfFlat, IvfPq, IvfSq, IvfRq, HnswPq, HnswSq,
BTree, Bitmap, LabelList, FTS.
replace : bool, default True
Whether to replace an existing index on this column.
wait_timeout : timedelta, optional
Timeout to wait for async indexing to complete.
name : str, optional
Custom name for the index.
train : bool, default True
Whether to train the index with existing data.
- If False, raise an error if duplicate index exists.
accelerator: str, default None
If set, use the given accelerator to create the index.
Only support "cuda" for now.
index_cache_size : int, optional
The size of the index cache in number of entries. Default value is 256.
num_bits: int
The number of bits to encode sub-vectors. Only used with the IVF_PQ index.
Only 4 and 8 are supported.
wait_timeout: timedelta, optional
The timeout to wait if indexing is asynchronous.
name: str, optional
The name of the index. If not provided, a default name will be generated.
train: bool, default True
Whether to train the index with existing data. Vector indices always train
with existing data.
Examples
--------
New API (recommended):
>>> table.create_index( # doctest: +SKIP
... "vector", config=IvfPq(distance_type="l2")
... )
>>> table.create_index("category", config=BTree()) # doctest: +SKIP
>>> table.create_index("content", config=FTS()) # doctest: +SKIP
Legacy API (deprecated):
>>> table.create_index( # doctest: +SKIP
... "l2", vector_column_name="vector"
... )
"""
raise NotImplementedError
@@ -1188,7 +1251,7 @@ class Table(ABC):
... .when_not_matched_insert_all() \\
... .execute(new_data)
>>> res
MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0, num_attempts=1)
MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0, num_attempts=1, num_rows=3)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
@@ -2250,11 +2313,51 @@ class LanceTable(Table):
dataset, allow_pyarrow_filter=False, batch_size=batch_size
)
# New unified API overload
@overload
def create_index(
self,
metric: DistanceType = "l2",
num_partitions=None,
num_sub_vectors=None,
column: str,
/,
*,
config: IndexConfigType,
replace: bool = ...,
wait_timeout: Optional[timedelta] = ...,
name: Optional[str] = ...,
train: bool = ...,
) -> None: ...
# Legacy API overload (deprecated)
@overload
def create_index(
self,
metric: Literal["l2", "cosine", "dot", "hamming"] = ...,
num_partitions: Optional[int] = ...,
num_sub_vectors: Optional[int] = ...,
vector_column_name: str = ...,
replace: bool = ...,
accelerator: Optional[str] = ...,
index_cache_size: Optional[int] = ...,
num_bits: int = ...,
index_type: Literal[
"IVF_FLAT", "IVF_SQ", "IVF_PQ", "IVF_RQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
] = ...,
max_iterations: int = ...,
sample_rate: int = ...,
m: int = ...,
ef_construction: int = ...,
*,
wait_timeout: Optional[timedelta] = ...,
name: Optional[str] = ...,
train: bool = ...,
target_partition_size: Optional[int] = ...,
) -> None: ...
def create_index(
self,
metric: str = "l2",
num_partitions: Optional[int] = None,
num_sub_vectors: Optional[int] = None,
vector_column_name: str = VECTOR_COLUMN_NAME,
replace: bool = True,
accelerator: Optional[str] = None,
@@ -2274,47 +2377,232 @@ class LanceTable(Table):
m: int = 20,
ef_construction: int = 300,
*,
config: Optional[IndexConfigType] = None,
wait_timeout: Optional[timedelta] = None,
name: Optional[str] = None,
train: bool = True,
target_partition_size: Optional[int] = None,
):
"""Create an index on the table."""
if accelerator is not None:
# accelerator is only supported through pylance.
self.to_lance().create_index(
column=vector_column_name,
index_type=index_type,
"""Create an index on a column.
This method supports both the new unified API and the legacy API
for backwards compatibility. The new API takes the column name as the
first positional argument and an index configuration object via
``config``; the legacy API takes the distance metric as the first
argument plus separate ``vector_column_name`` / ``num_partitions`` /
etc. parameters, and emits a ``DeprecationWarning``.
Parameters
----------
metric : str
For new API: the column name to index.
For legacy API: the distance metric ("l2", "cosine", "dot", "hamming").
config : IndexConfigType, optional
The index configuration object. If provided, uses the new unified API.
Can be one of: IvfFlat, IvfPq, IvfSq, IvfRq, HnswPq, HnswSq,
BTree, Bitmap, LabelList, FTS.
replace : bool, default True
Whether to replace an existing index on this column.
wait_timeout : timedelta, optional
Timeout to wait for async indexing to complete.
name : str, optional
Custom name for the index.
train : bool, default True
Whether to train the index with existing data.
Examples
--------
New API (recommended):
>>> table.create_index( # doctest: +SKIP
... "vector", config=IvfPq(distance_type="l2")
... )
>>> table.create_index("category", config=BTree()) # doctest: +SKIP
>>> table.create_index("content", config=FTS()) # doctest: +SKIP
Legacy API (deprecated):
>>> table.create_index( # doctest: +SKIP
... "l2", vector_column_name="vector"
... )
"""
# Detect whether this is a legacy API call
is_legacy = self._is_legacy_create_index_call(
metric,
config,
num_partitions,
num_sub_vectors,
vector_column_name,
accelerator,
index_cache_size,
)
if is_legacy:
warnings.warn(
"The create_index() API with metric/num_partitions parameters is "
"deprecated and will be removed in a future version. "
"Please migrate to the new unified API:\n"
" # Old (deprecated):\n"
" table.create_index('l2', vector_column_name='my_vector')\n"
" # New (recommended):\n"
" table.create_index('my_vector', config=IvfPq(distance_type='l2'))",
DeprecationWarning,
stacklevel=2,
)
# Legacy API: first arg is the distance metric
column = vector_column_name
# Build config from legacy parameters
config = self._build_vector_config_from_legacy_params(
metric=metric,
index_type=index_type,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
replace=replace,
accelerator=accelerator,
index_cache_size=index_cache_size,
num_bits=num_bits,
max_iterations=max_iterations,
sample_rate=sample_rate,
m=m,
ef_construction=ef_construction,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
self.checkout_latest()
return
elif index_type == "IVF_FLAT":
config = IvfFlat(
# Handle accelerator through pylance
if accelerator is not None:
self.to_lance().create_index(
column=column,
index_type=index_type,
metric=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
replace=replace,
accelerator=accelerator,
index_cache_size=index_cache_size,
num_bits=num_bits,
m=m,
ef_construction=ef_construction,
target_partition_size=target_partition_size,
)
self.checkout_latest()
return
else:
# New API: metric is the column name
column = metric
# Check if config has accelerator set and dispatch to pylance
if config is not None and hasattr(config, "accelerator"):
acc = getattr(config, "accelerator", None)
if acc is not None:
# Dispatch to pylance for GPU acceleration
index_type_map = {
"IvfFlat": "IVF_FLAT",
"IvfSq": "IVF_SQ",
"IvfPq": "IVF_PQ",
"IvfRq": "IVF_RQ",
"HnswPq": "IVF_HNSW_PQ",
"HnswSq": "IVF_HNSW_SQ",
}
cfg_type = type(config).__name__
lance_index_type = index_type_map.get(cfg_type, "IVF_PQ")
self.to_lance().create_index(
column=column,
index_type=lance_index_type,
metric=getattr(config, "distance_type", "l2"),
num_partitions=getattr(config, "num_partitions", None),
num_sub_vectors=getattr(config, "num_sub_vectors", None),
replace=replace,
accelerator=acc,
num_bits=getattr(config, "num_bits", 8),
m=getattr(config, "m", 20),
ef_construction=getattr(config, "ef_construction", 300),
target_partition_size=getattr(
config, "target_partition_size", None
),
)
self.checkout_latest()
return
return LOOP.run(
self._table.create_index(
column,
replace=replace,
config=config,
wait_timeout=wait_timeout,
name=name,
train=train,
)
)
def _is_legacy_create_index_call(
self,
first_arg: str,
config: Optional[IndexConfigType],
num_partitions: Optional[int],
num_sub_vectors: Optional[int],
vector_column_name: str,
accelerator: Optional[str],
index_cache_size: Optional[int],
) -> bool:
"""Detect if this is a legacy create_index call."""
# If config is provided, it's definitely the new API
if config is not None:
return False
# If old-style parameters were explicitly set, it's legacy
if any(
x is not None
for x in (num_partitions, num_sub_vectors, accelerator, index_cache_size)
):
return True
# If vector_column_name differs from default, it's legacy
if vector_column_name != VECTOR_COLUMN_NAME:
return True
# If first arg is a known metric, assume legacy
if first_arg.lower() in KNOWN_METRICS:
return True
# Otherwise assume new API
return False
def _build_vector_config_from_legacy_params(
self,
metric: str,
index_type: str,
num_partitions: Optional[int],
num_sub_vectors: Optional[int],
num_bits: int,
max_iterations: int,
sample_rate: int,
m: int,
ef_construction: int,
target_partition_size: Optional[int],
accelerator: Optional[str],
) -> IndexConfigType:
"""Build an index config object from legacy parameters."""
if index_type == "IVF_FLAT":
return IvfFlat(
distance_type=metric,
num_partitions=num_partitions,
max_iterations=max_iterations,
sample_rate=sample_rate,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
elif index_type == "IVF_SQ":
config = IvfSq(
return IvfSq(
distance_type=metric,
num_partitions=num_partitions,
max_iterations=max_iterations,
sample_rate=sample_rate,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
elif index_type == "IVF_PQ":
config = IvfPq(
return IvfPq(
distance_type=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
@@ -2322,18 +2610,20 @@ class LanceTable(Table):
max_iterations=max_iterations,
sample_rate=sample_rate,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
elif index_type == "IVF_RQ":
config = IvfRq(
return IvfRq(
distance_type=metric,
num_partitions=num_partitions,
num_bits=num_bits,
max_iterations=max_iterations,
sample_rate=sample_rate,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
elif index_type == "IVF_HNSW_PQ":
config = HnswPq(
return HnswPq(
distance_type=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
@@ -2343,9 +2633,10 @@ class LanceTable(Table):
m=m,
ef_construction=ef_construction,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
elif index_type == "IVF_HNSW_SQ":
config = HnswSq(
return HnswSq(
distance_type=metric,
num_partitions=num_partitions,
max_iterations=max_iterations,
@@ -2353,9 +2644,10 @@ class LanceTable(Table):
m=m,
ef_construction=ef_construction,
target_partition_size=target_partition_size,
accelerator=accelerator,
)
elif index_type == "IVF_HNSW_FLAT":
config = HnswFlat(
return HnswFlat(
distance_type=metric,
num_partitions=num_partitions,
max_iterations=max_iterations,
@@ -2367,16 +2659,6 @@ class LanceTable(Table):
else:
raise ValueError(f"Unknown index type {index_type}")
return LOOP.run(
self._table.create_index(
vector_column_name,
replace=replace,
config=config,
name=name,
train=train,
)
)
def drop_index(self, name: str) -> None:
"""
Drops an index from the table
@@ -2476,6 +2758,11 @@ class LanceTable(Table):
"""
return LOOP.run(self._table.latest_storage_options())
@deprecation.deprecated(
deprecated_in="0.25.0",
current_version=__version__,
details="Use create_index() with config=BTree()/Bitmap()/LabelList() instead.",
)
def create_scalar_index(
self,
column: str,
@@ -2484,6 +2771,12 @@ class LanceTable(Table):
index_type: ScalarIndexType = "BTREE",
name: Optional[str] = None,
):
"""Create a scalar index on a column.
.. deprecated:: 0.25.0
Use :meth:`create_index` with a BTree, Bitmap, or LabelList config instead.
Example: ``table.create_index("column", config=BTree())``
"""
if index_type == "BTREE":
config = BTree()
elif index_type == "BITMAP":
@@ -2496,6 +2789,11 @@ class LanceTable(Table):
self._table.create_index(column, replace=replace, config=config, name=name)
)
@deprecation.deprecated(
deprecated_in="0.25.0",
current_version=__version__,
details="Use create_index() with config=FTS() instead.",
)
def create_fts_index(
self,
field_names: Union[str, List[str]],
@@ -2519,6 +2817,12 @@ class LanceTable(Table):
prefix_only: bool = False,
name: Optional[str] = None,
):
"""Create a full-text search index on a column.
.. deprecated:: 0.25.0
Use :meth:`create_index` with an FTS config instead.
Example: ``table.create_index("text_column", config=FTS())``
"""
self._ensure_no_legacy_fts_index()
if use_tantivy:
@@ -3297,6 +3601,11 @@ class LanceTable(Table):
[`AsyncTable.unset_lsm_write_spec`][lancedb.AsyncTable.unset_lsm_write_spec]."""
return LOOP.run(self._table.unset_lsm_write_spec())
def close_lsm_writers(self) -> None:
"""Close cached MemWAL shard writers. See
[`AsyncTable.close_lsm_writers`][lancedb.AsyncTable.close_lsm_writers]."""
return LOOP.run(self._table.close_lsm_writers())
def uses_v2_manifest_paths(self) -> bool:
"""
Check if the table is using the new v2 manifest paths.
@@ -3905,6 +4214,16 @@ class AsyncTable:
"""
await self._inner.unset_lsm_write_spec()
async def close_lsm_writers(self) -> None:
"""Drain and close any cached MemWAL shard writers for this table.
When an LSM write spec is installed, `merge_insert` opens MemWAL shard
writers and caches them for reuse across calls. This closes them,
flushing pending data; writers reopen lazily on the next
`merge_insert`. It is a no-op when no writers are cached.
"""
await self._inner.close_lsm_writers()
@property
def name(self) -> str:
"""The name of the table."""
@@ -4355,7 +4674,7 @@ class AsyncTable:
... .when_not_matched_insert_all() \\
... .execute(new_data)
>>> res
MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0, num_attempts=1)
MergeResult(version=2, num_updated_rows=2, num_inserted_rows=1, num_deleted_rows=0, num_attempts=1, num_rows=3)
>>> # The order of new rows is non-deterministic since we use
>>> # a hash-join as part of this operation and so we sort here
>>> table.to_arrow().sort_by("a").to_pandas()
@@ -4735,6 +5054,8 @@ class AsyncTable:
when_not_matched_by_source_condition=merge._when_not_matched_by_source_condition,
timeout=merge._timeout,
use_index=merge._use_index,
use_lsm_write=merge._use_lsm_write,
validate_single_shard=merge._validate_single_shard,
),
)

View File

@@ -57,7 +57,7 @@ async def test_upsert_async(mem_db_async):
await table.count_rows() # 3
res
# MergeResult(version=2, num_updated_rows=1,
# num_inserted_rows=1, num_deleted_rows=0)
# num_inserted_rows=1, num_deleted_rows=0, num_rows=2)
# --8<-- [end:upsert_basic_async]
assert await table.count_rows() == 3
assert res.version == 2
@@ -86,7 +86,7 @@ def test_insert_if_not_exists(mem_db):
table.count_rows() # 3
res
# MergeResult(version=2, num_updated_rows=0,
# num_inserted_rows=1, num_deleted_rows=0)
# num_inserted_rows=1, num_deleted_rows=0, num_rows=1)
# --8<-- [end:insert_if_not_exists]
assert table.count_rows() == 3
assert res.version == 2
@@ -116,7 +116,7 @@ async def test_insert_if_not_exists_async(mem_db_async):
await table.count_rows() # 3
res
# MergeResult(version=2, num_updated_rows=0,
# num_inserted_rows=1, num_deleted_rows=0)
# num_inserted_rows=1, num_deleted_rows=0, num_rows=1)
# --8<-- [end:insert_if_not_exists]
assert await table.count_rows() == 3
assert res.version == 2
@@ -150,7 +150,7 @@ def test_replace_range(mem_db):
table.count_rows("doc_id = 1") # 1
res
# MergeResult(version=2, num_updated_rows=1,
# num_inserted_rows=0, num_deleted_rows=1)
# num_inserted_rows=0, num_deleted_rows=1, num_rows=1)
# --8<-- [end:insert_if_not_exists]
assert table.count_rows("doc_id = 1") == 1
assert res.version == 2
@@ -185,7 +185,7 @@ async def test_replace_range_async(mem_db_async):
await table.count_rows("doc_id = 1") # 1
res
# MergeResult(version=2, num_updated_rows=1,
# num_inserted_rows=0, num_deleted_rows=1)
# num_inserted_rows=0, num_deleted_rows=1, num_rows=1)
# --8<-- [end:insert_if_not_exists]
assert await table.count_rows("doc_id = 1") == 1
assert res.version == 2

View File

@@ -215,11 +215,12 @@ def test_reject_legacy_tantivy_index(table):
@pytest.mark.parametrize("with_position", [True, False])
def test_create_inverted_index(table, with_position):
table.create_fts_index(
"text",
with_position=with_position,
name="custom_fts_index",
)
with pytest.warns(DeprecationWarning, match="create_fts_index"):
table.create_fts_index(
"text",
with_position=with_position,
name="custom_fts_index",
)
indices = table.list_indices()
fts_indices = [i for i in indices if i.index_type == "FTS"]
assert any(i.name == "custom_fts_index" for i in fts_indices)

View File

@@ -162,12 +162,13 @@ async def test_create_bitmap_index(some_table: AsyncTable):
await some_table.create_index("data", config=Bitmap())
indices = await some_table.list_indices()
assert len(indices) == 3
# list_indices returns indices in alphabetical order by name
assert indices[0].index_type == "Bitmap"
assert indices[0].columns == ["id"]
assert indices[0].columns == ["data"]
assert indices[1].index_type == "Bitmap"
assert indices[1].columns == ["is_active"]
assert indices[1].columns == ["id"]
assert indices[2].index_type == "Bitmap"
assert indices[2].columns == ["data"]
assert indices[2].columns == ["is_active"]
index_name = indices[0].name
stats = await some_table.index_stats(index_name)

View File

@@ -0,0 +1,196 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""Tests for the MemWAL LSM ``merge_insert`` dispatch."""
from datetime import timedelta
import lancedb
import pyarrow as pa
import pytest
from lancedb._lancedb import LsmWriteSpec
SCHEMA = pa.schema(
[
pa.field("id", pa.int64(), nullable=False),
pa.field("value", pa.int64(), nullable=False),
]
)
REGION_SCHEMA = pa.schema(
[
pa.field("id", pa.int64(), nullable=False),
pa.field("region", pa.utf8(), nullable=False),
]
)
def _reader(ids):
batch = pa.RecordBatch.from_arrays(
[
pa.array(ids, type=pa.int64()),
pa.array(list(range(len(ids))), type=pa.int64()),
],
schema=SCHEMA,
)
return pa.RecordBatchReader.from_batches(SCHEMA, [batch])
def _region_reader(rows):
batch = pa.RecordBatch.from_arrays(
[
pa.array([row[0] for row in rows], type=pa.int64()),
pa.array([row[1] for row in rows], type=pa.utf8()),
],
schema=REGION_SCHEMA,
)
return pa.RecordBatchReader.from_batches(REGION_SCHEMA, [batch])
def _bucket_table(tmp_path):
"""A table with ``id`` as the primary key and a single-bucket LSM spec."""
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _reader([1, 2, 3]))
table.set_unenforced_primary_key("id")
# num_buckets = 1: every row routes to the single bucket.
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1))
return table
def test_lsm_merge_insert_bucket(tmp_path):
table = _bucket_table(tmp_path)
# Empty `on` defaults to the primary key.
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([3, 4, 5]))
)
# LSM path: rows go to the MemWAL, so only num_rows is populated.
assert result.num_rows == 3
assert result.version == 0
assert result.num_inserted_rows == 0
assert result.num_updated_rows == 0
def test_lsm_merge_insert_unsharded(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _reader([1, 2, 3]))
table.set_unenforced_primary_key("id")
table.set_lsm_write_spec(LsmWriteSpec.unsharded())
result = (
table.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([10, 11, 12, 13]))
)
assert result.num_rows == 4
def test_lsm_merge_insert_identity(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _region_reader([(1, "us"), (2, "us")]))
table.set_unenforced_primary_key("id")
table.set_lsm_write_spec(LsmWriteSpec.identity("region"))
# All rows share one identity value, so they route to one shard.
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_region_reader([(3, "us"), (4, "us")]))
)
assert result.num_rows == 2
def test_lsm_merge_insert_use_lsm_write_false(tmp_path):
table = _bucket_table(tmp_path) # rows id = 1, 2, 3
# use_lsm_write(False) opts out: the standard path runs and commits.
result = (
table.merge_insert("id")
.when_not_matched_insert_all()
.use_lsm_write(False)
.execute(_reader([3, 4, 5]))
)
assert result.num_inserted_rows == 2
assert table.count_rows() == 5
def test_lsm_merge_insert_validate_single_shard_off(tmp_path):
table = _bucket_table(tmp_path)
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.validate_single_shard(False)
.execute(_reader([6, 7, 8]))
)
assert result.num_rows == 3
def test_lsm_merge_insert_use_lsm_write_true_requires_spec(tmp_path):
# A table with a primary key but no LSM write spec installed.
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _reader([1, 2, 3]))
table.set_unenforced_primary_key("id")
with pytest.raises(Exception, match="use_lsm_write"):
(
table.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.use_lsm_write(True)
.execute(_reader([4]))
)
def test_lsm_merge_insert_rejects_on_not_primary_key(tmp_path):
table = _bucket_table(tmp_path)
with pytest.raises(Exception, match="primary key"):
(
table.merge_insert("value")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([1]))
)
def test_lsm_merge_insert_rejects_non_upsert(tmp_path):
table = _bucket_table(tmp_path)
# Insert-only (no when_matched_update_all) is not the upsert shape.
with pytest.raises(Exception, match="upsert"):
table.merge_insert([]).when_not_matched_insert_all().execute(_reader([4]))
def test_lsm_close_writers(tmp_path):
table = _bucket_table(tmp_path)
(
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([7, 8]))
)
table.close_lsm_writers()
# The writer reopens lazily on the next merge_insert.
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([9]))
)
assert result.num_rows == 1
@pytest.mark.asyncio
async def test_async_lsm_merge_insert(tmp_path):
db = await lancedb.connect_async(
tmp_path, read_consistency_interval=timedelta(seconds=0)
)
table = await db.create_table("t", _reader([1, 2, 3]))
await table.set_unenforced_primary_key("id")
await table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1))
builder = (
table.merge_insert([]).when_matched_update_all().when_not_matched_insert_all()
)
result = await builder.execute(_reader([3, 4, 5]))
assert result.num_rows == 3
await table.close_lsm_writers()

View File

@@ -1,12 +1,13 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import re
from concurrent.futures import ThreadPoolExecutor
import contextlib
from datetime import timedelta
import http.server
import json
import multiprocessing as mp
import pickle
import re
import sys
import threading
import time
@@ -171,6 +172,155 @@ def test_table_len_sync():
assert len(table) == 1
def test_remote_connection_serializes():
def handler(request):
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"tables": []}')
with mock_lancedb_connection(handler) as db:
serialized = json.loads(db.serialize())
assert isinstance(serialized["client_config"], dict)
restored = lancedb.deserialize_conn(db.serialize())
assert restored.table_names() == []
def test_remote_table_is_picklable():
def handler(request):
request.close_connection = True
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
payload = json.dumps(
{
"version": 1,
"schema": {
"fields": [
{"name": "id", "type": {"type": "int64"}, "nullable": False}
]
},
}
)
request.wfile.write(payload.encode())
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"3")
else:
request.send_response(404)
request.end_headers()
with mock_lancedb_connection(handler) as db:
table = db.open_table("test")
restored = pickle.loads(pickle.dumps(table))
assert restored.count_rows() == 3
def test_remote_table_open_does_not_require_picklable_client_config():
from lancedb.remote import HeaderProvider
class LocalHeaderProvider(HeaderProvider):
def get_headers(self):
return {"X-Test-Header": "present"}
def handler(request):
request.close_connection = True
assert request.headers.get("X-Test-Header") == "present"
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"version": 1, "schema": {"fields": []}}')
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"3")
else:
request.send_response(404)
request.end_headers()
with http.server.HTTPServer(
("localhost", 0), make_mock_http_handler(handler)
) as server:
port = server.server_address[1]
handle = threading.Thread(target=server.serve_forever)
handle.start()
try:
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
"header_provider": LocalHeaderProvider(),
},
)
table = db.open_table("test")
assert table.count_rows() == 3
with pytest.raises(ValueError, match="header_provider"):
pickle.dumps(table)
finally:
server.shutdown()
handle.join()
def test_remote_permutation_is_picklable():
from lancedb.permutation import Permutation
rows = list(range(10))
def handler(request):
request.close_connection = True
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
payload = json.dumps(
{
"version": 1,
"schema": {
"fields": [
{"name": "a", "type": {"type": "int64"}, "nullable": False}
]
},
}
)
request.wfile.write(payload.encode())
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(str(len(rows)).encode())
elif request.path == "/v1/table/test/query/":
content_len = int(request.headers.get("Content-Length"))
body = json.loads(request.rfile.read(content_len))
if "filter" in body:
match = re.search(r"_rowoffset in \((.*?)\)", body["filter"])
offsets = [int(offset.strip()) for offset in match.group(1).split(",")]
else:
offsets = rows
table = pa.table({"a": [rows[offset] for offset in offsets]})
request.send_response(200)
request.send_header("Content-Type", "application/vnd.apache.arrow.file")
request.end_headers()
with pa.ipc.new_file(request.wfile, schema=table.schema) as writer:
writer.write_table(table)
else:
request.send_response(404)
request.end_headers()
with mock_lancedb_connection(handler) as db:
permutation = Permutation.identity(db.open_table("test"))
restored = pickle.loads(pickle.dumps(permutation))
assert restored.__getitems__([0, 2, 4]) == [{"a": 0}, {"a": 2}, {"a": 4}]
def test_create_table_exist_ok():
def handler(request):
if request.path == "/v1/table/test/create/?mode=exist_ok":
@@ -436,22 +586,25 @@ def test_table_create_indices():
# This is a smoke-test.
table = db.create_table("test", [{"id": 1}])
# Test create_scalar_index with custom name
table.create_scalar_index(
"id", wait_timeout=timedelta(seconds=2), name="custom_scalar_idx"
)
# Test create_scalar_index with custom name (legacy method)
with pytest.warns(DeprecationWarning, match="create_scalar_index"):
table.create_scalar_index(
"id", wait_timeout=timedelta(seconds=2), name="custom_scalar_idx"
)
# Test create_fts_index with custom name
table.create_fts_index(
"text", wait_timeout=timedelta(seconds=2), name="custom_fts_idx"
)
# Test create_fts_index with custom name (legacy method)
with pytest.warns(DeprecationWarning, match="create_fts_index"):
table.create_fts_index(
"text", wait_timeout=timedelta(seconds=2), name="custom_fts_idx"
)
# Test create_index with custom name
table.create_index(
vector_column_name="vector",
wait_timeout=timedelta(seconds=10),
name="custom_vector_idx",
)
# Test create_index with custom name (legacy form: vector_column_name kwarg)
with pytest.warns(DeprecationWarning, match="create_index"):
table.create_index(
vector_column_name="vector",
wait_timeout=timedelta(seconds=10),
name="custom_vector_idx",
)
# Validate that the name parameter was passed correctly in requests
assert len(received_requests) == 3
@@ -480,6 +633,98 @@ def test_table_create_indices():
table.drop_index("custom_fts_idx")
def test_remote_create_index_new_api():
received_requests = []
def handler(request):
if request.path == "/v1/table/test/create_index/":
content_len = int(request.headers.get("Content-Length", 0))
body = request.rfile.read(content_len) if content_len > 0 else b""
received_requests.append(json.loads(body) if body else {})
request.send_response(200)
request.end_headers()
elif request.path == "/v1/table/test/create/?mode=create":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"{}")
elif request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(
json.dumps(
dict(
version=1,
schema=dict(
fields=[
dict(name="id", type={"type": "int64"}, nullable=False),
dict(
name="category",
type={"type": "string"},
nullable=False,
),
dict(
name="text", type={"type": "string"}, nullable=False
),
dict(
name="vector",
type={
"type": "fixed_size_list",
"fields": [
dict(
name="item",
type={"type": "float"},
nullable=True,
)
],
"length": 2,
},
nullable=False,
),
]
),
)
).encode()
)
else:
request.send_response(404)
request.end_headers()
from lancedb.index import BTree, FTS, IvfPq, IvfRq
with mock_lancedb_connection(handler) as db:
table = db.create_table("test", [{"id": 1}])
# New API: column-first, config= kwarg. Should NOT emit DeprecationWarning.
import warnings as _warnings
with _warnings.catch_warnings():
_warnings.simplefilter("error", DeprecationWarning)
table.create_index("vector", config=IvfPq(distance_type="l2"))
table.create_index("category", config=BTree())
table.create_index("text", config=FTS())
# IvfRq via new API
table.create_index("vector", config=IvfRq(distance_type="l2"))
# Legacy index_type="IVF_RQ" routes to IvfRq config under the hood.
with pytest.warns(DeprecationWarning, match="create_index"):
table.create_index(
vector_column_name="vector",
index_type="IVF_RQ",
num_partitions=8,
)
assert len(received_requests) == 5
assert [req["column"] for req in received_requests] == [
"vector",
"category",
"text",
"vector",
"vector",
]
def test_table_wait_for_index_timeout():
def handler(request):
index_stats = dict(
@@ -1305,6 +1550,10 @@ def _remote_fork_child(port: int, queue) -> None:
queue.put(db.table_names())
def _remote_table_fork_child(table, queue) -> None:
queue.put(table.count_rows())
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
@@ -1367,3 +1616,65 @@ def test_remote_connection_after_fork():
finally:
server.shutdown()
server_thread.join()
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
"fork() is unavailable on Windows and unsafe on macOS "
"(Apple frameworks/TLS are not fork-safe)"
),
)
def test_inherited_remote_table_reopens_after_fork():
def handler(request):
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"version": 1, "schema": {"fields": []}}')
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"7")
else:
request.send_response(404)
request.end_headers()
server = http.server.HTTPServer(("localhost", 0), make_mock_http_handler(handler))
port = server.server_address[1]
server_thread = threading.Thread(target=server.serve_forever)
server_thread.start()
try:
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
table = db.open_table("test")
assert table.count_rows() == 7
ctx = mp.get_context("fork")
queue = ctx.Queue()
proc = ctx.Process(target=_remote_table_fork_child, args=(table, queue))
proc.start()
proc.join(timeout=15)
if proc.is_alive():
proc.terminate()
proc.join(timeout=5)
if proc.is_alive():
proc.kill()
proc.join()
pytest.fail("Remote table hung after fork")
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no result"
assert queue.get() == 7
finally:
server.shutdown()
server_thread.join()

View File

@@ -4,6 +4,7 @@
import os
import sys
import warnings
from datetime import date, datetime, timedelta
from time import sleep
from typing import List
@@ -11,7 +12,7 @@ from unittest.mock import patch
import lancedb
from lancedb.dependencies import _PANDAS_AVAILABLE
from lancedb.index import HnswFlat, HnswPq, HnswSq, IvfPq
from lancedb.index import BTree, FTS, HnswFlat, HnswPq, HnswSq, IvfPq
import numpy as np
import polars as pl
import pyarrow as pa
@@ -928,7 +929,12 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
num_bits=4,
)
mock_create_index.assert_called_with(
"vector", replace=True, config=expected_config, name=None, train=True
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
)
# Test with target_partition_size
@@ -948,7 +954,12 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
target_partition_size=8192,
)
mock_create_index.assert_called_with(
"vector", replace=True, config=expected_config, name=None, train=True
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
)
# target_partition_size has a default value,
@@ -967,7 +978,12 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
num_bits=4,
)
mock_create_index.assert_called_with(
"vector", replace=True, config=expected_config, name=None, train=True
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
)
table.create_index(
@@ -978,7 +994,12 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
)
expected_config = HnswPq(distance_type="dot")
mock_create_index.assert_called_with(
"my_vector", replace=False, config=expected_config, name=None, train=True
"my_vector",
replace=False,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
)
table.create_index(
@@ -993,7 +1014,12 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
distance_type="cosine", sample_rate=0.1, m=29, ef_construction=10
)
mock_create_index.assert_called_with(
"my_vector", replace=True, config=expected_config, name=None, train=True
"my_vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
)
table.create_index(
@@ -1008,7 +1034,12 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
distance_type="cosine", sample_rate=0.1, m=29, ef_construction=10
)
mock_create_index.assert_called_with(
"my_vector", replace=True, config=expected_config, name=None, train=True
"my_vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
)
@@ -1032,6 +1063,7 @@ def test_create_index_name_and_train_parameters(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name="my_custom_index",
train=True,
)
@@ -1039,13 +1071,82 @@ def test_create_index_name_and_train_parameters(
# Test with train=False
table.create_index(vector_column_name="vector", train=False)
mock_create_index.assert_called_with(
"vector", replace=True, config=expected_config, name=None, train=False
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=False,
)
# Test with both name and train
table.create_index(vector_column_name="vector", name="my_index_name", train=True)
mock_create_index.assert_called_with(
"vector", replace=True, config=expected_config, name="my_index_name", train=True
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name="my_index_name",
train=True,
)
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_legacy_emits_deprecation_warning(
mock_create_index, mem_db: DBConnection
):
table = mem_db.create_table(
"test",
data=[{"vector": [3.1, 4.1]}, {"vector": [5.9, 26.5]}],
)
with pytest.warns(DeprecationWarning, match="create_index"):
table.create_index(metric="l2", num_partitions=8, vector_column_name="vector")
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_new_api(mock_create_index, mem_db: DBConnection):
table = mem_db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "category": "a", "text": "hello world"},
{"vector": [5.9, 26.5], "category": "b", "text": "goodbye"},
],
)
# Vector index via new API should not warn
with warnings.catch_warnings():
warnings.simplefilter("error", DeprecationWarning)
table.create_index("vector", config=IvfPq(distance_type="l2"))
mock_create_index.assert_called_with(
"vector",
replace=True,
config=IvfPq(distance_type="l2"),
wait_timeout=None,
name=None,
train=True,
)
# Scalar index via new API
table.create_index("category", config=BTree())
mock_create_index.assert_called_with(
"category",
replace=True,
config=BTree(),
wait_timeout=None,
name=None,
train=True,
)
# FTS index via new API
table.create_index("text", config=FTS(with_position=True))
mock_create_index.assert_called_with(
"text",
replace=True,
config=FTS(with_position=True),
wait_timeout=None,
name=None,
train=True,
)
@@ -1861,8 +1962,9 @@ def test_create_scalar_index(mem_db: DBConnection):
"my_table",
data=test_data,
)
# Test with default name
table.create_scalar_index("x")
# Test with default name; confirm DeprecationWarning fires
with pytest.warns(DeprecationWarning, match="create_scalar_index"):
table.create_scalar_index("x")
indices = table.list_indices()
assert len(indices) == 1
scalar_index = indices[0]

View File

@@ -1,10 +1,15 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import contextlib
import functools
import http.server
import json
import multiprocessing as mp
import pickle
import re
import sys
import threading
import lancedb
import pyarrow as pa
@@ -15,6 +20,107 @@ from lancedb.util import tbl_to_tensor
torch = pytest.importorskip("torch")
REMOTE_ROWS = list(range(100))
def _make_mock_http_handler(handler):
class MockLanceDBHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
handler(self)
def do_POST(self):
handler(self)
return MockLanceDBHandler
def _remote_schema_payload():
return {
"version": 1,
"schema": {
"fields": [
{"name": "a", "type": {"type": "int64"}, "nullable": False},
]
},
}
def _offsets_from_filter(filter_sql: str | None) -> list[int]:
if filter_sql is None:
return REMOTE_ROWS
match = re.search(r"_rowoffset in \((.*?)\)", filter_sql)
if match is None:
return REMOTE_ROWS
raw_offsets = match.group(1).strip()
if raw_offsets == "":
return []
return [int(offset.strip()) for offset in raw_offsets.split(",")]
def _remote_dataset_handler(request):
request.close_connection = True
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(json.dumps(_remote_schema_payload()).encode())
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(str(len(REMOTE_ROWS)).encode())
elif request.path == "/v1/table/test/query/":
content_len = int(request.headers.get("Content-Length"))
body = json.loads(request.rfile.read(content_len))
offsets = _offsets_from_filter(body.get("filter"))
requested_columns = body.get("columns") or ["a"]
if isinstance(requested_columns, dict):
requested_columns = list(requested_columns)
data = {}
for column in requested_columns:
if column == "a":
data[column] = [REMOTE_ROWS[offset] for offset in offsets]
elif column == "_rowoffset":
data[column] = offsets
elif column == "_rowid":
data[column] = offsets
table = pa.table(data)
request.send_response(200)
request.send_header("Content-Type", "application/vnd.apache.arrow.file")
request.end_headers()
with pa.ipc.new_file(request.wfile, schema=table.schema) as writer:
writer.write_table(table)
else:
request.send_response(404)
request.end_headers()
@contextlib.contextmanager
def _remote_dataset_table():
with http.server.ThreadingHTTPServer(
("localhost", 0), _make_mock_http_handler(_remote_dataset_handler)
) as server:
port = server.server_address[1]
handle = threading.Thread(target=server.serve_forever)
handle.start()
try:
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
yield db.open_table("test")
finally:
server.shutdown()
handle.join()
def _open_native_table(uri: str, table_name: str):
"""Top-level connection factory used by the explicit-factory pickle test.
@@ -107,6 +213,39 @@ def test_permutation_dataloader_multiprocessing(tmp_db):
assert seen == 1000
def test_remote_table_dataloader_multiprocessing():
with _remote_dataset_table() as table:
dataloader = torch.utils.data.DataLoader(
table,
collate_fn=tbl_to_tensor,
batch_size=10,
num_workers=2,
multiprocessing_context="spawn",
)
seen = 0
for batch in dataloader:
assert batch.size(0) == 1
assert batch.size(1) == 10
seen += batch.size(1)
assert seen == len(REMOTE_ROWS)
def test_remote_permutation_dataloader_multiprocessing():
with _remote_dataset_table() as table:
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(
permutation,
batch_size=10,
num_workers=2,
multiprocessing_context="spawn",
)
seen = 0
for batch in dataloader:
assert batch["a"].size(0) == 10
seen += batch["a"].size(0)
assert seen == len(REMOTE_ROWS)
def test_permutation_pickle_with_connection_factory(tmp_path):
"""When the user provides a connection_factory, pickling should round-trip
through that factory rather than introspecting the connection URI. Useful
@@ -171,6 +310,35 @@ def _multiworker_dataloader_target(db_uri: str, result_queue):
result_queue.put(count)
def _remote_multiworker_dataloader_target(port: int, result_queue):
import lancedb
from lancedb.permutation import Permutation
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
table = db.open_table("test")
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(
permutation,
batch_size=10,
num_workers=2,
multiprocessing_context="fork",
)
count = 0
for batch in dataloader:
assert batch["a"].size(0) == 10
count += 1
result_queue.put(count)
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
@@ -208,3 +376,46 @@ def test_permutation_dataloader_fork_workers(tmp_path):
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no batches"
assert queue.get() == 100
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
"fork() is unavailable on Windows and unsafe on macOS "
"(Apple frameworks/TLS are not fork-safe)"
),
)
def test_remote_permutation_dataloader_fork_workers():
with http.server.ThreadingHTTPServer(
("localhost", 0), _make_mock_http_handler(_remote_dataset_handler)
) as server:
port = server.server_address[1]
handle = threading.Thread(target=server.serve_forever)
handle.start()
try:
ctx = mp.get_context("spawn")
queue = ctx.Queue()
proc = ctx.Process(
target=_remote_multiworker_dataloader_target,
args=(port, queue),
)
proc.start()
proc.join(timeout=30)
if proc.is_alive():
proc.terminate()
proc.join(timeout=5)
if proc.is_alive():
proc.kill()
proc.join()
pytest.fail(
"Remote permutation hung when iterated in a fork-based "
"DataLoader worker"
)
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no batches"
assert queue.get() == 10
finally:
server.shutdown()
handle.join()

View File

@@ -143,18 +143,20 @@ pub struct MergeResult {
pub num_inserted_rows: u64,
pub num_deleted_rows: u64,
pub num_attempts: u32,
pub num_rows: u64,
}
#[pymethods]
impl MergeResult {
pub fn __repr__(&self) -> String {
format!(
"MergeResult(version={}, num_updated_rows={}, num_inserted_rows={}, num_deleted_rows={}, num_attempts={})",
"MergeResult(version={}, num_updated_rows={}, num_inserted_rows={}, num_deleted_rows={}, num_attempts={}, num_rows={})",
self.version,
self.num_updated_rows,
self.num_inserted_rows,
self.num_deleted_rows,
self.num_attempts
self.num_attempts,
self.num_rows
)
}
}
@@ -167,6 +169,7 @@ impl From<lancedb::table::MergeResult> for MergeResult {
num_inserted_rows: result.num_inserted_rows,
num_deleted_rows: result.num_deleted_rows,
num_attempts: result.num_attempts,
num_rows: result.num_rows,
}
}
}
@@ -194,6 +197,12 @@ impl LsmWriteSpec {
}
/// Identity sharding — shard by the raw value of `column`.
///
/// `column` must be a deterministic function of the unenforced primary
/// key: every row with a given primary key must always produce the same
/// `column` value, or upserts of that key can land in different shards
/// and a stale version can win. Typically `column` is the primary key
/// itself or a stable attribute of it.
#[staticmethod]
pub fn identity(column: String) -> Self {
Self {
@@ -933,6 +942,12 @@ impl Table {
if let Some(use_index) = parameters.use_index {
builder.use_index(use_index);
}
if let Some(use_lsm_write) = parameters.use_lsm_write {
builder.use_lsm_write(use_lsm_write);
}
if let Some(validate_single_shard) = parameters.validate_single_shard {
builder.validate_single_shard(validate_single_shard);
}
future_into_py(self_.py(), async move {
let res = builder.execute(Box::new(batches)).await.infer_error()?;
@@ -971,6 +986,13 @@ impl Table {
})
}
pub fn close_lsm_writers(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
inner.close_lsm_writers().await.infer_error()
})
}
pub fn uses_v2_manifest_paths(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
@@ -1124,6 +1146,8 @@ pub struct MergeInsertParams {
when_not_matched_by_source_condition: Option<String>,
timeout: Option<std::time::Duration>,
use_index: Option<bool>,
use_lsm_write: Option<bool>,
validate_single_shard: Option<bool>,
}
#[pyclass]

View File

@@ -1,2 +1,2 @@
[toolchain]
channel = "1.94.0"
channel = "1.95.0"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.30.0"
version = "0.30.0-beta.1"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -75,7 +75,7 @@ reqwest = { version = "0.12.0", default-features = false, features = [
"stream",
], optional = true }
http = { version = "1", optional = true } # Matching what is in reqwest
uuid = { version = "1.7.0", features = ["v4"] }
uuid = { version = "1.7.0", features = ["v4", "v5"] }
polars-arrow = { version = ">=0.37,<0.40.0", optional = true }
polars = { version = ">=0.37,<0.40.0", optional = true }
hf-hub = { version = "0.4.1", optional = true, default-features = false, features = [

View File

@@ -464,11 +464,9 @@ mod tests {
let mut iter = ids.into_iter().map(|o| o.unwrap());
while let Some(first) = iter.next() {
let rows_left_in_clump = if first == 4470 { 19 } else { 29 };
let mut expected_next = first + 1;
for _ in 0..rows_left_in_clump {
for expected_next in (first + 1)..=(first + rows_left_in_clump) {
let next = iter.next().unwrap();
assert_eq!(next, expected_next);
expected_next += 1;
}
}
}

View File

@@ -908,6 +908,15 @@ mod tests {
use serial_test::serial;
use std::time::Duration;
// Serializes the env-var-mutating tests below: cargo test runs tests in
// parallel, but several of these tests read and write the same process-
// global env vars (`LANCEDB_USER_ID*`), so they would race without this.
static ENV_MUTEX: std::sync::Mutex<()> = std::sync::Mutex::new(());
fn lock_env() -> std::sync::MutexGuard<'static, ()> {
ENV_MUTEX.lock().unwrap_or_else(|e| e.into_inner())
}
#[test]
fn test_timeout_config_default() {
let config = TimeoutConfig::default();
@@ -1166,6 +1175,7 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_none() {
let _guard = lock_env();
let config = ClientConfig::default();
// Clear env vars that might be set from other tests
// SAFETY: This is only called in tests
@@ -1179,6 +1189,7 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_from_env() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::set_var("LANCEDB_USER_ID", "env-user-id");
@@ -1194,6 +1205,7 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_from_env_key() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::remove_var("LANCEDB_USER_ID");
@@ -1215,6 +1227,7 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_direct_takes_precedence() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::set_var("LANCEDB_USER_ID", "env-user-id");
@@ -1233,6 +1246,7 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_empty_env_ignored() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::set_var("LANCEDB_USER_ID", "");

View File

@@ -1805,6 +1805,7 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
num_inserted_rows: 0,
num_updated_rows: 0,
num_attempts: 0,
num_rows: 0,
});
}

View File

@@ -89,7 +89,6 @@ use futures::future::join_all;
pub use lance::dataset::refs::{TagContents, Tags as LanceTags};
pub use lance::dataset::scanner::DatasetRecordBatchStream;
use lance::dataset::statistics::DatasetStatisticsExt;
use lance_index::frag_reuse::FRAG_REUSE_INDEX_NAME;
pub use lance_index::optimize::OptimizeOptions;
pub use optimize::{CompactionOptions, OptimizeAction, OptimizeStats};
pub use schema_evolution::{AddColumnsResult, AlterColumnsResult, DropColumnsResult};
@@ -367,6 +366,14 @@ impl LsmWriteSpec {
/// Construct an identity-sharding spec (shard by the raw value of
/// `column`) with no maintained indexes.
///
/// `column` must be a deterministic function of the unenforced primary
/// key: every row with a given primary key must always produce the same
/// `column` value. MemWAL dedups upserts by primary key but tracks
/// generations per shard, so if the same key is written with two
/// different `column` values its versions land in different shards and a
/// stale value can win. Typically `column` is the primary key itself, or
/// a stable attribute of it (e.g. a tenant id).
pub fn identity(column: impl Into<String>) -> Self {
Self::Identity {
column: column.into(),
@@ -581,6 +588,13 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
message: "unset_lsm_write_spec is not supported on this table type".into(),
})
}
/// Drain and close any cached MemWAL shard writers for this table.
///
/// The default implementation is a no-op; table types that maintain
/// MemWAL shard writers override it.
async fn close_lsm_writers(&self) -> Result<()> {
Ok(())
}
/// Gets the table tag manager.
async fn tags(&self) -> Result<Box<dyn Tags + '_>>;
/// Optimize the dataset.
@@ -1387,6 +1401,16 @@ impl Table {
self.inner.unset_lsm_write_spec().await
}
/// Drain and close any cached MemWAL shard writers held for this table.
///
/// When an [`LsmWriteSpec`] is installed, `merge_insert` opens MemWAL shard
/// writers and caches them for reuse across calls. This closes them,
/// flushing pending data; writers reopen lazily on the next `merge_insert`.
/// It is a no-op when no writers are cached.
pub async fn close_lsm_writers(&self) -> Result<()> {
self.inner.close_lsm_writers().await
}
/// Retrieve the version of the table
///
/// LanceDb supports versioning. Every operation that modifies the table increases
@@ -2830,6 +2854,10 @@ impl BaseTable for NativeTable {
merge::lsm::unset_lsm_write_spec(self).await
}
async fn close_lsm_writers(&self) -> Result<()> {
merge::lsm::close_lsm_writers(self).await
}
/// Delete rows from the table
async fn delete(&self, predicate: Predicate<'_>) -> Result<DeleteResult> {
delete::execute_delete(self, predicate).await
@@ -2864,71 +2892,32 @@ impl BaseTable for NativeTable {
async fn list_indices(&self) -> Result<Vec<IndexConfig>> {
let dataset = self.dataset.get().await?;
let indices = dataset.load_indices().await?;
let results = futures::stream::iter(indices.as_slice())
.then(|idx| async {
// skip Lance internal indexes
if idx.name == FRAG_REUSE_INDEX_NAME {
return None;
}
let stats = match dataset.index_statistics(idx.name.as_str()).await {
Ok(stats) => stats,
Err(e) => {
log::warn!(
"Failed to get statistics for index {} ({}): {}",
idx.name,
idx.uuid,
e
);
return None;
}
};
let stats: serde_json::Value = match serde_json::from_str(&stats) {
Ok(stats) => stats,
Err(e) => {
log::warn!(
"Failed to deserialize index statistics for index {} ({}): {}",
idx.name,
idx.uuid,
e
);
return None;
}
};
let Some(index_type) = stats.get("index_type").and_then(|v| v.as_str()) else {
log::warn!(
"Index statistics was missing 'index_type' field for index {} ({})",
idx.name,
idx.uuid
);
return None;
};
let index_type: crate::index::IndexType = match index_type.parse() {
let indices = dataset
.describe_indices(None)
.await?
.into_iter()
.filter_map(|idx_desc| {
let index_type: crate::index::IndexType = match idx_desc.index_type().parse() {
Ok(index_type) => index_type,
Err(e) => {
log::warn!(
"Failed to parse index type for index {} ({}): {}",
idx.name,
idx.uuid,
"Failed to parse index type for index {}: {}",
idx_desc.name(),
e
);
return None;
}
};
let mut columns = Vec::with_capacity(idx.fields.len());
for field_id in &idx.fields {
let field_path = match dataset.schema().field_path(*field_id) {
let field_ids = idx_desc.field_ids();
let mut columns = Vec::with_capacity(field_ids.len());
for field_id in field_ids {
let field_path = match dataset.schema().field_path(*field_id as i32) {
Ok(field_path) => field_path,
Err(e) => {
log::warn!(
"Failed to resolve field path for index {} ({}) field id {}: {}",
idx.name,
idx.uuid,
"Failed to resolve field path for index {} field id {}: {}",
idx_desc.name(),
field_id,
e
);
@@ -2938,17 +2927,14 @@ impl BaseTable for NativeTable {
columns.push(field_path);
}
let name = idx.name.clone();
Some(IndexConfig {
name: idx_desc.name().to_string(),
index_type,
columns,
name,
})
})
.collect::<Vec<_>>()
.await;
Ok(results.into_iter().flatten().collect())
.collect();
Ok(indices)
}
async fn uri(&self) -> Result<String> {
@@ -3058,11 +3044,12 @@ impl BaseTable for NativeTable {
let p99 = *sorted_sizes.get(num_fragments * 99 / 100).unwrap_or(&0);
let min = sorted_sizes.first().copied().unwrap_or(0);
let max = sorted_sizes.last().copied().unwrap_or(0);
let mean = if num_fragments == 0 {
0
} else {
sorted_sizes.iter().copied().sum::<usize>() / num_fragments
};
let mean = sorted_sizes
.iter()
.copied()
.sum::<usize>()
.checked_div(num_fragments)
.unwrap_or(0);
let frag_stats = FragmentStatistics {
num_fragments,
@@ -4062,26 +4049,27 @@ mod tests {
let index_configs = table.list_indices().await.unwrap();
assert_eq!(index_configs.len(), 5);
// list_indices returns indices in alphabetical order by name
let mut configs_iter = index_configs.into_iter();
let index = configs_iter.next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::Bitmap);
assert_eq!(index.columns, vec!["category".to_string()]);
let index = configs_iter.next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::Bitmap);
assert_eq!(index.columns, vec!["is_active".to_string()]);
let index = configs_iter.next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::Bitmap);
assert_eq!(index.columns, vec!["data".to_string()]);
let index = configs_iter.next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::Bitmap);
assert_eq!(index.columns, vec!["large_data".to_string()]);
assert_eq!(index.columns, vec!["is_active".to_string()]);
let index = configs_iter.next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::Bitmap);
assert_eq!(index.columns, vec!["large_category".to_string()]);
let index = configs_iter.next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::Bitmap);
assert_eq!(index.columns, vec!["large_data".to_string()]);
}
#[tokio::test]

View File

@@ -870,8 +870,10 @@ mod tests {
.await
.unwrap();
// Should return empty or nearly empty result
assert!(result[0].num_rows() <= 1);
assert_eq!(
result.iter().map(|batch| batch.num_rows()).sum::<usize>(),
0
);
}
#[tokio::test]

View File

@@ -8,6 +8,7 @@ use std::{
use lance::{Dataset, dataset::refs};
use crate::table::merge::lsm::ShardWriterCache;
use crate::{Error, error::Result, utils::background_cache::BackgroundCache};
/// A wrapper around a [Dataset] that provides consistency checks.
@@ -18,6 +19,10 @@ use crate::{Error, error::Result, utils::background_cache::BackgroundCache};
pub struct DatasetConsistencyWrapper {
state: Arc<Mutex<DatasetState>>,
consistency: ConsistencyMode,
/// The single MemWAL `ShardWriter` for this dataset, co-located so it is
/// cached for the session and shares the dataset's lifecycle. A dataset
/// writes to one shard at a time. Shared by `Arc` across clones.
shard_writer: Arc<ShardWriterCache>,
}
/// The current dataset and whether it is pinned to a specific version.
@@ -67,9 +72,15 @@ impl DatasetConsistencyWrapper {
pinned_version: None,
})),
consistency,
shard_writer: Arc::new(ShardWriterCache::default()),
}
}
/// The MemWAL `ShardWriter` cache co-located with this dataset.
pub(crate) fn shard_writer(&self) -> &Arc<ShardWriterCache> {
&self.shard_writer
}
/// Get the current dataset.
///
/// Behavior depends on the consistency mode:

View File

@@ -41,6 +41,16 @@ pub struct MergeResult {
/// A value of 1 means the operation succeeded on the first try.
#[serde(default)]
pub num_attempts: u32,
/// Total number of rows written.
///
/// On the standard `merge_insert` path this equals
/// `num_inserted_rows + num_updated_rows`. On the MemWAL LSM write path the
/// insert/update breakdown is not known until compaction; in that mode
/// `num_inserted_rows`, `num_updated_rows`, `num_deleted_rows`, `version`
/// and `num_attempts` are all `0` and this field holds the total number of
/// rows written through the shard writer.
#[serde(default)]
pub num_rows: u64,
}
/// A builder used to create and run a merge insert operation
@@ -57,6 +67,8 @@ pub struct MergeInsertBuilder {
pub(crate) when_not_matched_by_source_delete_filt: Option<String>,
pub(crate) timeout: Option<Duration>,
pub(crate) use_index: bool,
pub(crate) use_lsm_write: Option<bool>,
pub(crate) validate_single_shard: bool,
}
impl MergeInsertBuilder {
@@ -71,6 +83,8 @@ impl MergeInsertBuilder {
when_not_matched_by_source_delete_filt: None,
timeout: None,
use_index: true,
use_lsm_write: None,
validate_single_shard: true,
}
}
@@ -150,6 +164,34 @@ impl MergeInsertBuilder {
self
}
/// Controls whether `merge_insert` uses the MemWAL LSM write path.
///
/// By default (unset), a `merge_insert` on a table with an
/// [`LsmWriteSpec`](super::LsmWriteSpec) installed is routed through
/// Lance's MemWAL shard writer, and a table without one uses the standard
/// path. Calling this with `false` forces the standard path even when a
/// spec is set. Calling it with `true` requires a spec — `merge_insert`
/// errors if none is installed.
pub fn use_lsm_write(&mut self, use_lsm_write: bool) -> &mut Self {
self.use_lsm_write = Some(use_lsm_write);
self
}
/// Controls how an LSM `merge_insert` checks that its input targets a
/// single shard.
///
/// When a table has an LSM write spec, every row in a `merge_insert` call
/// must route to the same shard. When `true` (the default), every row is
/// inspected to verify this. When `false`, only the first row is inspected
/// and the shard it routes to is used for the whole input — a faster path
/// for callers that have already pre-sharded their input.
///
/// Has no effect on tables without an LSM write spec.
pub fn validate_single_shard(&mut self, validate_single_shard: bool) -> &mut Self {
self.validate_single_shard = validate_single_shard;
self
}
/// Executes the merge insert operation
///
/// Returns version and statistics about the merge operation including the number of rows
@@ -167,6 +209,23 @@ pub(crate) async fn execute_merge_insert(
params: MergeInsertBuilder,
new_data: Box<dyn RecordBatchReader + Send>,
) -> Result<MergeResult> {
match lsm::lsm_dispatch_decision(table, &params).await? {
lsm::LsmDispatch::Lsm(plan) => {
let future =
lsm::execute_lsm_merge_insert(table, plan, params.validate_single_shard, new_data);
return match params.timeout {
Some(timeout) => match tokio::time::timeout(timeout, future).await {
Ok(result) => result,
Err(_) => Err(Error::Runtime {
message: "merge insert timed out".to_string(),
}),
},
None => future.await,
};
}
lsm::LsmDispatch::Standard => {}
}
let dataset = table.dataset.get().await?;
let mut builder = LanceMergeInsertBuilder::try_new(dataset.clone(), params.on)?;
match (
@@ -219,6 +278,7 @@ pub(crate) async fn execute_merge_insert(
num_inserted_rows: stats.num_inserted_rows,
num_deleted_rows: stats.num_deleted_rows,
num_attempts: stats.num_attempts,
num_rows: stats.num_inserted_rows + stats.num_updated_rows,
})
}
@@ -327,3 +387,366 @@ mod tests {
assert_eq!(table.count_rows(None).await.unwrap(), 25);
}
}
#[cfg(test)]
mod lsm_tests {
use std::sync::Arc;
use arrow_array::{
Int64Array, RecordBatch, RecordBatchIterator, RecordBatchReader, StringArray,
};
use arrow_schema::{DataType, Field, Schema};
use tempfile::{TempDir, tempdir};
use crate::connect;
use crate::error::Error;
use crate::table::{LsmWriteSpec, Table};
/// A reader of `[id: Int64, value: Int64]` rows; `value` is `0..n`.
fn id_value_reader(ids: Vec<i64>) -> Box<dyn RecordBatchReader + Send> {
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("value", DataType::Int64, false),
]));
let n = ids.len() as i64;
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int64Array::from(ids)),
Arc::new(Int64Array::from_iter_values(0..n)),
],
)
.unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(batch)], schema))
}
/// A reader of `[id: Int64, region: Utf8]` rows.
fn id_region_reader(rows: Vec<(i64, &str)>) -> Box<dyn RecordBatchReader + Send> {
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("region", DataType::Utf8, false),
]));
let ids: Vec<i64> = rows.iter().map(|(id, _)| *id).collect();
let regions: Vec<&str> = rows.iter().map(|(_, region)| *region).collect();
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int64Array::from(ids)),
Arc::new(StringArray::from(regions)),
],
)
.unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(batch)], schema))
}
/// A multi-batch reader of `[id: Int64, region: Utf8]` rows.
fn id_region_multi_reader(batches: Vec<Vec<(i64, &str)>>) -> Box<dyn RecordBatchReader + Send> {
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("region", DataType::Utf8, false),
]));
let records: Vec<_> = batches
.into_iter()
.map(|rows| {
let ids: Vec<i64> = rows.iter().map(|(id, _)| *id).collect();
let regions: Vec<&str> = rows.iter().map(|(_, region)| *region).collect();
Ok(RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int64Array::from(ids)),
Arc::new(StringArray::from(regions)),
],
)
.unwrap())
})
.collect();
Box::new(RecordBatchIterator::new(records, schema))
}
/// Create an `[id, value]` table with `id` as the unenforced primary key.
async fn id_value_table(dir: &TempDir) -> Table {
let conn = connect(dir.path().to_str().unwrap())
.execute()
.await
.unwrap();
let table = conn
.create_table("t", id_value_reader(vec![1, 2, 3]))
.execute()
.await
.unwrap();
table.set_unenforced_primary_key(["id"]).await.unwrap();
table
}
#[tokio::test]
async fn lsm_merge_insert_bucket() {
let dir = tempdir().unwrap();
let table = id_value_table(&dir).await;
// num_buckets = 1: every row routes to the single bucket.
table
.set_lsm_write_spec(LsmWriteSpec::bucket("id", 1))
.await
.unwrap();
// Empty `on` defaults to the primary key.
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let result = builder
.execute(id_value_reader(vec![3, 4, 5]))
.await
.unwrap();
// LSM path: rows go to the MemWAL, the breakdown is unknown until
// compaction, so only `num_rows` is populated.
assert_eq!(result.num_rows, 3);
assert_eq!(result.version, 0);
assert_eq!(result.num_inserted_rows, 0);
assert_eq!(result.num_updated_rows, 0);
}
#[tokio::test]
async fn lsm_merge_insert_unsharded() {
let dir = tempdir().unwrap();
let table = id_value_table(&dir).await;
table
.set_lsm_write_spec(LsmWriteSpec::unsharded())
.await
.unwrap();
let mut builder = table.merge_insert(&["id"]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let result = builder
.execute(id_value_reader(vec![10, 11, 12, 13]))
.await
.unwrap();
assert_eq!(result.num_rows, 4);
}
#[tokio::test]
async fn lsm_merge_insert_identity() {
let dir = tempdir().unwrap();
let conn = connect(dir.path().to_str().unwrap())
.execute()
.await
.unwrap();
let table = conn
.create_table("t", id_region_reader(vec![(1, "us"), (2, "us")]))
.execute()
.await
.unwrap();
table.set_unenforced_primary_key(["id"]).await.unwrap();
table
.set_lsm_write_spec(LsmWriteSpec::identity("region"))
.await
.unwrap();
// All rows share one identity value, so they route to one shard.
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let result = builder
.execute(id_region_reader(vec![(3, "us"), (4, "us")]))
.await
.unwrap();
assert_eq!(result.num_rows, 2);
}
#[tokio::test]
async fn lsm_merge_insert_use_lsm_write_false_falls_back() {
let dir = tempdir().unwrap();
let table = id_value_table(&dir).await;
table
.set_lsm_write_spec(LsmWriteSpec::bucket("id", 1))
.await
.unwrap();
// use_lsm_write(false) opts out: the standard path runs and commits.
let mut builder = table.merge_insert(&["id"]);
builder.when_not_matched_insert_all().use_lsm_write(false);
let result = builder
.execute(id_value_reader(vec![3, 4, 5]))
.await
.unwrap();
assert_eq!(result.num_inserted_rows, 2);
assert_eq!(table.count_rows(None).await.unwrap(), 5);
}
#[tokio::test]
async fn lsm_merge_insert_rejects_on_not_primary_key() {
let dir = tempdir().unwrap();
let table = id_value_table(&dir).await;
table
.set_lsm_write_spec(LsmWriteSpec::bucket("id", 1))
.await
.unwrap();
let mut builder = table.merge_insert(&["value"]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let err = builder.execute(id_value_reader(vec![1])).await.unwrap_err();
assert!(matches!(err, Error::InvalidInput { .. }), "got {err:?}");
}
#[tokio::test]
async fn lsm_merge_insert_rejects_non_upsert() {
let dir = tempdir().unwrap();
let table = id_value_table(&dir).await;
table
.set_lsm_write_spec(LsmWriteSpec::bucket("id", 1))
.await
.unwrap();
// Insert-only (no when_matched_update_all) is not the upsert shape.
let mut builder = table.merge_insert(&[]);
builder.when_not_matched_insert_all();
let err = builder.execute(id_value_reader(vec![4])).await.unwrap_err();
assert!(matches!(err, Error::InvalidInput { .. }), "got {err:?}");
}
#[tokio::test]
async fn lsm_close_writers_then_reopen() {
let dir = tempdir().unwrap();
let table = id_value_table(&dir).await;
table
.set_lsm_write_spec(LsmWriteSpec::bucket("id", 1))
.await
.unwrap();
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
builder.execute(id_value_reader(vec![7, 8])).await.unwrap();
table.close_lsm_writers().await.unwrap();
// The writer reopens lazily on the next merge_insert.
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let result = builder.execute(id_value_reader(vec![9])).await.unwrap();
assert_eq!(result.num_rows, 1);
}
#[tokio::test]
async fn lsm_merge_insert_multi_batch() {
let dir = tempdir().unwrap();
let conn = connect(dir.path().to_str().unwrap())
.execute()
.await
.unwrap();
let table = conn
.create_table("t", id_region_reader(vec![(1, "us")]))
.execute()
.await
.unwrap();
table.set_unenforced_primary_key(["id"]).await.unwrap();
table
.set_lsm_write_spec(LsmWriteSpec::identity("region"))
.await
.unwrap();
// Multiple batches that all route to one shard are written together.
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let result = builder
.execute(id_region_multi_reader(vec![
vec![(2, "us"), (3, "us")],
vec![(4, "us")],
]))
.await
.unwrap();
assert_eq!(result.num_rows, 3);
// Batches that route to different shards are rejected; the validation
// runs before any write, so no partial write is left behind.
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let err = builder
.execute(id_region_multi_reader(vec![
vec![(5, "us")],
vec![(6, "eu")],
]))
.await
.unwrap_err();
assert!(matches!(err, Error::InvalidInput { .. }), "got {err:?}");
}
#[tokio::test]
async fn lsm_merge_insert_use_lsm_write_true_requires_spec() {
let dir = tempdir().unwrap();
// id_value_table sets a primary key but no LSM write spec.
let table = id_value_table(&dir).await;
let mut builder = table.merge_insert(&["id"]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all()
.use_lsm_write(true);
let err = builder.execute(id_value_reader(vec![4])).await.unwrap_err();
assert!(matches!(err, Error::InvalidInput { .. }), "got {err:?}");
}
#[tokio::test]
async fn lsm_merge_insert_rejects_second_shard() {
let dir = tempdir().unwrap();
let conn = connect(dir.path().to_str().unwrap())
.execute()
.await
.unwrap();
let table = conn
.create_table("t", id_region_reader(vec![(1, "us")]))
.execute()
.await
.unwrap();
table.set_unenforced_primary_key(["id"]).await.unwrap();
table
.set_lsm_write_spec(LsmWriteSpec::identity("region"))
.await
.unwrap();
// The first merge_insert opens the single writer for shard "us".
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
builder
.execute(id_region_reader(vec![(2, "us")]))
.await
.unwrap();
// A merge_insert routing to a different shard is rejected.
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
let err = builder
.execute(id_region_reader(vec![(3, "eu")]))
.await
.unwrap_err();
assert!(matches!(err, Error::InvalidInput { .. }), "got {err:?}");
// After closing the writer, a different shard can be written.
table.close_lsm_writers().await.unwrap();
let mut builder = table.merge_insert(&[]);
builder
.when_matched_update_all(None)
.when_not_matched_insert_all();
builder
.execute(id_region_reader(vec![(4, "eu")]))
.await
.unwrap();
}
}

File diff suppressed because it is too large Load Diff