Compare commits

...

40 Commits

Author SHA1 Message Date
Dan Tasse
dca77b5739 docs: update namespace_path parameter to show it used to be namespace 2026-05-13 13:29:43 -04:00
Brendan Clement
011fdd5c94 feat(nodejs): add prewarmData method on Table (#3374)
### Summary
- Closes #3362 
- Adds `prewarmData(columns?: string[])` to the Node bindings, mirroring
the Rust and Python implementations

### Testing
- [x] `npm run build` (regenerates the napi `.node` module + TS
declarations)
- [x] `npm run lint`
- [x] `npm test
- [ ] live test against remote table - just waiting for my dev stack to
get created

### Documentation
- updated docs
2026-05-12 15:29:48 -07:00
Shengan Zhang
650f173236 feat(python): add IVF_HNSW_FLAT vector index support (#3366)
## Summary

Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was
documented at https://docs.lancedb.com/indexing/vector-index but
`lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised
`ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying
`pylance` already accepted it, only the LanceDB wrapper was missing the
wiring.

**Rust core (`rust/lancedb`):**
- Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the
`IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`).
- Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)`
helper, keeping symmetry with the other `IVF_HNSW_*` variants.
- Forward the variant in `RemoteTable::create_index` and add two
parametrised tests (default + customised config) for the JSON
serialisation.
- New `NativeTable` integration test
(`test_create_index_ivf_hnsw_flat`).

**Python binding (`python/`):**
- New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias.
- PyO3 `extract_index_params` recognises the `HnswFlat` config.
- `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync
`RemoteTable.create_index` both dispatch to the new config.
- `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover
the new type so `pyright`/`make check` stays clean.
- Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync
dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage.
- Existing `test_index_statistics_index_type_lists_all_supported_values`
updated to include `IVF_HNSW_FLAT`.

A matching Node.js / TypeScript binding is in a follow-up PR.

Closes #3331

## Test plan

- [ ] \`cargo check --quiet --features remote --tests --examples\`
- [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the
new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised
\`RemoteTable::create_index\` cases)
- [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote
--tests --examples\`
- [ ] \`cd python && make develop && make check && make test\` (covers
the two new async tests, the alias test, the dispatcher test, and the
updated \`test_index_statistics_index_type_lists_all_supported_values\`
assertion)
2026-05-11 15:08:32 -07:00
Xuanwo
9b21c136c6 feat(python): support model-backed native FTS tokenizers (#3289)
This wires Lance's existing `jieba/*` and `lindera/*` native FTS
tokenizers through the Python SDK instead of leaving them behind
disabled features and narrow public typing. It also documents the
`LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for
successful CJK indexing plus missing-model error guidance.

Closes #2168.
2026-05-08 23:53:14 +08:00
Heng Ge
694aa48e19 fix(database): drop spurious trailing ? from listing-database URIs (#3357)
## Summary

`url::Url::query_pairs_mut()` leaves the URL with `query=Some("")` after
`.clear()` even when the input had no query string. The listing-database
connect path then captured that empty query into
`ListingDatabase::query_string`, and `table_uri()` blindly appended
`?<query>` to every per-table URI — producing URIs like
`s3://bucket/prefix/foo.lance?`.

The trailing `?` is benign for normal table operations, but it breaks
any caller that constructs a sub-path from the table URI. In particular,
MemWAL flushes write to `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`,
which `url::Url::parse` then re-parses as `path=<base table>` +
`query=/_mem_wal/...`. `Dataset::write` resolves the base table dataset,
finds it already exists, and fails with `Dataset already exists:
…_gen_1` on the very first MemTable flush (observed deterministically
against S3 across all merge_insert LSM modes; tracked in
[lance-format/lance#6713](https://github.com/lance-format/lance/pull/6715)).

## Fix

Treat `Some("")` query the same as no query when capturing
`query_string`. A real `?foo=bar` query is still propagated unchanged.

Adds a regression test covering both the empty-query and non-empty-query
paths.

## Verification

- `url::Url::parse("s3://bucket/prefix/").query()` → `None`, but after
`query_pairs_mut().clear()` → `Some("")`. Confirmed in a standalone
repro.
- Without this fix, every `table_uri()` for an `s3://`-style connection
ends with `?`, breaking MemWAL and any future sub-path consumer in the
same way.
- New unit test `test_table_uri_url_path_has_no_trailing_question_mark`
exercises both code paths.
2026-05-07 23:29:29 -07:00
LanceDB Robot
455ba5abbf chore: update lance dependency to v7.0.0-beta.7 (#3356)
## Summary
- Update Lance Rust workspace dependencies to `7.0.0-beta.7` using
`ci/set_lance_version.py`.
- Update the Java `lance-core` Maven property to `7.0.0-beta.7`.
- Refresh `Cargo.lock` for the new Lance tag:
https://github.com/lance-format/lance/releases/tag/v7.0.0-beta.7

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-05-07 16:04:38 -07:00
Octopus
5338aeb006 ci: avoid passing GPG passphrase on command line in Java publish workflow (#3313)
Fixes #3299

## Problem

Two security issues exist in `.github/workflows/java-publish.yml`:

1. **`gpg-passphrase` input is misused**: `actions/setup-java`'s
`gpg-passphrase` input expects the **name** of an environment variable
(default: `GPG_PASSPHRASE`), not the secret value itself. The previous
value `${{ secrets.GPG_PASSPHRASE }}` was setting the env var name to
the actual secret, which is incorrect.

2. **Passphrase visible on the command line**: `-Dgpg.passphrase=${{
secrets.GPG_PASSPHRASE }}` passes the GPG passphrase as a Maven system
property argument, making it visible in process listings and potentially
echoed in debug logs — a supply-chain security risk for release
workflows.

## Solution

- Fix `gpg-passphrase: MAVEN_GPG_PASSPHRASE` — use the correct env var
name so `actions/setup-java` generates a proper Maven `settings.xml`
entry that reads from `MAVEN_GPG_PASSPHRASE`.
- Remove `-Dgpg.passphrase=...` from the Maven CLI invocation.
- Add `MAVEN_GPG_PASSPHRASE: ${{ secrets.GPG_PASSPHRASE }}` to the
`env:` block of the Publish step, so the passphrase is available as an
environment variable rather than a CLI argument.

## Testing

The Java publish workflow only runs on tag pushes, so this cannot be
exercised in a PR build. The logic change is straightforward:
`actions/setup-java` is documented to write a `settings.xml` that reads
`<gpg.passphrase>` from the named env var, and `maven-gpg-plugin` picks
it up from there without any CLI argument.

Co-authored-by: octo-patch <octo-patch@github.com>
2026-05-07 08:45:27 -07:00
LanceDB Robot
47a34f5cca chore: update lance dependency to v7.0.0-beta.4 (#3348)
## Summary
- Update Lance Rust dependencies to `v7.0.0-beta.4` using
`ci/set_lance_version.py`.
- Update the Java `lance-core` dependency property to `7.0.0-beta.4`.
- Align LanceDB with dependency updates required by Lance 7, including
`object_store` 0.13 API compatibility.

Triggering tag:
https://github.com/lance-format/lance/releases/tag/v7.0.0-beta.4

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-05-05 18:36:39 -07:00
Weston Pace
a17c241e86 feat(python): make Permutation fork-safe for PyTorch DataLoader workers (#3339)
## Summary

PyTorch's `DataLoader` uses fork-based multiprocessing by default on
Linux, but threads do not survive `fork()`. LanceDB's Python bindings
drive async work through two threaded layers, both of which become inert
in a forked child:

- `BackgroundEventLoop` runs an asyncio loop on a Python
`threading.Thread`.
- `pyo3-async-runtimes::tokio` holds a global multi-threaded tokio
runtime whose worker threads also die on fork — and its runtime lives in
a `OnceLock` that cannot be replaced after first use.

As a result, any `Permutation` (or other async API) used inside a
fork-based `DataLoader` worker hangs indefinitely. This PR makes both
layers fork-safe so `Permutation` works as a `torch.utils.data.Dataset`
with `num_workers > 0`.

## Approach

### Rust — new `python/src/runtime.rs`

Mirrors the pattern used in [Lance's Python
bindings](456198cd6f/python/src/lib.rs (L139)),
adapted for the async-bridge use case.

- `LanceRuntime` implements `pyo3_async_runtimes::generic::Runtime +
ContextExt`, backed by an `AtomicPtr<tokio::runtime::Runtime>` we own
(sidestepping `pyo3-async-runtimes`'s frozen `OnceLock` global).
- A `pthread_atfork(after_in_child)` handler nulls the pointer; the next
`spawn` rebuilds the runtime in the child. The previous runtime is
intentionally **leaked** — calling `Drop` would try to join now-dead
worker threads and hang.
- `runtime::future_into_py` is a drop-in for
`pyo3_async_runtimes::tokio::future_into_py`. All ~80 call sites in
`arrow.rs` / `connection.rs` / `permutation.rs` / `query.rs` /
`table.rs` are updated to route through it.
- `python/Cargo.toml` adds `libc = "0.2"` and the tokio
`rt-multi-thread` feature.

### Python — `lancedb/background_loop.py`

- Refactors `BackgroundEventLoop.__init__` to a reusable `_start()`
method.
- An `os.register_at_fork(after_in_child=…)` hook calls `LOOP._start()`
to give the singleton a fresh asyncio loop and thread **in place**. This
matters because the rest of the codebase imports `LOOP` via `from
.background_loop import LOOP` — rebinding the module attribute would
leave those references holding the dead loop.

### Python — `lancedb/__init__.py`

Removes the `__warn_on_fork` pre-fork warning (and the now-unused
`import warnings`). Fork is supported.

## Test plan

- [x] New `test_permutation_dataloader_fork_workers` in
`python/tests/test_torch.py`: runs a `Permutation` through
`torch.utils.data.DataLoader(num_workers=2,
multiprocessing_context="fork")` inside a spawn-isolated child with a
30s hang detector. **Pre-fix**: timed out at 36s. **Post-fix**: passes
in ~3.6s.
- [x] New `test_remote_connection_after_fork` in
`python/tests/test_remote_db.py`: forks a child that creates a fresh
`lancedb.connect(...)` against a mock HTTP server and calls
`table_names()`; passes in <1s, validates the runtime reset is
sufficient for fresh remote clients.
- [x] All 62 tests in `test_torch.py` + `test_permutation.py` pass.
- [x] All 35 tests in `test_remote_db.py` pass.
- [x] `test_table.py` (87) + `test_db.py` + `test_query.py` (157, minus
one unrelated `sentence_transformers` import skip) — 244 passing.
- [x] `cargo clippy -p lancedb-python --tests` clean.
- [x] `cargo fmt`, `ruff check`, `ruff format` all clean.

## Known limitation (follow-up)

This PR makes a **freshly-built** `lancedb.connect(...)` work in a
forked child. An **inherited** `Connection` from the parent still
carries an inherited `reqwest::Client` whose hyper connection pool
references socket FDs and TCP/TLS state shared with the parent — using
it from the child after fork is unsafe (especially with HTTP/1.1
keep-alive). The recommended pattern for fork-based `DataLoader` workers
that hit a remote DB is to construct a new connection inside the worker.
Auto-clearing inherited HTTP client pools on fork would require tracking
live `Connection` instances in `lancedb` core and is left for a
follow-up PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:44:10 -07:00
Weston Pace
1fc23e5473 fix(python): make Permutation picklable for PyTorch multiprocessing (#3335)
## Summary

When pytorch is used with multiprocessing and the mp mode is spawn then
the Permutation needs to be pickled. It could not be pickled because
`Table` and `Connection` are not serializable. This PR adds pickle
support to Permutation without adding general pickle support to `Table`
or `Connection`. To add general support we probably need to start by
adding serialization in the namespace client.

In the meantime this PR enable pickling by adding special cases for:

 * In-memory tables (just serialize as Arrow IPC)
 * Native tables (serialize the URI)

If a user is not using one of the above cases (e.g. using a remote
connection) then they will need to provide a connection factory that can
be pickled.

## Breaking change

`PermutationBuilder.persist(...)` is removed from the Python bindings;
the permutation table is now always in-memory. The underlying Rust
`PermutationBuilder::persist` API is untouched and can be re-exposed
later if needed. It probably won't make sense to do that until we have a
way to serialize `Table` and `Connection`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:37:58 -07:00
qingfeng-occ
87b831bcae fix(node): remove redundant postbuild:release script to fix build failure (#3285)
The `build:release` command already outputs the `*.node` files directly
to the `dist/` directory via the `--output-dir dist` flag.

Therefore, the `postbuild:release` script, which attempts to copy
`*.node` files from the `lancedb/` source directory, fails with a "no
such file or directory" error because the source files do not exist
there.

This commit removes the redundant `postbuild:release` script to resolve
the build failure.

fix #3284

Signed-off-by: qingfeng-occ <qing.feng@zte.com.cn>
2026-05-04 09:37:18 -07:00
Nitesh Yadav
59db036118 fix(python): add missing space in hybrid query error message (#3340)
Hi, the hybrid query error message looks like it can use a space, just
added it.

```python
def _validate_query(self, query, vector=None, text=None):
    if query is not None and (vector is not None or text is not None):
        raise ValueError(
            "You can either provide a string query in search() method"
            "or set `vector()` and `text()` explicitly for hybrid search."
            "But not both."
        )
```
2026-05-02 15:51:00 -07:00
Lance Release
c091243d5b Bump version: 0.28.0-beta.10 → 0.28.0-beta.11 2026-04-29 17:53:49 +00:00
Lance Release
a2aea7b4e5 Bump version: 0.31.0-beta.10 → 0.31.0-beta.11 2026-04-29 17:53:22 +00:00
LanceDB Robot
4a5341edb1 chore: update lance dependency to v6.0.0-beta.7 (#3334)
## Summary
- Update Lance Rust dependencies to `6.0.0-beta.7` using
`ci/set_lance_version.py`.
- Update Java `lance-core.version` to `6.0.0-beta.7`.
- Align Arrow/DataFusion/PyO3 dependency versions and apply required
compatibility fixes for the Lance upgrade.

Triggering tag:
[v6.0.0-beta.7](https://github.com/lance-format/lance/releases/tag/v6.0.0-beta.7)

## Verification
- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-04-29 10:52:25 -07:00
Jack Ye
25dfe2cfd4 feat: add manifest-enabled directory namespace mode (#3332)
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
2026-04-29 09:22:06 -07:00
Lance Release
4dcd7f4314 Bump version: 0.28.0-beta.9 → 0.28.0-beta.10 2026-04-28 13:29:26 +00:00
Lance Release
2e36cd9dad Bump version: 0.31.0-beta.9 → 0.31.0-beta.10 2026-04-28 13:29:00 +00:00
Weston Pace
f31e27768a fix: address RUSTSEC-2026-0104 cargo-deny advisory (#3326)
## Summary

- Update `rustls-webpki` 0.103.10 → 0.103.13 to fix RUSTSEC-2026-0104
(reachable panic in CRL parsing)
- Add advisory ignore for the legacy `rustls-webpki` 0.101.7 copy pinned
to the aws-smithy/rustls 0.21 chain (same chain already exempted for
RUSTSEC-2026-0098/0099)

Fixes the `deny` CI job failure seen in #3325.

## Test plan

- [x] `cargo deny check advisories` passes locally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 17:56:10 -07:00
LanceDB Robot
b84150a53e chore: update lance dependency to v6.0.0-beta.4 (#3325)
## Summary

- Updates Lance Rust dependencies to `6.0.0-beta.4` using
`ci/set_lance_version.py`.
- Updates the Java `lance-core.version` property to `6.0.0-beta.4`.
- Triggering Lance tag:
https://github.com/lance-format/lance/releases/tag/v6.0.0-beta.4

## Verification

- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-04-27 15:13:07 -07:00
Will Jones
d135c18db6 ci: add cargo-deny configuration and CI check (#3307)
Adds a `deny.toml` at the workspace root and a `deny` CI job that runs
`cargo deny check` on every PR. Catches yanked crates, license drift,
banned or wildcard dependencies, unapproved sources, and new RUSTSEC
advisories.

As part of wiring this up:

- Updated `aws-lc-rs` 1.13.0 → 1.16.3 / `aws-lc-sys` 0.28.0 → 0.40.0 to
  clear four 2026 AWS-LC advisories (timing side-channel, PKCS7 bypass,
  CRL scope). Removed the `=0.28.0` workaround pin; the original build
  failure no longer reproduces.
- Updated `bytes`, `zlib-rs`, `rand`, `rustls-webpki`, `lz4_flex` to
  clear their current advisories.
- Marked `lancedb-nodejs` and `lancedb-python` as `publish = false` and
  pinned `lzma-sys` from `*` to `0.1` so `bans.wildcards = "deny"` can
  be enforced.

10 remaining advisories have no safe upgrade available (transitive via
opendal, lance, datafusion, async-openai, aws-sdk on the legacy rustls
0.21 chain). Each is ignored in `deny.toml` with a per-entry rationale
and a link to the RUSTSEC advisory. New advisories still fail CI.

Fixes #3297

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:53:15 -07:00
Will Jones
ef399de092 ci: switch PyPI publish to OIDC trusted publishing (#3302)
## Summary

- Replaces `LANCEDB_PYPI_API_TOKEN` (long-lived token) with OIDC trusted
publishing via `pypa/gh-action-pypi-publish`
- Adds `id-token: write` permission to linux/mac/windows jobs
- Removes `twine`-based upload and the `pypi_token` input from
`upload_wheel` composite action
- Enables PEP 740 Sigstore attestations on published wheels as a bonus

After merging, rotate/revoke the `LANCEDB_PYPI_API_TOKEN` secret.

Closes #3294

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 20:53:06 -07:00
Will Jones
0d767abd0e ci: add Dependabot config for shipped Rust binaries (#3300)
Adds `.github/dependabot.yml` enabling weekly cargo update PRs for the
root workspace, which produces the Rust binaries we ship: the Node.js
and Python native extensions. The `rust/lancedb` library crate shares
the same lockfile — its consumers pick versions themselves, but bumping
transitive deps here keeps the shipped binaries current.

Also removes the misleading `exclude = ["python"]` line from the root
`Cargo.toml`: `python` is listed in `members`, and `cargo metadata`
confirms it's a workspace member, so the exclude was dead code that
implied the opposite.

Minor/patch updates are grouped to reduce PR noise.

Part of #3292. Only covers the cargo ecosystem; pip, npm, and
github-actions can follow.
2026-04-24 20:52:54 -07:00
Jack Ye
a92ae0ded5 fix: enable hostname verification by default (#3304)
## Summary

- make `TlsConfig::default()` enable hostname verification by default
- align the Rust default with the documented Python and Node behavior
- update the Rust unit test to lock in the safe default
2026-04-21 08:39:03 -07:00
Xuanwo
c54888a83a refactor(python): remove legacy tantivy FTS support (#3282)
This follows the Rust-side Tantivy removal by deleting the remaining
Python Tantivy runtime, tests, and packaging references.

It also turns the legacy Python-only Tantivy parameters into explicit
errors and stops reading legacy `_indices/fts` directories so Python FTS
is fully native-only.
2026-04-20 09:28:45 +08:00
Will Jones
ba6c44abc9 ci: add top-level permissions to GHA workflows (#3255)
Adds `permissions: contents: read` to the 10 workflows that had no
top-level permissions block. Workflows that already declared
permissions, or individual jobs that need elevated permissions (`issues:
write`, `pull-requests: write`, `contents: write`), are left unchanged.

Affected workflows: `dev.yml`, `java-publish.yml`, `java.yml`,
`license-header-check.yml`, `nodejs.yml`, `pypi-publish.yml`,
`python.yml`, `rust.yml`, `update_package_lock_run.yml`,
`update_package_lock_run_nodejs.yml`
2026-04-20 09:22:27 +08:00
Lance Release
75b0a8e0a3 Bump version: 0.28.0-beta.8 → 0.28.0-beta.9 2026-04-19 20:39:29 +00:00
Lance Release
2a886141f7 Bump version: 0.31.0-beta.8 → 0.31.0-beta.9 2026-04-19 20:39:04 +00:00
Jack Ye
2a1df8edcf fix(rust): materialize declared namespace tables on create (#3288)
## Summary
- handle `declare_table` already-exists conflicts in the Rust namespace
database create path
- reuse declared-but-not-materialized table metadata instead of failing
create mode
- preserve overwrite behavior while allowing declared Geneva system
tables to be materialized
2026-04-19 13:25:53 -07:00
C Kaustubh
fd98b845ea fix(node): prevent reranker from keeping process alive (#3270)
Fixes #3269.

## What I observed
Using a reranker in a hybrid query could keep the Node.js process alive
even after `table.close()` and `db.close()`.

## Root cause
The reranker callback bridge used a `ThreadsafeFunction` in referenced
mode, which can keep the event loop alive longer than intended.

## Minimal fix
- In `nodejs/src/rerankers.rs`, create the reranker callback TSFN in
weak mode (`.weak::<true>()`).
- Add a regression test in `nodejs/__test__/rerankers.test.ts` that
spawns a child process, runs a rerank query, and asserts the process
exits naturally.

## Validation
- Built Node bindings successfully.
- Ran targeted tests: `rerankers.test.ts` passes (including new
regression test).
- Pre-commit checks for changed files were run and clean.
2026-04-19 14:02:23 +08:00
Lance Release
be48ada352 Bump version: 0.28.0-beta.7 → 0.28.0-beta.8 2026-04-19 04:19:10 +00:00
Lance Release
9ad2dfe601 Bump version: 0.31.0-beta.7 → 0.31.0-beta.8 2026-04-19 04:18:45 +00:00
Jack Ye
f909df3e87 fix(python): use namespace-backed rust connection for namespace tables (#3286)
So far, I have been using a hacky approach that creates and opens
namespace-backed table, by getting its location and use a temporary
lancedb connection to create or open it. This was working for features
like credentials vending but is no longer fully working for the managed
versioning feature, recently geneva tests have been failing here and
there and various patches are not addressing the root cause. This PR
fully fixes this and implements proper rust binding for it.
Specifically:

- build a real Rust namespace-backed connection from the Python
namespace client
- route namespace table create/open through that connection instead of
resolved-location temp connections
- keep namespace client naming consistent in the Rust bridge and
preserve federated namespace + DuckDB behavior
2026-04-18 21:17:52 -07:00
Lance Release
d715bbb588 Bump version: 0.28.0-beta.6 → 0.28.0-beta.7 2026-04-17 08:12:27 +00:00
Lance Release
5ce3d8d141 Bump version: 0.31.0-beta.6 → 0.31.0-beta.7 2026-04-17 08:12:03 +00:00
Jack Ye
5eaac178b1 fix(python): pass namespace client on schema-only table create (#3283)
## Summary
- pass `namespace_client` through the Python create-table path
- ensure schema-only namespace table creation uses the namespace-aware
empty-table flow
- fix reopening namespace tables created without initial data
2026-04-17 01:11:18 -07:00
Lance Release
11af763fcd Bump version: 0.28.0-beta.5 → 0.28.0-beta.6 2026-04-16 18:57:28 +00:00
Lance Release
2ed5452e1c Bump version: 0.31.0-beta.5 → 0.31.0-beta.6 2026-04-16 18:57:05 +00:00
Xuanwo
b7c0b5987c chore: upgrade lance to 6.0.0-beta.1 (#3281) 2026-04-17 02:51:58 +08:00
Jack Ye
97a4b38f19 feat(rust): support nested namespace ops in listing db (#3279)
## Summary
- delegate child-namespace `ListingDatabase` operations through an
eagerly initialized `LanceNamespaceDatabase`
- support nested namespace create/open/list/drop flows without requiring
callers to inject explicit locations
- add `namespace_client_properties` plumbing for local and namespace
connections so directory namespace settings like
`table_version_tracking_enabled` can be configured
- add regression tests for nested namespace ops and namespace client
property propagation
2026-04-16 10:12:28 -07:00
105 changed files with 5553 additions and 2108 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.28.0-beta.5"
current_version = "0.28.0-beta.11"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

18
.github/dependabot.yml vendored Normal file
View File

@@ -0,0 +1,18 @@
version: 2
# Scope: the root Cargo workspace, which produces the Rust binaries we
# ship to users (the Node.js and Python native extensions). The
# `rust/lancedb` library crate shares the same lockfile; its consumers
# pick their own dependency versions, but bumping transitive deps here
# keeps the binaries we ship current.
updates:
- package-ecosystem: cargo
directory: /
schedule:
interval: weekly
open-pull-requests-limit: 10
groups:
rust-minor-patch:
update-types:
- minor
- patch

View File

@@ -8,6 +8,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
permissions:
contents: read
jobs:
labeler:
permissions:

View File

@@ -19,6 +19,9 @@ on:
paths:
- .github/workflows/java-publish.yml
permissions:
contents: read
jobs:
publish:
name: Build and Publish
@@ -40,7 +43,7 @@ jobs:
server-username: SONATYPE_USER
server-password: SONATYPE_TOKEN
gpg-private-key: ${{ secrets.GPG_PRIVATE_KEY }}
gpg-passphrase: ${{ secrets.GPG_PASSPHRASE }}
gpg-passphrase: MAVEN_GPG_PASSPHRASE
- name: Set git config
run: |
git config --global user.email "dev+gha@lancedb.com"
@@ -55,10 +58,11 @@ jobs:
echo "use-agent" >> ~/.gnupg/gpg.conf
echo "pinentry-mode loopback" >> ~/.gnupg/gpg.conf
export GPG_TTY=$(tty)
./mvnw --batch-mode -DskipTests -DpushChanges=false -Dgpg.passphrase=${{ secrets.GPG_PASSPHRASE }} deploy -pl lancedb-core -am -P deploy-to-ossrh
./mvnw --batch-mode -DskipTests -DpushChanges=false deploy -pl lancedb-core -am -P deploy-to-ossrh
env:
SONATYPE_USER: ${{ secrets.SONATYPE_USER }}
SONATYPE_TOKEN: ${{ secrets.SONATYPE_TOKEN }}
MAVEN_GPG_PASSPHRASE: ${{ secrets.GPG_PASSPHRASE }}
report-failure:
name: Report Workflow Failure

View File

@@ -24,6 +24,9 @@ on:
- java/**
- .github/workflows/java.yml
permissions:
contents: read
jobs:
build-java:
runs-on: ubuntu-24.04

View File

@@ -10,6 +10,10 @@ on:
- nodejs/**
- java/**
- .github/workflows/license-header-check.yml
permissions:
contents: read
jobs:
check-licenses:
runs-on: ubuntu-latest

View File

@@ -15,6 +15,9 @@ on:
- .github/workflows/nodejs.yml
- docker-compose.yml
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

View File

@@ -14,10 +14,16 @@ on:
env:
PIP_EXTRA_INDEX_URL: "https://pypi.fury.io/lance-format/ https://pypi.fury.io/lancedb/"
permissions:
contents: read
jobs:
linux:
name: Python ${{ matrix.config.platform }} manylinux${{ matrix.config.manylinux }}
timeout-minutes: 60
permissions:
id-token: write
contents: read
strategy:
matrix:
config:
@@ -57,10 +63,12 @@ jobs:
- uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }}
mac:
timeout-minutes: 90
permissions:
id-token: write
contents: read
runs-on: ${{ matrix.config.runner }}
strategy:
matrix:
@@ -85,10 +93,12 @@ jobs:
- uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }}
windows:
timeout-minutes: 60
permissions:
id-token: write
contents: read
runs-on: windows-latest
steps:
- uses: actions/checkout@v4
@@ -107,7 +117,6 @@ jobs:
- uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }}
gh-release:
if: startsWith(github.ref, 'refs/tags/python-v')

View File

@@ -17,6 +17,9 @@ on:
- .github/workflows/build_windows_wheel/**
- .github/workflows/run_tests/**
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
@@ -108,7 +111,6 @@ jobs:
- name: Install
run: |
pip install --extra-index-url https://pypi.fury.io/lance-format/ --extra-index-url https://pypi.fury.io/lancedb/ -e .[tests,dev,embeddings]
pip install tantivy
pip install mlx
- name: Doctest
run: pytest --doctest-modules python/lancedb
@@ -227,6 +229,5 @@ jobs:
pip install "pydantic<2"
pip install pyarrow==16
pip install --extra-index-url https://pypi.fury.io/lance-format/ --extra-index-url https://pypi.fury.io/lancedb/ -e .[tests]
pip install tantivy
- name: Run tests
run: pytest -m "not slow and not s3_test" -x -v --durations=30 python/tests

View File

@@ -9,9 +9,15 @@ on:
- Cargo.toml
- Cargo.lock
- rust-toolchain.toml
- deny.toml
- rust/**
- nodejs/Cargo.toml
- python/Cargo.toml
- .github/workflows/rust.yml
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
@@ -53,6 +59,17 @@ jobs:
- name: Run clippy (without remote feature)
run: cargo clippy --profile ci --workspace --tests -- -D warnings
deny:
# Supply-chain checks: advisories, licenses, banned crates, and source
# restrictions. Configuration lives in `deny.toml` at the workspace root.
timeout-minutes: 10
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: EmbarkStudios/cargo-deny-action@v2
with:
command: check advisories bans licenses sources
build-no-lock:
runs-on: ubuntu-24.04
timeout-minutes: 30

View File

@@ -3,6 +3,9 @@ name: Update package-lock.json
on:
workflow_dispatch:
permissions:
contents: read
jobs:
publish:
runs-on: ubuntu-latest

View File

@@ -3,6 +3,9 @@ name: Update NodeJs package-lock.json
on:
workflow_dispatch:
permissions:
contents: read
jobs:
publish:
runs-on: ubuntu-latest

View File

@@ -2,9 +2,6 @@ name: upload-wheel
description: "Upload wheels to Pypi"
inputs:
pypi_token:
required: true
description: "release token for the repo"
fury_token:
required: true
description: "release token for the fury repo"
@@ -12,12 +9,6 @@ inputs:
runs:
using: "composite"
steps:
- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install twine
python3 -m pip install --upgrade pkginfo
- name: Choose repo
shell: bash
id: choose_repo
@@ -27,19 +18,17 @@ runs:
else
echo "repo=pypi" >> $GITHUB_OUTPUT
fi
- name: Publish to PyPI
- name: Publish to Fury
if: steps.choose_repo.outputs.repo == 'fury'
shell: bash
env:
FURY_TOKEN: ${{ inputs.fury_token }}
PYPI_TOKEN: ${{ inputs.pypi_token }}
run: |
if [[ ${{ steps.choose_repo.outputs.repo }} == fury ]]; then
WHEEL=$(ls target/wheels/lancedb-*.whl 2> /dev/null | head -n 1)
echo "Uploading $WHEEL to Fury"
curl -f -F package=@$WHEEL https://$FURY_TOKEN@push.fury.io/lancedb/
else
twine upload --repository ${{ steps.choose_repo.outputs.repo }} \
--username __token__ \
--password $PYPI_TOKEN \
target/wheels/lancedb-*.whl
fi
WHEEL=$(ls target/wheels/lancedb-*.whl 2> /dev/null | head -n 1)
echo "Uploading $WHEEL to Fury"
curl -f -F package=@$WHEEL https://$FURY_TOKEN@push.fury.io/lancedb/
- name: Publish to PyPI
if: steps.choose_repo.outputs.repo == 'pypi'
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages-dir: target/wheels/

2433
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,5 @@
[workspace]
members = ["rust/lancedb", "nodejs", "python"]
# Python package needs to be built by maturin.
exclude = ["python"]
resolver = "2"
[workspace.package]
@@ -15,40 +13,40 @@ categories = ["database-implementations"]
rust-version = "1.91.0"
[workspace.dependencies]
lance = { "version" = "=5.1.0-beta.3", default-features = false, "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=5.1.0-beta.3", default-features = false, "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=5.1.0-beta.3", default-features = false, "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=5.1.0-beta.3", "tag" = "v5.1.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
lance = { "version" = "=7.0.0-beta.7", default-features = false, "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=7.0.0-beta.7", default-features = false, "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=7.0.0-beta.7", default-features = false, "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "57.2", optional = false }
arrow-array = "57.2"
arrow-data = "57.2"
arrow-ipc = "57.2"
arrow-ord = "57.2"
arrow-schema = "57.2"
arrow-select = "57.2"
arrow-cast = "57.2"
arrow = { version = "58.0.0", optional = false }
arrow-array = "58.0.0"
arrow-data = "58.0.0"
arrow-ipc = "58.0.0"
arrow-ord = "58.0.0"
arrow-schema = "58.0.0"
arrow-select = "58.0.0"
arrow-cast = "58.0.0"
async-trait = "0"
datafusion = { version = "52.1", default-features = false }
datafusion-catalog = "52.1"
datafusion-common = { version = "52.1", default-features = false }
datafusion-execution = "52.1"
datafusion-expr = "52.1"
datafusion-functions = "52.1"
datafusion-physical-plan = "52.1"
datafusion-physical-expr = "52.1"
datafusion-sql = "52.1"
datafusion = { version = "53.0.0", default-features = false }
datafusion-catalog = "53.0.0"
datafusion-common = { version = "53.0.0", default-features = false }
datafusion-execution = "53.0.0"
datafusion-expr = "53.0.0"
datafusion-functions = "53.0.0"
datafusion-physical-plan = "53.0.0"
datafusion-physical-expr = "53.0.0"
datafusion-sql = "53.0.0"
env_logger = "0.11"
half = { "version" = "2.7.1", default-features = false, features = [
"num-traits",
@@ -56,7 +54,7 @@ half = { "version" = "2.7.1", default-features = false, features = [
futures = "0"
log = "0.4"
moka = { version = "0.12", features = ["future"] }
object_store = "0.12.0"
object_store = "0.13.2"
pin-project = "1.0.7"
rand = "0.9"
snafu = "0.8"

196
deny.toml Normal file
View File

@@ -0,0 +1,196 @@
# cargo-deny configuration for LanceDB.
#
# Run locally with `cargo deny check`. See
# https://embarkstudios.github.io/cargo-deny/ for the full reference.
# The set of target triples we care about. cargo-deny will only consider
# dependencies that are used on at least one of these targets. Keeping this
# explicit avoids noise from platform-specific crates (e.g. wasm, android,
# ios) that we never actually ship.
[graph]
targets = [
"x86_64-unknown-linux-gnu",
"aarch64-unknown-linux-gnu",
"x86_64-apple-darwin",
"aarch64-apple-darwin",
"x86_64-pc-windows-msvc",
"aarch64-pc-windows-msvc",
]
all-features = true
[output]
feature-depth = 1
# ---------------------------------------------------------------------------
# Advisories: security vulnerabilities and yanked crates.
# ---------------------------------------------------------------------------
[advisories]
version = 2
# Fail the check if any crate in the lockfile has been yanked from crates.io.
# Yanked crates are a signal the author retracted the release (often due to
# bugs or security issues) and should not be depended on.
yanked = "deny"
# Advisory IDs we have explicitly reviewed and chosen to accept. Every
# entry must include a rationale and, where possible, an upstream issue
# pointing to a fix. Revisit this list whenever dependencies are updated.
ignore = [
# rsa: Marvin Attack timing side-channel in PKCS#1 v1.5 decryption.
# Reached only through opendal → reqsign → rsa. We do not use RSA
# decryption in LanceDB ourselves; this is dormant in the signing path.
# No fixed release exists upstream as of this writing.
# https://rustsec.org/advisories/RUSTSEC-2023-0071
{ id = "RUSTSEC-2023-0071", reason = "rsa crate via opendal/reqsign; no fixed upstream release" },
# instant: unmaintained. Pulled in via backoff → instant. Upstream
# recommends switching to `web-time`; fix has to come from backoff.
# https://rustsec.org/advisories/RUSTSEC-2024-0384
{ id = "RUSTSEC-2024-0384", reason = "transitive via backoff; waiting on backoff replacement" },
# paste: unmaintained (author archived the repo). Used transitively by
# datafusion and the arrow ecosystem; widespread, no drop-in replacement.
# https://rustsec.org/advisories/RUSTSEC-2024-0436
{ id = "RUSTSEC-2024-0436", reason = "transitive via datafusion; awaiting ecosystem migration" },
# encoding: unmaintained. Reached through lindera-dictionary, which is
# required by the native Lindera tokenizer path. Lindera has not migrated
# off this crate yet.
# https://rustsec.org/advisories/RUSTSEC-2021-0153
{ id = "RUSTSEC-2021-0153", reason = "transitive via lindera-dictionary for native Lindera tokenizer" },
# fast-float: unsound and unmaintained. Reached only through polars-arrow
# from the optional Polars integration; replacement requires a Polars
# dependency upgrade.
# https://rustsec.org/advisories/RUSTSEC-2024-0379
{ id = "RUSTSEC-2024-0379", reason = "transitive via polars-arrow; waiting on Polars migration" },
# tantivy: segfault on malformed input due to missing bounds check.
# Pulled in via lance for full-text search. We only feed tantivy
# documents we construct ourselves, not attacker-controlled bytes.
# Tracked for a lance dependency bump.
# https://rustsec.org/advisories/RUSTSEC-2025-0003
{ id = "RUSTSEC-2025-0003", reason = "tantivy via lance; inputs are internally produced, not user-supplied bytes" },
# backoff: unmaintained. Reached only via async-openai. Replacement
# requires async-openai to migrate (or us to drop async-openai).
# https://rustsec.org/advisories/RUSTSEC-2025-0012
{ id = "RUSTSEC-2025-0012", reason = "transitive via async-openai; waiting on upstream migration" },
# number_prefix: unmaintained. Transitive via indicatif → hf-hub.
# No security impact, just maintenance status.
# https://rustsec.org/advisories/RUSTSEC-2025-0119
{ id = "RUSTSEC-2025-0119", reason = "transitive via hf-hub/indicatif; cosmetic formatting crate" },
# bincode: unmaintained. Reached through lindera and lindera-dictionary,
# which are required by the native Lindera tokenizer path. Lindera has not
# migrated to another serialization format yet.
# https://rustsec.org/advisories/RUSTSEC-2025-0141
{ id = "RUSTSEC-2025-0141", reason = "transitive via lindera/lindera-dictionary for native Lindera tokenizer" },
# lru: soundness issue in IterMut. Reached only through aws-sdk-s3 in
# LanceDB's dev-dependency graph; LanceDB does not use that iterator
# directly. Clearing this requires the AWS SDK chain to update lru.
# https://rustsec.org/advisories/RUSTSEC-2026-0002
{ id = "RUSTSEC-2026-0002", reason = "transitive via aws-sdk-s3 dev-dependency; waiting on AWS SDK lru upgrade" },
# rustls-webpki 0.101.7 (old major line): name-constraint checks for
# URI / wildcard names. Pulled in only via the legacy rustls 0.21 chain
# from aws-smithy-http-client. The 0.103 line we actively use is patched.
# Clearing the 0.101 copy requires the aws-sdk chain to migrate off
# rustls 0.21.
# https://rustsec.org/advisories/RUSTSEC-2026-0098
# https://rustsec.org/advisories/RUSTSEC-2026-0099
{ id = "RUSTSEC-2026-0098", reason = "only affects rustls-webpki 0.101 from legacy aws-smithy/rustls 0.21 chain" },
{ id = "RUSTSEC-2026-0099", reason = "only affects rustls-webpki 0.101 from legacy aws-smithy/rustls 0.21 chain" },
# rustls-webpki 0.101.7: reachable panic in CRL parsing. Same legacy
# rustls 0.21 chain from aws-smithy-http-client as above. The 0.103 line
# we actively use is upgraded to 0.103.13 which contains the fix.
# https://rustsec.org/advisories/RUSTSEC-2026-0104
{ id = "RUSTSEC-2026-0104", reason = "only affects rustls-webpki 0.101 from legacy aws-smithy/rustls 0.21 chain" },
# rand 0.8.5: soundness issue only when ThreadRng reseeds inside a custom
# logger. Reached through several transitive chains. LanceDB does not use
# rand from a custom logger; upgrade once all pinned chains accept 0.8.6+.
# https://rustsec.org/advisories/RUSTSEC-2026-0097
{ id = "RUSTSEC-2026-0097", reason = "transitive rand 0.8.5; LanceDB does not call ThreadRng from custom logging" },
]
# ---------------------------------------------------------------------------
# Licenses: only allow licenses we've reviewed as compatible with Apache-2.0.
# ---------------------------------------------------------------------------
[licenses]
version = 2
# SPDX identifiers for licenses that are compatible with our Apache-2.0
# distribution. Additions require legal review.
allow = [
"Apache-2.0",
"Apache-2.0 WITH LLVM-exception",
"MIT",
"BSD-2-Clause",
"BSD-3-Clause",
"ISC",
"Unicode-3.0",
"Unicode-DFS-2016",
"Zlib",
"CC0-1.0",
"MPL-2.0",
"BSL-1.0",
"OpenSSL",
# 0BSD ("BSD Zero Clause") is effectively public domain — no attribution
# required. Pulled in by `mock_instant`.
"0BSD",
# bzip2-1.0.6 is the permissive upstream bzip2 license (BSD-like). Pulled
# in by `libbz2-rs-sys`, the pure-Rust bzip2 implementation.
"bzip2-1.0.6",
# CDLA-Permissive-2.0 is a permissive data license used by `webpki-roots`
# for the Mozilla CA root bundle. Data-only, distribution-compatible.
"CDLA-Permissive-2.0",
]
confidence-threshold = 0.8
# Crates whose license cannot be determined from Cargo metadata but whose
# license we've manually confirmed from upstream. Keep this list minimal.
[[licenses.clarify]]
# polars-arrow-format omits the `license` field in its Cargo.toml, but the
# upstream repo (pola-rs/polars-arrow-format) is dual-licensed Apache-2.0 OR
# MIT. See https://github.com/pola-rs/polars-arrow-format/blob/main/LICENSE
crate = "polars-arrow-format"
expression = "Apache-2.0 OR MIT"
license-files = []
# ---------------------------------------------------------------------------
# Bans: disallow specific crates and flag dependency hygiene issues.
# ---------------------------------------------------------------------------
[bans]
# Warn (not deny) on duplicate versions of the same crate. In a large
# workspace like this one, duplicates are common and often unavoidable
# transitively. We surface them to discourage growth, but don't fail CI.
multiple-versions = "warn"
# Wildcard version requirements (`foo = "*"`) are a footgun — they let any
# future release in without review. Ban them outright.
wildcards = "deny"
# Internal workspace crates reference each other via `path = "..."`, which
# cargo-deny sees as a wildcard version. That's fine for private workspace
# members (not published to crates.io), so allow it specifically for paths.
allow-wildcard-paths = true
# Features that, if enabled, should cause the check to fail.
deny = []
# Crates to skip when checking for duplicate versions.
skip = []
# Similar to `skip`, but also skips the entire transitive subtree.
skip-tree = []
# ---------------------------------------------------------------------------
# Sources: restrict where crates can come from.
# ---------------------------------------------------------------------------
[sources]
# Deny any registry other than the ones explicitly listed below.
unknown-registry = "deny"
# Deny any git dependency whose host isn't in the allow-list below. This
# prevents accidental pulls from arbitrary forks.
unknown-git = "deny"
allow-registry = ["https://github.com/rust-lang/crates.io-index"]
# Lance is developed in a sibling repo and pulled as a git dependency until
# releases are cut to crates.io. Allow that specific host.
allow-git = [
"https://github.com/lance-format/lance",
]

View File

@@ -24,4 +24,4 @@ RUN python --version && \
rustc --version && \
protoc --version
RUN pip install --no-cache-dir tantivy lancedb
RUN pip install --no-cache-dir lancedb

View File

@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
<dependency>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-core</artifactId>
<version>0.28.0-beta.5</version>
<version>0.28.0-beta.11</version>
</dependency>
```

View File

@@ -501,6 +501,34 @@ Modeled after ``VACUUM`` in PostgreSQL.
***
### prewarmData()
```ts
abstract prewarmData(columns?): Promise<void>
```
Prewarm one or more columns of data in the table.
#### Parameters
* **columns?**: `string`[]
The columns to prewarm. If undefined, all columns are prewarmed.
This will load the column data into the page cache so that future queries that
read those columns avoid the initial cold-start latency. This call initiates
prewarming and returns once the request is accepted; the warming itself may
continue in the background. Calling it on already-prewarmed columns is a
no-op on the server.
Prewarming is generally useful for columns used in filters or projections.
Large columns (e.g. high-dimensional vectors or binary data) may not be
practical to prewarm.
This feature is currently only supported on remote tables.
#### Returns
`Promise`&lt;`void`&gt;
***
### prewarmIndex()
```ts

View File

@@ -41,6 +41,29 @@ for testing purposes.
***
### manifestEnabled?
```ts
optional manifestEnabled: boolean;
```
(For LanceDB OSS only): use directory namespace manifests as the source
of truth for table metadata. Existing directory-listed root tables are
migrated into the manifest on access.
***
### namespaceClientProperties?
```ts
optional namespaceClientProperties: Record<string, string>;
```
(For LanceDB OSS only): extra properties for the backing namespace
client used by manifest-enabled native connections.
***
### readConsistencyInterval?
```ts

View File

@@ -94,11 +94,11 @@ of raw SQL strings with [where][lancedb.query.LanceQueryBuilder.where] and
## Full text search
::: lancedb.fts.create_index
Use [lancedb.table.Table.create_fts_index][] for the synchronous API or
[lancedb.table.AsyncTable.create_index][] with [lancedb.index.FTS][] for the
asynchronous API.
::: lancedb.fts.populate_index
::: lancedb.fts.search_index
::: lancedb.index.FTS
## Utilities

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.28.0-beta.5</version>
<version>0.28.0-beta.11</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.28.0-beta.5</version>
<version>0.28.0-beta.11</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>
@@ -28,7 +28,7 @@
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<arrow.version>15.0.0</arrow.version>
<lance-core.version>5.1.0-beta.3</lance-core.version>
<lance-core.version>7.0.0-beta.7</lance-core.version>
<spotless.skip>false</spotless.skip>
<spotless.version>2.30.0</spotless.version>
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>

View File

@@ -1,7 +1,8 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.28.0-beta.5"
version = "0.28.0-beta.11"
publish = false
license.workspace = true
description.workspace = true
repository.workspace = true
@@ -15,7 +16,7 @@ crate-type = ["cdylib"]
async-trait.workspace = true
arrow-ipc.workspace = true
arrow-array.workspace = true
arrow-buffer = "57.2"
arrow-buffer = "58.0.0"
half.workspace = true
arrow-schema.workspace = true
env_logger.workspace = true
@@ -31,8 +32,8 @@ lzma-sys = { version = "0.1", features = ["static"] }
log.workspace = true
# Pin to resolve build failures; update periodically for security patches.
aws-lc-sys = "=0.38.0"
aws-lc-rs = "=1.16.1"
aws-lc-sys = "=0.40.0"
aws-lc-rs = "=1.16.3"
[build-dependencies]
napi-build = "2.3.1"

View File

@@ -1,6 +1,8 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
import { spawn } from "node:child_process";
import * as path from "node:path";
import { RecordBatch } from "apache-arrow";
import * as tmp from "tmp";
import { Connection, Index, Table, connect, makeArrowTable } from "../lancedb";
@@ -76,4 +78,91 @@ describe("rerankers", function () {
expect(result).toHaveLength(2);
});
it("does not keep process alive after rerank query", async function () {
const script = `
import * as lancedb from "./dist/index.js";
import * as os from "node:os";
import * as path from "node:path";
import * as fs from "node:fs/promises";
const dir = await fs.mkdtemp(path.join(os.tmpdir(), "lancedb-rerank-exit-"));
const db = await lancedb.connect(dir);
const table = await db.createTable("test", [{ text: "hello", vector: [1, 2, 3] }], {
mode: "overwrite",
});
await table.createIndex("text", { config: lancedb.Index.fts() });
await table.waitForIndex(["text_idx"], 30);
const reranker = await lancedb.rerankers.RRFReranker.create();
await table
.query()
.nearestTo([1, 2, 3])
.fullTextSearch("hello")
.rerank(reranker)
.toArray();
table.close();
db.close();
`;
await new Promise<void>((resolve, reject) => {
const child = spawn(
process.execPath,
["--input-type=module", "-e", script],
{
cwd: path.resolve(__dirname, ".."),
stdio: ["ignore", "pipe", "pipe"],
},
);
let stdout = "";
let stderr = "";
child.stdout.on("data", (chunk) => {
stdout += chunk.toString();
});
child.stderr.on("data", (chunk) => {
stderr += chunk.toString();
});
const timeout = setTimeout(() => {
child.kill();
reject(
new Error(
`child process did not exit in time\nstdout:\n${stdout}\nstderr:\n${stderr}`,
),
);
}, 20_000);
child.on("error", (err) => {
clearTimeout(timeout);
reject(err);
});
child.on("exit", (code, signal) => {
clearTimeout(timeout);
if (signal !== null) {
reject(
new Error(
`child process exited with signal ${signal}\nstdout:\n${stdout}\nstderr:\n${stderr}`,
),
);
return;
}
if (code !== 0) {
reject(
new Error(
`child process exited with code ${code}\nstdout:\n${stdout}\nstderr:\n${stderr}`,
),
);
return;
}
resolve();
});
});
});
});

View File

@@ -1870,6 +1870,25 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
expect(results.length).toBe(3);
});
test("prewarmData errors on local tables", async () => {
const db = await connect(tmpDir.name);
const data = [
{ text: "alpha", vector: [0.1, 0.2, 0.3] },
{ text: "beta", vector: [0.4, 0.5, 0.6] },
];
const table = await db.createTable("prewarm_data_test", data);
// prewarmData is only supported on remote tables. We verify the call
// is wired through napi and surfaces the expected error for both
// arg shapes (undefined and string[]).
await expect(table.prewarmData()).rejects.toThrow(
"prewarm_data is currently only supported on remote tables",
);
await expect(table.prewarmData(["text"])).rejects.toThrow(
"prewarm_data is currently only supported on remote tables",
);
});
test("full text index on list", async () => {
const db = await connect(tmpDir.name);
const data = [

View File

@@ -285,6 +285,25 @@ export abstract class Table {
*/
abstract prewarmIndex(name: string): Promise<void>;
/**
* Prewarm one or more columns of data in the table.
*
* @param columns The columns to prewarm. If undefined, all columns are prewarmed.
*
* This will load the column data into the page cache so that future queries that
* read those columns avoid the initial cold-start latency. This call initiates
* prewarming and returns once the request is accepted; the warming itself may
* continue in the background. Calling it on already-prewarmed columns is a
* no-op on the server.
*
* Prewarming is generally useful for columns used in filters or projections.
* Large columns (e.g. high-dimensional vectors or binary data) may not be
* practical to prewarm.
*
* This feature is currently only supported on remote tables.
*/
abstract prewarmData(columns?: string[]): Promise<void>;
/**
* Waits for asynchronous indexing to complete on the table.
*
@@ -710,6 +729,10 @@ export class LocalTable extends Table {
await this.inner.prewarmIndex(name);
}
async prewarmData(columns?: string[]): Promise<void> {
await this.inner.prewarmData(columns);
}
async waitForIndex(
indexNames: string[],
timeoutSeconds: number,

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"cpu": [
"x64",
"arm64"

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.28.0-beta.5",
"version": "0.28.0-beta.11",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",
@@ -75,7 +75,6 @@
"build:debug": "napi build --platform --dts ../lancedb/native.d.ts --js ../lancedb/native.js --output-dir lancedb",
"postbuild:debug": "shx mkdir -p dist && shx cp lancedb/*.node dist/",
"build:release": "napi build --platform --release --dts ../lancedb/native.d.ts --js ../lancedb/native.js --output-dir dist",
"postbuild:release": "shx mkdir -p dist && shx cp lancedb/*.node dist/",
"build": "npm run build:debug && npm run tsc",
"build-release": "npm run build:release && npm run tsc",
"tsc": "tsc -b",

View File

@@ -67,6 +67,12 @@ impl Connection {
builder = builder.storage_option(key, value);
}
}
if let Some(manifest_enabled) = options.manifest_enabled {
builder = builder.manifest_enabled(manifest_enabled);
}
if let Some(namespace_client_properties) = options.namespace_client_properties {
builder = builder.namespace_client_properties(namespace_client_properties);
}
// Create client config, optionally with header provider
let client_config = options.client_config.unwrap_or_default();

View File

@@ -37,6 +37,13 @@ pub struct ConnectionOptions {
///
/// The available options are described at https://docs.lancedb.com/storage/
pub storage_options: Option<HashMap<String, String>>,
/// (For LanceDB OSS only): use directory namespace manifests as the source
/// of truth for table metadata. Existing directory-listed root tables are
/// migrated into the manifest on access.
pub manifest_enabled: Option<bool>,
/// (For LanceDB OSS only): extra properties for the backing namespace
/// client used by manifest-enabled native connections.
pub namespace_client_properties: Option<HashMap<String, String>>,
/// (For LanceDB OSS only): the session to use for this connection. Holds
/// shared caches and other session-specific state.
pub session: Option<session::Session>,

View File

@@ -18,6 +18,7 @@ type RerankHybridFn = ThreadsafeFunction<
RerankHybridCallbackArgs,
Status,
false,
true,
>;
/// Reranker implementation that "wraps" a NodeJS Reranker implementation.
@@ -32,7 +33,10 @@ impl Reranker {
pub fn new(
rerank_hybrid: Function<RerankHybridCallbackArgs, Promise<Buffer>>,
) -> napi::Result<Self> {
let rerank_hybrid = rerank_hybrid.build_threadsafe_function().build()?;
let rerank_hybrid = rerank_hybrid
.build_threadsafe_function()
.weak::<true>()
.build()?;
Ok(Self { rerank_hybrid })
}
}

View File

@@ -159,6 +159,14 @@ impl Table {
.default_error()
}
#[napi(catch_unwind)]
pub async fn prewarm_data(&self, columns: Option<Vec<String>>) -> napi::Result<()> {
self.inner_ref()?
.prewarm_data(columns)
.await
.default_error()
}
#[napi(catch_unwind)]
pub async fn wait_for_index(&self, index_names: Vec<String>, timeout_s: i64) -> Result<()> {
let timeout = std::time::Duration::from_secs(timeout_s.try_into().unwrap());

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.31.0-beta.5"
current_version = "0.31.0-beta.11"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -1,6 +1,7 @@
[package]
name = "lancedb-python"
version = "0.31.0-beta.5"
version = "0.31.0-beta.11"
publish = false
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true
@@ -14,7 +15,7 @@ name = "_lancedb"
crate-type = ["cdylib"]
[dependencies]
arrow = { version = "57.2", features = ["pyarrow"] }
arrow = { version = "58.0.0", features = ["pyarrow"] }
async-trait = "0.1"
bytes = "1"
lancedb = { path = "../rust/lancedb", default-features = false }
@@ -24,8 +25,8 @@ lance-namespace-impls.workspace = true
lance-io.workspace = true
env_logger.workspace = true
log.workspace = true
pyo3 = { version = "0.26", features = ["extension-module", "abi3-py39"] }
pyo3-async-runtimes = { version = "0.26", features = [
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39"] }
pyo3-async-runtimes = { version = "0.28", features = [
"attributes",
"tokio-runtime",
] }
@@ -34,10 +35,11 @@ futures.workspace = true
serde = "1"
serde_json = "1"
snafu.workspace = true
tokio = { version = "1.40", features = ["sync"] }
tokio = { version = "1.40", features = ["sync", "rt-multi-thread"] }
libc = "0.2"
[build-dependencies]
pyo3-build-config = { version = "0.26", features = [
pyo3-build-config = { version = "0.28", features = [
"extension-module",
"abi3-py39",
] }

View File

@@ -183,7 +183,6 @@
| stack-data | 0.6.3 | MIT License | http://github.com/alexmojaki/stack_data |
| sympy | 1.14.0 | BSD License | https://sympy.org |
| tabulate | 0.9.0 | MIT License | https://github.com/astanin/python-tabulate |
| tantivy | 0.25.1 | UNKNOWN | UNKNOWN |
| threadpoolctl | 3.6.0 | BSD License | https://github.com/joblib/threadpoolctl |
| timm | 1.0.24 | Apache Software License | https://github.com/huggingface/pytorch-image-models |
| tinycss2 | 1.4.0 | BSD License | https://www.courtbouillon.org/tinycss2 |

View File

@@ -57,7 +57,6 @@ tests = [
"duckdb>=0.9.0",
"pytz>=2023.3",
"polars>=0.19, <=1.3.0",
"tantivy>=0.20.0",
"pyarrow-stubs>=16.0",
"pylance>=5.0.0b5",
"requests>=2.31.0",

View File

@@ -7,7 +7,6 @@ import os
from concurrent.futures import ThreadPoolExecutor
from datetime import timedelta
from typing import Dict, Optional, Union, Any, List
import warnings
__version__ = importlib.metadata.version("lancedb")
@@ -73,6 +72,7 @@ def connect(
client_config: Union[ClientConfig, Dict[str, Any], None] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
manifest_enabled: bool = False,
namespace_client_impl: Optional[str] = None,
namespace_client_properties: Optional[Dict[str, str]] = None,
namespace_client_pushdown_operations: Optional[List[str]] = None,
@@ -111,6 +111,10 @@ def connect(
storage_options: dict, optional
Additional options for the storage backend. See available options at
<https://docs.lancedb.com/storage/>
manifest_enabled : bool, default False
When true for local/native connections, use directory namespace
manifests as the source of truth for table metadata. Existing
directory-listed root tables are migrated into the manifest on access.
session: Session, optional
(For LanceDB OSS only)
A session to use for this connection. Sessions allow you to configure
@@ -158,11 +162,11 @@ def connect(
conn : DBConnection
A connection to a LanceDB database.
"""
if namespace_client_impl is not None or namespace_client_properties is not None:
if namespace_client_impl is None or namespace_client_properties is None:
if namespace_client_impl is not None:
if namespace_client_properties is None:
raise ValueError(
"Both namespace_client_impl and "
"namespace_client_properties must be provided"
"namespace_client_properties must be provided when "
"namespace_client_impl is set"
)
if kwargs:
raise ValueError(f"Unknown keyword arguments: {kwargs}")
@@ -175,6 +179,12 @@ def connect(
namespace_client_pushdown_operations=namespace_client_pushdown_operations,
)
if namespace_client_properties is not None and not manifest_enabled:
raise ValueError(
"namespace_client_impl must be provided when using "
"namespace_client_properties unless manifest_enabled=True"
)
if namespace_client_pushdown_operations is not None:
raise ValueError(
"namespace_client_pushdown_operations is only valid when "
@@ -212,6 +222,8 @@ def connect(
read_consistency_interval=read_consistency_interval,
storage_options=storage_options,
session=session,
manifest_enabled=manifest_enabled,
namespace_client_properties=namespace_client_properties,
)
@@ -289,6 +301,8 @@ def deserialize_conn(
parsed["uri"],
read_consistency_interval=rci,
storage_options=storage_options,
manifest_enabled=parsed.get("manifest_enabled", False),
namespace_client_properties=parsed.get("namespace_client_properties"),
)
else:
raise ValueError(f"Unknown connection_type: {connection_type}")
@@ -304,6 +318,8 @@ async def connect_async(
client_config: Optional[Union[ClientConfig, Dict[str, Any]]] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
manifest_enabled: bool = False,
namespace_client_properties: Optional[Dict[str, str]] = None,
) -> AsyncConnection:
"""Connect to a LanceDB database.
@@ -343,6 +359,13 @@ async def connect_async(
cache sizes for index and metadata caches, which can significantly
impact memory use and performance. They can also be re-used across
multiple connections to share the same cache state.
manifest_enabled : bool, default False
When true for local/native connections, use directory namespace
manifests as the source of truth for table metadata. Existing
directory-listed root tables are migrated into the manifest on access.
namespace_client_properties : dict, optional
Additional directory namespace client properties to use with
``manifest_enabled=True``.
Examples
--------
@@ -385,6 +408,8 @@ async def connect_async(
client_config,
storage_options,
session,
manifest_enabled,
namespace_client_properties,
)
)
@@ -412,13 +437,3 @@ __all__ = [
"Table",
"__version__",
]
def __warn_on_fork():
warnings.warn(
"lance is not fork-safe. If you are using multiprocessing, use spawn instead.",
)
if hasattr(os, "register_at_fork"):
os.register_at_fork(before=__warn_on_fork) # type: ignore[attr-defined]

View File

@@ -12,6 +12,7 @@ from .index import (
LabelList,
HnswPq,
HnswSq,
HnswFlat,
FTS,
)
from lance_namespace import (
@@ -25,6 +26,7 @@ from .remote import ClientConfig
IvfHnswPq: type[HnswPq] = HnswPq
IvfHnswSq: type[HnswSq] = HnswSq
IvfHnswFlat: type[HnswFlat] = HnswFlat
class PyExpr:
"""A type-safe DataFusion expression node (Rust-side handle)."""
@@ -180,6 +182,7 @@ class Table:
IvfPq,
HnswPq,
HnswSq,
HnswFlat,
BTree,
Bitmap,
LabelList,
@@ -242,6 +245,8 @@ async def connect(
client_config: Optional[Union[ClientConfig, Dict[str, Any]]],
storage_options: Optional[Dict[str, str]],
session: Optional[Session],
manifest_enabled: bool = False,
namespace_client_properties: Optional[Dict[str, str]] = None,
) -> Connection: ...
class RecordBatchStream:
@@ -440,7 +445,7 @@ class AsyncPermutationBuilder:
async def execute(self) -> Table: ...
def async_permutation_builder(
table: Table, dest_table_name: str
table: Table,
) -> AsyncPermutationBuilder: ...
def fts_query_to_json(query: Any) -> str: ...

View File

@@ -2,7 +2,9 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import asyncio
import os
import threading
import warnings
class BackgroundEventLoop:
@@ -13,6 +15,9 @@ class BackgroundEventLoop:
"""
def __init__(self):
self._start()
def _start(self):
self.loop = asyncio.new_event_loop()
self.thread = threading.Thread(
target=self.loop.run_forever,
@@ -31,3 +36,30 @@ class BackgroundEventLoop:
LOOP = BackgroundEventLoop()
_FORK_WARNED = False
def _reset_after_fork():
# Threads do not survive fork(), so the asyncio loop in LOOP.thread is
# dead in the child. Re-initialize the singleton in place so existing
# `from .background_loop import LOOP` references in other modules see
# the new state. The Rust-side tokio runtime is reset analogously by a
# pthread_atfork hook installed in the _lancedb extension.
LOOP._start()
global _FORK_WARNED
if not _FORK_WARNED:
_FORK_WARNED = True
warnings.warn(
"lancedb fork support is experimental: the internal async "
"runtime has been reset in the forked child, but a small chance "
"of deadlock remains if other state was mid-operation at fork "
"time. The 'forkserver' or 'spawn' multiprocessing start method "
"is likely a safer alternative.",
RuntimeWarning,
stacklevel=2,
)
if hasattr(os, "register_at_fork"):
os.register_at_fork(after_in_child=_reset_after_fork)

View File

@@ -79,6 +79,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], default []
The parent namespace to list namespaces in.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -106,6 +107,7 @@ class DBConnection(EnforceOverrides):
----------
namespace_path: List[str]
The namespace identifier to create.
Previously called ``namespace`` in 0.30.2 and earlier.
mode: str, optional
Creation mode - "create" (fail if exists), "exist_ok" (skip if exists),
or "overwrite" (replace if exists). Case insensitive.
@@ -133,6 +135,7 @@ class DBConnection(EnforceOverrides):
----------
namespace_path: List[str]
The namespace identifier to drop.
Previously called ``namespace`` in 0.30.2 and earlier.
mode: str, optional
Whether to skip if not exists ("SKIP") or fail ("FAIL"). Case insensitive.
behavior: str, optional
@@ -157,6 +160,7 @@ class DBConnection(EnforceOverrides):
----------
namespace_path: List[str]
The namespace identifier to describe.
Previously called ``namespace`` in 0.30.2 and earlier.
Returns
-------
@@ -180,6 +184,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], optional
The namespace to list tables in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -210,6 +215,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], default []
The namespace to list tables in.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
The token to use for pagination. If not present, start from the beginning.
Typically, this token is last table name from the previous page.
@@ -248,6 +254,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], default []
The namespace to create the table in.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
data: The data to initialize the table, *optional*
User must provide at least one of `data` or `schema`.
Acceptable types are:
@@ -416,6 +423,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], optional
The namespace to open the table from.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
index_cache_size: int, default 256
**Deprecated**: Use session-level cache configuration instead.
Create a Session with custom cache sizes and pass it to lancedb.connect().
@@ -451,6 +459,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], default []
The namespace to drop the table from.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
"""
if namespace_path is None:
namespace_path = []
@@ -474,9 +483,11 @@ class DBConnection(EnforceOverrides):
cur_namespace_path: List[str], optional
The namespace of the current table.
None or empty list represents root namespace.
Previously called ``cur_namespace`` in 0.30.2 and earlier.
new_namespace_path: List[str], optional
The namespace to move the table to.
If not specified, defaults to the same as cur_namespace.
If not specified, defaults to the same as cur_namespace_path.
Previously called ``new_namespace`` in 0.30.2 and earlier.
"""
if cur_namespace_path is None:
cur_namespace_path = []
@@ -500,6 +511,7 @@ class DBConnection(EnforceOverrides):
namespace_path: List[str], optional
The namespace to drop all tables from.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
"""
if namespace_path is None:
namespace_path = []
@@ -590,8 +602,13 @@ class LanceDBConnection(DBConnection):
read_consistency_interval: Optional[timedelta] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
manifest_enabled: bool = False,
namespace_client_properties: Optional[Dict[str, str]] = None,
_inner: Optional[LanceDbConnection] = None,
):
self.storage_options = storage_options
self._manifest_enabled = manifest_enabled
self._namespace_client_properties = namespace_client_properties
if _inner is not None:
self._conn = _inner
self._cached_namespace_client = None
@@ -633,6 +650,8 @@ class LanceDBConnection(DBConnection):
None,
storage_options,
session,
manifest_enabled,
namespace_client_properties,
)
# TODO: It would be nice if we didn't store self.storage_options but it is
@@ -640,7 +659,6 @@ class LanceDBConnection(DBConnection):
# work because some paths like LanceDBConnection.from_inner will lose the
# storage_options. Also, this class really shouldn't be holding any state
# beyond _conn.
self.storage_options = storage_options
self._conn = AsyncConnection(LOOP.run(do_connect()))
self._cached_namespace_client: Optional[LanceNamespace] = None
@@ -677,6 +695,8 @@ class LanceDBConnection(DBConnection):
"connection_type": "local",
"uri": self.uri,
"storage_options": self.storage_options,
"manifest_enabled": self._manifest_enabled,
"namespace_client_properties": self._namespace_client_properties,
"read_consistency_interval_seconds": (
rci.total_seconds() if rci else None
),
@@ -705,6 +725,7 @@ class LanceDBConnection(DBConnection):
namespace_path: List[str], optional
The parent namespace to list namespaces in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -772,6 +793,7 @@ class LanceDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to list tables in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -814,6 +836,7 @@ class LanceDBConnection(DBConnection):
----------
namespace_path: List[str], optional
The namespace to list tables in.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
The token to use for pagination.
limit: int, default 10
@@ -868,6 +891,7 @@ class LanceDBConnection(DBConnection):
----------
namespace_path: List[str], optional
The namespace to create the table in.
Previously called ``namespace`` in 0.30.2 and earlier.
See
---
@@ -941,6 +965,7 @@ class LanceDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to open the table from. When non-empty, the
table is resolved through the directory namespace client.
Previously called ``namespace`` in 0.30.2 and earlier.
Returns
-------
@@ -1001,6 +1026,7 @@ class LanceDBConnection(DBConnection):
target_namespace_path: List[str], optional
The namespace for the target table.
None or empty list represents root namespace.
Previously called ``target_namespace`` in 0.30.2 and earlier.
source_version: int, optional
The version of the source table to clone.
source_tag: str, optional
@@ -1046,6 +1072,7 @@ class LanceDBConnection(DBConnection):
The name of the table.
namespace_path: List[str], optional
The namespace to drop the table from.
Previously called ``namespace`` in 0.30.2 and earlier.
ignore_missing: bool, default False
If True, ignore if the table does not exist.
"""
@@ -1084,8 +1111,10 @@ class LanceDBConnection(DBConnection):
The new name of the table.
cur_namespace_path: List[str], optional
The namespace of the current table.
Previously called ``cur_namespace`` in 0.30.2 and earlier.
new_namespace_path: List[str], optional
The namespace to move the table to.
Previously called ``new_namespace`` in 0.30.2 and earlier.
"""
if cur_namespace_path is None:
cur_namespace_path = []
@@ -1208,6 +1237,7 @@ class AsyncConnection(object):
namespace_path: List[str], optional
The parent namespace to list namespaces in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
The token to use for pagination. If not present, start from the beginning.
limit: int, optional
@@ -1237,6 +1267,7 @@ class AsyncConnection(object):
----------
namespace_path: List[str]
The namespace identifier to create.
Previously called ``namespace`` in 0.30.2 and earlier.
mode: str, optional
Creation mode - "create", "exist_ok", or "overwrite". Case insensitive.
properties: Dict[str, str], optional
@@ -1266,6 +1297,7 @@ class AsyncConnection(object):
----------
namespace_path: List[str]
The namespace identifier to drop.
Previously called ``namespace`` in 0.30.2 and earlier.
mode: str, optional
Whether to skip if not exists ("SKIP") or fail ("FAIL"). Case insensitive.
behavior: str, optional
@@ -1293,6 +1325,7 @@ class AsyncConnection(object):
----------
namespace_path: List[str]
The namespace identifier to describe.
Previously called ``namespace`` in 0.30.2 and earlier.
Returns
-------
@@ -1315,6 +1348,7 @@ class AsyncConnection(object):
namespace_path: List[str], optional
The namespace to list tables in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -1350,6 +1384,7 @@ class AsyncConnection(object):
namespace_path: List[str], optional
The namespace to list tables in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
start_after: str, optional
If present, only return names that come lexicographically after the supplied
value.
@@ -1390,6 +1425,7 @@ class AsyncConnection(object):
namespace_path: Optional[List[str]] = None,
embedding_functions: Optional[List[EmbeddingFunctionConfig]] = None,
location: Optional[str] = None,
namespace_client: Optional[Any] = None,
) -> AsyncTable:
"""Create an [AsyncTable][lancedb.table.AsyncTable] in the database.
@@ -1400,6 +1436,7 @@ class AsyncConnection(object):
namespace_path: List[str], default []
The namespace to create the table in.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
data: The data to initialize the table, *optional*
User must provide at least one of `data` or `schema`.
Acceptable types are:
@@ -1587,6 +1624,7 @@ class AsyncConnection(object):
namespace_path=namespace_path,
storage_options=storage_options,
location=location,
namespace_client=namespace_client,
)
else:
data = data_to_reader(data, schema)
@@ -1597,6 +1635,7 @@ class AsyncConnection(object):
namespace_path=namespace_path,
storage_options=storage_options,
location=location,
namespace_client=namespace_client,
)
return AsyncTable(new_table)
@@ -1621,6 +1660,7 @@ class AsyncConnection(object):
namespace_path: List[str], optional
The namespace to open the table from.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
storage_options: dict, optional
Additional options for the storage backend. Options already set on the
connection will be inherited by the table, but can be overridden here.
@@ -1690,6 +1730,7 @@ class AsyncConnection(object):
target_namespace_path: List[str], optional
The namespace for the target table.
None or empty list represents root namespace.
Previously called ``target_namespace`` in 0.30.2 and earlier.
source_version: int, optional
The version of the source table to clone.
source_tag: str, optional
@@ -1732,9 +1773,11 @@ class AsyncConnection(object):
cur_namespace_path: List[str], optional
The namespace of the current table.
None or empty list represents root namespace.
Previously called ``cur_namespace`` in 0.30.2 and earlier.
new_namespace_path: List[str], optional
The namespace to move the table to.
If not specified, defaults to the same as cur_namespace.
If not specified, defaults to the same as cur_namespace_path.
Previously called ``new_namespace`` in 0.30.2 and earlier.
"""
if cur_namespace_path is None:
cur_namespace_path = []
@@ -1763,6 +1806,7 @@ class AsyncConnection(object):
namespace_path: List[str], default []
The namespace to drop the table from.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
ignore_missing: bool, default False
If True, ignore if the table does not exist.
"""
@@ -1784,6 +1828,7 @@ class AsyncConnection(object):
namespace_path: List[str], optional
The namespace to drop all tables from.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
"""
if namespace_path is None:
namespace_path = []

View File

@@ -1,201 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""Full text search index using tantivy-py"""
import os
from typing import List, Tuple, Optional
import pyarrow as pa
try:
import tantivy
except ImportError:
raise ImportError(
"Please install tantivy-py `pip install tantivy` to use the full text search feature." # noqa: E501
)
from .table import LanceTable
def create_index(
index_path: str,
text_fields: List[str],
ordering_fields: Optional[List[str]] = None,
tokenizer_name: str = "default",
) -> tantivy.Index:
"""
Create a new Index (not populated)
Parameters
----------
index_path : str
Path to the index directory
text_fields : List[str]
List of text fields to index
ordering_fields: List[str]
List of unsigned type fields to order by at search time
tokenizer_name : str, default "default"
The tokenizer to use
Returns
-------
index : tantivy.Index
The index object (not yet populated)
"""
if ordering_fields is None:
ordering_fields = []
# Declaring our schema.
schema_builder = tantivy.SchemaBuilder()
# special field that we'll populate with row_id
schema_builder.add_integer_field("doc_id", stored=True)
# data fields
for name in text_fields:
schema_builder.add_text_field(name, stored=True, tokenizer_name=tokenizer_name)
if ordering_fields:
for name in ordering_fields:
schema_builder.add_unsigned_field(name, fast=True)
schema = schema_builder.build()
os.makedirs(index_path, exist_ok=True)
index = tantivy.Index(schema, path=index_path)
return index
def populate_index(
index: tantivy.Index,
table: LanceTable,
fields: List[str],
writer_heap_size: Optional[int] = None,
ordering_fields: Optional[List[str]] = None,
) -> int:
"""
Populate an index with data from a LanceTable
Parameters
----------
index : tantivy.Index
The index object
table : LanceTable
The table to index
fields : List[str]
List of fields to index
writer_heap_size : int
The writer heap size in bytes, defaults to 1GB
Returns
-------
int
The number of rows indexed
"""
if ordering_fields is None:
ordering_fields = []
writer_heap_size = writer_heap_size or 1024 * 1024 * 1024
# first check the fields exist and are string or large string type
nested = []
for name in fields:
try:
f = table.schema.field(name) # raises KeyError if not found
except KeyError:
f = resolve_path(table.schema, name)
nested.append(name)
if not pa.types.is_string(f.type) and not pa.types.is_large_string(f.type):
raise TypeError(f"Field {name} is not a string type")
# create a tantivy writer
writer = index.writer(heap_size=writer_heap_size)
# write data into index
dataset = table.to_lance()
row_id = 0
max_nested_level = 0
if len(nested) > 0:
max_nested_level = max([len(name.split(".")) for name in nested])
for b in dataset.to_batches(columns=fields + ordering_fields):
if max_nested_level > 0:
b = pa.Table.from_batches([b])
for _ in range(max_nested_level - 1):
b = b.flatten()
for i in range(b.num_rows):
doc = tantivy.Document()
for name in fields:
value = b[name][i].as_py()
if value is not None:
doc.add_text(name, value)
for name in ordering_fields:
value = b[name][i].as_py()
if value is not None:
doc.add_unsigned(name, value)
if not doc.is_empty:
doc.add_integer("doc_id", row_id)
writer.add_document(doc)
row_id += 1
# commit changes
writer.commit()
return row_id
def resolve_path(schema, field_name: str) -> pa.Field:
"""
Resolve a nested field path to a list of field names
Parameters
----------
field_name : str
The field name to resolve
Returns
-------
List[str]
The resolved path
"""
path = field_name.split(".")
field = schema.field(path.pop(0))
for segment in path:
if pa.types.is_struct(field.type):
field = field.type.field(segment)
else:
raise KeyError(f"field {field_name} not found in schema {schema}")
return field
def search_index(
index: tantivy.Index, query: str, limit: int = 10, ordering_field=None
) -> Tuple[Tuple[int], Tuple[float]]:
"""
Search an index for a query
Parameters
----------
index : tantivy.Index
The index object
query : str
The query string
limit : int
The maximum number of results to return
Returns
-------
ids_and_score: list[tuple[int], tuple[float]]
A tuple of two tuples, the first containing the document ids
and the second containing the scores
"""
searcher = index.searcher()
query = index.parse_query(query)
# get top results
if ordering_field:
results = searcher.search(query, limit, order_by_field=ordering_field)
else:
results = searcher.search(query, limit)
if results.count == 0:
return tuple(), tuple()
return tuple(
zip(
*[
(searcher.doc(doc_address)["doc_id"][0], score)
for score, doc_address in results.hits
]
)
)

View File

@@ -7,6 +7,7 @@ from typing import Literal, Optional
from ._lancedb import (
IndexConfig,
)
from .types import BaseTokenizerType
lang_mapping = {
"ar": "Arabic",
@@ -111,8 +112,12 @@ class FTS:
- "simple": Splits text by whitespace and punctuation.
- "whitespace": Split text by whitespace, but not punctuation.
- "raw": No tokenization. The entire text is treated as a single token.
- "ngram": N-gram tokenizer for substring-style matching.
- "jieba/*": Jieba tokenizer loaded from Lance's language model home.
- "lindera/*": Lindera tokenizer loaded from Lance's language model home.
language : str, default "English"
The language to use for tokenization.
The language to use for stemming and stop-word removal. This is not the
primary way to enable CJK tokenization.
max_token_length : int, default 40
The maximum token length to index. Tokens longer than this length will be
ignored.
@@ -127,10 +132,17 @@ class FTS:
ascii_folding : bool, default True
Whether to fold ASCII characters. This converts accented characters to
their ASCII equivalent. For example, "café" would be converted to "cafe".
Notes
-----
Model-backed tokenizers such as ``jieba/default`` and ``lindera/ipadic``
require tokenizer models in Lance's language model home. Set
``LANCE_LANGUAGE_MODEL_HOME`` to override the default platform data
directory under ``lance/language_models``.
"""
with_position: bool = False
base_tokenizer: Literal["simple", "raw", "whitespace"] = "simple"
base_tokenizer: BaseTokenizerType = "simple"
language: str = "English"
max_token_length: Optional[int] = 40
lower_case: bool = True
@@ -376,9 +388,98 @@ class HnswSq:
target_partition_size: Optional[int] = None
@dataclass
class HnswFlat:
"""Describe a HNSW-FLAT index configuration.
HNSW-FLAT stands for Hierarchical Navigable Small World without quantization.
It stores raw vectors in the HNSW graph, providing the highest recall among
the IVF_HNSW family at the cost of more memory and disk space compared to
:class:`HnswSq` or :class:`HnswPq`.
Parameters
----------
distance_type: str, default "l2"
The distance metric used to train the index.
The following distance types are available:
"l2" - Euclidean distance. This is a very common distance metric that
accounts for both magnitude and direction when determining the distance
between vectors. l2 distance has a range of [0, ∞).
"cosine" - Cosine distance. Cosine distance is a distance metric
calculated from the cosine similarity between two vectors. Cosine
similarity is a measure of similarity between two non-zero vectors of an
inner product space. It is defined to equal the cosine of the angle
between them. Unlike l2, the cosine distance is not affected by the
magnitude of the vectors. Cosine distance has a range of [0, 2].
"dot" - Dot product. Dot distance is the dot product of two vectors. Dot
distance has a range of (-∞, ∞). If the vectors are normalized (i.e. their
l2 norm is 1), then dot distance is equivalent to the cosine distance.
num_partitions, default sqrt(num_rows)
The number of IVF partitions to create.
For HNSW, we recommend a small number of partitions. Setting this to 1
works well for most tables. For very large tables, training just one HNSW
graph will require too much memory. Each partition becomes its own HNSW
graph, so setting this value higher reduces the peak memory use of
training.
max_iterations, default 50
Max iterations to train kmeans.
When training an IVF index we use kmeans to calculate the partitions.
This parameter controls how many iterations of kmeans to run.
sample_rate, default 256
The rate used to calculate the number of training vectors for kmeans.
m, default 20
The number of neighbors to select for each vector in the HNSW graph.
This value controls the tradeoff between search speed and accuracy.
The higher the value the more accurate the search but the slower it
will be.
ef_construction, default 300
The number of candidates to evaluate during the construction of the HNSW
graph.
This value controls the tradeoff between build speed and accuracy.
The higher the value the more accurate the build but the slower it will
be. 150 to 300 is the typical range. 100 is a minimum for good quality
search results. In most cases, there is no benefit to setting this higher
than 500. This value should be set to a value that is not less than `ef`
in the search phase.
target_partition_size, default is 1,048,576
The target size of each partition.
"""
distance_type: Literal["l2", "cosine", "dot"] = "l2"
num_partitions: Optional[int] = None
max_iterations: int = 50
sample_rate: int = 256
m: int = 20
ef_construction: int = 300
target_partition_size: Optional[int] = None
# Backwards-compatible aliases
IvfHnswPq = HnswPq
IvfHnswSq = HnswSq
IvfHnswFlat = HnswFlat
@dataclass
@@ -698,11 +799,13 @@ __all__ = [
"IvfPq",
"IvfHnswPq",
"IvfHnswSq",
"IvfHnswFlat",
"IvfSq",
"IvfRq",
"IvfFlat",
"HnswPq",
"HnswSq",
"HnswFlat",
"IndexConfig",
"FTS",
"Bitmap",

View File

@@ -10,7 +10,6 @@ through a namespace abstraction.
from __future__ import annotations
import asyncio
import sys
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Optional, Union
@@ -25,7 +24,24 @@ if TYPE_CHECKING:
from datetime import timedelta
import pyarrow as pa
from lancedb.db import DBConnection, LanceDBConnection
from lance_namespace_urllib3_client.models.json_arrow_data_type import JsonArrowDataType
from lance_namespace_urllib3_client.models.json_arrow_field import JsonArrowField
from lance_namespace_urllib3_client.models.json_arrow_schema import JsonArrowSchema
from lance_namespace_urllib3_client.models.query_table_request import QueryTableRequest
from lance_namespace_urllib3_client.models.query_table_request_columns import (
QueryTableRequestColumns,
)
from lance_namespace_urllib3_client.models.query_table_request_full_text_query import (
QueryTableRequestFullTextQuery,
)
from lance_namespace_urllib3_client.models.query_table_request_vector import (
QueryTableRequestVector,
)
from lance_namespace_urllib3_client.models.string_fts_query import StringFtsQuery
from lance_namespace.errors import TableNotFoundError
from lancedb._lancedb import connect_namespace_client as _connect_namespace_client
from lancedb.background_loop import LOOP
from lancedb.db import AsyncConnection, DBConnection
from lancedb.namespace_utils import (
_normalize_create_namespace_mode,
_normalize_drop_namespace_mode,
@@ -40,14 +56,11 @@ from lance_namespace import (
ListNamespacesResponse,
ListTablesResponse,
ListTablesRequest,
DescribeTableRequest,
DescribeNamespaceRequest,
DropTableRequest,
ListNamespacesRequest,
CreateNamespaceRequest,
DropNamespaceRequest,
DeclareTableRequest,
CreateTableRequest,
)
from lancedb.table import AsyncTable, LanceTable, Table
from lancedb.util import validate_table_name
@@ -56,21 +69,6 @@ from lancedb.pydantic import LanceModel
from lancedb.embeddings import EmbeddingFunctionConfig
from ._lancedb import Session
from lance_namespace_urllib3_client.models.json_arrow_schema import JsonArrowSchema
from lance_namespace_urllib3_client.models.json_arrow_field import JsonArrowField
from lance_namespace_urllib3_client.models.json_arrow_data_type import JsonArrowDataType
from lance_namespace_urllib3_client.models.query_table_request import QueryTableRequest
from lance_namespace_urllib3_client.models.query_table_request_vector import (
QueryTableRequestVector,
)
from lance_namespace_urllib3_client.models.query_table_request_columns import (
QueryTableRequestColumns,
)
from lance_namespace_urllib3_client.models.query_table_request_full_text_query import (
QueryTableRequestFullTextQuery,
)
from lance_namespace_urllib3_client.models.string_fts_query import StringFtsQuery
def _query_to_namespace_request(
table_id: List[str],
@@ -203,7 +201,8 @@ def _execute_server_side_query(
Parameters
----------
namespace_client : LanceNamespace
The namespace client to use
The namespace client to use.
Previously called ``namespace`` in 0.30.2 and earlier.
table_id : List[str]
The table identifier (namespace path + table name)
query : Query
@@ -390,7 +389,8 @@ class LanceNamespaceDBConnection(DBConnection):
Parameters
----------
namespace_client : LanceNamespace
The namespace client to use for table management
The namespace client to use for table management.
Previously called ``namespace`` in 0.30.2 and earlier.
read_consistency_interval : Optional[timedelta]
The interval at which to check for updates to the table from other
processes. If None, then consistency is not checked.
@@ -424,6 +424,23 @@ class LanceNamespaceDBConnection(DBConnection):
)
self._namespace_client_impl = namespace_client_impl
self._namespace_client_properties = namespace_client_properties
self._inner = AsyncConnection(
_connect_namespace_client(
namespace_client,
read_consistency_interval=(
read_consistency_interval.total_seconds()
if read_consistency_interval is not None
else None
),
storage_options=self.storage_options or None,
session=session,
namespace_client_pushdown_operations=(
list(self._namespace_client_pushdown_operations)
),
namespace_client_impl=namespace_client_impl,
namespace_client_properties=namespace_client_properties,
)
)
@override
def serialize(self) -> str:
@@ -497,13 +514,10 @@ class LanceNamespaceDBConnection(DBConnection):
if mode.lower() not in ["create", "overwrite"]:
raise ValueError("mode must be either 'create' or 'overwrite'")
validate_table_name(name)
table_id = namespace_path + [name]
if "CreateTable" in self._namespace_client_pushdown_operations:
return self._create_table_server_side(
name=name,
data=data,
async_table = LOOP.run(
self._inner.create_table(
name,
data,
schema=schema,
mode=mode,
exist_ok=exist_ok,
@@ -513,130 +527,15 @@ class LanceNamespaceDBConnection(DBConnection):
namespace_path=namespace_path,
storage_options=storage_options,
)
# Local create path: declare_table + local write
# Step 1: Get the table location and storage options from namespace
# In overwrite mode, if table exists, use describe_table to get
# existing location. Otherwise, call create_empty_table to reserve
# a new location
location = None
namespace_storage_options = None
if mode.lower() == "overwrite":
# Try to describe the table first to see if it exists
try:
describe_request = DescribeTableRequest(id=table_id)
describe_response = self._namespace_client.describe_table(
describe_request
)
location = describe_response.location
namespace_storage_options = describe_response.storage_options
except Exception:
# Table doesn't exist, will create a new one below
pass
if location is None:
# Table doesn't exist or mode is "create", reserve a new location
declare_request = DeclareTableRequest(
id=table_id,
location=None,
properties=self.storage_options if self.storage_options else None,
)
declare_response = self._namespace_client.declare_table(declare_request)
if not declare_response.location:
raise ValueError(
"Table location is missing from declare_table response"
)
location = declare_response.location
namespace_storage_options = declare_response.storage_options
# Merge storage options: self.storage_options < user options < namespace options
merged_storage_options = dict(self.storage_options)
if storage_options:
merged_storage_options.update(storage_options)
if namespace_storage_options:
merged_storage_options.update(namespace_storage_options)
# Step 2: Create table using LanceTable.create with the location
# We need a temporary connection for the LanceTable.create method
temp_conn = LanceDBConnection(
location, # Use the actual location as the connection URI
read_consistency_interval=self.read_consistency_interval,
storage_options=merged_storage_options,
session=self.session,
)
# Note: storage_options_provider is auto-created in Rust from namespace_client
tbl = LanceTable.create(
temp_conn,
return LanceTable(
self,
name,
data,
schema,
mode=mode,
exist_ok=exist_ok,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
embedding_functions=embedding_functions,
namespace_path=namespace_path,
storage_options=merged_storage_options,
location=location,
namespace_client=self._namespace_client,
pushdown_operations=self._namespace_client_pushdown_operations,
)
return tbl
def _create_table_server_side(
self,
name: str,
data: Optional[DATA],
schema: Optional[Union[pa.Schema, LanceModel]],
mode: str,
exist_ok: bool,
on_bad_vectors: str,
fill_value: float,
embedding_functions: Optional[List[EmbeddingFunctionConfig]],
namespace_path: Optional[List[str]],
storage_options: Optional[Dict[str, str]],
) -> Table:
"""Create a table using server-side namespace.create_table()."""
if namespace_path is None:
namespace_path = []
table_id = namespace_path + [name]
arrow_ipc_bytes = _data_to_arrow_ipc(
data=data,
schema=schema,
embedding_functions=embedding_functions,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
)
merged = dict(self.storage_options or {})
if storage_options:
merged.update(storage_options)
request = CreateTableRequest(
id=table_id,
mode=_normalize_create_table_mode(mode),
properties=merged or None,
)
try:
self._namespace_client.create_table(request, arrow_ipc_bytes)
except Exception as e:
if exist_ok and "already exists" in str(e).lower():
return self.open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
)
raise
return self.open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
_async=async_table,
)
@override
@@ -650,30 +549,28 @@ class LanceNamespaceDBConnection(DBConnection):
) -> Table:
if namespace_path is None:
namespace_path = []
table_id = namespace_path + [name]
request = DescribeTableRequest(id=table_id)
response = self._namespace_client.describe_table(request)
try:
async_table = LOOP.run(
self._inner.open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
index_cache_size=index_cache_size,
)
)
except RuntimeError as e:
if "Table not found" in str(e):
table_id = namespace_path + [name]
raise TableNotFoundError(f"Table not found: {'$'.join(table_id)}")
raise
# Merge storage options: self.storage_options < user options < namespace options
merged_storage_options = dict(self.storage_options)
if storage_options:
merged_storage_options.update(storage_options)
if response.storage_options:
merged_storage_options.update(response.storage_options)
# Pass managed_versioning to avoid redundant describe_table call in Rust.
# Convert None to False since we already have the answer from describe_table.
managed_versioning = response.managed_versioning is True
# Note: storage_options_provider is auto-created in Rust from namespace_client
return self._lance_table_from_uri(
return LanceTable(
self,
name,
response.location,
namespace_path=namespace_path,
storage_options=merged_storage_options,
index_cache_size=index_cache_size,
namespace_client=self._namespace_client,
managed_versioning=managed_versioning,
pushdown_operations=self._namespace_client_pushdown_operations,
_async=async_table,
)
@override
@@ -729,6 +626,8 @@ class LanceNamespaceDBConnection(DBConnection):
namespace_path : Optional[List[str]]
The parent namespace path to list children from.
If None, lists root-level namespaces.
*Changed in version 0.31.0: renamed from* ``namespace``.
page_token : Optional[str]
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -897,33 +796,34 @@ class LanceNamespaceDBConnection(DBConnection):
namespace_client: Optional[Any] = None,
managed_versioning: Optional[bool] = None,
) -> LanceTable:
# Open a table directly from a URI using the location parameter
# Note: storage_options should already be merged by the caller
# Note: storage_options_provider is auto-created in Rust from namespace_client
# Open a table directly from the namespace-resolved physical location.
#
# Open the table through the Rust namespace-backed connection. The Rust
# layer keeps the logical namespace path and namespace client intact.
if namespace_path is None:
namespace_path = []
temp_conn = LanceDBConnection(
table_uri, # Use the table location as the connection URI
read_consistency_interval=self.read_consistency_interval,
storage_options=storage_options if storage_options is not None else {},
session=self.session,
async_table = LOOP.run(
self._inner.open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
index_cache_size=index_cache_size,
location=None,
namespace_client=namespace_client,
managed_versioning=managed_versioning,
)
)
# Open the table using the temporary connection with the location parameter
# Pass namespace_client to enable managed versioning support and auto-create
# storage options provider
# Pass managed_versioning to avoid redundant describe_table call
# Pass pushdown_operations if configured on this connection
return LanceTable.open(
temp_conn,
return LanceTable(
self,
name,
namespace_path=namespace_path,
storage_options=storage_options,
index_cache_size=index_cache_size,
location=table_uri,
namespace_client=namespace_client,
managed_versioning=managed_versioning,
pushdown_operations=self._namespace_client_pushdown_operations,
_async=async_table,
)
@override
@@ -964,7 +864,8 @@ class AsyncLanceNamespaceDBConnection:
Parameters
----------
namespace_client : LanceNamespace
The namespace client to use for table management
The namespace client to use for table management.
Previously called ``namespace`` in 0.30.2 and earlier.
read_consistency_interval : Optional[timedelta]
The interval at which to check for updates to the table from other
processes. If None, then consistency is not checked.
@@ -990,6 +891,23 @@ class AsyncLanceNamespaceDBConnection:
self._namespace_client_pushdown_operations = set(
namespace_client_pushdown_operations or []
)
self._inner = AsyncConnection(
_connect_namespace_client(
namespace_client,
read_consistency_interval=(
read_consistency_interval.total_seconds()
if read_consistency_interval is not None
else None
),
storage_options=self.storage_options or None,
session=session,
namespace_client_pushdown_operations=(
list(self._namespace_client_pushdown_operations)
),
namespace_client_impl=None,
namespace_client_properties=None,
)
)
async def table_names(
self,
@@ -1041,148 +959,16 @@ class AsyncLanceNamespaceDBConnection:
if mode.lower() not in ["create", "overwrite"]:
raise ValueError("mode must be either 'create' or 'overwrite'")
validate_table_name(name)
table_id = namespace_path + [name]
if "CreateTable" in self._namespace_client_pushdown_operations:
return await self._create_table_server_side(
name=name,
data=data,
schema=schema,
mode=mode,
exist_ok=exist_ok,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
embedding_functions=embedding_functions,
namespace_path=namespace_path,
storage_options=storage_options,
)
# Local create path: declare_table + local write
# Step 1: Get the table location and storage options from namespace
location = None
namespace_storage_options = None
if mode.lower() == "overwrite":
# Try to describe the table first to see if it exists
try:
describe_request = DescribeTableRequest(id=table_id)
describe_response = self._namespace_client.describe_table(
describe_request
)
location = describe_response.location
namespace_storage_options = describe_response.storage_options
except Exception:
# Table doesn't exist, will create a new one below
pass
if location is None:
# Table doesn't exist or mode is "create", reserve a new location
declare_request = DeclareTableRequest(
id=table_id,
location=None,
properties=self.storage_options if self.storage_options else None,
)
declare_response = self._namespace_client.declare_table(declare_request)
if not declare_response.location:
raise ValueError(
"Table location is missing from declare_table response"
)
location = declare_response.location
namespace_storage_options = declare_response.storage_options
# Merge storage options: self.storage_options < user options < namespace options
merged_storage_options = dict(self.storage_options)
if storage_options:
merged_storage_options.update(storage_options)
if namespace_storage_options:
merged_storage_options.update(namespace_storage_options)
# Step 2: Create table using LanceTable.create with the location
# Run the sync operation in a thread
def _create_table():
temp_conn = LanceDBConnection(
location,
read_consistency_interval=self.read_consistency_interval,
storage_options=merged_storage_options,
session=self.session,
)
# storage_options_provider is auto-created in Rust from namespace_client
return LanceTable.create(
temp_conn,
name,
data,
schema,
mode=mode,
exist_ok=exist_ok,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
embedding_functions=embedding_functions,
namespace_path=namespace_path,
storage_options=merged_storage_options,
location=location,
namespace_client=self._namespace_client,
pushdown_operations=self._namespace_client_pushdown_operations,
)
lance_table = await asyncio.to_thread(_create_table)
# Get the underlying async table from LanceTable
return lance_table._table
async def _create_table_server_side(
self,
name: str,
data: Optional[DATA],
schema: Optional[Union[pa.Schema, LanceModel]],
mode: str,
exist_ok: bool,
on_bad_vectors: str,
fill_value: float,
embedding_functions: Optional[List[EmbeddingFunctionConfig]],
namespace_path: Optional[List[str]],
storage_options: Optional[Dict[str, str]],
) -> AsyncTable:
"""Create a table using server-side namespace.create_table()."""
if namespace_path is None:
namespace_path = []
table_id = namespace_path + [name]
def _prepare_and_create():
arrow_ipc_bytes = _data_to_arrow_ipc(
data=data,
schema=schema,
embedding_functions=embedding_functions,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
)
merged = dict(self.storage_options or {})
if storage_options:
merged.update(storage_options)
request = CreateTableRequest(
id=table_id,
mode=_normalize_create_table_mode(mode),
properties=merged or None,
)
self._namespace_client.create_table(request, arrow_ipc_bytes)
try:
await asyncio.to_thread(_prepare_and_create)
except Exception as e:
if exist_ok and "already exists" in str(e).lower():
return await self.open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
)
raise
return await self.open_table(
return await self._inner.create_table(
name,
data,
schema=schema,
mode=mode,
exist_ok=exist_ok,
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
namespace_path=namespace_path,
embedding_functions=embedding_functions,
storage_options=storage_options,
)
@@ -1197,45 +983,18 @@ class AsyncLanceNamespaceDBConnection:
"""Open an existing table from the namespace."""
if namespace_path is None:
namespace_path = []
table_id = namespace_path + [name]
request = DescribeTableRequest(id=table_id)
response = self._namespace_client.describe_table(request)
# Merge storage options: self.storage_options < user options < namespace options
merged_storage_options = dict(self.storage_options)
if storage_options:
merged_storage_options.update(storage_options)
if response.storage_options:
merged_storage_options.update(response.storage_options)
# Capture managed_versioning from describe response.
# Convert None to False since we already have the answer from describe_table.
managed_versioning = response.managed_versioning is True
# Open table in a thread
# Note: storage_options_provider is auto-created in Rust from namespace_client
def _open_table():
temp_conn = LanceDBConnection(
response.location,
read_consistency_interval=self.read_consistency_interval,
storage_options=merged_storage_options,
session=self.session,
)
return LanceTable.open(
temp_conn,
try:
return await self._inner.open_table(
name,
namespace_path=namespace_path,
storage_options=merged_storage_options,
storage_options=storage_options,
index_cache_size=index_cache_size,
location=response.location,
namespace_client=self._namespace_client,
managed_versioning=managed_versioning,
pushdown_operations=self._namespace_client_pushdown_operations,
)
lance_table = await asyncio.to_thread(_open_table)
return lance_table._table
except RuntimeError as e:
if "Table not found" in str(e):
table_id = namespace_path + [name]
raise TableNotFoundError(f"Table not found: {'$'.join(table_id)}")
raise
async def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):
"""Drop a table from the namespace."""
@@ -1289,6 +1048,8 @@ class AsyncLanceNamespaceDBConnection:
namespace_path : Optional[List[str]]
The parent namespace path to list children from.
If None, lists root-level namespaces.
*Changed in version 0.31.0: renamed from* ``namespace``.
page_token : Optional[str]
Token for pagination. Use the token from a previous response
to get the next page of results.

View File

@@ -1,11 +1,12 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
from deprecation import deprecated
from lancedb import AsyncConnection, DBConnection
import pyarrow as pa
import copy
import json
from deprecation import deprecated
import pyarrow as pa
from ._lancedb import async_permutation_builder, PermutationReader
from .table import LanceTable
from .background_loop import LOOP
@@ -36,10 +37,7 @@ class PermutationBuilder:
be referenced by name in the future. If names are not provided then they can only
be referenced by their ordinal index. There is no requirement to name every split.
By default, the permutation will be stored in memory and will be lost when the
program exits. To persist the permutation (for very large datasets or to share
the permutation across multiple workers) use the [persist](#persist) method to
create a permanent table.
The permutation is stored in memory and will be lost when the program exits.
"""
def __init__(self, table: LanceTable):
@@ -51,15 +49,6 @@ class PermutationBuilder:
"""
self._async = async_permutation_builder(table)
def persist(
self, database: Union[DBConnection, AsyncConnection], table_name: str
) -> "PermutationBuilder":
"""
Persist the permutation to the given database.
"""
self._async.persist(database, table_name)
return self
def split_random(
self,
*,
@@ -380,20 +369,44 @@ class Permutation:
def __init__(
self,
reader: PermutationReader,
base_table: LanceTable,
permutation_table: Optional[LanceTable],
split: int,
selection: dict[str, str],
batch_size: int,
transform_fn: Callable[pa.RecordBatch, Any],
offset: Optional[int] = None,
limit: Optional[int] = None,
connection_factory: Optional[Callable[[str], LanceTable]] = None,
_reader: Optional[PermutationReader] = None,
):
"""
Internal constructor. Use [from_tables](#from_tables) instead.
"""
assert reader is not None, "reader is required"
assert base_table is not None, "base_table is required"
assert selection is not None, "selection is required"
self.reader = reader
self.base_table = base_table
self.permutation_table = permutation_table
self.split = split
self.selection = selection
self.transform_fn = transform_fn
self.batch_size = batch_size
self.offset = offset
self.limit = limit
self.connection_factory = connection_factory
if _reader is None:
_reader = LOOP.run(self._build_reader())
self.reader: PermutationReader = _reader
async def _build_reader(self) -> PermutationReader:
reader = await PermutationReader.from_tables(
self.base_table, self.permutation_table, self.split
)
if self.offset is not None:
reader = await reader.with_offset(self.offset)
if self.limit is not None:
reader = await reader.with_limit(self.limit)
return reader
def _with_selection(self, selection: dict[str, str]) -> "Permutation":
"""
@@ -402,21 +415,97 @@ class Permutation:
Does not validation of the selection and it replaces it entirely. This is not
intended for public use.
"""
return Permutation(self.reader, selection, self.batch_size, self.transform_fn)
def _with_reader(self, reader: PermutationReader) -> "Permutation":
"""
Creates a new permutation with the given reader
This is an internal method and should not be used directly.
"""
return Permutation(reader, self.selection, self.batch_size, self.transform_fn)
new = copy.copy(self)
new.selection = selection
return new
def with_batch_size(self, batch_size: int) -> "Permutation":
"""
Creates a new permutation with the given batch size
"""
return Permutation(self.reader, self.selection, batch_size, self.transform_fn)
new = copy.copy(self)
new.batch_size = batch_size
return new
def with_connection_factory(
self, connection_factory: Callable[[str], LanceTable]
) -> "Permutation":
"""
Creates a new permutation that will use ``connection_factory`` to reopen
the base table when this permutation is unpickled in a worker process.
The factory is a callable that takes a single argument — the base table
name — and returns a [LanceTable]. It must be picklable; the worker
will pickle it via standard ``pickle`` and call it to recover the base
table. Picklable callables in practice means top-level (module-level)
functions, ``functools.partial`` of such functions, or instances of
picklable classes implementing ``__call__``. Lambdas and closures over
local variables don't pickle with the default protocol.
Setting a factory is necessary when the URI alone is not enough to
re-open the connection — most importantly for LanceDB Cloud (``db://``)
connections, where ``api_key`` and ``region`` aren't recoverable from
the connection object after construction.
For local file or cloud-storage paths the factory is optional: if not
set, ``__getstate__`` falls back to capturing
``(uri, storage_options, namespace_path)`` and re-opening via
``lancedb.connect(uri, storage_options=...)``.
Examples
--------
Basic native (file-system path), parameterized via ``functools.partial``::
import functools, lancedb
from lancedb.permutation import Permutation
def open_native_table(uri: str, table_name: str):
return lancedb.connect(uri).open_table(table_name)
factory = functools.partial(open_native_table, "/data/lance_db")
permutation = Permutation.identity(
factory("training")
).with_connection_factory(factory)
Native via :func:`lancedb.connect_namespace` (e.g. a directory- or
REST-backed namespace client). The factory takes the
implementation name and properties dict as partial-bound args so
the worker can rebuild the same namespace connection::
def open_via_namespace(
impl: str, properties: dict[str, str], table_name: str,
):
return lancedb.connect_namespace(impl, properties).open_table(
table_name,
)
factory = functools.partial(
open_via_namespace,
"dir",
{"root": "/data/lance_db"},
)
LanceDB Cloud, reading credentials from env vars at worker startup
so secrets aren't pickled into the dataset::
import os, lancedb
def open_remote_table(table_name: str):
db = lancedb.connect(
"db://my-database",
api_key=os.environ["LANCEDB_API_KEY"],
region=os.environ.get("LANCEDB_REGION", "us-east-1"),
)
return db.open_table(table_name)
permutation = Permutation.identity(
open_remote_table("training")
).with_connection_factory(open_remote_table)
"""
assert connection_factory is not None, "connection_factory is required"
new = copy.copy(self)
new.connection_factory = connection_factory
return new
@classmethod
def identity(cls, table: LanceTable) -> "Permutation":
@@ -489,11 +578,126 @@ class Permutation:
schema = await reader.output_schema(None)
initial_selection = {name: name for name in schema.names}
return cls(
reader, initial_selection, DEFAULT_BATCH_SIZE, Transforms.arrow2python
base_table,
permutation_table,
split,
initial_selection,
DEFAULT_BATCH_SIZE,
Transforms.arrow2python,
_reader=reader,
)
return LOOP.run(do_from_tables())
def __getstate__(self) -> dict[str, Any]:
"""Build a picklable state dict for this permutation.
The base table is captured either via a user-supplied
``connection_factory`` (see [with_connection_factory]) or, as a
fallback, by introspecting ``(uri, storage_options, namespace_path)``
on the connection. The permutation table — always an in-memory
LanceDB table — is captured as a pyarrow Table (which pickles via
Arrow IPC natively). The reader is dropped from the wire format;
``__setstate__`` rebuilds it from the restored tables.
"""
permutation_data: Optional[pa.Table] = None
if self.permutation_table is not None:
permutation_data = self.permutation_table.to_arrow()
common = {
"base_table_name": self.base_table.name,
"permutation_data": permutation_data,
"split": self.split,
"selection": self.selection,
"batch_size": self.batch_size,
"transform_fn": self.transform_fn,
"offset": self.offset,
"limit": self.limit,
"connection_factory": self.connection_factory,
}
if self.connection_factory is not None:
# The factory carries enough state to recover the base table on
# its own; we don't need to capture the URI / storage options /
# namespace from the existing connection.
return common
# URI-introspection fallback: only viable for native (OSS) connections
# where (uri, storage_options) is enough to reopen. Remote / cloud
# connections don't expose recoverable api_key / region — those users
# must call with_connection_factory().
try:
base_uri = self.base_table._conn.uri
storage_options = self.base_table._conn.storage_options
except AttributeError as e:
raise ValueError(
"Cannot pickle this Permutation: the base table's connection "
"does not expose a uri/storage_options, which usually means it "
"is a remote (LanceDB Cloud) connection. Call "
"Permutation.with_connection_factory(...) first to provide a "
"picklable callable that re-opens the base table from a worker "
"process."
) from e
if base_uri.startswith("memory://"):
# In-memory base tables don't exist in any worker process by
# default, so dump the entire base table into the pickle. This
# can be expensive for large datasets — users with large
# in-memory base tables should either persist them or set a
# connection_factory.
return {
**common,
"base_table_data": self.base_table.to_arrow(),
}
return {
**common,
"base_table_uri": base_uri,
"base_table_namespace": self.base_table._namespace_path,
"base_table_storage_options": storage_options,
}
def __setstate__(self, state: dict[str, Any]) -> None:
from . import connect
connection_factory = state["connection_factory"]
if connection_factory is not None:
base_table = connection_factory(state["base_table_name"])
elif "base_table_data" in state:
# In-memory base table inlined into the pickle; rebuild the same
# way we rebuild the in-memory permutation table.
mem_db = connect("memory://")
base_table = mem_db.create_table(
state["base_table_name"], state["base_table_data"]
)
else:
base_db = connect(
state["base_table_uri"],
storage_options=state["base_table_storage_options"],
)
base_table = base_db.open_table(
state["base_table_name"],
namespace_path=state["base_table_namespace"] or None,
)
permutation_table: Optional[LanceTable] = None
if state["permutation_data"] is not None:
mem_db = connect("memory://")
permutation_table = mem_db.create_table(
"permutation", state["permutation_data"]
)
self.base_table = base_table
self.permutation_table = permutation_table
self.split = state["split"]
self.selection = state["selection"]
self.batch_size = state["batch_size"]
self.transform_fn = state["transform_fn"]
self.offset = state["offset"]
self.limit = state["limit"]
self.connection_factory = connection_factory
self.reader = LOOP.run(self._build_reader())
@property
def schema(self) -> pa.Schema:
async def do_output_schema():
@@ -760,7 +964,9 @@ class Permutation:
for expensive operations such as image decoding.
"""
assert transform is not None, "transform is required"
return Permutation(self.reader, self.selection, self.batch_size, transform)
new = copy.copy(self)
new.transform_fn = transform
return new
def __getitem__(self, index: int) -> Any:
"""
@@ -795,12 +1001,10 @@ class Permutation:
"""
Skip the first `skip` rows of the permutation
"""
async def do_with_skip():
reader = await self.reader.with_offset(skip)
return self._with_reader(reader)
return LOOP.run(do_with_skip())
new = copy.copy(self)
new.offset = skip
new.reader = LOOP.run(new._build_reader())
return new
@deprecated(details="Use with_take instead")
def take(self, limit: int) -> "Permutation":
@@ -818,12 +1022,10 @@ class Permutation:
"""
Limit the permutation to `limit` rows (following any `skip`)
"""
async def do_with_take():
reader = await self.reader.with_limit(limit)
return self._with_reader(reader)
return LOOP.run(do_with_take())
new = copy.copy(self)
new.limit = limit
new.reader = LOOP.run(new._build_reader())
return new
@deprecated(details="Use with_repeat instead")
def repeat(self, times: int) -> "Permutation":

View File

@@ -25,7 +25,6 @@ import deprecation
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.fs as pa_fs
import pydantic
from lancedb.pydantic import PYDANTIC_VERSION
@@ -1526,9 +1525,7 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
return self._table._output_schema(self.to_query_object())
def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
path, fs, exist = self._table._get_fts_index_path()
if exist:
return self.tantivy_to_arrow()
self._table._ensure_no_legacy_fts_index()
query = self._query
if self._phrase_query:
@@ -1552,90 +1549,6 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
):
raise NotImplementedError("to_batches on an FTS query")
def tantivy_to_arrow(self) -> pa.Table:
try:
import tantivy
except ImportError:
raise ImportError(
"Please install tantivy-py `pip install tantivy` to use the full text search feature." # noqa: E501
)
from .fts import search_index
# get the index path
path, fs, exist = self._table._get_fts_index_path()
# check if the index exist
if not exist:
raise FileNotFoundError(
"Fts index does not exist. "
"Please first call table.create_fts_index(['<field_names>']) to "
"create the fts index."
)
# Check that we are on local filesystem
if not isinstance(fs, pa_fs.LocalFileSystem):
raise NotImplementedError(
"Tantivy-based full text search "
"is only supported on the local filesystem"
)
# open the index
index = tantivy.Index.open(path)
# get the scores and doc ids
query = self._query
if self._phrase_query:
query = query.replace('"', "'")
query = f'"{query}"'
limit = self._limit if self._limit is not None else 10
row_ids, scores = search_index(
index, query, limit, ordering_field=self.ordering_field_name
)
if len(row_ids) == 0:
empty_schema = pa.schema([pa.field("_score", pa.float32())])
return pa.Table.from_batches([], schema=empty_schema)
scores = pa.array(scores)
output_tbl = self._table.to_lance().take(row_ids, columns=self._columns)
output_tbl = output_tbl.append_column("_score", scores)
# this needs to match vector search results which are uint64
row_ids = pa.array(row_ids, type=pa.uint64())
if self._where is not None:
tmp_name = "__lancedb__duckdb__indexer__"
output_tbl = output_tbl.append_column(
tmp_name, pa.array(range(len(output_tbl)))
)
try:
# TODO would be great to have Substrait generate pyarrow compute
# expressions or conversely have pyarrow support SQL expressions
# using Substrait
import duckdb
indexer = duckdb.sql(
f"SELECT {tmp_name} FROM output_tbl WHERE {self._where}"
).to_arrow_table()[tmp_name]
output_tbl = output_tbl.take(indexer).drop([tmp_name])
row_ids = row_ids.take(indexer)
except ImportError:
import tempfile
import lance
# TODO Use "memory://" instead once that's supported
with tempfile.TemporaryDirectory() as tmp:
ds = lance.write_dataset(output_tbl, tmp)
output_tbl = ds.to_table(filter=self._where)
indexer = output_tbl[tmp_name]
row_ids = row_ids.take(indexer)
output_tbl = output_tbl.drop([tmp_name])
if self._with_row_id:
output_tbl = output_tbl.append_column("_rowid", row_ids)
if self._reranker is not None:
output_tbl = self._reranker.rerank_fts(self._query, output_tbl)
return output_tbl
def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
"""Rerank the results using the specified reranker.
@@ -1730,7 +1643,7 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
def _validate_query(self, query, vector=None, text=None):
if query is not None and (vector is not None or text is not None):
raise ValueError(
"You can either provide a string query in search() method"
"You can either provide a string query in search() method "
"or set `vector()` and `text()` explicitly for hybrid search."
"But not both."
)

View File

@@ -123,6 +123,7 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], optional
The parent namespace to list namespaces in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -155,6 +156,7 @@ class RemoteDBConnection(DBConnection):
----------
namespace_path: List[str]
The namespace identifier to create.
Previously called ``namespace`` in 0.30.2 and earlier.
mode: str, optional
Creation mode - "create" (fail if exists), "exist_ok" (skip if exists),
or "overwrite" (replace if exists). Case insensitive.
@@ -185,6 +187,7 @@ class RemoteDBConnection(DBConnection):
----------
namespace_path: List[str]
The namespace identifier to drop.
Previously called ``namespace`` in 0.30.2 and earlier.
mode: str, optional
Whether to skip if not exists ("SKIP") or fail ("FAIL"). Case insensitive.
behavior: str, optional
@@ -212,6 +215,7 @@ class RemoteDBConnection(DBConnection):
----------
namespace_path: List[str]
The namespace identifier to describe.
Previously called ``namespace`` in 0.30.2 and earlier.
Returns
-------
@@ -234,6 +238,7 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to list tables in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str, optional
Token for pagination. Use the token from a previous response
to get the next page of results.
@@ -271,6 +276,7 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], default []
The namespace to list tables in.
Empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
page_token: str
The last token to start the new page.
limit: int, default 10
@@ -313,6 +319,7 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to open the table from.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
Returns
-------
@@ -352,6 +359,7 @@ class RemoteDBConnection(DBConnection):
target_namespace_path: List[str], optional
The namespace for the target table.
None or empty list represents root namespace.
Previously called ``target_namespace`` in 0.30.2 and earlier.
source_version: int, optional
The version of the source table to clone.
source_tag: str, optional
@@ -403,6 +411,7 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to create the table in.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
data: The data to initialize the table, *optional*
User must provide at least one of `data` or `schema`.
Acceptable types are:
@@ -536,6 +545,7 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to drop the table from.
None or empty list represents root namespace.
Previously called ``namespace`` in 0.30.2 and earlier.
"""
if namespace_path is None:
namespace_path = []
@@ -557,6 +567,12 @@ class RemoteDBConnection(DBConnection):
The current name of the table.
new_name: str
The new name of the table.
cur_namespace_path: List[str], optional
The namespace of the current table.
Previously called ``cur_namespace`` in 0.30.2 and earlier.
new_namespace_path: List[str], optional
The namespace to move the table to.
Previously called ``new_namespace`` in 0.30.2 and earlier.
"""
if cur_namespace_path is None:
cur_namespace_path = []

View File

@@ -22,6 +22,7 @@ from lancedb.index import (
FTS,
BTree,
Bitmap,
HnswFlat,
HnswSq,
IvfFlat,
IvfPq,
@@ -39,6 +40,7 @@ from lancedb.table import _normalize_progress
from ..query import LanceVectorQueryBuilder, LanceQueryBuilder, LanceTakeQueryBuilder
from ..table import AsyncTable, IndexStatistics, Query, Table, Tags
from ..types import BaseTokenizerType
class RemoteTable(Table):
@@ -167,7 +169,7 @@ class RemoteTable(Table):
wait_timeout: Optional[timedelta] = None,
with_position: bool = False,
# tokenizer configs:
base_tokenizer: str = "simple",
base_tokenizer: BaseTokenizerType = "simple",
language: str = "English",
max_token_length: Optional[int] = 40,
lower_case: bool = True,
@@ -284,13 +286,15 @@ class RemoteTable(Table):
)
elif index_type == "IVF_HNSW_SQ":
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_HNSW_FLAT":
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_FLAT":
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
else:
raise ValueError(
f"Unknown vector index type: {index_type}. Valid options are"
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ'"
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
)
LOOP.run(

View File

@@ -57,6 +57,7 @@ from .index import (
LabelList,
HnswPq,
HnswSq,
HnswFlat,
FTS,
)
from .merge import LanceMergeInsertBuilder
@@ -86,6 +87,59 @@ from .util import (
)
from .index import lang_mapping
_MODEL_BACKED_TOKENIZER_PREFIXES = ("jieba", "lindera")
_MODEL_BACKED_TOKENIZER_ERRORS = (
"unknown base tokenizer",
"Invalid directory path:",
"Failed to load Jieba",
"Failed to load tokenizer config",
"Failed to initialize default tokenizer",
)
def _add_unique_note(exception: BaseException, note: str) -> None:
existing_notes = getattr(exception, "__notes__", ()) or ()
message = (
exception.args[0]
if exception.args and isinstance(exception.args[0], str)
else ""
)
if note not in existing_notes and note not in message:
add_note(exception, note)
def _is_model_backed_tokenizer(base_tokenizer: str) -> bool:
return any(
base_tokenizer == prefix or base_tokenizer.startswith(f"{prefix}/")
for prefix in _MODEL_BACKED_TOKENIZER_PREFIXES
)
def _maybe_add_fts_error_note(
exception: BaseException, *, base_tokenizer: str, language: Optional[str] = None
) -> None:
message = str(exception)
if language is not None and "not support the requested language" in message:
supported_langs = ", ".join(lang_mapping.values())
_add_unique_note(exception, f"Supported languages: {supported_langs}")
return
if not _is_model_backed_tokenizer(base_tokenizer):
return
if not any(marker in message for marker in _MODEL_BACKED_TOKENIZER_ERRORS):
return
_add_unique_note(
exception,
"Model-backed tokenizers such as 'jieba/default' and 'lindera/ipadic' "
"require tokenizer models in Lance's language model home. Set "
"LANCE_LANGUAGE_MODEL_HOME to override the default platform data "
"directory under 'lance/language_models'. Expected layouts include "
"'<model-home>/jieba/default/...' and "
"'<model-home>/lindera/ipadic/...'.",
)
if TYPE_CHECKING:
from .db import LanceDBConnection
@@ -943,29 +997,29 @@ class Table(ABC):
Parameters
----------
field_names: str or list of str
The name(s) of the field to index.
If ``use_tantivy`` is False (default), only a single field name
(str) is supported. To index multiple fields, create a separate
FTS index for each field.
The name of the field to index. Native FTS indexes can only be
created on a single field at a time. To search over multiple text
fields, create a separate FTS index for each field.
replace: bool, default False
If True, replace the existing index if it exists. Note that this is
not yet an atomic operation; the index will be temporarily
unavailable while the new index is being created.
writer_heap_size: int, default 1GB
Only available with use_tantivy=True
Deprecated legacy Tantivy parameter. Any value other than the
default raises an error.
ordering_field_names:
A list of unsigned type fields to index to optionally order
results on at search time.
only available with use_tantivy=True
Deprecated legacy Tantivy parameter. Setting this raises an error.
tokenizer_name: str, default "default"
The tokenizer to use for the index. Can be "raw", "default" or the 2 letter
language code followed by "_stem". So for english it would be "en_stem".
For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
A compatibility alias for native tokenizer configs. Can be "raw",
"default" or the 2 letter language code followed by "_stem". So
for english it would be "en_stem". For new native FTS indexes, use
``base_tokenizer`` directly; ``tokenizer_name`` is a legacy
compatibility alias and does not expose model-backed tokenizer names
such as ``jieba/default`` or ``lindera/ipadic``.
use_tantivy: bool, default False
If True, use the legacy full-text search implementation based on tantivy.
If False, use the new full-text search implementation based on lance-index.
Deprecated legacy Tantivy parameter. Setting this to True raises an
error.
with_position: bool, default False
Only available with use_tantivy=False
If False, do not store the positions of the terms in the text.
This can reduce the size of the index and improve indexing speed.
But it will raise an exception for phrase queries.
@@ -975,8 +1029,11 @@ class Table(ABC):
- "whitespace": Split text by whitespace, but not punctuation.
- "raw": No tokenization. The entire text is treated as a single token.
- "ngram": N-Gram tokenizer.
- "jieba/*": Jieba tokenizer loaded from Lance's language model home.
- "lindera/*": Lindera tokenizer loaded from Lance's language model home.
language : str, default "English"
The language to use for tokenization.
The language to use for stemming and stop-word removal. This is not
the primary way to enable CJK tokenization.
max_token_length : int, default 40
The maximum token length to index. Tokens longer than this length will be
ignored.
@@ -1002,6 +1059,13 @@ class Table(ABC):
The timeout to wait if indexing is asynchronous.
name: str, optional
The name of the index. If not provided, a default name will be generated.
Notes
-----
Model-backed tokenizers such as ``jieba/default`` and ``lindera/ipadic``
require tokenizer models in Lance's language model home. Set
``LANCE_LANGUAGE_MODEL_HOME`` to override the default platform data
directory under ``lance/language_models``.
"""
raise NotImplementedError
@@ -1746,6 +1810,16 @@ class Table(ABC):
index_exists = fs.get_file_info(path).type != pa_fs.FileType.NotFound
return (path, fs, index_exists)
def _ensure_no_legacy_fts_index(self):
path, _, exists = self._get_fts_index_path()
if exists:
raise ValueError(
"Legacy Tantivy FTS index detected at "
f"{path}. Tantivy-based FTS has been removed. "
"Delete the legacy index and recreate it with "
"table.create_fts_index(...)."
)
@abstractmethod
def uses_v2_manifest_paths(self) -> bool:
"""
@@ -2163,7 +2237,13 @@ class LanceTable(Table):
index_cache_size: Optional[int] = None,
num_bits: int = 8,
index_type: Literal[
"IVF_FLAT", "IVF_SQ", "IVF_PQ", "IVF_RQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
"IVF_FLAT",
"IVF_SQ",
"IVF_PQ",
"IVF_RQ",
"IVF_HNSW_SQ",
"IVF_HNSW_PQ",
"IVF_HNSW_FLAT",
] = "IVF_PQ",
max_iterations: int = 50,
sample_rate: int = 256,
@@ -2250,6 +2330,16 @@ class LanceTable(Table):
ef_construction=ef_construction,
target_partition_size=target_partition_size,
)
elif index_type == "IVF_HNSW_FLAT":
config = HnswFlat(
distance_type=metric,
num_partitions=num_partitions,
max_iterations=max_iterations,
sample_rate=sample_rate,
m=m,
ef_construction=ef_construction,
target_partition_size=target_partition_size,
)
else:
raise ValueError(f"Unknown index type {index_type}")
@@ -2405,41 +2495,57 @@ class LanceTable(Table):
prefix_only: bool = False,
name: Optional[str] = None,
):
if not use_tantivy:
if not isinstance(field_names, str):
raise ValueError(
"Native FTS indexes can only be created on a single field "
"at a time. To search over multiple text fields, create a "
"separate FTS index for each field."
)
self._ensure_no_legacy_fts_index()
if tokenizer_name is None:
tokenizer_configs = {
"base_tokenizer": base_tokenizer,
"language": language,
"with_position": with_position,
"max_token_length": max_token_length,
"lower_case": lower_case,
"stem": stem,
"remove_stop_words": remove_stop_words,
"ascii_folding": ascii_folding,
"ngram_min_length": ngram_min_length,
"ngram_max_length": ngram_max_length,
"prefix_only": prefix_only,
}
else:
tokenizer_configs = self.infer_tokenizer_configs(tokenizer_name)
config = FTS(
**tokenizer_configs,
if use_tantivy:
raise ValueError(
"Tantivy-based FTS has been removed. "
"Remove use_tantivy and recreate the index with native FTS."
)
if ordering_field_names is not None:
raise ValueError(
"ordering_field_names was only supported by the removed "
"Tantivy-based FTS implementation."
)
if writer_heap_size != 1024 * 1024 * 1024:
raise ValueError(
"writer_heap_size was only supported by the removed "
"Tantivy-based FTS implementation."
)
if not isinstance(field_names, str):
raise ValueError(
"Native FTS indexes can only be created on a single field "
"at a time. To search over multiple text fields, create a "
"separate FTS index for each field."
)
if "." in field_names:
raise ValueError(
"Native FTS indexes can only be created on top-level fields. "
f"Received nested field path: {field_names!r}."
)
# delete the existing legacy index if it exists
if replace:
path, fs, exist = self._get_fts_index_path()
if exist:
fs.delete_dir(path)
if tokenizer_name is None:
tokenizer_configs = {
"base_tokenizer": base_tokenizer,
"language": language,
"with_position": with_position,
"max_token_length": max_token_length,
"lower_case": lower_case,
"stem": stem,
"remove_stop_words": remove_stop_words,
"ascii_folding": ascii_folding,
"ngram_min_length": ngram_min_length,
"ngram_max_length": ngram_max_length,
"prefix_only": prefix_only,
}
else:
tokenizer_configs = self.infer_tokenizer_configs(tokenizer_name)
config = FTS(
**tokenizer_configs,
)
try:
LOOP.run(
self._table.create_index(
field_names,
@@ -2448,42 +2554,13 @@ class LanceTable(Table):
name=name,
)
)
return
from .fts import create_index, populate_index
if isinstance(field_names, str):
field_names = [field_names]
if isinstance(ordering_field_names, str):
ordering_field_names = [ordering_field_names]
path, fs, exist = self._get_fts_index_path()
if exist:
if not replace:
raise ValueError("Index already exists. Use replace=True to overwrite.")
fs.delete_dir(path)
if not isinstance(fs, pa_fs.LocalFileSystem):
raise NotImplementedError(
"Full-text search is only supported on the local filesystem"
except (ValueError, RuntimeError) as e:
_maybe_add_fts_error_note(
e,
base_tokenizer=config.base_tokenizer,
language=config.language,
)
if tokenizer_name is None:
tokenizer_name = "default"
index = create_index(
path,
field_names,
ordering_fields=ordering_field_names,
tokenizer_name=tokenizer_name,
)
populate_index(
index,
self,
field_names,
ordering_fields=ordering_field_names,
writer_heap_size=writer_heap_size,
)
raise e
@staticmethod
def infer_tokenizer_configs(tokenizer_name: str) -> dict:
@@ -2929,6 +3006,7 @@ class LanceTable(Table):
namespace_path=namespace_path,
storage_options=storage_options,
location=location,
namespace_client=namespace_client,
)
)
return self
@@ -3812,7 +3890,18 @@ class AsyncTable:
*,
replace: Optional[bool] = None,
config: Optional[
Union[IvfFlat, IvfPq, IvfRq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]
Union[
IvfFlat,
IvfPq,
IvfRq,
HnswPq,
HnswSq,
HnswFlat,
BTree,
Bitmap,
LabelList,
FTS,
]
] = None,
wait_timeout: Optional[timedelta] = None,
name: Optional[str] = None,
@@ -3859,6 +3948,7 @@ class AsyncTable:
IvfRq,
HnswPq,
HnswSq,
HnswFlat,
BTree,
Bitmap,
LabelList,
@@ -3878,11 +3968,13 @@ class AsyncTable:
name=name,
train=train,
)
except ValueError as e:
if "not support the requested language" in str(e):
supported_langs = ", ".join(lang_mapping.values())
help_msg = f"Supported languages: {supported_langs}"
add_note(e, help_msg)
except (ValueError, RuntimeError) as e:
if isinstance(config, FTS):
_maybe_add_fts_error_note(
e,
base_tokenizer=config.base_tokenizer,
language=config.language,
)
raise e
async def drop_index(self, name: str) -> None:
@@ -5027,6 +5119,7 @@ class IndexStatistics:
"IVF_RQ",
"IVF_HNSW_SQ",
"IVF_HNSW_PQ",
"IVF_HNSW_FLAT",
"FTS",
"BTREE",
"BITMAP",

View File

@@ -24,6 +24,7 @@ VectorIndexType = Literal[
"IVF_PQ",
"IVF_HNSW_SQ",
"IVF_HNSW_PQ",
"IVF_HNSW_FLAT",
"IVF_RQ",
]
ScalarIndexType = Literal["BTREE", "BITMAP", "LABEL_LIST"]
@@ -31,6 +32,7 @@ IndexType = Literal[
"IVF_PQ",
"IVF_HNSW_PQ",
"IVF_HNSW_SQ",
"IVF_HNSW_FLAT",
"IVF_SQ",
"FTS",
"BTREE",
@@ -40,4 +42,5 @@ IndexType = Literal[
]
# Tokenizer literals
BaseTokenizerType = Literal["simple", "raw", "whitespace", "ngram"]
BuiltinTokenizerType = Literal["simple", "raw", "whitespace", "ngram"]
BaseTokenizerType = BuiltinTokenizerType | str

View File

@@ -180,7 +180,7 @@ def test_fts_fuzzy_query():
),
mode="overwrite",
)
table.create_fts_index("text", use_tantivy=False, replace=True)
table.create_fts_index("text", replace=True)
results = table.search(MatchQuery("foo", "text", fuzziness=1)).to_pandas()
assert len(results) == 4
@@ -230,7 +230,7 @@ def test_fts_boost_query():
),
mode="overwrite",
)
table.create_fts_index("desc", use_tantivy=False, replace=True)
table.create_fts_index("desc", replace=True)
results = table.search(
BoostQuery(
@@ -265,7 +265,7 @@ def test_fts_boolean_query(tmp_path):
],
mode="overwrite",
)
table.create_fts_index("text", use_tantivy=False, replace=True)
table.create_fts_index("text", replace=True)
# SHOULD
results = table.search(
@@ -319,9 +319,7 @@ def test_fts_native():
],
)
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
@@ -332,7 +330,6 @@ def test_fts_native():
# --8<-- [start:fts_config_folding]
table.create_fts_index(
"text",
use_tantivy=False,
language="French",
stem=True,
ascii_folding=True,
@@ -346,7 +343,7 @@ def test_fts_native():
table.search("puppy").limit(10).where("text='foo'", prefilter=False).to_list()
# --8<-- [end:fts_postfiltering]
# --8<-- [start:fts_with_position]
table.create_fts_index("text", use_tantivy=False, with_position=True, replace=True)
table.create_fts_index("text", with_position=True, replace=True)
# --8<-- [end:fts_with_position]
# --8<-- [start:fts_incremental_index]
table.add([{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"}])

View File

@@ -0,0 +1,8 @@
我们 98740 r
都 202780 d
有 423765 v
光明 1219 n
的 318825 uj
前途 1263 n
前 62779 f
途 857 n

View File

@@ -0,0 +1,4 @@
segmenter:
mode: "normal"
dictionary:
path: "./python/tests/models/lindera/ipadic/main"

Binary file not shown.

View File

@@ -15,8 +15,7 @@ import pytest
from lancedb.pydantic import LanceModel, Vector
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_basic(tmp_path, use_tantivy):
def test_basic(tmp_path):
db = lancedb.connect(tmp_path)
assert db.uri == str(tmp_path)
@@ -49,7 +48,7 @@ def test_basic(tmp_path, use_tantivy):
assert len(rs) == 1
assert rs["item"].iloc[0] == "foo"
table.create_fts_index("item", use_tantivy=use_tantivy)
table.create_fts_index("item")
rs = table.search("bar", query_type="fts").to_pandas()
assert len(rs) == 1
assert rs["item"].iloc[0] == "bar"

View File

@@ -15,7 +15,10 @@
# limitations under the License.
import os
import random
import shutil
from unittest import mock
from pathlib import Path
import zipfile
import lancedb as ldb
from lancedb.db import DBConnection
@@ -36,8 +39,7 @@ import pytest
import pytest_asyncio
from utils import exception_output
pytest.importorskip("lancedb.fts")
tantivy = pytest.importorskip("tantivy")
TEST_LANGUAGE_MODEL_HOME = Path(__file__).parent / "models"
@pytest.fixture
@@ -92,6 +94,40 @@ def table(tmp_path) -> ldb.table.LanceTable:
return table
@pytest.fixture
def language_model_home(monkeypatch, tmp_path):
model_home = tmp_path / "language-models"
shutil.copytree(TEST_LANGUAGE_MODEL_HOME, model_home)
monkeypatch.setenv("LANCE_LANGUAGE_MODEL_HOME", str(model_home))
return model_home
@pytest.fixture
def lindera_ipadic(language_model_home):
model_path = language_model_home / "lindera" / "ipadic"
extracted_model = model_path / "main"
config_path = model_path / "config.yml"
if extracted_model.exists():
shutil.rmtree(extracted_model)
with zipfile.ZipFile(model_path / "main.zip", "r") as zip_ref:
zip_ref.extractall(model_path)
config_path.write_text(
"segmenter:\n"
' mode: "normal"\n'
" dictionary:\n"
f' path: "{extracted_model.resolve().as_posix()}"\n',
encoding="utf-8",
)
try:
yield
finally:
if extracted_model.exists():
shutil.rmtree(extracted_model)
@pytest_asyncio.fixture
async def async_table(tmp_path) -> ldb.table.AsyncTable:
# Use local random state to avoid affecting other tests
@@ -144,58 +180,53 @@ async def async_table(tmp_path) -> ldb.table.AsyncTable:
return table
def test_create_index(tmp_path):
index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
assert isinstance(index, tantivy.Index)
assert os.path.exists(str(tmp_path / "index"))
@pytest.mark.parametrize(
("kwargs", "match"),
[
(
{"use_tantivy": True},
"Tantivy-based FTS has been removed",
),
(
{"ordering_field_names": ["count"]},
"ordering_field_names was only supported",
),
(
{"writer_heap_size": 128},
"writer_heap_size was only supported",
),
],
)
def test_reject_removed_tantivy_parameters(table, kwargs, match):
with pytest.raises(ValueError, match=match):
table.create_fts_index("text", **kwargs)
def test_create_index_with_stemming(tmp_path, table):
index = ldb.fts.create_index(
str(tmp_path / "index"), ["text"], tokenizer_name="en_stem"
)
assert isinstance(index, tantivy.Index)
assert os.path.exists(str(tmp_path / "index"))
def test_reject_legacy_tantivy_index(table):
path, _, _ = table._get_fts_index_path()
os.makedirs(path, exist_ok=True)
# Check stemming by running tokenizer on non empty table
table.create_fts_index("text", tokenizer_name="en_stem", use_tantivy=True)
with pytest.raises(ValueError, match="Legacy Tantivy FTS index detected"):
table.search("puppy").limit(5).to_list()
with pytest.raises(ValueError, match="Legacy Tantivy FTS index detected"):
table.create_fts_index("text")
@pytest.mark.parametrize("use_tantivy", [True, False])
@pytest.mark.parametrize("with_position", [True, False])
def test_create_inverted_index(table, use_tantivy, with_position):
if use_tantivy and not with_position:
pytest.skip("we don't support building a tantivy index without position")
def test_create_inverted_index(table, with_position):
table.create_fts_index(
"text",
use_tantivy=use_tantivy,
with_position=with_position,
name="custom_fts_index",
)
if not use_tantivy:
indices = table.list_indices()
fts_indices = [i for i in indices if i.index_type == "FTS"]
assert any(i.name == "custom_fts_index" for i in fts_indices)
indices = table.list_indices()
fts_indices = [i for i in indices if i.index_type == "FTS"]
assert any(i.name == "custom_fts_index" for i in fts_indices)
def test_populate_index(tmp_path, table):
index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
assert ldb.fts.populate_index(index, table, ["text"]) == len(table)
def test_search_index(tmp_path, table):
index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
ldb.fts.populate_index(index, table, ["text"])
index.reload()
results = ldb.fts.search_index(index, query="puppy", limit=5)
assert len(results) == 2
assert len(results[0]) == 5 # row_ids
assert len(results[1]) == 5 # _score
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_search_fts(table, use_tantivy):
table.create_fts_index("text", use_tantivy=use_tantivy)
def test_search_fts(table):
table.create_fts_index("text")
results = table.search("puppy").select(["id", "text"]).limit(5).to_list()
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
@@ -204,53 +235,52 @@ def test_search_fts(table, use_tantivy):
results = table.search("puppy").select(["id", "text"]).to_list()
assert len(results) == 10
if not use_tantivy:
# Test with a query
results = (
table.search(MatchQuery("puppy", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
# Test with a query
results = (
table.search(MatchQuery("puppy", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
# Test boost query
results = (
table.search(
BoostQuery(
MatchQuery("puppy", "text"),
MatchQuery("runs", "text"),
)
# Test boost query
results = (
table.search(
BoostQuery(
MatchQuery("puppy", "text"),
MatchQuery("runs", "text"),
)
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
# Test multi match query
table.create_fts_index("text2", use_tantivy=use_tantivy)
results = (
table.search(MultiMatchQuery("puppy", ["text", "text2"]))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
# Test multi match query
table.create_fts_index("text2")
results = (
table.search(MultiMatchQuery("puppy", ["text", "text2"]))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
# Test boolean query
results = (
table.search(MatchQuery("puppy", "text") & MatchQuery("runs", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
for r in results:
assert "puppy" in r["text"]
assert "runs" in r["text"]
# Test boolean query
results = (
table.search(MatchQuery("puppy", "text") & MatchQuery("runs", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
for r in results:
assert "puppy" in r["text"]
assert "runs" in r["text"]
@pytest.mark.asyncio
@@ -318,13 +348,13 @@ async def test_fts_select_async(async_table):
def test_search_fts_phrase_query(table):
table.create_fts_index("text", use_tantivy=False, with_position=False)
table.create_fts_index("text", with_position=False)
try:
phrase_results = table.search('"puppy runs"').limit(100).to_list()
assert False
except Exception:
pass
table.create_fts_index("text", use_tantivy=False, with_position=True, replace=True)
table.create_fts_index("text", with_position=True, replace=True)
results = table.search("puppy").limit(100).to_list()
# Test with quotation marks
@@ -375,8 +405,8 @@ async def test_search_fts_phrase_query_async(async_table):
def test_search_fts_specify_column(table):
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text2", use_tantivy=False)
table.create_fts_index("text")
table.create_fts_index("text2")
results = table.search("puppy", fts_columns="text").limit(5).to_list()
assert len(results) == 5
@@ -470,42 +500,8 @@ async def test_search_fts_specify_column_async(async_table):
pass
def test_search_ordering_field_index_table(tmp_path, table):
table.create_fts_index("text", ordering_field_names=["count"], use_tantivy=True)
rows = (
table.search("puppy", ordering_field_name="count")
.limit(20)
.select(["text", "count"])
.to_list()
)
for r in rows:
assert "puppy" in r["text"]
assert sorted(rows, key=lambda x: x["count"], reverse=True) == rows
def test_search_ordering_field_index(tmp_path, table):
index = ldb.fts.create_index(
str(tmp_path / "index"), ["text"], ordering_fields=["count"]
)
ldb.fts.populate_index(index, table, ["text"], ordering_fields=["count"])
index.reload()
results = ldb.fts.search_index(
index, query="puppy", limit=5, ordering_field="count"
)
assert len(results) == 2
assert len(results[0]) == 5 # row_ids
assert len(results[1]) == 5 # _distance
rows = table.to_lance().take(results[0]).to_pylist()
for r in rows:
assert "puppy" in r["text"]
assert sorted(rows, key=lambda x: x["count"], reverse=True) == rows
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_create_index_from_table(tmp_path, table, use_tantivy):
table.create_fts_index("text", use_tantivy=use_tantivy)
def test_create_index_from_table(tmp_path, table):
table.create_fts_index("text")
df = table.search("puppy").limit(5).select(["text"]).to_pandas()
assert len(df) <= 5
assert "text" in df.columns
@@ -525,36 +521,24 @@ def test_create_index_from_table(tmp_path, table, use_tantivy):
)
with pytest.raises(Exception, match="already exists"):
table.create_fts_index("text", use_tantivy=use_tantivy)
table.create_fts_index("text")
table.create_fts_index("text", replace=True, use_tantivy=use_tantivy)
table.create_fts_index("text", replace=True)
assert len(table.search("gorilla").limit(1).to_pandas()) == 1
def test_create_index_multiple_columns(tmp_path, table):
table.create_fts_index(["text", "text2"], use_tantivy=True)
df = table.search("puppy").limit(5).to_pandas()
assert len(df) == 5
assert "text" in df.columns
assert "text2" in df.columns
def test_empty_rs(tmp_path, table, mocker):
table.create_fts_index(["text", "text2"], use_tantivy=True)
mocker.patch("lancedb.fts.search_index", return_value=([], []))
df = table.search("puppy").limit(5).to_pandas()
assert len(df) == 0
with pytest.raises(ValueError, match="Native FTS indexes can only be created"):
table.create_fts_index(["text", "text2"])
def test_nested_schema(tmp_path, table):
table.create_fts_index("nested.text", use_tantivy=True)
rs = table.search("puppy").limit(5).to_list()
assert len(rs) == 5
with pytest.raises(ValueError, match="top-level fields"):
table.create_fts_index("nested.text")
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_search_index_with_filter(table, use_tantivy):
table.create_fts_index("text", use_tantivy=use_tantivy)
def test_search_index_with_filter(table):
table.create_fts_index("text")
orig_import = __import__
def import_mock(name, *args):
@@ -584,8 +568,7 @@ def test_search_index_with_filter(table, use_tantivy):
assert r["_rowid"] is not None
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_null_input(table, use_tantivy):
def test_null_input(table):
table.add(
[
{
@@ -598,14 +581,13 @@ def test_null_input(table, use_tantivy):
}
]
)
table.create_fts_index("text", use_tantivy=use_tantivy)
table.create_fts_index("text")
def test_syntax(table):
# https://github.com/lancedb/lancedb/issues/769
table.create_fts_index("text", use_tantivy=True)
with pytest.raises(ValueError, match="Syntax Error"):
table.search("they could have been dogs OR").limit(10).to_list()
table.create_fts_index("text")
table.search("they could have been dogs OR").limit(10).to_list()
# these should work
@@ -616,6 +598,7 @@ def test_syntax(table):
).to_list()
# phrase queries
table.create_fts_index("text", with_position=True, replace=True)
table.search("they could have been dogs OR cats").phrase_query().limit(10).to_list()
table.search('"they could have been dogs OR cats"').limit(10).to_list()
table.search('''"the cats OR dogs were not really 'pets' at all"''').limit(
@@ -639,7 +622,7 @@ def test_language(mem_db: DBConnection):
table = mem_db.create_table("test", data=data)
with pytest.raises(ValueError) as e:
table.create_fts_index("text", use_tantivy=False, language="klingon")
table.create_fts_index("text", language="klingon")
assert exception_output(e) == (
"ValueError: LanceDB does not support the requested language: 'klingon'\n"
@@ -650,7 +633,6 @@ def test_language(mem_db: DBConnection):
table.create_fts_index(
"text",
use_tantivy=False,
language="French",
stem=True,
ascii_folding=True,
@@ -690,7 +672,7 @@ def test_fts_on_list(mem_db: DBConnection):
}
)
table = mem_db.create_table("test", data=data)
table.create_fts_index("text", use_tantivy=False, with_position=True)
table.create_fts_index("text", with_position=True)
res = table.search("lance").limit(5).to_list()
assert len(res) == 3
@@ -702,7 +684,7 @@ def test_fts_on_list(mem_db: DBConnection):
def test_fts_ngram(mem_db: DBConnection):
data = pa.table({"text": ["hello world", "lance database", "lance is cool"]})
table = mem_db.create_table("test", data=data)
table.create_fts_index("text", use_tantivy=False, base_tokenizer="ngram")
table.create_fts_index("text", base_tokenizer="ngram")
results = table.search("lan", query_type="fts").limit(10).to_list()
assert len(results) == 2
@@ -721,7 +703,6 @@ def test_fts_ngram(mem_db: DBConnection):
# test setting min_ngram_length and prefix_only
table.create_fts_index(
"text",
use_tantivy=False,
base_tokenizer="ngram",
replace=True,
ngram_min_length=2,
@@ -742,6 +723,90 @@ def test_fts_ngram(mem_db: DBConnection):
assert set(r["text"] for r in results) == {"lance database", "lance is cool"}
def test_fts_jieba_tokenizer(mem_db: DBConnection, language_model_home):
data = pa.table({"text": ["我们都有光明的前途", "光明的前途"]})
table = mem_db.create_table("test_jieba", data=data)
table.create_fts_index(
"text",
base_tokenizer="jieba/default",
stem=False,
remove_stop_words=False,
ascii_folding=False,
)
results = table.search("我们", query_type="fts").limit(10).to_list()
assert [row["text"] for row in results] == ["我们都有光明的前途"]
def test_fts_jieba_missing_language_model_note(
mem_db: DBConnection, monkeypatch, tmp_path
):
missing_root = tmp_path / "missing-language-models"
monkeypatch.setenv("LANCE_LANGUAGE_MODEL_HOME", str(missing_root))
table = mem_db.create_table(
"test_missing_jieba_model",
data=pa.table({"text": ["我们都有光明的前途"]}),
)
with pytest.raises((ValueError, RuntimeError)) as e:
table.create_fts_index(
"text",
base_tokenizer="jieba/default",
stem=False,
remove_stop_words=False,
ascii_folding=False,
)
output = exception_output(e)
assert "Invalid directory path:" in output
assert "LANCE_LANGUAGE_MODEL_HOME" in output
assert "jieba/default" in output
@pytest.mark.asyncio
async def test_fts_jieba_missing_language_model_note_async(monkeypatch, tmp_path):
missing_root = tmp_path / "missing-language-models"
monkeypatch.setenv("LANCE_LANGUAGE_MODEL_HOME", str(missing_root))
db = await ldb.connect_async(tmp_path / "async-db")
table = await db.create_table(
"test_missing_jieba_model_async",
data=pa.table({"text": ["我们都有光明的前途"]}),
)
with pytest.raises((ValueError, RuntimeError)) as e:
await table.create_index(
"text",
config=FTS(
base_tokenizer="jieba/default",
stem=False,
remove_stop_words=False,
ascii_folding=False,
),
)
output = exception_output(e)
assert "Invalid directory path:" in output
assert "LANCE_LANGUAGE_MODEL_HOME" in output
assert "jieba/default" in output
def test_fts_lindera_tokenizer(
mem_db: DBConnection, language_model_home, lindera_ipadic
):
data = pa.table({"text": ["成田国際空港", "東京国際空港", "羽田空港"]})
table = mem_db.create_table("test_lindera", data=data)
table.create_fts_index(
"text",
base_tokenizer="lindera/ipadic",
stem=False,
remove_stop_words=False,
ascii_folding=False,
)
results = table.search("成田", query_type="fts").limit(10).to_list()
assert [row["text"] for row in results] == ["成田国際空港"]
def test_fts_query_to_json():
"""Test that FTS query to_json() produces valid JSON strings with exact format."""
@@ -886,7 +951,7 @@ def test_fts_query_to_json():
def test_fts_fast_search(table):
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
# Insert some unindexed data
table.add(

View File

@@ -28,7 +28,7 @@ def sync_table(tmpdir_factory) -> Table:
}
)
table = db.create_table("test", data)
table.create_fts_index("text", with_position=False, use_tantivy=False)
table.create_fts_index("text", with_position=False)
return table
@@ -192,7 +192,7 @@ def table_with_id(tmpdir_factory) -> Table:
}
)
table = db.create_table("test_with_id", data)
table.create_fts_index("text", with_position=False, use_tantivy=False)
table.create_fts_index("text", with_position=False)
return table

View File

@@ -16,11 +16,13 @@ from lancedb.index import (
IvfSq,
IvfHnswPq,
IvfHnswSq,
IvfHnswFlat,
IvfRq,
Bitmap,
LabelList,
HnswPq,
HnswSq,
HnswFlat,
FTS,
)
from lancedb.table import IndexStatistics
@@ -250,6 +252,21 @@ async def test_create_hnswpq_alias_index(some_table: AsyncTable):
assert indices[0].index_type in {"HnswPq", "IvfHnswPq"}
@pytest.mark.asyncio
async def test_create_hnswflat_index(some_table: AsyncTable):
await some_table.create_index("vector", config=HnswFlat(num_partitions=10))
indices = await some_table.list_indices()
assert len(indices) == 1
@pytest.mark.asyncio
async def test_create_hnswflat_alias_index(some_table: AsyncTable):
await some_table.create_index("vector", config=IvfHnswFlat(num_partitions=5))
indices = await some_table.list_indices()
assert len(indices) == 1
assert indices[0].index_type in {"HnswFlat", "IvfHnswFlat"}
@pytest.mark.asyncio
async def test_create_ivfsq_index(some_table: AsyncTable):
await some_table.create_index("vector", config=IvfSq(num_partitions=10))
@@ -295,6 +312,7 @@ def test_index_statistics_index_type_lists_all_supported_values():
"IVF_RQ",
"IVF_HNSW_SQ",
"IVF_HNSW_PQ",
"IVF_HNSW_FLAT",
"FTS",
"BTREE",
"BITMAP",

View File

@@ -18,6 +18,9 @@ Tests verify:
"""
import copy
import shutil
import sys
import tempfile
import time
import uuid
from typing import Dict, Optional
@@ -387,6 +390,66 @@ def test_namespace_open_table_with_provider(s3_bucket: str, use_custom: bool):
assert get_describe_call_count(inner_ns_client) == describe_count_after_open
@pytest.mark.skipif(
sys.platform == "win32",
reason="TODO: fix schema-only namespace metrics test on Windows",
)
@pytest.mark.parametrize("use_custom", [False, True], ids=["DirectoryNS", "CustomNS"])
def test_namespace_create_schema_only_with_provider(use_custom: bool):
"""
Test creating a schema-only table through namespace with storage options provider.
Verifies:
- declare_table is called once to reserve the location
- describe_table is not needed during create in create mode
- the table can be reopened successfully afterward
- opening the table triggers exactly one describe_table call
"""
temp_dir = tempfile.mkdtemp()
try:
ns_client, inner_ns_client = create_tracking_namespace(
bucket_name=temp_dir,
storage_options={},
credential_expires_in_seconds=3600,
use_custom=use_custom,
)
db = LanceNamespaceDBConnection(ns_client)
namespace_name = f"test_ns_{uuid.uuid4().hex[:8]}"
db.create_namespace([namespace_name])
table_name = f"test_table_{uuid.uuid4().hex}"
namespace_path = [namespace_name]
schema = pa.schema(
[
pa.field("id", pa.int64()),
pa.field("vector", pa.list_(pa.float32(), 2)),
pa.field("text", pa.string()),
]
)
assert get_declare_call_count(inner_ns_client) == 0
assert get_describe_call_count(inner_ns_client) == 0
table = db.create_table(
table_name, schema=schema, namespace_path=namespace_path
)
assert table.name == table_name
assert table.namespace == namespace_path
assert get_declare_call_count(inner_ns_client) == 1
assert get_describe_call_count(inner_ns_client) == 0
reopened_table = db.open_table(table_name, namespace_path=namespace_path)
assert reopened_table.schema == schema
assert get_declare_call_count(inner_ns_client) == 1
assert get_describe_call_count(inner_ns_client) == 1
finally:
shutil.rmtree(temp_dir, ignore_errors=True)
@pytest.mark.s3_test
@pytest.mark.parametrize("use_custom", [False, True], ids=["DirectoryNS", "CustomNS"])
def test_namespace_credential_refresh_on_read(s3_bucket: str, use_custom: bool):

View File

@@ -9,21 +9,6 @@ from lancedb import DBConnection, Table, connect
from lancedb.permutation import Permutation, Permutations, permutation_builder
def test_permutation_persistence(tmp_path):
db = connect(tmp_path)
tbl = db.create_table("test_table", pa.table({"x": range(100), "y": range(100)}))
permutation_tbl = (
permutation_builder(tbl).shuffle().persist(db, "test_permutation").execute()
)
assert permutation_tbl.count_rows() == 100
re_open = db.open_table("test_permutation")
assert re_open.count_rows() == 100
assert permutation_tbl.to_arrow() == re_open.to_arrow()
def test_split_random_ratios(mem_db):
"""Test random splitting with ratios."""
tbl = mem_db.create_table(

View File

@@ -1385,7 +1385,7 @@ def test_query_timeout(tmp_path):
}
)
table = db.create_table("test", data)
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
with pytest.raises(Exception, match="Query timeout"):
table.search().where("text = 'a'").to_list(timeout=timedelta(0))

View File

@@ -6,6 +6,8 @@ import contextlib
from datetime import timedelta
import http.server
import json
import multiprocessing as mp
import sys
import threading
import time
from unittest.mock import MagicMock, patch
@@ -1230,3 +1232,82 @@ def test_background_loop_cancellation(exception):
with pytest.raises(exception):
loop.run(None)
mock_future.cancel.assert_called_once()
def _remote_fork_child(port: int, queue) -> None:
# Build a fresh Connection in the child so we exercise the at-fork-child
# tokio runtime reset rather than relying on an inherited reqwest client.
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
queue.put(db.table_names())
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
"fork() is unavailable on Windows and unsafe on macOS "
"(Apple frameworks/TLS are not fork-safe)"
),
)
def test_remote_connection_after_fork():
"""A freshly-built remote Connection in a forked child should not hang.
The pyo3-async-runtimes tokio runtime would otherwise be inherited from
the parent with dead worker threads; the at-fork-child handler in our
runtime module rebuilds it on first use in the child.
"""
def handler(request):
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"tables": []}')
server = http.server.HTTPServer(("localhost", 0), make_mock_http_handler(handler))
port = server.server_address[1]
server_thread = threading.Thread(target=server.serve_forever)
server_thread.start()
try:
# Hit the server in the parent first so the runtime + LOOP are warm
# before fork; a fresh child must still succeed.
parent_db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
assert parent_db.table_names() == []
ctx = mp.get_context("fork")
queue = ctx.Queue()
proc = ctx.Process(target=_remote_fork_child, args=(port, queue))
proc.start()
proc.join(timeout=15)
if proc.is_alive():
proc.terminate()
proc.join(timeout=5)
if proc.is_alive():
proc.kill()
proc.join()
pytest.fail("Remote connection hung after fork")
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no result"
assert queue.get() == []
# Parent connection must still be usable after the child returned.
assert parent_db.table_names() == []
finally:
server.shutdown()
server_thread.join()

View File

@@ -26,11 +26,8 @@ from lancedb.rerankers import (
)
from lancedb.table import LanceTable
# Tests rely on FTS index
pytest.importorskip("lancedb.fts")
def get_test_table(tmp_path, use_tantivy):
def get_test_table(tmp_path):
db = lancedb.connect(tmp_path)
# Create a LanceDB table schema with a vector and a text column
emb = EmbeddingFunctionRegistry.get_instance().get("test").create()
@@ -98,7 +95,7 @@ def get_test_table(tmp_path, use_tantivy):
)
# Create a fts index
table.create_fts_index("text", use_tantivy=use_tantivy, replace=True)
table.create_fts_index("text", replace=True)
return table, MyTable
@@ -208,8 +205,8 @@ def _run_test_reranker(reranker, table, query, query_vector, schema):
assert len(result) == 20 and result == result_arrow
def _run_test_hybrid_reranker(reranker, tmp_path, use_tantivy):
table, schema = get_test_table(tmp_path, use_tantivy)
def _run_test_hybrid_reranker(reranker, tmp_path):
table, schema = get_test_table(tmp_path)
# The default reranker
result1 = (
table.search(
@@ -285,8 +282,7 @@ def _run_test_hybrid_reranker(reranker, tmp_path, use_tantivy):
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_linear_combination(tmp_path, use_tantivy):
def test_linear_combination(tmp_path):
reranker = LinearCombinationReranker()
vector_results = pa.Table.from_pydict(
@@ -313,22 +309,20 @@ def test_linear_combination(tmp_path, use_tantivy):
assert "_score" not in combined_results.column_names
assert "_relevance_score" in combined_results.column_names
_run_test_hybrid_reranker(reranker, tmp_path, use_tantivy)
_run_test_hybrid_reranker(reranker, tmp_path)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_rrf_reranker(tmp_path, use_tantivy):
def test_rrf_reranker(tmp_path):
reranker = RRFReranker()
_run_test_hybrid_reranker(reranker, tmp_path, use_tantivy)
_run_test_hybrid_reranker(reranker, tmp_path)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_mrr_reranker(tmp_path, use_tantivy):
def test_mrr_reranker(tmp_path):
reranker = MRRReranker()
_run_test_hybrid_reranker(reranker, tmp_path, use_tantivy)
_run_test_hybrid_reranker(reranker, tmp_path)
# Test multi-vector part
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
query = "single player experience"
rs1 = table.search(query, vector_column_name="vector").limit(10).with_row_id(True)
rs2 = (
@@ -363,7 +357,7 @@ def test_rrf_reranker_distance():
table = db.create_table("test", data)
table.create_index(num_partitions=1, num_sub_vectors=2)
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
reranker = RRFReranker(return_score="all")
@@ -422,35 +416,31 @@ def test_rrf_reranker_distance():
@pytest.mark.skipif(
os.environ.get("COHERE_API_KEY") is None, reason="COHERE_API_KEY not set"
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cohere_reranker(tmp_path, use_tantivy):
def test_cohere_reranker(tmp_path):
pytest.importorskip("cohere")
reranker = CohereReranker()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cross_encoder_reranker(tmp_path, use_tantivy):
def test_cross_encoder_reranker(tmp_path):
pytest.importorskip("sentence_transformers")
reranker = CrossEncoderReranker()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_colbert_reranker(tmp_path, use_tantivy):
def test_colbert_reranker(tmp_path):
pytest.importorskip("rerankers")
reranker = ColbertReranker()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_answerdotai_reranker(tmp_path, use_tantivy):
def test_answerdotai_reranker(tmp_path):
pytest.importorskip("rerankers")
reranker = AnswerdotaiRerankers()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -459,10 +449,9 @@ def test_answerdotai_reranker(tmp_path, use_tantivy):
or os.environ.get("OPENAI_BASE_URL") is not None,
reason="OPENAI_API_KEY not set",
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_openai_reranker(tmp_path, use_tantivy):
def test_openai_reranker(tmp_path):
pytest.importorskip("openai")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
reranker = OpenaiReranker()
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -470,10 +459,9 @@ def test_openai_reranker(tmp_path, use_tantivy):
@pytest.mark.skipif(
os.environ.get("JINA_API_KEY") is None, reason="JINA_API_KEY not set"
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_jina_reranker(tmp_path, use_tantivy):
def test_jina_reranker(tmp_path):
pytest.importorskip("jina")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
reranker = JinaReranker()
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -481,11 +469,10 @@ def test_jina_reranker(tmp_path, use_tantivy):
@pytest.mark.skipif(
os.environ.get("VOYAGE_API_KEY") is None, reason="VOYAGE_API_KEY not set"
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_voyageai_reranker(tmp_path, use_tantivy):
def test_voyageai_reranker(tmp_path):
pytest.importorskip("voyageai")
reranker = VoyageAIReranker(model_name="rerank-2.5")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -504,7 +491,7 @@ def test_empty_result_reranker():
# Create empty table with schema
empty_table = db.create_table("empty_table", schema=schema, mode="overwrite")
empty_table.create_fts_index("text", use_tantivy=False, replace=True)
empty_table.create_fts_index("text", replace=True)
for reranker in [
CrossEncoderReranker(),
# ColbertReranker(),
@@ -603,11 +590,10 @@ def test_empty_hybrid_result_reranker():
assert "_rowid" in result.column_names
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cross_encoder_reranker_return_all(tmp_path, use_tantivy):
def test_cross_encoder_reranker_return_all(tmp_path):
pytest.importorskip("sentence_transformers")
reranker = CrossEncoderReranker(return_score="all")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
query = "single player experience"
result = (
table.search(query, query_type="hybrid", vector_column_name="vector")

View File

@@ -242,8 +242,8 @@ def test_s3_dynamodb_sync(s3_bucket: str, commit_table: str, monkeypatch):
# FTS indices should error since they are not supported yet.
with pytest.raises(
NotImplementedError,
match="Full-text search is only supported on the local filesystem",
ValueError,
match="Tantivy-based FTS has been removed",
):
table.create_fts_index("x", use_tantivy=True)

View File

@@ -3,6 +3,7 @@
import os
import sys
from datetime import date, datetime, timedelta
from time import sleep
from typing import List
@@ -10,7 +11,7 @@ from unittest.mock import patch
import lancedb
from lancedb.dependencies import _PANDAS_AVAILABLE
from lancedb.index import HnswPq, HnswSq, IvfPq
from lancedb.index import HnswFlat, HnswPq, HnswSq, IvfPq
import numpy as np
import polars as pl
import pyarrow as pa
@@ -916,6 +917,21 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
"my_vector", replace=True, config=expected_config, name=None, train=True
)
table.create_index(
vector_column_name="my_vector",
metric="cosine",
index_type="IVF_HNSW_FLAT",
sample_rate=0.1,
m=29,
ef_construction=10,
)
expected_config = HnswFlat(
distance_type="cosine", sample_rate=0.1, m=29, ef_construction=10
)
mock_create_index.assert_called_with(
"my_vector", replace=True, config=expected_config, name=None, train=True
)
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_name_and_train_parameters(
@@ -1947,7 +1963,6 @@ def setup_hybrid_search_table(db: DBConnection, embedding_func):
def test_hybrid_search(tmp_db: DBConnection):
# This test uses an FTS index
pytest.importorskip("lancedb.fts")
pytest.importorskip("lance")
table, MyTable, emb = setup_hybrid_search_table(tmp_db, "test")
@@ -2018,7 +2033,6 @@ def test_hybrid_search(tmp_db: DBConnection):
def test_hybrid_search_metric_type(tmp_db: DBConnection):
# This test uses an FTS index
pytest.importorskip("lancedb.fts")
pytest.importorskip("lance")
# Need to use nonnorm as the embedding function so l2 and dot results
@@ -2040,6 +2054,13 @@ def test_hybrid_search_metric_type(tmp_db: DBConnection):
@pytest.mark.parametrize(
"consistency_interval", [None, timedelta(seconds=0), timedelta(seconds=0.1)]
)
@pytest.mark.skipif(
sys.platform == "win32",
reason=(
"TODO: directory namespace is not supported on Windows yet; "
"re-enable after that is fixed."
),
)
def test_consistency(tmp_path, consistency_interval):
db = lancedb.connect(tmp_path)
table = db.create_table("my_table", data=[{"id": 0}])
@@ -2060,7 +2081,6 @@ def test_consistency(tmp_path, consistency_interval):
elif consistency_interval == timedelta(seconds=0):
assert table2.version == table.version
else:
# (consistency_interval == timedelta(seconds=0.1)
assert table2.version == table.version - 1
sleep(0.1)
assert table2.version == table.version

View File

@@ -1,14 +1,29 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import functools
import multiprocessing as mp
import pickle
import sys
import lancedb
import pyarrow as pa
import pytest
from lancedb.permutation import Permutation, Permutations, permutation_builder
from lancedb.util import tbl_to_tensor
from lancedb.permutation import Permutation
torch = pytest.importorskip("torch")
def _open_native_table(uri: str, table_name: str):
"""Top-level connection factory used by the explicit-factory pickle test.
Defined at module scope so that pickle can resolve it by name in the
worker / unpickling process.
"""
return lancedb.connect(uri).open_table(table_name)
def test_table_dataloader(mem_db):
table = mem_db.create_table("test_table", pa.table({"a": range(1000)}))
dataloader = torch.utils.data.DataLoader(
@@ -40,3 +55,156 @@ def test_permutation_dataloader(mem_db):
for batch in dataloader:
assert batch.size(0) == 1
assert batch.size(1) == 10
def test_permutation_is_picklable(tmp_db):
"""A Permutation must be picklable so it can be used with PyTorch's
DataLoader when num_workers > 0 (which uses multiprocessing and pickles
the dataset to pass it to worker processes)."""
table = tmp_db.create_table("test_table", pa.table({"a": range(1000)}))
permutation = Permutation.identity(table)
pickled = pickle.dumps(permutation)
restored = pickle.loads(pickled)
assert len(restored) == 1000
rows = restored.__getitems__([0, 1, 2])
assert rows == [{"a": 0}, {"a": 1}, {"a": 2}]
def test_permutation_with_memory_base_is_picklable(mem_db):
"""An in-memory base table is inlined into the pickle as Arrow IPC bytes
and rebuilt on the other side as an in-memory LanceTable, so the
Permutation round-trips even though the original database can't be
reopened across processes."""
table = mem_db.create_table("test_table", pa.table({"a": range(50)}))
permutation = Permutation.identity(table)
restored = pickle.loads(pickle.dumps(permutation))
assert len(restored) == 50
assert restored.__getitems__([0, 10, 49]) == [{"a": 0}, {"a": 10}, {"a": 49}]
def test_permutation_dataloader_multiprocessing(tmp_db):
"""Using a Permutation with a PyTorch DataLoader that has num_workers > 0
must work end-to-end. Each worker process gets a pickled copy of the
dataset and reads batches from it."""
table = tmp_db.create_table("test_table", pa.table({"a": range(1000)}))
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(
permutation,
batch_size=10,
shuffle=True,
num_workers=2,
multiprocessing_context="spawn",
)
seen = 0
for batch in dataloader:
assert batch["a"].size(0) == 10
seen += batch["a"].size(0)
assert seen == 1000
def test_permutation_pickle_with_connection_factory(tmp_path):
"""When the user provides a connection_factory, pickling should round-trip
through that factory rather than introspecting the connection URI. Useful
for remote / cloud connections where the URI alone isn't reopenable."""
db = lancedb.connect(tmp_path)
db.create_table("test_table", pa.table({"a": range(50)}))
factory = functools.partial(_open_native_table, str(tmp_path))
permutation = Permutation.identity(factory("test_table")).with_connection_factory(
factory
)
restored = pickle.loads(pickle.dumps(permutation))
assert len(restored) == 50
# The factory survives pickling and is what powered base-table reopen.
assert restored.connection_factory is not None
assert restored.connection_factory.func is _open_native_table
assert restored.__getitems__([0, 1, 2]) == [{"a": 0}, {"a": 1}, {"a": 2}]
def test_permutation_with_builder_is_picklable(tmp_db):
"""A Permutation built from a non-identity permutation table must round-trip
through pickle while preserving the row order defined by the permutation."""
table = tmp_db.create_table("test_table", pa.table({"a": range(100)}))
perm_tbl = (
permutation_builder(table)
.split_random(ratios=[0.8, 0.2], seed=42, split_names=["train", "test"])
.shuffle(seed=42)
.execute()
)
permutations = Permutations(table, perm_tbl)
permutation = permutations["train"]
indices = list(range(len(permutation)))
expected = permutation.__getitems__(indices)
restored = pickle.loads(pickle.dumps(permutation))
assert len(restored) == len(permutation)
assert restored.__getitems__(indices) == expected
def _multiworker_dataloader_target(db_uri: str, result_queue):
import lancedb
from lancedb.permutation import Permutation
db = lancedb.connect(db_uri)
table = db.open_table("test_table")
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(
permutation,
batch_size=10,
num_workers=2,
multiprocessing_context="fork",
)
count = 0
for batch in dataloader:
assert batch["a"].size(0) == 10
count += 1
result_queue.put(count)
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
"fork() is unavailable on Windows and unsafe on macOS "
"(Apple frameworks/TLS are not fork-safe)"
),
)
def test_permutation_dataloader_fork_workers(tmp_path):
"""A Permutation used by a fork-based DataLoader should not hang.
PyTorch's DataLoader uses fork-based multiprocessing by default on Linux.
LanceDB drives async work through a background asyncio thread that does
not survive a fork, so any LOOP.run() in a worker blocks forever.
"""
import lancedb
db_uri = str(tmp_path / "db")
db = lancedb.connect(db_uri)
db.create_table("test_table", pa.table({"a": list(range(1000))}))
ctx = mp.get_context("spawn")
queue = ctx.Queue()
proc = ctx.Process(target=_multiworker_dataloader_target, args=(db_uri, queue))
proc.start()
proc.join(timeout=30)
if proc.is_alive():
proc.terminate()
proc.join(timeout=5)
if proc.is_alive():
proc.kill()
proc.join()
pytest.fail("Permutation hung when iterated in a fork-based DataLoader worker")
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no batches"
assert queue.get() == 100

View File

@@ -3,6 +3,8 @@
use std::sync::Arc;
use crate::error::PythonErrorExt;
use crate::runtime::future_into_py;
use arrow::{
datatypes::SchemaRef,
pyarrow::{IntoPyArrow, ToPyArrow},
@@ -12,9 +14,6 @@ use lancedb::arrow::SendableRecordBatchStream;
use pyo3::{
Bound, Py, PyAny, PyRef, PyResult, Python, exceptions::PyStopAsyncIteration, pyclass, pymethods,
};
use pyo3_async_runtimes::tokio::future_into_py;
use crate::error::PythonErrorExt;
#[pyclass]
pub struct RecordBatchStream {

View File

@@ -1,11 +1,23 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{collections::HashMap, sync::Arc, time::Duration};
use std::{
collections::{HashMap, HashSet},
sync::Arc,
time::Duration,
};
use crate::{
error::PythonErrorExt,
namespace::{create_namespace_storage_options_provider, extract_namespace_arc},
runtime::future_into_py,
table::Table,
};
use arrow::{datatypes::Schema, ffi_stream::ArrowArrayStreamReader, pyarrow::FromPyArrow};
use lancedb::{
connection::Connection as LanceConnection,
connection::NamespaceClientPushdownOperation,
database::namespace::LanceNamespaceDatabase,
database::{CreateTableMode, Database, ReadConsistency},
};
use pyo3::{
@@ -14,13 +26,6 @@ use pyo3::{
pyclass, pyfunction, pymethods,
types::{PyDict, PyDictMethods},
};
use pyo3_async_runtimes::tokio::future_into_py;
use crate::{
error::PythonErrorExt,
namespace::{create_namespace_storage_options_provider, extract_namespace_arc},
table::Table,
};
#[pyclass]
pub struct Connection {
@@ -39,6 +44,29 @@ impl Connection {
}
}
fn parse_namespace_client_pushdown_operations(
operations: Option<Vec<String>>,
) -> PyResult<HashSet<NamespaceClientPushdownOperation>> {
let mut parsed = HashSet::new();
for operation in operations.unwrap_or_default() {
match operation.as_str() {
"QueryTable" => {
parsed.insert(NamespaceClientPushdownOperation::QueryTable);
}
"CreateTable" => {
parsed.insert(NamespaceClientPushdownOperation::CreateTable);
}
_ => {
return Err(PyValueError::new_err(format!(
"Invalid pushdown operation: {}",
operation
)));
}
}
}
Ok(parsed)
}
impl Connection {
fn parse_create_mode_str(mode: &str) -> PyResult<CreateTableMode> {
match mode {
@@ -496,7 +524,7 @@ impl Connection {
}
#[pyfunction]
#[pyo3(signature = (uri, api_key=None, region=None, host_override=None, read_consistency_interval=None, client_config=None, storage_options=None, session=None))]
#[pyo3(signature = (uri, api_key=None, region=None, host_override=None, read_consistency_interval=None, client_config=None, storage_options=None, session=None, manifest_enabled=false, namespace_client_properties=None))]
#[allow(clippy::too_many_arguments)]
pub fn connect(
py: Python<'_>,
@@ -508,6 +536,8 @@ pub fn connect(
client_config: Option<PyClientConfig>,
storage_options: Option<HashMap<String, String>>,
session: Option<crate::session::Session>,
manifest_enabled: bool,
namespace_client_properties: Option<HashMap<String, String>>,
) -> PyResult<Bound<'_, PyAny>> {
future_into_py(py, async move {
let mut builder = lancedb::connect(&uri);
@@ -527,6 +557,12 @@ pub fn connect(
if let Some(storage_options) = storage_options {
builder = builder.storage_options(storage_options);
}
if manifest_enabled {
builder = builder.manifest_enabled(true);
}
if let Some(namespace_client_properties) = namespace_client_properties {
builder = builder.namespace_client_properties(namespace_client_properties);
}
#[cfg(feature = "remote")]
if let Some(client_config) = client_config {
builder = builder.client_config(client_config.into());
@@ -538,6 +574,52 @@ pub fn connect(
})
}
#[pyfunction]
#[pyo3(signature = (
namespace_client,
read_consistency_interval=None,
storage_options=None,
session=None,
namespace_client_pushdown_operations=None,
namespace_client_impl=None,
namespace_client_properties=None,
))]
#[allow(clippy::too_many_arguments)]
pub fn connect_namespace_client(
py: Python<'_>,
namespace_client: Py<PyAny>,
read_consistency_interval: Option<f64>,
storage_options: Option<HashMap<String, String>>,
session: Option<crate::session::Session>,
namespace_client_pushdown_operations: Option<Vec<String>>,
namespace_client_impl: Option<String>,
namespace_client_properties: Option<HashMap<String, String>>,
) -> PyResult<Connection> {
let namespace_client = extract_namespace_arc(py, namespace_client)?;
let read_consistency_interval = read_consistency_interval.map(Duration::from_secs_f64);
let namespace_client_pushdown_operations =
parse_namespace_client_pushdown_operations(namespace_client_pushdown_operations)?;
let ns_impl = namespace_client_impl.unwrap_or_else(|| "python".to_string());
let ns_properties = namespace_client_properties.unwrap_or_default();
let storage_options = storage_options.unwrap_or_default();
let session = session.map(|s| s.inner.clone());
let database = LanceNamespaceDatabase::from_namespace_client(
namespace_client,
ns_impl,
ns_properties,
storage_options,
read_consistency_interval,
session,
namespace_client_pushdown_operations,
);
Ok(Connection::new(LanceConnection::new(
Arc::new(database),
Arc::new(lancedb::embeddings::MemoryRegistry::new()),
)))
}
#[derive(FromPyObject)]
pub struct PyClientConfig {
user_agent: String,

View File

@@ -17,7 +17,7 @@ use pyo3::{Bound, PyAny, PyResult, exceptions::PyValueError, prelude::*, pyfunct
/// [`expr_lit`] and combined with the methods on this struct. On the Python
/// side a thin wrapper class (`lancedb.expr.Expr`) delegates to these methods
/// and adds Python operator overloads.
#[pyclass(name = "PyExpr")]
#[pyclass(name = "PyExpr", from_py_object)]
#[derive(Clone)]
pub struct PyExpr(pub DfExpr);

View File

@@ -33,7 +33,7 @@ impl PyHeaderProvider {
Ok(headers_py) => {
// Convert Python dict to Rust HashMap
let bound_headers = headers_py.bind(py);
let dict: &Bound<PyDict> = bound_headers.downcast().map_err(|e| {
let dict: &Bound<PyDict> = bound_headers.cast().map_err(|e| {
format!("HeaderProvider.get_headers must return a dict: {}", e)
})?;

View File

@@ -1,11 +1,13 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use lancedb::index::vector::{IvfFlatIndexBuilder, IvfRqIndexBuilder, IvfSqIndexBuilder};
use lancedb::index::vector::{
IvfFlatIndexBuilder, IvfHnswFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder,
IvfPqIndexBuilder, IvfRqIndexBuilder, IvfSqIndexBuilder,
};
use lancedb::index::{
Index as LanceDbIndex,
scalar::{BTreeIndexBuilder, FtsIndexBuilder},
vector::{IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder},
};
use pyo3::IntoPyObject;
use pyo3::types::PyStringMethods;
@@ -13,7 +15,7 @@ use pyo3::{
Bound, FromPyObject, PyAny, PyResult, Python,
exceptions::{PyKeyError, PyValueError},
intern, pyclass, pymethods,
types::PyAnyMethods,
types::{PyAnyMethods, PyString},
};
use crate::util::parse_distance_type;
@@ -22,7 +24,7 @@ pub fn class_name(ob: &'_ Bound<'_, PyAny>) -> PyResult<String> {
let full_name = ob
.getattr(intern!(ob.py(), "__class__"))?
.getattr(intern!(ob.py(), "__name__"))?;
let full_name = full_name.downcast()?.to_string_lossy();
let full_name = full_name.cast::<PyString>()?.to_string_lossy();
match full_name.rsplit_once('.') {
Some((_, name)) => Ok(name.to_string()),
@@ -162,8 +164,26 @@ pub fn extract_index_params(source: &Option<Bound<'_, PyAny>>) -> PyResult<Lance
}
Ok(LanceDbIndex::IvfHnswSq(hnsw_sq_builder))
}
"HnswFlat" => {
let params = source.extract::<IvfHnswFlatParams>()?;
let distance_type = parse_distance_type(params.distance_type)?;
let mut hnsw_flat_builder = IvfHnswFlatIndexBuilder::default()
.distance_type(distance_type)
.max_iterations(params.max_iterations)
.sample_rate(params.sample_rate)
.num_edges(params.m)
.ef_construction(params.ef_construction);
if let Some(num_partitions) = params.num_partitions {
hnsw_flat_builder = hnsw_flat_builder.num_partitions(num_partitions);
}
if let Some(target_partition_size) = params.target_partition_size {
hnsw_flat_builder =
hnsw_flat_builder.target_partition_size(target_partition_size);
}
Ok(LanceDbIndex::IvfHnswFlat(hnsw_flat_builder))
}
not_supported => Err(PyValueError::new_err(format!(
"Invalid index type '{}'. Must be one of BTree, Bitmap, LabelList, FTS, IvfPq, IvfSq, IvfHnswPq, or IvfHnswSq",
"Invalid index type '{}'. Must be one of BTree, Bitmap, LabelList, FTS, IvfPq, IvfSq, IvfHnswPq, IvfHnswSq, or IvfHnswFlat",
not_supported
))),
}
@@ -250,6 +270,17 @@ struct IvfHnswSqParams {
target_partition_size: Option<u32>,
}
#[derive(FromPyObject)]
struct IvfHnswFlatParams {
distance_type: String,
num_partitions: Option<u32>,
max_iterations: u32,
sample_rate: u32,
m: u32,
ef_construction: u32,
target_partition_size: Option<u32>,
}
#[pyclass(get_all)]
/// A description of an index currently configured on a column
pub struct IndexConfig {

View File

@@ -2,7 +2,7 @@
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use arrow::RecordBatchStream;
use connection::{Connection, connect};
use connection::{Connection, connect, connect_namespace_client};
use env_logger::Env;
use expr::{PyExpr, expr_col, expr_func, expr_lit};
use index::IndexConfig;
@@ -28,6 +28,7 @@ pub mod index;
pub mod namespace;
pub mod permutation;
pub mod query;
pub mod runtime;
pub mod session;
pub mod table;
pub mod util;
@@ -58,6 +59,7 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_class::<PyPermutationReader>()?;
m.add_class::<PyExpr>()?;
m.add_function(wrap_pyfunction!(connect, m)?)?;
m.add_function(wrap_pyfunction!(connect_namespace_client, m)?)?;
m.add_function(wrap_pyfunction!(permutation::async_permutation_builder, m)?)?;
m.add_function(wrap_pyfunction!(util::validate_table_name, m)?)?;
m.add_function(wrap_pyfunction!(query::fts_query_to_json, m)?)?;

View File

@@ -183,7 +183,7 @@ async fn call_py_method_primitive<Req, Resp>(
) -> lance_core::Result<Resp>
where
Req: serde::Serialize + Send + 'static,
Resp: for<'py> pyo3::FromPyObject<'py> + Send + 'static,
Resp: for<'a, 'py> pyo3::FromPyObject<'a, 'py> + Send + 'static,
{
let request_json = serde_json::to_string(&request).map_err(|e| {
lance_core::Error::io(format!(
@@ -203,7 +203,7 @@ where
// Call the Python method
let result = py_namespace.call_method1(py, method_name, (request_arg,))?;
let value: Resp = result.extract(py)?;
let value: Resp = result.extract(py).map_err(Into::into)?;
Ok::<_, PyErr>(value)
})
})

View File

@@ -4,7 +4,7 @@
use std::sync::{Arc, Mutex};
use crate::{
arrow::RecordBatchStream, connection::Connection, error::PythonErrorExt, table::Table,
arrow::RecordBatchStream, error::PythonErrorExt, runtime::future_into_py, table::Table,
};
use arrow::pyarrow::{PyArrowType, ToPyArrow};
use lancedb::{
@@ -21,16 +21,15 @@ use pyo3::{
pyclass, pymethods,
types::{PyAnyMethods, PyDict, PyDictMethods, PyType},
};
use pyo3_async_runtimes::tokio::future_into_py;
fn table_from_py<'a>(table: Bound<'a, PyAny>) -> PyResult<Bound<'a, Table>> {
if table.hasattr("_inner")? {
Ok(table.getattr("_inner")?.downcast_into::<Table>()?)
Ok(table.getattr("_inner")?.cast_into::<Table>()?)
} else if table.hasattr("_table")? {
Ok(table
.getattr("_table")?
.getattr("_inner")?
.downcast_into::<Table>()?)
.cast_into::<Table>()?)
} else {
Err(PyRuntimeError::new_err(
"Provided table does not appear to be a Table or RemoteTable instance",
@@ -80,24 +79,6 @@ impl PyAsyncPermutationBuilder {
#[pymethods]
impl PyAsyncPermutationBuilder {
#[pyo3(signature = (database, table_name))]
pub fn persist(
slf: PyRefMut<'_, Self>,
database: Bound<'_, PyAny>,
table_name: String,
) -> PyResult<Self> {
let conn = if database.hasattr("_conn")? {
database
.getattr("_conn")?
.getattr("_inner")?
.downcast_into::<Connection>()?
} else {
database.getattr("_inner")?.downcast_into::<Connection>()?
};
let database = conn.borrow().database()?;
slf.modify(|builder| builder.persist(database, table_name))
}
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None, seed=None, split_names=None))]
pub fn split_random(
slf: PyRefMut<'_, Self>,
@@ -243,7 +224,7 @@ impl PyPermutationReader {
let Some(selection) = selection else {
return Ok(Select::All);
};
let selection = selection.downcast_into::<PyDict>()?;
let selection = selection.cast_into::<PyDict>()?;
let selection = selection
.iter()
.map(|(key, value)| {

View File

@@ -4,6 +4,11 @@
use std::sync::Arc;
use std::time::Duration;
use crate::expr::PyExpr;
use crate::runtime::future_into_py;
use crate::util::parse_distance_type;
use crate::{arrow::RecordBatchStream, util::PyLanceDB};
use crate::{error::PythonErrorExt, index::class_name};
use arrow::array::Array;
use arrow::array::ArrayData;
use arrow::array::make_array;
@@ -33,19 +38,16 @@ use pyo3::pyfunction;
use pyo3::pymethods;
use pyo3::types::PyList;
use pyo3::types::{PyDict, PyString};
use pyo3::{FromPyObject, exceptions::PyRuntimeError};
use pyo3::{Borrowed, FromPyObject, exceptions::PyRuntimeError};
use pyo3::{PyErr, pyclass};
use pyo3::{exceptions::PyValueError, intern};
use pyo3_async_runtimes::tokio::future_into_py;
use crate::expr::PyExpr;
use crate::util::parse_distance_type;
use crate::{arrow::RecordBatchStream, util::PyLanceDB};
use crate::{error::PythonErrorExt, index::class_name};
impl<'a, 'py> FromPyObject<'a, 'py> for PyLanceDB<FtsQuery> {
type Error = PyErr;
impl FromPyObject<'_> for PyLanceDB<FtsQuery> {
fn extract_bound(ob: &Bound<'_, PyAny>) -> PyResult<Self> {
match class_name(ob)?.as_str() {
fn extract(ob: Borrowed<'a, 'py, PyAny>) -> PyResult<Self> {
let ob = ob.to_owned();
match class_name(&ob)?.as_str() {
"MatchQuery" => {
let query = ob.getattr("query")?.extract()?;
let column = ob.getattr("column")?.extract()?;
@@ -424,7 +426,7 @@ impl Query {
"Query text is required for nearest_to_text",
))?;
let query = if let Ok(query_text) = fts_query.downcast::<PyString>() {
let query = if let Ok(query_text) = fts_query.cast::<PyString>() {
let mut query_text = query_text.to_string();
let columns = query
.get_item("columns")?
@@ -606,7 +608,7 @@ impl TakeQuery {
}
}
#[pyclass]
#[pyclass(from_py_object)]
#[derive(Clone)]
pub struct FTSQuery {
inner: LanceDbQuery,
@@ -735,7 +737,7 @@ impl FTSQuery {
}
}
#[pyclass]
#[pyclass(from_py_object)]
#[derive(Clone)]
pub struct VectorQuery {
inner: LanceDbVectorQuery,

142
python/src/runtime.rs Normal file
View File

@@ -0,0 +1,142 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! Fork-safe wrapper around tokio + pyo3-async-runtimes.
//!
//! `pyo3_async_runtimes::tokio` keeps its multi-threaded runtime in a
//! `OnceLock` that can never be replaced. Tokio's worker threads do not
//! survive `fork()`, so once a child inherits a "frozen" runtime, every
//! `future_into_py` call hangs forever.
//!
//! We sidestep the global by routing every future through our own
//! [`LanceRuntime`] (a [`pyo3_async_runtimes::generic::Runtime`] impl) backed
//! by an [`AtomicPtr`] to a tokio runtime that we own. A `pthread_atfork`
//! child handler nulls the pointer; the next `spawn` rebuilds the runtime in
//! the child. This mirrors the pattern used in the Lance Python bindings.
use std::future::Future;
use std::pin::Pin;
use std::sync::atomic::{AtomicBool, AtomicPtr, Ordering};
use pyo3::{Bound, PyAny, PyResult, Python, conversion::IntoPyObject};
use pyo3_async_runtimes::{
TaskLocals,
generic::{ContextExt, JoinError, Runtime},
};
use tokio::{runtime, task};
static RUNTIME: AtomicPtr<runtime::Runtime> = AtomicPtr::new(std::ptr::null_mut());
static RUNTIME_INSTALLING: AtomicBool = AtomicBool::new(false);
static ATFORK_INSTALLED: AtomicBool = AtomicBool::new(false);
fn create_runtime() -> runtime::Runtime {
runtime::Builder::new_multi_thread()
.enable_all()
.thread_name("lancedb-tokio-worker")
.build()
.expect("Failed to build tokio runtime")
}
fn get_runtime() -> &'static runtime::Runtime {
loop {
let ptr = RUNTIME.load(Ordering::SeqCst);
if !ptr.is_null() {
return unsafe { &*ptr };
}
if !RUNTIME_INSTALLING.fetch_or(true, Ordering::SeqCst) {
break;
}
std::thread::yield_now();
}
if !ATFORK_INSTALLED.fetch_or(true, Ordering::SeqCst) {
install_atfork();
}
let new_ptr = Box::into_raw(Box::new(create_runtime()));
RUNTIME.store(new_ptr, Ordering::SeqCst);
unsafe { &*new_ptr }
}
/// Runs in async-signal context after `fork()` in the child. We can only
/// touch atomics here; we deliberately leak the previous runtime because
/// dropping a tokio `Runtime` would try to join its (now-dead) worker
/// threads and hang.
extern "C" fn atfork_child() {
RUNTIME.store(std::ptr::null_mut(), Ordering::SeqCst);
RUNTIME_INSTALLING.store(false, Ordering::SeqCst);
}
#[cfg(not(windows))]
fn install_atfork() {
unsafe { libc::pthread_atfork(None, None, Some(atfork_child)) };
}
#[cfg(windows)]
fn install_atfork() {}
/// Marker type implementing [`Runtime`] over our fork-safe runtime slot.
pub struct LanceRuntime;
/// Newtype wrapper around `tokio::task::JoinError` so we can implement the
/// foreign [`JoinError`] trait without violating orphan rules.
pub struct LanceJoinError(task::JoinError);
impl JoinError for LanceJoinError {
fn is_panic(&self) -> bool {
self.0.is_panic()
}
fn into_panic(self) -> Box<dyn std::any::Any + Send + 'static> {
self.0.into_panic()
}
}
impl Runtime for LanceRuntime {
type JoinError = LanceJoinError;
type JoinHandle = Pin<Box<dyn Future<Output = Result<(), Self::JoinError>> + Send>>;
fn spawn<F>(fut: F) -> Self::JoinHandle
where
F: Future<Output = ()> + Send + 'static,
{
let handle = get_runtime().spawn(fut);
Box::pin(async move { handle.await.map_err(LanceJoinError) })
}
fn spawn_blocking<F>(f: F) -> Self::JoinHandle
where
F: FnOnce() + Send + 'static,
{
let handle = get_runtime().spawn_blocking(f);
Box::pin(async move { handle.await.map_err(LanceJoinError) })
}
}
tokio::task_local! {
static TASK_LOCALS: std::cell::OnceCell<TaskLocals>;
}
impl ContextExt for LanceRuntime {
fn scope<F, R>(locals: TaskLocals, fut: F) -> Pin<Box<dyn Future<Output = R> + Send>>
where
F: Future<Output = R> + Send + 'static,
{
let cell = std::cell::OnceCell::new();
cell.set(locals).unwrap();
Box::pin(TASK_LOCALS.scope(cell, fut))
}
fn get_task_locals() -> Option<TaskLocals> {
TASK_LOCALS
.try_with(|c| c.get().cloned())
.unwrap_or_default()
}
}
/// Drop-in replacement for `pyo3_async_runtimes::tokio::future_into_py` that
/// uses our fork-safe runtime.
pub fn future_into_py<F, T>(py: Python<'_>, fut: F) -> PyResult<Bound<'_, PyAny>>
where
F: Future<Output = PyResult<T>> + Send + 'static,
T: for<'py> IntoPyObject<'py> + Send + 'static,
{
pyo3_async_runtimes::generic::future_into_py::<LanceRuntime, _, T>(py, fut)
}

View File

@@ -11,7 +11,7 @@ use pyo3::{PyResult, pyclass, pymethods};
/// Sessions allow you to configure cache sizes for index and metadata caches,
/// which can significantly impact memory use and performance. They can
/// also be re-used across multiple connections to share the same cache state.
#[pyclass]
#[pyclass(from_py_object)]
#[derive(Clone)]
pub struct Session {
pub(crate) inner: Arc<LanceSession>,

View File

@@ -2,6 +2,7 @@
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{collections::HashMap, sync::Arc};
use crate::runtime::future_into_py;
use crate::{
connection::Connection,
error::PythonErrorExt,
@@ -24,12 +25,11 @@ use pyo3::{
pyclass, pymethods,
types::{IntoPyDict, PyAnyMethods, PyDict, PyDictMethods},
};
use pyo3_async_runtimes::tokio::future_into_py;
mod scannable;
/// Statistics about a compaction operation.
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct CompactionStats {
/// The number of fragments removed
@@ -43,7 +43,7 @@ pub struct CompactionStats {
}
/// Statistics about a cleanup operation
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct RemovalStats {
/// The number of bytes removed
@@ -53,7 +53,7 @@ pub struct RemovalStats {
}
/// Statistics about an optimize operation
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct OptimizeStats {
/// Statistics about the compaction operation
@@ -62,7 +62,7 @@ pub struct OptimizeStats {
pub prune: RemovalStats,
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct UpdateResult {
pub rows_updated: u64,
@@ -88,7 +88,7 @@ impl From<lancedb::table::UpdateResult> for UpdateResult {
}
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct AddResult {
pub version: u64,
@@ -109,7 +109,7 @@ impl From<lancedb::table::AddResult> for AddResult {
}
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct DeleteResult {
pub num_deleted_rows: u64,
@@ -135,7 +135,7 @@ impl From<lancedb::table::DeleteResult> for DeleteResult {
}
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct MergeResult {
pub version: u64,
@@ -171,7 +171,7 @@ impl From<lancedb::table::MergeResult> for MergeResult {
}
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct AddColumnsResult {
pub version: u64,
@@ -192,7 +192,7 @@ impl From<lancedb::table::AddColumnsResult> for AddColumnsResult {
}
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct AlterColumnsResult {
pub version: u64,
@@ -213,7 +213,7 @@ impl From<lancedb::table::AlterColumnsResult> for AlterColumnsResult {
}
}
#[pyclass(get_all)]
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct DropColumnsResult {
pub version: u64,

View File

@@ -126,8 +126,11 @@ impl Scannable for PyScannable {
}
}
impl<'py> FromPyObject<'py> for PyScannable {
fn extract_bound(ob: &pyo3::Bound<'py, PyAny>) -> pyo3::PyResult<Self> {
impl<'a, 'py> FromPyObject<'a, 'py> for PyScannable {
type Error = pyo3::PyErr;
fn extract(ob: pyo3::Borrowed<'a, 'py, PyAny>) -> pyo3::PyResult<Self> {
let ob = ob.to_owned();
// Convert from Scannable dataclass.
let schema: PyArrowType<Schema> = ob.getattr("schema")?.extract()?;
let schema = Arc::new(schema.0);

40
python/uv.lock generated
View File

@@ -1996,7 +1996,6 @@ tests = [
{ name = "pytest-mock" },
{ name = "pytz" },
{ name = "requests" },
{ name = "tantivy" },
]
[package.metadata]
@@ -2050,7 +2049,6 @@ requires-dist = [
{ name = "sentence-transformers", marker = "extra == 'embeddings'", specifier = ">=2.2.0" },
{ name = "sentencepiece", marker = "extra == 'embeddings'", specifier = ">=0.1.99" },
{ name = "sentencepiece", marker = "extra == 'siglip'" },
{ name = "tantivy", marker = "extra == 'tests'", specifier = ">=0.20.0" },
{ name = "torch", marker = "extra == 'clip'" },
{ name = "torch", marker = "extra == 'embeddings'", specifier = ">=2.0.0" },
{ name = "torch", marker = "extra == 'siglip'" },
@@ -4779,44 +4777,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-any.whl", hash = "sha256:024ca478df22e9340661486f85298cff5f6dcdba14f3813e8830015b9ed1948f", size = 35252, upload-time = "2022-10-06T17:21:44.262Z" },
]
[[package]]
name = "tantivy"
version = "0.25.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/1b/f9/0cd3955d155d3e3ef74b864769514dd191e5dacba9f0beb7af2d914942ce/tantivy-0.25.1.tar.gz", hash = "sha256:68a3314699a7d18fcf338b52bae8ce46a97dde1128a3e47e33fa4db7f71f265e", size = 75120, upload-time = "2025-12-02T11:57:12.997Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/80/f7/2276bed3bed983ce2970dc70e3571f372587fe4f5f2bac1d6d617df08fa3/tantivy-0.25.1-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:7aa587a3dc9470584cacf5e3640fee93d12ec5f10109669c1f47c4e90820b958", size = 7638510, upload-time = "2025-12-02T11:56:08.754Z" },
{ url = "https://files.pythonhosted.org/packages/20/8c/078dc50570e243414356b05633f52fe544b85179281ffa9f1fe05d76bbd8/tantivy-0.25.1-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:56d77fe667595693d9fa5f0b4545776d84da9526bab0273b3fc6c7536dc0d8a2", size = 3932659, upload-time = "2025-12-02T11:56:10.621Z" },
{ url = "https://files.pythonhosted.org/packages/bd/dc/281c48436a1e3178b58fe463af314434fe0f3a4ec0c7588a362900e0c69e/tantivy-0.25.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5ba8c347cd48595fcaeabb28a909ebce92cf9c5e5c84ab5ba1136a280a307b5c", size = 4197430, upload-time = "2025-12-02T11:56:12.65Z" },
{ url = "https://files.pythonhosted.org/packages/7b/6c/61e6e0b0a350007d10a9b66a35703361d3345e14e7a7cc83494776b2a054/tantivy-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:aa7c4932e8fde1f09f2d46226060e827e197c2749abdc6129d73a752773adc38", size = 4184055, upload-time = "2025-12-02T11:56:14.647Z" },
{ url = "https://files.pythonhosted.org/packages/5f/fd/0eb059b12f0b6f91623a54a46448a83b7f716d08f3bca68c095d697b85da/tantivy-0.25.1-cp310-cp310-win_amd64.whl", hash = "sha256:afcfc5dbb0bcd5d24531f4471737ae0896f33528426ab0b1dad3e427c19120f6", size = 3424134, upload-time = "2025-12-02T11:56:16.242Z" },
{ url = "https://files.pythonhosted.org/packages/4e/7a/8a277f377e8a151fc0e71d4ffc1114aefb6e5e1c7dd609fed0955cf34ed8/tantivy-0.25.1-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:d363d7b4207d3a5aa7f0d212420df35bed18bdb6bae26a2a8bd57428388b7c29", size = 7637033, upload-time = "2025-12-02T11:56:18.104Z" },
{ url = "https://files.pythonhosted.org/packages/71/31/8b4acdedfc9f9a2d04b1340d07eef5213d6f151d1e18da0cb423e5f090d2/tantivy-0.25.1-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:8f4389cf1d889a1df7c5a3195806b4b56c37cee10d8a26faaa0dea35a867b5ff", size = 3932180, upload-time = "2025-12-02T11:56:19.833Z" },
{ url = "https://files.pythonhosted.org/packages/2f/dc/3e8499c21b4b9795e8f2fc54c68ce5b92905aaeadadaa56ecfa9180b11b1/tantivy-0.25.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:99864c09fc54652c3c2486cdf13f86cdc8200f4b481569cb291e095ca5d496e5", size = 4197620, upload-time = "2025-12-02T11:56:21.496Z" },
{ url = "https://files.pythonhosted.org/packages/f8/8e/f2ce62fffc811eb62bead92c7b23c2e218f817cbd54c4f3b802e03ba1438/tantivy-0.25.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:05abf37ddbc5063c575548be0d62931629c086bff7a5a1b67cf5a8f5ebf4cd8c", size = 4183794, upload-time = "2025-12-02T11:56:23.215Z" },
{ url = "https://files.pythonhosted.org/packages/de/64/24e2891b0ba3fd9853e10c296095a33b89bf3efd65e29da1ee5dae736040/tantivy-0.25.1-cp311-cp311-win_amd64.whl", hash = "sha256:f307ee8ad21597b0be23af83008fd66cfd5f958cdfa24ec0aaa08a38e86bbef4", size = 3424235, upload-time = "2025-12-02T11:56:25.172Z" },
{ url = "https://files.pythonhosted.org/packages/41/e7/6849c713ed0996c7628324c60512c4882006f0a62145e56c624a93407f90/tantivy-0.25.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:90fd919e5f611809f746560ecf36eb9be824dec62e21ae17a27243759edb9aa1", size = 7621494, upload-time = "2025-12-02T11:56:27.069Z" },
{ url = "https://files.pythonhosted.org/packages/c5/22/c3d8294600dc6e7fa350daef9ff337d3c06e132b81df727de9f7a50c692a/tantivy-0.25.1-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:4613c7cf6c23f3a97989819690a0f956d799354957de7a204abcc60083cebe02", size = 3925219, upload-time = "2025-12-02T11:56:29.403Z" },
{ url = "https://files.pythonhosted.org/packages/41/fc/cbb1df71dd44c9110eff4eaaeda9d44f2d06182fe0452193be20ddfba93f/tantivy-0.25.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c477bd20b4df804d57dfc5033431bef27cde605695ae141b03abbf6ebc069129", size = 4198699, upload-time = "2025-12-02T11:56:31.359Z" },
{ url = "https://files.pythonhosted.org/packages/47/4d/71abb78b774073c3ce12a4faa4351a9d910a71ffa3659526affba163873d/tantivy-0.25.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f9b1a1ba1113c523c7ff7b10f282d6c4074006f7ef8d71e1d973d51bf7291ddb", size = 4183585, upload-time = "2025-12-02T11:56:33.317Z" },
{ url = "https://files.pythonhosted.org/packages/be/16/3f00cd7ec458b92a0e977960af9ddfbeb762127d9acc68da9094a1fda556/tantivy-0.25.1-cp312-cp312-win_amd64.whl", hash = "sha256:9de0bafd3bd7ac9f8f82d53e17562e9db11a5af308fe5185c4bd86feaddbe4a6", size = 3424622, upload-time = "2025-12-02T11:56:34.788Z" },
{ url = "https://files.pythonhosted.org/packages/3d/25/73cfbcf1a8ea49be6c42817431cac46b70a119fe64da903fcc2d92b5b511/tantivy-0.25.1-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:f51ff7196c6f31719202080ed8372d5e3d51e92c749c032fb8234f012e99744c", size = 7622530, upload-time = "2025-12-02T11:56:36.839Z" },
{ url = "https://files.pythonhosted.org/packages/12/c8/c0d7591cdf4f7e7a9fc4da786d1ca8cd1aacffaa2be16ea6d401a8e4a566/tantivy-0.25.1-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:550e63321bfcacc003859f2fa29c1e8e56450807b3c9a501c1add27cfb9236d9", size = 3925637, upload-time = "2025-12-02T11:56:38.425Z" },
{ url = "https://files.pythonhosted.org/packages/3a/09/bedfc223bffec7641b417dd7ab071134b2ef8f8550e9b1fb6014657ef52e/tantivy-0.25.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fde31cc8d6e122faf7902aeea32bc008a429a6e8904e34d3468126a3ec01b016", size = 4197322, upload-time = "2025-12-02T11:56:40.411Z" },
{ url = "https://files.pythonhosted.org/packages/f5/f1/1fa5183500c8042200c9f2b840d34f5bbcfb434a1ee750e7132262d2a5c9/tantivy-0.25.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b11bd5a518b0be645320b47af8493f6a40c4f3234313e37adcf4534a564d27dd", size = 4183143, upload-time = "2025-12-02T11:56:42.048Z" },
{ url = "https://files.pythonhosted.org/packages/d5/74/a4c4f4eb95888ccb784da3b017aa0625ab1ac411bf5d022a9a797d9a2334/tantivy-0.25.1-cp313-cp313-win_amd64.whl", hash = "sha256:cc7fe88853e06b3251ee4fa42b7a2038727f850c8765bcc8167cfc73585dd24e", size = 3423491, upload-time = "2025-12-02T11:56:43.858Z" },
{ url = "https://files.pythonhosted.org/packages/8b/2f/581519492226f97d23bd0adc95dad991ebeaa73ea6abc8bff389a3096d9a/tantivy-0.25.1-cp313-cp313t-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:dae99e75b7eaa9bf5bd16ab106b416370f08c135aed0e117d62a3201cd1ffe36", size = 7610316, upload-time = "2025-12-02T11:56:45.927Z" },
{ url = "https://files.pythonhosted.org/packages/91/40/5d7bc315ab9e6a22c5572656e8ada1c836cfa96dccf533377504fbc3c9d9/tantivy-0.25.1-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:506e9533c5ef4d3df43bad64ffecc0aa97c76e361ea610815dc3a20a9d6b30b3", size = 3919882, upload-time = "2025-12-02T11:56:48.469Z" },
{ url = "https://files.pythonhosted.org/packages/02/b9/e0ef2f57a6a72444cb66c2ffbc310ab33ffaace275f1c4b0319d84ea3f18/tantivy-0.25.1-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5dbd4f8f264dacbcc9dee542832da2173fd53deaaea03f082d95214f8b5ed6bc", size = 4196031, upload-time = "2025-12-02T11:56:50.151Z" },
{ url = "https://files.pythonhosted.org/packages/1e/02/bf3f8cacfd08642e14a73f7956a3fb95d58119132c98c121b9065a1f8615/tantivy-0.25.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:824c643ccb640dd9e35e00c5d5054ddf3323f56fe4219d57d428a9eeea13d22c", size = 4183437, upload-time = "2025-12-02T11:56:51.818Z" },
{ url = "https://files.pythonhosted.org/packages/9c/83/afa90e570198e2d1139dd567bec3c9cf44d8c54f63a649f16d711ede02f5/tantivy-0.25.1-cp313-cp313t-win_amd64.whl", hash = "sha256:09c987b840afcebac817836ac08407eff17272d8aa60ce6e291f89c81830221d", size = 3419409, upload-time = "2025-12-02T11:56:53.451Z" },
{ url = "https://files.pythonhosted.org/packages/ff/44/9f1d67aa5030f7eebc966c863d1316a510a971dd8bb45651df4acdfae9ed/tantivy-0.25.1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:7f5d29ae85dd0f23df8d15b3e7b341d4f9eb5a446bbb9640df48ac1f6d9e0c6c", size = 7623723, upload-time = "2025-12-02T11:56:55.066Z" },
{ url = "https://files.pythonhosted.org/packages/db/30/6e085bd3ed9d12da3c91c185854abd70f9dfd35fb36a75ea98428d42c30b/tantivy-0.25.1-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:f2d2938fb69a74fc1bb36edfaf7f0d1596fa1264db0f377bda2195c58bcb6245", size = 3926243, upload-time = "2025-12-02T11:56:57.058Z" },
{ url = "https://files.pythonhosted.org/packages/32/f5/a00d65433430f51718e5cc6938df571765d7c4e03aedec5aef4ab567aa9b/tantivy-0.25.1-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4f5ff124c4802558e627091e780b362ca944169736caba5a372eef39a79d0ae0", size = 4207186, upload-time = "2025-12-02T11:56:58.803Z" },
{ url = "https://files.pythonhosted.org/packages/19/63/61bdb12fc95f2a7f77bd419a5149bfa9f28caa76cb569bf2b6b06e1d033e/tantivy-0.25.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:43b80ef62a340416139c93d19264e5f808da48e04f9305f1092b8ed22be0a5be", size = 4187312, upload-time = "2025-12-02T11:57:00.595Z" },
{ url = "https://files.pythonhosted.org/packages/b7/de/e39c0b01d59019bf5c38face8b81defbc4a68cebf5e0c53bcb2cd715a449/tantivy-0.25.1-cp314-cp314-win_amd64.whl", hash = "sha256:286b654f40c70c1e6b64b9bc7031ed0bf5c440f5bffeaeeee21a0ee6cc39f0e2", size = 3436535, upload-time = "2025-12-02T11:57:02.267Z" },
]
[[package]]
name = "threadpoolctl"
version = "3.6.0"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.28.0-beta.5"
version = "0.28.0-beta.11"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -40,7 +40,7 @@ lance-datafusion.workspace = true
lance-datagen = { workspace = true }
lance-file = { workspace = true }
lance-io = { workspace = true }
lance-index = { workspace = true }
lance-index = { workspace = true, features = ["tokenizer-jieba", "tokenizer-lindera"] }
lance-table = { workspace = true }
lance-linalg = { workspace = true }
lance-testing = { workspace = true }
@@ -108,10 +108,20 @@ test-log = "0.2"
[features]
default = []
aws = ["lance/aws", "lance-io/aws", "lance-namespace-impls/dir-aws"]
aws = [
"lance/aws",
"lance-io/aws",
"lance-namespace-impls/dir-aws",
"object_store/aws",
]
oss = ["lance/oss", "lance-io/oss", "lance-namespace-impls/dir-oss"]
gcs = ["lance/gcp", "lance-io/gcp", "lance-namespace-impls/dir-gcp"]
azure = ["lance/azure", "lance-io/azure", "lance-namespace-impls/dir-azure"]
azure = [
"lance/azure",
"lance-io/azure",
"lance-namespace-impls/dir-azure",
"lance-namespace-impls/credential-vendor-azure",
]
huggingface = [
"lance/huggingface",
"lance-io/huggingface",

View File

@@ -582,6 +582,23 @@ pub struct ConnectRequest {
/// Database specific options
pub options: HashMap<String, String>,
/// Extra properties for the equivalent namespace client.
///
/// For a local [`ListingDatabase`], these are merged into the backing
/// `DirectoryNamespace` properties. This is useful for namespace-specific
/// settings such as `table_version_tracking_enabled` that are distinct from
/// storage options.
pub namespace_client_properties: HashMap<String, String>,
/// Use directory namespace manifests as the source of truth for native
/// LanceDB table metadata.
///
/// When enabled for a local/native connection, LanceDB returns a
/// namespace-backed database directly. Directory listing fallback remains
/// enabled for migration, and directory-listing-to-manifest migration is
/// forced on.
pub manifest_enabled: bool,
/// The interval at which to check for updates from other processes.
///
/// If None, then consistency is not checked. For performance
@@ -621,6 +638,8 @@ impl ConnectBuilder {
client_config: Default::default(),
read_consistency_interval: None,
options: HashMap::new(),
namespace_client_properties: HashMap::new(),
manifest_enabled: false,
session: None,
},
embedding_registry: None,
@@ -757,6 +776,42 @@ impl ConnectBuilder {
self
}
/// Set an additional property for the equivalent namespace client.
pub fn namespace_client_property(
mut self,
key: impl Into<String>,
value: impl Into<String>,
) -> Self {
self.request
.namespace_client_properties
.insert(key.into(), value.into());
self
}
/// Set multiple additional properties for the equivalent namespace client.
pub fn namespace_client_properties(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
for (key, value) in pairs {
self.request
.namespace_client_properties
.insert(key.into(), value.into());
}
self
}
/// Enable or disable manifest-backed directory namespace mode for local
/// native connections.
///
/// When enabled, the connection uses the directory namespace database
/// directly for all table operations and forces
/// `dir_listing_to_manifest_migration_enabled=true`.
pub fn manifest_enabled(mut self, enabled: bool) -> Self {
self.request.manifest_enabled = enabled;
self
}
/// The interval at which to check for updates from other processes. This
/// only affects LanceDB OSS.
///
@@ -852,6 +907,16 @@ impl ConnectBuilder {
pub async fn execute(self) -> Result<Connection> {
if self.request.uri.starts_with("db") {
self.execute_remote()
} else if self.request.manifest_enabled {
let internal = Arc::new(
ListingDatabase::connect_manifest_enabled_namespace_database(&self.request).await?,
);
Ok(Connection {
internal,
embedding_registry: self
.embedding_registry
.unwrap_or_else(|| Arc::new(MemoryRegistry::new())),
})
} else {
let internal = Arc::new(ListingDatabase::connect_with_options(&self.request).await?);
Ok(Connection {
@@ -881,7 +946,7 @@ use std::collections::HashSet;
/// These operations will be executed on the namespace server instead of locally
/// when enabled via [`ConnectNamespaceBuilder::pushdown_operations`].
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum PushdownOperation {
pub enum NamespaceClientPushdownOperation {
/// Execute queries on the namespace server via `query_table()` instead of locally.
QueryTable,
/// Execute table creation on the namespace server via `create_table()`
@@ -893,10 +958,11 @@ pub struct ConnectNamespaceBuilder {
ns_impl: String,
properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
namespace_client_properties: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
embedding_registry: Option<Arc<dyn EmbeddingRegistry>>,
session: Option<Arc<lance::session::Session>>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
}
impl ConnectNamespaceBuilder {
@@ -905,6 +971,7 @@ impl ConnectNamespaceBuilder {
ns_impl: ns_impl.to_string(),
properties,
storage_options: HashMap::new(),
namespace_client_properties: HashMap::new(),
read_consistency_interval: None,
embedding_registry: None,
session: None,
@@ -933,6 +1000,29 @@ impl ConnectNamespaceBuilder {
self
}
/// Set an additional namespace client property.
pub fn namespace_client_property(
mut self,
key: impl Into<String>,
value: impl Into<String>,
) -> Self {
self.namespace_client_properties
.insert(key.into(), value.into());
self
}
/// Set multiple additional namespace client properties.
pub fn namespace_client_properties(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
for (key, value) in pairs {
self.namespace_client_properties
.insert(key.into(), value.into());
}
self
}
/// The interval at which to check for updates from other processes.
///
/// If left unset, consistency is not checked. For maximum read
@@ -970,11 +1060,11 @@ impl ConnectNamespaceBuilder {
/// and leveraging server-side compute resources.
///
/// Available operations:
/// - [`PushdownOperation::QueryTable`]: Execute queries via `namespace.query_table()`
/// - [`PushdownOperation::CreateTable`]: Execute table creation via `namespace.create_table()`
/// - [`NamespaceClientPushdownOperation::QueryTable`]: Execute queries via `namespace.query_table()`
/// - [`NamespaceClientPushdownOperation::CreateTable`]: Execute table creation via `namespace.create_table()`
///
/// By default, no operations are pushed down (all executed locally).
pub fn pushdown_operation(mut self, operation: PushdownOperation) -> Self {
pub fn pushdown_operation(mut self, operation: NamespaceClientPushdownOperation) -> Self {
self.pushdown_operations.insert(operation);
self
}
@@ -984,7 +1074,7 @@ impl ConnectNamespaceBuilder {
/// See [`Self::pushdown_operation`] for details.
pub fn pushdown_operations(
mut self,
operations: impl IntoIterator<Item = PushdownOperation>,
operations: impl IntoIterator<Item = NamespaceClientPushdownOperation>,
) -> Self {
self.pushdown_operations.extend(operations);
self
@@ -994,10 +1084,13 @@ impl ConnectNamespaceBuilder {
pub async fn execute(self) -> Result<Connection> {
use crate::database::namespace::LanceNamespaceDatabase;
let mut properties = self.properties;
properties.extend(self.namespace_client_properties);
let internal = Arc::new(
LanceNamespaceDatabase::connect(
&self.ns_impl,
self.properties,
properties,
self.storage_options,
self.read_consistency_interval,
self.session,
@@ -1070,6 +1163,9 @@ mod tests {
use lance_testing::datagen::{BatchGenerator, IncrementingInt32};
use tempfile::tempdir;
use crate::database::listing::{ListingDatabaseOptions, OPT_NEW_TABLE_V2_MANIFEST_PATHS};
use crate::database::namespace::LanceNamespaceDatabase;
use crate::table::NativeTable;
use crate::test_utils::connection::new_test_connection;
use super::*;
@@ -1117,6 +1213,172 @@ mod tests {
assert_eq!(db.uri(), relative_uri.to_str().unwrap().to_string());
}
#[tokio::test]
async fn test_connect_with_namespace_client_properties() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri)
.namespace_client_property("table_version_tracking_enabled", "true")
.namespace_client_property("manifest_enabled", "true")
.execute()
.await
.unwrap();
let (ns_impl, properties) = db.namespace_client_config().await.unwrap();
assert_eq!(ns_impl, "dir");
assert_eq!(properties.get("root"), Some(&uri.to_string()));
assert_eq!(
properties.get("table_version_tracking_enabled"),
Some(&"true".to_string())
);
assert_eq!(
properties.get("manifest_enabled"),
Some(&"true".to_string())
);
}
#[tokio::test]
async fn test_connect_with_manifest_enabled_uses_directory_namespace() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri)
.manifest_enabled(true)
.storage_option("timeout", "30s")
.namespace_client_property("manifest_enabled", "false")
.namespace_client_property("dir_listing_to_manifest_migration_enabled", "false")
.execute()
.await
.unwrap();
assert!(
db.database()
.as_any()
.downcast_ref::<LanceNamespaceDatabase>()
.is_some()
);
assert_eq!(db.uri(), uri);
let (ns_impl, properties) = db.namespace_client_config().await.unwrap();
assert_eq!(ns_impl, "dir");
assert_eq!(properties.get("root"), Some(&uri.to_string()));
assert_eq!(
properties.get("manifest_enabled"),
Some(&"true".to_string())
);
assert_eq!(
properties.get("dir_listing_to_manifest_migration_enabled"),
Some(&"true".to_string())
);
assert_eq!(properties.get("storage.timeout"), Some(&"30s".to_string()));
}
#[tokio::test]
async fn test_manifest_enabled_rejects_commit_engine_uri() {
let Err(err) = connect("s3+ddb://bucket/db?ddbTableName=manifest")
.manifest_enabled(true)
.execute()
.await
else {
panic!("expected manifest-enabled s3+ddb connection to fail");
};
assert!(
matches!(err, Error::NotSupported { message } if message.contains("commit engine URI schemes"))
);
let Err(err) = connect("s3://bucket/db?engine=ddb&ddbTableName=manifest")
.manifest_enabled(true)
.execute()
.await
else {
panic!("expected manifest-enabled engine query connection to fail");
};
assert!(
matches!(err, Error::NotSupported { message } if message.contains("commit engine"))
);
}
#[tokio::test]
async fn test_manifest_enabled_connection_migrates_root_listing_table() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
connect(uri)
.execute()
.await
.unwrap()
.create_empty_table("legacy", schema)
.execute()
.await
.unwrap();
let db = connect(uri).manifest_enabled(true).execute().await.unwrap();
let tables = db.table_names().execute().await.unwrap();
assert_eq!(tables, vec!["legacy".to_string()]);
db.open_table("legacy").execute().await.unwrap();
}
#[tokio::test]
async fn test_manifest_enabled_preserves_new_table_options() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let options = ListingDatabaseOptions::builder()
.enable_v2_manifest_paths(true)
.build();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
let table = connect(uri)
.manifest_enabled(true)
.database_options(&options)
.execute()
.await
.unwrap()
.create_empty_table("v1_manifest", schema)
.storage_option(OPT_NEW_TABLE_V2_MANIFEST_PATHS, "false")
.execute()
.await
.unwrap();
let native_table = table
.base_table()
.as_any()
.downcast_ref::<NativeTable>()
.unwrap();
assert!(!native_table.uses_v2_manifest_paths().await.unwrap());
}
#[tokio::test]
async fn test_manifest_enabled_vend_input_storage_options() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
let table = connect(uri)
.manifest_enabled(true)
.storage_option("test_storage_option", "test_value")
.namespace_client_property("vend_input_storage_options", "true")
.namespace_client_property(
"vend_input_storage_options_refresh_interval_millis",
"60000",
)
.execute()
.await
.unwrap()
.create_empty_table("vended", schema)
.execute()
.await
.unwrap();
let storage_options = table.latest_storage_options().await.unwrap().unwrap();
assert_eq!(
storage_options.get("test_storage_option"),
Some(&"test_value".to_string())
);
assert!(storage_options.contains_key("expires_at_millis"));
}
#[tokio::test]
async fn test_table_names() {
let tc = new_test_connection().await.unwrap();

View File

@@ -20,6 +20,7 @@ use snafu::ResultExt;
use crate::connection::ConnectRequest;
use crate::database::ReadConsistency;
use crate::database::namespace::LanceNamespaceDatabase;
use crate::error::{CreateDirSnafu, Error, Result};
use crate::io::object_store::MirroringObjectStoreWrapper;
use crate::table::NativeTable;
@@ -255,6 +256,9 @@ pub struct ListingDatabase {
// Session for object stores and caching
session: Arc<lance::session::Session>,
// Namespace-backed database for child namespace operations
namespace_database: Arc<LanceNamespaceDatabase>,
}
impl std::fmt::Display for ListingDatabase {
@@ -281,6 +285,175 @@ const MIRRORED_STORE: &str = "mirroredStore";
/// A connection to LanceDB
impl ListingDatabase {
pub(crate) fn build_namespace_client_properties(
uri: &str,
storage_options: &HashMap<String, String>,
namespace_client_properties: HashMap<String, String>,
) -> HashMap<String, String> {
let mut properties = namespace_client_properties;
properties.insert("root".to_string(), uri.to_string());
for (key, value) in storage_options {
properties.insert(format!("storage.{}", key), value.clone());
}
properties
}
pub(crate) fn build_manifest_enabled_namespace_client_properties(
uri: &str,
storage_options: &HashMap<String, String>,
namespace_client_properties: HashMap<String, String>,
) -> HashMap<String, String> {
let mut properties = Self::build_namespace_client_properties(
uri,
storage_options,
namespace_client_properties,
);
properties.insert("manifest_enabled".to_string(), "true".to_string());
properties.insert(
"dir_listing_to_manifest_migration_enabled".to_string(),
"true".to_string(),
);
properties
}
async fn connect_namespace_database(
uri: &str,
storage_options: HashMap<String, String>,
namespace_client_properties: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
session: Arc<lance::session::Session>,
) -> Result<Arc<LanceNamespaceDatabase>> {
let ns_properties = Self::build_namespace_client_properties(
uri,
&storage_options,
namespace_client_properties,
);
Ok(Arc::new(
LanceNamespaceDatabase::connect(
"dir",
ns_properties,
storage_options,
read_consistency_interval,
Some(session),
HashSet::new(),
)
.await?,
))
}
async fn prepare_namespace_root(
uri: &str,
storage_options: &HashMap<String, String>,
session: Arc<lance::session::Session>,
) -> Result<String> {
match url::Url::parse(uri) {
Ok(url) if url.scheme().len() == 1 && cfg!(windows) => {
let (object_store, _) = ObjectStore::from_uri_and_params(
session.store_registry(),
uri,
&ObjectStoreParams::default(),
)
.await?;
if object_store.is_local() {
Self::try_create_dir(uri).context(CreateDirSnafu { path: uri })?;
}
Ok(uri.to_string())
}
Ok(mut url) => {
if url.scheme().contains('+') {
return Err(Error::NotSupported {
message: "commit engine URI schemes are not supported for manifest-enabled namespace connections".to_string(),
});
}
for (key, value) in url.query_pairs() {
if key == ENGINE {
return Err(Error::NotSupported {
message: format!(
"commit engine '{}' is not supported for manifest-enabled namespace connections",
value
),
});
} else if key == MIRRORED_STORE {
return Err(Error::NotSupported {
message: "mirrored store is not supported for manifest-enabled namespace connections"
.to_string(),
});
}
}
url.set_query(None);
let plain_uri = url.to_string();
let os_params = ObjectStoreParams {
storage_options_accessor: if storage_options.is_empty() {
None
} else {
Some(Arc::new(StorageOptionsAccessor::with_static_options(
storage_options.clone(),
)))
},
..Default::default()
};
let (object_store, _) = ObjectStore::from_uri_and_params(
session.store_registry(),
&plain_uri,
&os_params,
)
.await?;
if object_store.is_local() {
Self::try_create_dir(&plain_uri).context(CreateDirSnafu {
path: plain_uri.clone(),
})?;
}
Ok(plain_uri)
}
Err(_) => {
let (object_store, _) = ObjectStore::from_uri_and_params(
session.store_registry(),
uri,
&ObjectStoreParams::default(),
)
.await?;
if object_store.is_local() {
Self::try_create_dir(uri).context(CreateDirSnafu { path: uri })?;
}
Ok(uri.to_string())
}
}
}
pub(crate) async fn connect_manifest_enabled_namespace_database(
request: &ConnectRequest,
) -> Result<LanceNamespaceDatabase> {
let options = ListingDatabaseOptions::parse_from_map(&request.options)?;
let session = request
.session
.clone()
.unwrap_or_else(|| Arc::new(lance::session::Session::default()));
let namespace_root =
Self::prepare_namespace_root(&request.uri, &options.storage_options, session.clone())
.await?;
let ns_properties = Self::build_manifest_enabled_namespace_client_properties(
&namespace_root,
&options.storage_options,
request.namespace_client_properties.clone(),
);
LanceNamespaceDatabase::connect_with_new_table_config(
"dir",
ns_properties,
options.storage_options,
request.read_consistency_interval,
Some(session),
HashSet::new(),
options.new_table_config,
)
.await
.map(|db| db.with_uri(request.uri.clone()))
}
/// Connect to a listing database
///
/// The URI should be a path to a directory where the tables are stored.
@@ -300,6 +473,7 @@ impl ListingDatabase {
uri,
request.read_consistency_interval,
options.new_table_config,
request.namespace_client_properties.clone(),
request.session.clone(),
)
.await
@@ -331,8 +505,15 @@ impl ListingDatabase {
// Filter out the commit store query param -- it's a lancedb param
url.query_pairs_mut().clear();
url.query_pairs_mut().extend_pairs(filtered_querys);
// Take a copy of the query string so we can propagate it to lance
let query_string = url.query().map(|s| s.to_string());
// Take a copy of the query string so we can propagate it to lance.
// `query_pairs_mut()` leaves the URL with `Some("")` even when no
// pairs survive (or none existed in the first place), so an empty
// string here must be treated the same as "no query" — otherwise
// every table URI ends up with a trailing `?`, which makes downstream
// sub-paths (e.g. MemWAL gen paths) re-parse as path=<base table> +
// query=<sub-path>, causing Lance to find the base table dataset
// when looking up the sub-path.
let query_string = url.query().filter(|q| !q.is_empty()).map(|s| s.to_string());
// clear the query string so we can use the url as the base uri
// use .set_query(None) instead of .set_query("") because the latter
// will add a trailing '?' to the url
@@ -387,6 +568,15 @@ impl ListingDatabase {
None => None,
};
let namespace_database = Self::connect_namespace_database(
&table_base_uri,
options.storage_options.clone(),
request.namespace_client_properties.clone(),
request.read_consistency_interval,
session.clone(),
)
.await?;
Ok(Self {
uri: table_base_uri,
query_string,
@@ -398,6 +588,7 @@ impl ListingDatabase {
storage_options_provider: None,
new_table_config: options.new_table_config,
session,
namespace_database,
})
}
Err(_) => {
@@ -405,6 +596,7 @@ impl ListingDatabase {
uri,
request.read_consistency_interval,
options.new_table_config,
request.namespace_client_properties.clone(),
request.session.clone(),
)
.await
@@ -416,6 +608,7 @@ impl ListingDatabase {
path: &str,
read_consistency_interval: Option<std::time::Duration>,
new_table_config: NewTableConfig,
namespace_client_properties: HashMap<String, String>,
session: Option<Arc<lance::session::Session>>,
) -> Result<Self> {
let session = session.unwrap_or_else(|| Arc::new(lance::session::Session::default()));
@@ -429,6 +622,15 @@ impl ListingDatabase {
Self::try_create_dir(path).context(CreateDirSnafu { path })?;
}
let namespace_database = Self::connect_namespace_database(
path,
HashMap::new(),
namespace_client_properties,
read_consistency_interval,
session.clone(),
)
.await?;
Ok(Self {
uri: path.to_string(),
query_string: None,
@@ -440,6 +642,7 @@ impl ListingDatabase {
storage_options_provider: None,
new_table_config,
session,
namespace_database,
})
}
@@ -497,6 +700,10 @@ impl ListingDatabase {
Ok(uri)
}
fn namespace_database(&self) -> Arc<LanceNamespaceDatabase> {
self.namespace_database.clone()
}
async fn drop_tables(&self, names: Vec<String>) -> Result<()> {
let object_store_params = ObjectStoreParams {
storage_options_accessor: if self.storage_options.is_empty() {
@@ -515,7 +722,7 @@ impl ListingDatabase {
let commit_handler = commit_handler_from_url(&uri, &Some(object_store_params)).await?;
for name in names {
let dir_name = format!("{}.{}", name, LANCE_EXTENSION);
let full_path = self.base_path.child(dir_name.clone());
let full_path = self.base_path.clone().join(dir_name.clone());
commit_handler.delete(&full_path).await?;
@@ -621,15 +828,12 @@ impl ListingDatabase {
store_params.storage_options_accessor = Some(Arc::new(accessor));
}
write_params.data_storage_version = self
.new_table_config
.data_storage_version
.or(storage_version_override);
write_params.data_storage_version = storage_version_override
.or(write_params.data_storage_version)
.or(self.new_table_config.data_storage_version);
if let Some(enable_v2_manifest_paths) = self
.new_table_config
.enable_v2_manifest_paths
.or(v2_manifest_override)
if let Some(enable_v2_manifest_paths) =
v2_manifest_override.or(self.new_table_config.enable_v2_manifest_paths)
{
write_params.enable_v2_manifest_paths = enable_v2_manifest_paths;
}
@@ -696,16 +900,7 @@ impl Database for ListingDatabase {
&self,
request: ListNamespacesRequest,
) -> Result<ListNamespacesResponse> {
if request.id.as_ref().map(|v| !v.is_empty()).unwrap_or(false) {
return Err(Error::NotSupported {
message: "Namespace operations are not supported for listing database".into(),
});
}
Ok(ListNamespacesResponse {
namespaces: Vec::new(),
page_token: None,
})
self.namespace_database().list_namespaces(request).await
}
fn uri(&self) -> &str {
@@ -726,36 +921,26 @@ impl Database for ListingDatabase {
async fn create_namespace(
&self,
_request: CreateNamespaceRequest,
request: CreateNamespaceRequest,
) -> Result<CreateNamespaceResponse> {
Err(Error::NotSupported {
message: "Namespace operations are not supported for listing database".into(),
})
self.namespace_database().create_namespace(request).await
}
async fn drop_namespace(
&self,
_request: DropNamespaceRequest,
) -> Result<DropNamespaceResponse> {
Err(Error::NotSupported {
message: "Namespace operations are not supported for listing database".into(),
})
async fn drop_namespace(&self, request: DropNamespaceRequest) -> Result<DropNamespaceResponse> {
self.namespace_database().drop_namespace(request).await
}
async fn describe_namespace(
&self,
_request: DescribeNamespaceRequest,
request: DescribeNamespaceRequest,
) -> Result<DescribeNamespaceResponse> {
Err(Error::NotSupported {
message: "Namespace operations are not supported for listing database".into(),
})
self.namespace_database().describe_namespace(request).await
}
#[allow(deprecated)]
async fn table_names(&self, request: TableNamesRequest) -> Result<Vec<String>> {
if !request.namespace_path.is_empty() {
return Err(Error::NotSupported {
message: "Namespace parameter is not supported for listing database. Only root namespace is supported.".into(),
});
return self.namespace_database().table_names(request).await;
}
let mut f = self
.object_store
@@ -788,9 +973,7 @@ impl Database for ListingDatabase {
async fn list_tables(&self, request: ListTablesRequest) -> Result<ListTablesResponse> {
if request.id.as_ref().map(|v| !v.is_empty()).unwrap_or(false) {
return Err(Error::NotSupported {
message: "Namespace parameter is not supported for listing database. Only root namespace is supported.".into(),
});
return self.namespace_database().list_tables(request).await;
}
let mut f = self
.object_store
@@ -838,11 +1021,8 @@ impl Database for ListingDatabase {
}
async fn create_table(&self, request: CreateTableRequest) -> Result<Arc<dyn BaseTable>> {
// When namespace is not empty, location must be provided
if !request.namespace_path.is_empty() && request.location.is_none() {
return Err(Error::InvalidInput {
message: "Location must be provided when namespace is not empty".into(),
});
if !request.namespace_path.is_empty() {
return self.namespace_database().create_table(request).await;
}
// Use provided location if available, otherwise derive from table name
let table_uri = request
@@ -959,11 +1139,8 @@ impl Database for ListingDatabase {
}
async fn open_table(&self, mut request: OpenTableRequest) -> Result<Arc<dyn BaseTable>> {
// When namespace is not empty, location must be provided
if !request.namespace_path.is_empty() && request.location.is_none() {
return Err(Error::InvalidInput {
message: "Location must be provided when namespace is not empty".into(),
});
if !request.namespace_path.is_empty() {
return self.namespace_database().open_table(request).await;
}
// Use provided location if available, otherwise derive from table name
let table_uri = request
@@ -1059,9 +1236,10 @@ impl Database for ListingDatabase {
async fn drop_table(&self, name: &str, namespace_path: &[String]) -> Result<()> {
if !namespace_path.is_empty() {
return Err(Error::NotSupported {
message: "Namespace parameter is not supported for listing database.".into(),
});
return self
.namespace_database()
.drop_table(name, namespace_path)
.await;
}
self.drop_tables(vec![name.to_string()]).await
}
@@ -1070,9 +1248,10 @@ impl Database for ListingDatabase {
async fn drop_all_tables(&self, namespace_path: &[String]) -> Result<()> {
// Check if namespace parameter is provided
if !namespace_path.is_empty() {
return Err(Error::NotSupported {
message: "Namespace parameter is not supported for listing database.".into(),
});
return self
.namespace_database()
.drop_all_tables(namespace_path)
.await;
}
let tables = self.table_names(TableNamesRequest::default()).await?;
self.drop_tables(tables).await
@@ -1083,30 +1262,11 @@ impl Database for ListingDatabase {
}
async fn namespace_client(&self) -> Result<Arc<dyn lance_namespace::LanceNamespace>> {
// Create a DirectoryNamespace pointing to the same root with the same storage options
let mut builder = lance_namespace_impls::DirectoryNamespaceBuilder::new(&self.uri);
// Add storage options
if !self.storage_options.is_empty() {
builder = builder.storage_options(self.storage_options.clone());
}
// Use the same session
builder = builder.session(self.session.clone());
let namespace = builder.build().await.map_err(|e| Error::Runtime {
message: format!("Failed to create namespace client: {}", e),
})?;
Ok(Arc::new(namespace) as Arc<dyn lance_namespace::LanceNamespace>)
self.namespace_database.namespace_client().await
}
async fn namespace_client_config(&self) -> Result<(String, HashMap<String, String>)> {
let mut properties = HashMap::new();
properties.insert("root".to_string(), self.uri.clone());
for (key, value) in &self.storage_options {
properties.insert(format!("storage.{}", key), value.clone());
}
Ok(("dir".to_string(), properties))
self.namespace_database.namespace_client_config().await
}
}
@@ -1132,6 +1292,8 @@ mod tests {
#[cfg(feature = "remote")]
client_config: Default::default(),
options: Default::default(),
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1265,6 +1427,8 @@ mod tests {
#[cfg(feature = "remote")]
client_config: Default::default(),
options: options.clone(),
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1799,6 +1963,8 @@ mod tests {
#[cfg(feature = "remote")]
client_config: Default::default(),
options,
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1904,6 +2070,8 @@ mod tests {
#[cfg(feature = "remote")]
client_config: Default::default(),
options,
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1975,6 +2143,8 @@ mod tests {
#[cfg(feature = "remote")]
client_config: Default::default(),
options,
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -2050,6 +2220,133 @@ mod tests {
assert_eq!(uri, expected);
}
/// Regression: connecting via a URL-style URI (which goes through
/// `url::Url::parse` and the `query_pairs_mut()` path) must not
/// append a trailing `?` to per-table URIs when the input URI has
/// no query string.
///
/// Earlier, `query_pairs_mut().clear()` left the URL with
/// `query=Some("")`, which then propagated as a trailing `?` on
/// every table URI. Sub-path lookups against that URI (e.g. MemWAL
/// `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`) re-parsed as
/// `path=<base table>` + `query=/_mem_wal/...`, causing
/// `Dataset::write` to find the base table dataset and falsely
/// report `Dataset already exists`.
/// Mirrors the URL-mutation step from
/// [`ListingDatabase::connect_with_options`] so we can assert the
/// fix without going through filesystem setup (which is awkward
/// across platforms — see the `file://` test below).
fn capture_query_like_connect(input_uri: &str) -> Option<String> {
let mut url = url::Url::parse(input_uri).unwrap();
let mut filtered_querys = Vec::new();
for (key, value) in url.query_pairs() {
if key == ENGINE || key == MIRRORED_STORE {
continue;
}
filtered_querys.push((key.to_string(), value.to_string()));
}
url.query_pairs_mut().clear();
url.query_pairs_mut().extend_pairs(filtered_querys);
url.query().filter(|q| !q.is_empty()).map(|s| s.to_string())
}
#[test]
fn test_capture_query_treats_empty_as_none() {
// No query at all. With the bug, `query_pairs_mut()` left the
// URL with `query=Some("")` and we used to propagate that.
assert_eq!(
capture_query_like_connect("s3://bucket/prefix/"),
None,
"empty query after mutation must be treated as no query"
);
// Real query is propagated.
assert_eq!(
capture_query_like_connect("s3://bucket/prefix/?foo=bar"),
Some("foo=bar".to_string())
);
// lancedb-internal `engine=` is stripped; nothing remains, so
// query_string is None — not Some("").
assert_eq!(
capture_query_like_connect(&format!("s3://bucket/prefix/?{}=mem", ENGINE)),
None
);
// Mixed: drop `engine=`, keep the rest.
let captured =
capture_query_like_connect(&format!("s3://bucket/prefix/?{}=mem&foo=bar", ENGINE));
assert_eq!(captured.as_deref(), Some("foo=bar"));
}
/// Regression: connecting via a URL-style URI (which goes through
/// `url::Url::parse` and the `query_pairs_mut()` path) must not
/// append a trailing `?` to per-table URIs when the input URI has
/// no query string. Sub-path lookups against such a URI (e.g.
/// MemWAL `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`) re-parse
/// as `path=<base table>` + `query=/_mem_wal/...`, causing
/// `Dataset::write` to find the base table dataset and falsely
/// report `Dataset already exists`.
///
/// Skipped on Windows: `try_create_dir` does not understand
/// `file:///C:/…` paths so `connect_with_options` fails before
/// even reaching the URL-mutation logic. The pure URL-mutation
/// invariant is covered by
/// `test_capture_query_treats_empty_as_none` above, which runs
/// on all platforms.
#[cfg(not(windows))]
#[tokio::test]
async fn test_table_uri_url_path_has_no_trailing_question_mark() {
let tempdir = tempdir().unwrap();
let uri = format!("file://{}", tempdir.path().to_str().unwrap());
let request = ConnectRequest {
uri: uri.clone(),
#[cfg(feature = "remote")]
client_config: Default::default(),
options: Default::default(),
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
let db = ListingDatabase::connect_with_options(&request)
.await
.unwrap();
assert_eq!(
db.query_string, None,
"no input query → no captured query_string"
);
let table_uri = db.table_uri("test").unwrap();
assert!(
!table_uri.ends_with('?'),
"table_uri must not have a trailing `?`: {}",
table_uri
);
assert_eq!(table_uri, format!("{}/test.lance", uri));
// A real query string should still be propagated.
let with_query = format!("{}?foo=bar", uri);
let request_with_query = ConnectRequest {
uri: with_query,
#[cfg(feature = "remote")]
client_config: Default::default(),
options: Default::default(),
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
let db_with_query = ListingDatabase::connect_with_options(&request_with_query)
.await
.unwrap();
assert_eq!(db_with_query.query_string.as_deref(), Some("foo=bar"));
let table_uri = db_with_query.table_uri("test").unwrap();
assert_eq!(table_uri, format!("{}/test.lance?foo=bar", uri));
}
#[tokio::test]
async fn test_namespace_client() {
let (_tempdir, db) = setup_database().await;
@@ -2108,4 +2405,210 @@ mod tests {
assert!(tables.contains(&"table1".to_string()));
assert!(tables.contains(&"table2".to_string()));
}
#[tokio::test]
async fn test_listing_database_namespace_operations() {
let (_tempdir, db) = setup_database().await;
db.create_namespace(CreateNamespaceRequest {
id: Some(vec!["parent".to_string()]),
..Default::default()
})
.await
.unwrap();
db.create_namespace(CreateNamespaceRequest {
id: Some(vec!["parent".to_string(), "child".to_string()]),
..Default::default()
})
.await
.unwrap();
let root_namespaces = db
.list_namespaces(ListNamespacesRequest {
id: Some(vec![]),
..Default::default()
})
.await
.unwrap();
assert!(root_namespaces.namespaces.contains(&"parent".to_string()));
let child_namespaces = db
.list_namespaces(ListNamespacesRequest {
id: Some(vec!["parent".to_string()]),
..Default::default()
})
.await
.unwrap();
assert!(child_namespaces.namespaces.contains(&"child".to_string()));
db.describe_namespace(DescribeNamespaceRequest {
id: Some(vec!["parent".to_string(), "child".to_string()]),
..Default::default()
})
.await
.unwrap();
}
#[tokio::test]
#[cfg(not(windows))] // TODO: support Windows once directory namespace-backed listing DB tests are supported.
async fn test_listing_database_with_namespace_client_properties() {
let tempdir = tempdir().unwrap();
let uri = tempdir.path().to_str().unwrap();
let mut namespace_client_properties = HashMap::new();
namespace_client_properties.insert(
"table_version_tracking_enabled".to_string(),
"true".to_string(),
);
namespace_client_properties.insert("manifest_enabled".to_string(), "true".to_string());
let request = ConnectRequest {
uri: uri.to_string(),
#[cfg(feature = "remote")]
client_config: Default::default(),
options: Default::default(),
namespace_client_properties,
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
let db = ListingDatabase::connect_with_options(&request)
.await
.unwrap();
let namespace_path = vec!["test_ns".to_string()];
db.create_namespace(CreateNamespaceRequest {
id: Some(namespace_path.clone()),
..Default::default()
})
.await
.unwrap();
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
]));
db.create_table(CreateTableRequest {
name: "managed_table".to_string(),
namespace_path: namespace_path.clone(),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
namespace_client: None,
})
.await
.unwrap();
let namespace_client = db.namespace_client().await.unwrap();
let describe = namespace_client
.describe_table(lance_namespace::models::DescribeTableRequest {
id: Some(vec!["test_ns".to_string(), "managed_table".to_string()]),
..Default::default()
})
.await
.unwrap();
assert_eq!(describe.managed_versioning, Some(true));
}
#[tokio::test]
async fn test_listing_database_nested_namespace_table_ops() {
let (_tempdir, db) = setup_database().await;
let namespace_path = vec!["parent".to_string(), "child".to_string()];
db.create_namespace(CreateNamespaceRequest {
id: Some(vec!["parent".to_string()]),
..Default::default()
})
.await
.unwrap();
db.create_namespace(CreateNamespaceRequest {
id: Some(namespace_path.clone()),
..Default::default()
})
.await
.unwrap();
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
]));
db.create_table(CreateTableRequest {
name: "nested_table".to_string(),
namespace_path: namespace_path.clone(),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
namespace_client: None,
})
.await
.unwrap();
let namespace_client = db.namespace_client().await.unwrap();
let describe = namespace_client
.describe_table(lance_namespace::models::DescribeTableRequest {
id: Some(vec![
"parent".to_string(),
"child".to_string(),
"nested_table".to_string(),
]),
..Default::default()
})
.await
.unwrap();
assert!(describe.location.is_some());
let table = db
.open_table(OpenTableRequest {
name: "nested_table".to_string(),
namespace_path: namespace_path.clone(),
index_cache_size: None,
lance_read_params: None,
location: None,
namespace_client: None,
managed_versioning: None,
})
.await
.unwrap();
assert_eq!(table.name(), "nested_table");
#[allow(deprecated)]
let table_names = db
.table_names(TableNamesRequest {
namespace_path: namespace_path.clone(),
start_after: None,
limit: None,
})
.await
.unwrap();
assert_eq!(table_names, vec!["nested_table".to_string()]);
let list_tables = db
.list_tables(ListTablesRequest {
id: Some(namespace_path.clone()),
..Default::default()
})
.await
.unwrap();
assert_eq!(list_tables.tables, vec!["nested_table".to_string()]);
db.drop_table("nested_table", &namespace_path)
.await
.unwrap();
let post_drop = db
.list_tables(ListTablesRequest {
id: Some(namespace_path),
..Default::default()
})
.await
.unwrap();
assert!(post_drop.tables.is_empty());
}
}

View File

@@ -22,10 +22,15 @@ use lance_namespace_impls::ConnectBuilder;
use lance_table::io::commit::CommitHandler;
use lance_table::io::commit::external_manifest::ExternalManifestCommitHandler;
use crate::connection::PushdownOperation;
use crate::connection::NamespaceClientPushdownOperation;
use crate::database::ReadConsistency;
use crate::database::listing::{
NewTableConfig, OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS, OPT_NEW_TABLE_STORAGE_VERSION,
OPT_NEW_TABLE_V2_MANIFEST_PATHS,
};
use crate::error::{Error, Result};
use crate::table::NativeTable;
use lance::dataset::WriteMode;
use super::{
BaseTable, CloneTableRequest, CreateTableMode, CreateTableRequest as DbCreateTableRequest,
@@ -44,21 +49,71 @@ pub struct LanceNamespaceDatabase {
// database URI
uri: String,
// Operations to push down to the namespace server
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
// Namespace implementation type (e.g., "dir", "rest")
ns_impl: String,
// Namespace properties used to construct the namespace client
ns_properties: HashMap<String, String>,
// Options for tables created by this connection
new_table_config: NewTableConfig,
}
impl LanceNamespaceDatabase {
pub fn from_namespace_client(
namespace_client: Arc<dyn LanceNamespace>,
namespace_client_impl: String,
namespace_client_properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
namespace_client_pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
) -> Self {
Self {
namespace: namespace_client,
storage_options,
read_consistency_interval,
session,
uri: format!("namespace://{}", namespace_client_impl),
pushdown_operations: namespace_client_pushdown_operations,
ns_impl: namespace_client_impl,
ns_properties: namespace_client_properties,
new_table_config: NewTableConfig::default(),
}
}
pub(crate) fn with_uri(mut self, uri: impl Into<String>) -> Self {
self.uri = uri.into();
self
}
pub async fn connect(
ns_impl: &str,
ns_properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
) -> Result<Self> {
Self::connect_with_new_table_config(
ns_impl,
ns_properties,
storage_options,
read_consistency_interval,
session,
pushdown_operations,
NewTableConfig::default(),
)
.await
}
pub(crate) async fn connect_with_new_table_config(
ns_impl: &str,
ns_properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
new_table_config: NewTableConfig,
) -> Result<Self> {
let mut builder = ConnectBuilder::new(ns_impl);
for (key, value) in ns_properties.clone() {
@@ -80,8 +135,79 @@ impl LanceNamespaceDatabase {
pushdown_operations,
ns_impl: ns_impl.to_string(),
ns_properties,
new_table_config,
})
}
fn extract_storage_overrides(
&self,
request: &DbCreateTableRequest,
) -> Result<(
Option<lance_encoding::version::LanceFileVersion>,
Option<bool>,
Option<bool>,
)> {
let storage_options = request
.write_options
.lance_write_params
.as_ref()
.and_then(|p| p.store_params.as_ref())
.and_then(|sp| sp.storage_options());
let storage_version_override = storage_options
.and_then(|opts| opts.get(OPT_NEW_TABLE_STORAGE_VERSION))
.map(|s| s.parse::<lance_encoding::version::LanceFileVersion>())
.transpose()?;
let v2_manifest_override = storage_options
.and_then(|opts| opts.get(OPT_NEW_TABLE_V2_MANIFEST_PATHS))
.map(|s| s.parse::<bool>())
.transpose()
.map_err(|_| Error::InvalidInput {
message: "enable_v2_manifest_paths must be a boolean".to_string(),
})?;
let stable_row_ids_override = storage_options
.and_then(|opts| opts.get(OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS))
.map(|s| s.parse::<bool>())
.transpose()
.map_err(|_| Error::InvalidInput {
message: "enable_stable_row_ids must be a boolean".to_string(),
})?;
Ok((
storage_version_override,
v2_manifest_override,
stable_row_ids_override,
))
}
fn apply_new_table_config(
&self,
params: &mut lance::dataset::WriteParams,
request: &DbCreateTableRequest,
) -> Result<()> {
let (storage_version_override, v2_manifest_override, stable_row_ids_override) =
self.extract_storage_overrides(request)?;
params.data_storage_version = storage_version_override
.or(params.data_storage_version)
.or(self.new_table_config.data_storage_version);
if let Some(enable_v2_manifest_paths) =
v2_manifest_override.or(self.new_table_config.enable_v2_manifest_paths)
{
params.enable_v2_manifest_paths = enable_v2_manifest_paths;
}
if let Some(enable_stable_row_ids) =
stable_row_ids_override.or(self.new_table_config.enable_stable_row_ids)
{
params.enable_stable_row_ids = enable_stable_row_ids;
}
Ok(())
}
}
impl std::fmt::Debug for LanceNamespaceDatabase {
@@ -163,37 +289,23 @@ impl Database for LanceNamespaceDatabase {
async fn create_table(&self, request: DbCreateTableRequest) -> Result<Arc<dyn BaseTable>> {
let mut table_id = request.namespace_path.clone();
table_id.push(request.name.clone());
let describe_request = DescribeTableRequest {
id: Some(table_id.clone()),
..Default::default()
};
let describe_result = self.namespace.describe_table(describe_request).await;
let mut existing_table = None;
match request.mode {
CreateTableMode::Create => {
if describe_result.is_ok() {
return Err(Error::TableAlreadyExists {
name: request.name.clone(),
});
}
}
CreateTableMode::Create => {}
CreateTableMode::Overwrite => {
if describe_result.is_ok() {
// Drop the existing table - must succeed
let drop_request = DropTableRequest {
id: Some(table_id.clone()),
..Default::default()
};
self.namespace
.drop_table(drop_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to drop existing table for overwrite: {}", e),
})?;
}
let describe_request = DescribeTableRequest {
id: Some(table_id.clone()),
..Default::default()
};
existing_table = self.namespace.describe_table(describe_request).await.ok();
}
CreateTableMode::ExistOk(_) => {
let describe_request = DescribeTableRequest {
id: Some(table_id.clone()),
..Default::default()
};
let describe_result = self.namespace.describe_table(describe_request).await;
if describe_result.is_ok() {
let native_table = NativeTable::open_from_namespace(
self.namespace.clone(),
@@ -221,20 +333,86 @@ impl Database for LanceNamespaceDatabase {
};
let (location, initial_storage_options, managed_versioning) = {
let response = self.namespace.declare_table(declare_request).await?;
let loc = response.location.ok_or_else(|| Error::Runtime {
message: "Table location is missing from declare_table response".to_string(),
})?;
// Use storage options from response, fall back to self.storage_options
let opts = response
.storage_options
.or_else(|| Some(self.storage_options.clone()))
.filter(|o| !o.is_empty());
(loc, opts, response.managed_versioning)
if let Some(response) = existing_table {
let loc = response.location.ok_or_else(|| Error::Runtime {
message: "Table location is missing from describe_table response".to_string(),
})?;
let opts = response
.storage_options
.or_else(|| Some(self.storage_options.clone()))
.filter(|o| !o.is_empty());
(loc, opts, response.managed_versioning)
} else {
match self.namespace.declare_table(declare_request).await {
Ok(response) => {
let loc = response.location.ok_or_else(|| Error::Runtime {
message: "Table location is missing from declare_table response"
.to_string(),
})?;
let opts = response
.storage_options
.or_else(|| Some(self.storage_options.clone()))
.filter(|o: &HashMap<String, String>| !o.is_empty());
(loc, opts, response.managed_versioning)
}
Err(e)
if matches!(request.mode, CreateTableMode::Create) && {
let err_str = e.to_string();
err_str.contains("already exists")
|| err_str.contains("TableAlreadyExists")
|| err_str.contains("table already exists")
} =>
{
let response = self
.namespace
.describe_table(DescribeTableRequest {
id: Some(table_id.clone()),
..Default::default()
})
.await
.map_err(|describe_err| Error::Runtime {
message: format!(
"Failed to describe existing declared table after declare conflict: {}",
describe_err
),
})?;
if response.version.is_some() && response.schema.is_some() {
return Err(Error::TableAlreadyExists {
name: request.name.clone(),
});
}
let loc = response.location.ok_or_else(|| Error::Runtime {
message: "Table location is missing from describe_table response"
.to_string(),
})?;
let opts = response
.storage_options
.or_else(|| Some(self.storage_options.clone()))
.filter(|o: &HashMap<String, String>| !o.is_empty());
(loc, opts, response.managed_versioning)
}
Err(e) => {
return Err(Error::Runtime {
message: format!("Failed to declare table: {}", e),
});
}
}
}
};
// Build write params with storage options and commit handler
let mut params = request.write_options.lance_write_params.unwrap_or_default();
let mut params = request
.write_options
.lance_write_params
.clone()
.unwrap_or_default();
self.apply_new_table_config(&mut params, &request)?;
if matches!(request.mode, CreateTableMode::Overwrite) {
params.mode = WriteMode::Overwrite;
}
// Set up storage options if provided
if let Some(storage_opts) = initial_storage_options {
@@ -450,6 +628,47 @@ mod tests {
));
}
#[tokio::test]
async fn test_namespace_connection_with_namespace_client_properties() {
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.namespace_client_property("table_version_tracking_enabled", "true")
.namespace_client_property("manifest_enabled", "true")
.execute()
.await
.expect("Failed to connect to namespace");
conn.create_namespace(CreateNamespaceRequest {
id: Some(vec!["test_ns".into()]),
..Default::default()
})
.await
.expect("Failed to create namespace");
let test_data = create_test_data();
conn.create_table("test_table", test_data)
.namespace(vec!["test_ns".into()])
.execute()
.await
.expect("Failed to create table");
let namespace_client = conn.namespace_client().await.unwrap();
let describe = namespace_client
.describe_table(DescribeTableRequest {
id: Some(vec!["test_ns".into(), "test_table".into()]),
..Default::default()
})
.await
.expect("Failed to describe table");
assert_eq!(describe.managed_versioning, Some(true));
}
#[tokio::test]
async fn test_namespace_create_table_basic() {
// Setup: Create a temporary directory for the namespace
@@ -651,6 +870,58 @@ mod tests {
assert_eq!(id_col.value(2), 30);
}
#[tokio::test]
async fn test_namespace_create_table_after_declare_conflict() {
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
conn.create_namespace(CreateNamespaceRequest {
id: Some(vec!["test_ns".into()]),
..Default::default()
})
.await
.expect("Failed to create namespace");
let namespace_client = conn.namespace_client().await.unwrap();
namespace_client
.declare_table(DeclareTableRequest {
id: Some(vec!["test_ns".into(), "declared_test".into()]),
..Default::default()
})
.await
.expect("Failed to declare table");
let test_data = create_test_data();
let table = conn
.create_table("declared_test", test_data)
.namespace(vec!["test_ns".into()])
.execute()
.await
.expect("Failed to create table after declare conflict");
let results = table
.query()
.execute()
.await
.expect("Failed to query table")
.try_collect::<Vec<_>>()
.await
.expect("Failed to collect results");
assert_eq!(results.len(), 1);
assert_eq!(results[0].num_rows(), 5);
assert_eq!(table.namespace(), &["test_ns"]);
assert_eq!(table.id(), "test_ns$declared_test");
}
#[tokio::test]
async fn test_namespace_create_table_exist_ok_mode() {
// Setup: Create a temporary directory for the namespace

View File

@@ -13,7 +13,10 @@ use crate::{DistanceType, Error, Result, table::BaseTable};
use self::{
scalar::{BTreeIndexBuilder, BitmapIndexBuilder, LabelListIndexBuilder},
vector::{IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder, IvfSqIndexBuilder},
vector::{
IvfHnswFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder,
IvfSqIndexBuilder,
},
};
pub mod scalar;
@@ -67,6 +70,10 @@ pub enum Index {
/// IVF-HNSW index with Scalar Quantization
/// It is a variant of the HNSW algorithm that uses scalar quantization to compress the vectors.
IvfHnswSq(IvfHnswSqIndexBuilder),
/// IVF-HNSW index without quantization.
/// Stores raw vectors, providing the highest recall at the cost of more memory and disk space.
IvfHnswFlat(IvfHnswFlatIndexBuilder),
}
/// Builder for the create_index operation
@@ -290,6 +297,8 @@ pub enum IndexType {
IvfHnswPq,
#[serde(alias = "IVF_HNSW_SQ")]
IvfHnswSq,
#[serde(alias = "IVF_HNSW_FLAT")]
IvfHnswFlat,
// Scalar
#[serde(alias = "BTREE")]
BTree,
@@ -311,6 +320,7 @@ impl std::fmt::Display for IndexType {
Self::IvfRq => write!(f, "IVF_RQ"),
Self::IvfHnswPq => write!(f, "IVF_HNSW_PQ"),
Self::IvfHnswSq => write!(f, "IVF_HNSW_SQ"),
Self::IvfHnswFlat => write!(f, "IVF_HNSW_FLAT"),
Self::BTree => write!(f, "BTREE"),
Self::Bitmap => write!(f, "BITMAP"),
Self::LabelList => write!(f, "LABEL_LIST"),
@@ -334,6 +344,7 @@ impl std::str::FromStr for IndexType {
"IVF_RQ" => Ok(Self::IvfRq),
"IVF_HNSW_PQ" => Ok(Self::IvfHnswPq),
"IVF_HNSW_SQ" => Ok(Self::IvfHnswSq),
"IVF_HNSW_FLAT" => Ok(Self::IvfHnswFlat),
_ => Err(Error::InvalidInput {
message: format!("the input value {} is not a valid IndexType", value),
}),

View File

@@ -474,3 +474,46 @@ impl IvfHnswSqIndexBuilder {
impl_ivf_params_setter!();
impl_hnsw_params_setter!();
}
/// Builder for an IVF_HNSW_FLAT index.
///
/// This index combines IVF partitioning with an HNSW graph per partition,
/// storing raw (unquantized) vectors. It offers the highest recall among
/// the IVF_HNSW family at the cost of more memory and disk space compared
/// to [`IvfHnswSqIndexBuilder`] or [`IvfHnswPqIndexBuilder`].
#[derive(Debug, Clone, Serialize)]
pub struct IvfHnswFlatIndexBuilder {
// IVF
#[serde(rename = "metric_type")]
pub(crate) distance_type: DistanceType,
#[serde(skip_serializing_if = "Option::is_none")]
pub(crate) num_partitions: Option<u32>,
pub(crate) sample_rate: u32,
pub(crate) max_iterations: u32,
#[serde(skip_serializing_if = "Option::is_none")]
pub(crate) target_partition_size: Option<u32>,
// HNSW
pub(crate) m: u32,
pub(crate) ef_construction: u32,
}
impl Default for IvfHnswFlatIndexBuilder {
fn default() -> Self {
Self {
distance_type: DistanceType::L2,
num_partitions: None,
sample_rate: 256,
max_iterations: 50,
m: 20,
ef_construction: 300,
target_partition_size: None,
}
}
}
impl IvfHnswFlatIndexBuilder {
impl_distance_type_setter!();
impl_ivf_params_setter!();
impl_hnsw_params_setter!();
}

View File

@@ -5,11 +5,12 @@
use std::{fmt::Formatter, sync::Arc};
use futures::{TryFutureExt, stream::BoxStream};
use futures::{StreamExt, TryFutureExt, stream::BoxStream};
use lance::io::WrappingObjectStore;
use object_store::{
Error, GetOptions, GetResult, ListResult, MultipartUpload, ObjectMeta, ObjectStore,
PutMultipartOptions, PutOptions, PutPayload, PutResult, Result, UploadPart, path::Path,
CopyOptions, Error, GetOptions, GetResult, ListResult, MultipartUpload, ObjectMeta,
ObjectStore, ObjectStoreExt, PutMultipartOptions, PutOptions, PutPayload, PutResult, Result,
UploadPart, path::Path,
};
use async_trait::async_trait;
@@ -93,20 +94,6 @@ impl ObjectStore for MirroringObjectStore {
self.primary.get_opts(location, options).await
}
async fn head(&self, location: &Path) -> Result<ObjectMeta> {
self.primary.head(location).await
}
async fn delete(&self, location: &Path) -> Result<()> {
if !location.primary_only() {
match self.secondary.delete(location).await {
Err(Error::NotFound { .. }) | Ok(_) => {}
Err(e) => return Err(e),
}
}
self.primary.delete(location).await
}
fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, Result<ObjectMeta>> {
self.primary.list(prefix)
}
@@ -115,21 +102,40 @@ impl ObjectStore for MirroringObjectStore {
self.primary.list_with_delimiter(prefix).await
}
async fn copy(&self, from: &Path, to: &Path) -> Result<()> {
if to.primary_only() {
self.primary.copy(from, to).await
} else {
self.secondary.copy(from, to).await?;
self.primary.copy(from, to).await?;
Ok(())
}
fn delete_stream(
&self,
locations: BoxStream<'static, Result<Path>>,
) -> BoxStream<'static, Result<Path>> {
let primary = self.primary.clone();
let secondary = self.secondary.clone();
locations
.map(move |location| {
let primary = primary.clone();
let secondary = secondary.clone();
async move {
let location = location?;
if !location.primary_only() {
match secondary.delete(&location).await {
Err(Error::NotFound { .. }) | Ok(_) => {}
Err(e) => return Err(e),
}
}
primary.delete(&location).await?;
Ok(location)
}
})
.buffered(10)
.boxed()
}
async fn copy_if_not_exists(&self, from: &Path, to: &Path) -> Result<()> {
if !to.primary_only() {
self.secondary.copy(from, to).await?;
async fn copy_opts(&self, from: &Path, to: &Path, options: CopyOptions) -> Result<()> {
if to.primary_only() {
self.primary.copy_opts(from, to, options).await
} else {
self.secondary.copy_opts(from, to, options.clone()).await?;
self.primary.copy_opts(from, to, options).await?;
Ok(())
}
self.primary.copy_if_not_exists(from, to).await
}
}

View File

@@ -10,9 +10,9 @@ use bytes::Bytes;
use futures::stream::BoxStream;
use lance::io::WrappingObjectStore;
use object_store::{
GetOptions, GetResult, ListResult, MultipartUpload, ObjectMeta, ObjectStore,
PutMultipartOptions, PutOptions, PutPayload, PutResult, Result as OSResult, UploadPart,
path::Path,
CopyOptions, GetOptions, GetResult, ListResult, MultipartUpload, ObjectMeta, ObjectStore,
PutMultipartOptions, PutOptions, PutPayload, PutResult, RenameOptions, Result as OSResult,
UploadPart, path::Path,
};
#[derive(Debug, Default)]
@@ -81,11 +81,6 @@ impl IoTrackingStore {
#[async_trait::async_trait]
#[deny(clippy::missing_trait_methods)]
impl ObjectStore for IoTrackingStore {
async fn put(&self, location: &Path, bytes: PutPayload) -> OSResult<PutResult> {
self.record_write(bytes.content_length() as u64);
self.target.put(location, bytes).await
}
async fn put_opts(
&self,
location: &Path,
@@ -96,14 +91,6 @@ impl ObjectStore for IoTrackingStore {
self.target.put_opts(location, bytes, opts).await
}
async fn put_multipart(&self, location: &Path) -> OSResult<Box<dyn MultipartUpload>> {
let target = self.target.put_multipart(location).await?;
Ok(Box::new(IoTrackingMultipartUpload {
target,
stats: self.stats.clone(),
}))
}
async fn put_multipart_opts(
&self,
location: &Path,
@@ -116,15 +103,6 @@ impl ObjectStore for IoTrackingStore {
}))
}
async fn get(&self, location: &Path) -> OSResult<GetResult> {
let result = self.target.get(location).await;
if let Ok(result) = &result {
let num_bytes = result.range.end - result.range.start;
self.record_read(num_bytes);
}
result
}
async fn get_opts(&self, location: &Path, options: GetOptions) -> OSResult<GetResult> {
let result = self.target.get_opts(location, options).await;
if let Ok(result) = &result {
@@ -134,14 +112,6 @@ impl ObjectStore for IoTrackingStore {
result
}
async fn get_range(&self, location: &Path, range: std::ops::Range<u64>) -> OSResult<Bytes> {
let result = self.target.get_range(location, range).await;
if let Ok(result) = &result {
self.record_read(result.len() as u64);
}
result
}
async fn get_ranges(
&self,
location: &Path,
@@ -154,20 +124,11 @@ impl ObjectStore for IoTrackingStore {
result
}
async fn head(&self, location: &Path) -> OSResult<ObjectMeta> {
self.record_read(0);
self.target.head(location).await
}
async fn delete(&self, location: &Path) -> OSResult<()> {
fn delete_stream(
&self,
locations: BoxStream<'static, OSResult<Path>>,
) -> BoxStream<'static, OSResult<Path>> {
self.record_write(0);
self.target.delete(location).await
}
fn delete_stream<'a>(
&'a self,
locations: BoxStream<'a, OSResult<Path>>,
) -> BoxStream<'a, OSResult<Path>> {
self.target.delete_stream(locations)
}
@@ -190,24 +151,14 @@ impl ObjectStore for IoTrackingStore {
self.target.list_with_delimiter(prefix).await
}
async fn copy(&self, from: &Path, to: &Path) -> OSResult<()> {
async fn copy_opts(&self, from: &Path, to: &Path, options: CopyOptions) -> OSResult<()> {
self.record_write(0);
self.target.copy(from, to).await
self.target.copy_opts(from, to, options).await
}
async fn rename(&self, from: &Path, to: &Path) -> OSResult<()> {
async fn rename_opts(&self, from: &Path, to: &Path, options: RenameOptions) -> OSResult<()> {
self.record_write(0);
self.target.rename(from, to).await
}
async fn rename_if_not_exists(&self, from: &Path, to: &Path) -> OSResult<()> {
self.record_write(0);
self.target.rename_if_not_exists(from, to).await
}
async fn copy_if_not_exists(&self, from: &Path, to: &Path) -> OSResult<()> {
self.record_write(0);
self.target.copy_if_not_exists(from, to).await
self.target.rename_opts(from, to, options).await
}
}

View File

@@ -16,7 +16,7 @@ use crate::remote::retry::{ResolvedRetryConfig, RetryCounter};
const REQUEST_ID_HEADER: HeaderName = HeaderName::from_static("x-request-id");
/// Configuration for TLS/mTLS settings.
#[derive(Clone, Debug, Default)]
#[derive(Clone, Debug)]
pub struct TlsConfig {
/// Path to the client certificate file (PEM format)
pub cert_file: Option<String>,
@@ -24,10 +24,22 @@ pub struct TlsConfig {
pub key_file: Option<String>,
/// Path to the CA certificate file for server verification (PEM format)
pub ssl_ca_cert: Option<String>,
/// Whether to verify the hostname in the server's certificate
/// Whether to verify the hostname in the server's certificate.
/// Defaults to `true`.
pub assert_hostname: bool,
}
impl Default for TlsConfig {
fn default() -> Self {
Self {
cert_file: None,
key_file: None,
ssl_ca_cert: None,
assert_hostname: true,
}
}
}
/// Trait for providing custom headers for each request
#[async_trait::async_trait]
pub trait HeaderProvider: Send + Sync + std::fmt::Debug {
@@ -926,7 +938,7 @@ mod tests {
assert!(config.cert_file.is_none());
assert!(config.key_file.is_none());
assert!(config.ssl_ca_cert.is_none());
assert!(!config.assert_hostname);
assert!(config.assert_hostname);
}
#[test]

View File

@@ -1540,6 +1540,7 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
Index::IvfPq(p) => ("IVF_PQ", Some(to_json(p)?)),
Index::IvfSq(p) => ("IVF_SQ", Some(to_json(p)?)),
Index::IvfHnswSq(p) => ("IVF_HNSW_SQ", Some(to_json(p)?)),
Index::IvfHnswFlat(p) => ("IVF_HNSW_FLAT", Some(to_json(p)?)),
Index::IvfRq(p) => ("IVF_RQ", Some(to_json(p)?)),
Index::BTree(p) => ("BTREE", Some(to_json(p)?)),
Index::Bitmap(p) => ("BITMAP", Some(to_json(p)?)),
@@ -2068,7 +2069,8 @@ mod tests {
use serde_json::json;
use crate::index::vector::{
IvfFlatIndexBuilder, IvfHnswSqIndexBuilder, IvfRqIndexBuilder, IvfSqIndexBuilder,
IvfFlatIndexBuilder, IvfHnswFlatIndexBuilder, IvfHnswSqIndexBuilder, IvfRqIndexBuilder,
IvfSqIndexBuilder,
};
use crate::remote::JSON_CONTENT_TYPE;
use crate::remote::db::DEFAULT_SERVER_VERSION;
@@ -3321,6 +3323,35 @@ mod tests {
.ef_construction(500),
),
),
(
"IVF_HNSW_FLAT",
json!({
"metric_type": "l2",
"sample_rate": 256,
"max_iterations": 50,
"m": 20,
"ef_construction": 300,
}),
Index::IvfHnswFlat(Default::default()),
),
(
"IVF_HNSW_FLAT",
json!({
"metric_type": "cosine",
"num_partitions": 64,
"sample_rate": 256,
"max_iterations": 50,
"m": 40,
"ef_construction": 500,
}),
Index::IvfHnswFlat(
IvfHnswFlatIndexBuilder::default()
.distance_type(DistanceType::Cosine)
.num_partitions(64)
.num_edges(40)
.ef_construction(500),
),
),
(
"IVF_SQ",
json!({

View File

@@ -43,7 +43,7 @@ pub struct RemoteInsertExec<S: HttpSend = Sender> {
client: RestfulLanceDbClient<S>,
input: Arc<dyn ExecutionPlan>,
overwrite: bool,
properties: PlanProperties,
properties: Arc<PlanProperties>,
add_result: Arc<Mutex<Option<AddResult>>>,
metrics: ExecutionPlanMetricsSet,
upload_id: Option<String>,
@@ -118,7 +118,7 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
client,
input,
overwrite,
properties,
properties: Arc::new(properties),
add_result: Arc::new(Mutex::new(None)),
metrics: ExecutionPlanMetricsSet::new(),
upload_id,
@@ -232,7 +232,7 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
self
}
fn properties(&self) -> &PlanProperties {
fn properties(&self) -> &Arc<PlanProperties> {
&self.properties
}

View File

@@ -47,7 +47,7 @@ use std::format;
use std::path::Path;
use std::sync::Arc;
use crate::connection::PushdownOperation;
use crate::connection::NamespaceClientPushdownOperation;
use crate::data::scannable::{PeekedScannable, Scannable, estimate_write_partitions};
use crate::database::Database;
@@ -1272,7 +1272,7 @@ pub struct NativeTable {
pub(crate) namespace_client: Option<Arc<dyn LanceNamespace>>,
// Operations to push down to the namespace server.
// pub(crate) so query.rs can access the field for server-side query execution.
pub(crate) pushdown_operations: HashSet<PushdownOperation>,
pub(crate) pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
}
impl std::fmt::Debug for NativeTable {
@@ -1359,7 +1359,7 @@ impl NativeTable {
params: Option<ReadParams>,
read_consistency_interval: Option<std::time::Duration>,
namespace_client: Option<Arc<dyn LanceNamespace>>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
managed_versioning: Option<bool>,
) -> Result<Self> {
let params = params.unwrap_or_default();
@@ -1470,7 +1470,7 @@ impl NativeTable {
write_store_wrapper: Option<Arc<dyn WrappingObjectStore>>,
params: Option<ReadParams>,
read_consistency_interval: Option<std::time::Duration>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
session: Option<Arc<lance::session::Session>>,
) -> Result<Self> {
let mut params = params.unwrap_or_default();
@@ -1518,7 +1518,7 @@ impl NativeTable {
let id = Self::build_id(&namespace, name);
let stored_namespace_client =
if pushdown_operations.contains(&PushdownOperation::QueryTable) {
if pushdown_operations.contains(&NamespaceClientPushdownOperation::QueryTable) {
Some(namespace_client)
} else {
None
@@ -1588,7 +1588,7 @@ impl NativeTable {
params: Option<WriteParams>,
read_consistency_interval: Option<std::time::Duration>,
namespace_client: Option<Arc<dyn LanceNamespace>>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
) -> Result<Self> {
// Default params uses format v1.
let params = params.unwrap_or(WriteParams {
@@ -1635,7 +1635,7 @@ impl NativeTable {
params: Option<WriteParams>,
read_consistency_interval: Option<std::time::Duration>,
namespace_client: Option<Arc<dyn LanceNamespace>>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
) -> Result<Self> {
let data: Box<dyn Scannable> = Box::new(RecordBatch::new_empty(schema));
Self::create(
@@ -1685,7 +1685,7 @@ impl NativeTable {
write_store_wrapper: Option<Arc<dyn WrappingObjectStore>>,
params: Option<WriteParams>,
read_consistency_interval: Option<std::time::Duration>,
pushdown_operations: HashSet<PushdownOperation>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
session: Option<Arc<lance::session::Session>>,
) -> Result<Self> {
// Build table_id from namespace + name for the storage options provider
@@ -1738,7 +1738,7 @@ impl NativeTable {
let id = Self::build_id(&namespace, name);
let stored_namespace_client =
if pushdown_operations.contains(&PushdownOperation::QueryTable) {
if pushdown_operations.contains(&NamespaceClientPushdownOperation::QueryTable) {
Some(namespace_client)
} else {
None
@@ -2033,6 +2033,24 @@ impl NativeTable {
);
Ok(Box::new(lance_idx_params))
}
Index::IvfHnswFlat(index) => {
Self::validate_index_type(field, "IVF HNSW FLAT", supported_vector_data_type)?;
let ivf_params = Self::build_ivf_params(
index.num_partitions,
index.target_partition_size,
index.sample_rate,
index.max_iterations,
);
let hnsw_params = HnswBuildParams::default()
.num_edges(index.m as usize)
.ef_construction(index.ef_construction as usize);
let lance_idx_params = VectorIndexParams::ivf_hnsw(
index.distance_type.into(),
ivf_params,
hnsw_params,
);
Ok(Box::new(lance_idx_params))
}
}
}
@@ -2058,7 +2076,8 @@ impl NativeTable {
| Index::IvfPq(_)
| Index::IvfRq(_)
| Index::IvfHnswPq(_)
| Index::IvfHnswSq(_) => IndexType::Vector,
| Index::IvfHnswSq(_)
| Index::IvfHnswFlat(_) => IndexType::Vector,
}
}
@@ -3176,6 +3195,56 @@ mod tests {
assert_eq!(stats.num_unindexed_rows, 0);
}
#[tokio::test]
async fn test_create_index_ivf_hnsw_flat() {
use arrow_array::RecordBatch;
use arrow_schema::{DataType, Field, Schema as ArrowSchema};
use rand;
use std::iter::repeat_with;
use crate::index::vector::IvfHnswFlatIndexBuilder;
use arrow_array::Float32Array;
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let conn = connect(uri).execute().await.unwrap();
let dimension = 16;
let schema = Arc::new(ArrowSchema::new(vec![Field::new(
"embeddings",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
dimension,
),
false,
)]));
let float_arr = Float32Array::from(
repeat_with(rand::random::<f32>)
.take(512 * dimension as usize)
.collect::<Vec<f32>>(),
);
let vectors = Arc::new(create_fixed_size_list(float_arr, dimension).unwrap());
let batch = RecordBatch::try_new(schema.clone(), vec![vectors.clone()]).unwrap();
let table = conn.create_table("test", batch).execute().await.unwrap();
let index = IvfHnswFlatIndexBuilder::default();
table
.create_index(&["embeddings"], Index::IvfHnswFlat(index))
.execute()
.await
.unwrap();
let index_configs = table.list_indices().await.unwrap();
assert_eq!(index_configs.len(), 1);
let index = index_configs.into_iter().next().unwrap();
assert_eq!(index.index_type, crate::index::IndexType::IvfHnswFlat);
assert_eq!(index.columns, vec!["embeddings".to_string()]);
assert_eq!(table.count_rows(None).await.unwrap(), 512);
}
fn create_fixed_size_list<T: Array>(values: T, list_size: i32) -> Result<FixedSizeListArray> {
let list_type = DataType::FixedSizeList(
Arc::new(Field::new("item", values.data_type().clone(), true)),

View File

@@ -39,21 +39,26 @@ use lance_index::scalar::FullTextSearchQuery;
struct MetadataEraserExec {
input: Arc<dyn ExecutionPlan>,
schema: Arc<ArrowSchema>,
properties: PlanProperties,
properties: Arc<PlanProperties>,
}
impl MetadataEraserExec {
fn compute_properties_from_input(
input: &Arc<dyn ExecutionPlan>,
schema: &Arc<ArrowSchema>,
) -> PlanProperties {
) -> Arc<PlanProperties> {
let input_properties = input.properties();
let eq_properties = input_properties
.eq_properties
.clone()
.with_new_schema(schema.clone())
.unwrap();
input_properties.clone().with_eq_properties(eq_properties)
Arc::new(
input_properties
.as_ref()
.clone()
.with_eq_properties(eq_properties),
)
}
fn new(input: Arc<dyn ExecutionPlan>) -> Self {
@@ -87,7 +92,7 @@ impl ExecutionPlan for MetadataEraserExec {
self
}
fn properties(&self) -> &PlanProperties {
fn properties(&self) -> &Arc<PlanProperties> {
&self.properties
}

Some files were not shown because too many files have changed in this diff Show More