Compare commits

...

14 Commits

Author SHA1 Message Date
Ayush Chaurasia
ff08a996fc feat(python): add LanceTorchDataset / LanceIterableTorchDataset wrappers
Provides first-class PyTorch `Dataset`/`IterableDataset` wrappers around a
LanceDB table or permutation. The wrapper:

* Captures only the URI / table name / connect kwargs needed to re-open
  the table — no Rust handles in pickle output. Works out of the box with
  `DataLoader(num_workers > 0)`, which would otherwise crash a
  hand-rolled subclass.
* Implements both `__getitem__` and PyTorch's `__getitems__` dunder so
  the underlying batched `Permutation.fetch` is used when DataLoader
  fetches a batch of indices.
* Forwards column selection / format / transform / batch_size to the
  underlying Permutation, so users do not have to hand-roll the
  `_ensure_open` boilerplate from the issue.

Builds on the public `Permutation.fetch` API (#3243).

Closes lancedb/lancedb#3242
2026-04-29 22:21:00 +05:30
Ayush Chaurasia
049a689a1c feat(python): add public Permutation.fetch(indices) API
Adds a public method that mirrors __getitems__ for batch index access,
so users do not have to call a dunder directly when implementing custom
torch datasets.

Closes lancedb/lancedb#3243
2026-04-29 22:13:42 +05:30
Jack Ye
25dfe2cfd4 feat: add manifest-enabled directory namespace mode (#3332)
Adds manifest_enabled for local/native connections so directory
namespace manifests can be the source of truth, including migration from
directory listing and Azure credential vending feature wiring. Also
exposes the option through Rust, Python, and Node bindings with focused
validation.
2026-04-29 09:22:06 -07:00
Lance Release
4dcd7f4314 Bump version: 0.28.0-beta.9 → 0.28.0-beta.10 2026-04-28 13:29:26 +00:00
Lance Release
2e36cd9dad Bump version: 0.31.0-beta.9 → 0.31.0-beta.10 2026-04-28 13:29:00 +00:00
Weston Pace
f31e27768a fix: address RUSTSEC-2026-0104 cargo-deny advisory (#3326)
## Summary

- Update `rustls-webpki` 0.103.10 → 0.103.13 to fix RUSTSEC-2026-0104
(reachable panic in CRL parsing)
- Add advisory ignore for the legacy `rustls-webpki` 0.101.7 copy pinned
to the aws-smithy/rustls 0.21 chain (same chain already exempted for
RUSTSEC-2026-0098/0099)

Fixes the `deny` CI job failure seen in #3325.

## Test plan

- [x] `cargo deny check advisories` passes locally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 17:56:10 -07:00
LanceDB Robot
b84150a53e chore: update lance dependency to v6.0.0-beta.4 (#3325)
## Summary

- Updates Lance Rust dependencies to `6.0.0-beta.4` using
`ci/set_lance_version.py`.
- Updates the Java `lance-core.version` property to `6.0.0-beta.4`.
- Triggering Lance tag:
https://github.com/lance-format/lance/releases/tag/v6.0.0-beta.4

## Verification

- `cargo clippy --workspace --tests --all-features -- -D warnings`
- `cargo fmt --all`
2026-04-27 15:13:07 -07:00
Will Jones
d135c18db6 ci: add cargo-deny configuration and CI check (#3307)
Adds a `deny.toml` at the workspace root and a `deny` CI job that runs
`cargo deny check` on every PR. Catches yanked crates, license drift,
banned or wildcard dependencies, unapproved sources, and new RUSTSEC
advisories.

As part of wiring this up:

- Updated `aws-lc-rs` 1.13.0 → 1.16.3 / `aws-lc-sys` 0.28.0 → 0.40.0 to
  clear four 2026 AWS-LC advisories (timing side-channel, PKCS7 bypass,
  CRL scope). Removed the `=0.28.0` workaround pin; the original build
  failure no longer reproduces.
- Updated `bytes`, `zlib-rs`, `rand`, `rustls-webpki`, `lz4_flex` to
  clear their current advisories.
- Marked `lancedb-nodejs` and `lancedb-python` as `publish = false` and
  pinned `lzma-sys` from `*` to `0.1` so `bans.wildcards = "deny"` can
  be enforced.

10 remaining advisories have no safe upgrade available (transitive via
opendal, lance, datafusion, async-openai, aws-sdk on the legacy rustls
0.21 chain). Each is ignored in `deny.toml` with a per-entry rationale
and a link to the RUSTSEC advisory. New advisories still fail CI.

Fixes #3297

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 20:53:15 -07:00
Will Jones
ef399de092 ci: switch PyPI publish to OIDC trusted publishing (#3302)
## Summary

- Replaces `LANCEDB_PYPI_API_TOKEN` (long-lived token) with OIDC trusted
publishing via `pypa/gh-action-pypi-publish`
- Adds `id-token: write` permission to linux/mac/windows jobs
- Removes `twine`-based upload and the `pypi_token` input from
`upload_wheel` composite action
- Enables PEP 740 Sigstore attestations on published wheels as a bonus

After merging, rotate/revoke the `LANCEDB_PYPI_API_TOKEN` secret.

Closes #3294

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 20:53:06 -07:00
Will Jones
0d767abd0e ci: add Dependabot config for shipped Rust binaries (#3300)
Adds `.github/dependabot.yml` enabling weekly cargo update PRs for the
root workspace, which produces the Rust binaries we ship: the Node.js
and Python native extensions. The `rust/lancedb` library crate shares
the same lockfile — its consumers pick versions themselves, but bumping
transitive deps here keeps the shipped binaries current.

Also removes the misleading `exclude = ["python"]` line from the root
`Cargo.toml`: `python` is listed in `members`, and `cargo metadata`
confirms it's a workspace member, so the exclude was dead code that
implied the opposite.

Minor/patch updates are grouped to reduce PR noise.

Part of #3292. Only covers the cargo ecosystem; pip, npm, and
github-actions can follow.
2026-04-24 20:52:54 -07:00
Jack Ye
a92ae0ded5 fix: enable hostname verification by default (#3304)
## Summary

- make `TlsConfig::default()` enable hostname verification by default
- align the Rust default with the documented Python and Node behavior
- update the Rust unit test to lock in the safe default
2026-04-21 08:39:03 -07:00
Xuanwo
c54888a83a refactor(python): remove legacy tantivy FTS support (#3282)
This follows the Rust-side Tantivy removal by deleting the remaining
Python Tantivy runtime, tests, and packaging references.

It also turns the legacy Python-only Tantivy parameters into explicit
errors and stops reading legacy `_indices/fts` directories so Python FTS
is fully native-only.
2026-04-20 09:28:45 +08:00
Will Jones
ba6c44abc9 ci: add top-level permissions to GHA workflows (#3255)
Adds `permissions: contents: read` to the 10 workflows that had no
top-level permissions block. Workflows that already declared
permissions, or individual jobs that need elevated permissions (`issues:
write`, `pull-requests: write`, `contents: write`), are left unchanged.

Affected workflows: `dev.yml`, `java-publish.yml`, `java.yml`,
`license-header-check.yml`, `nodejs.yml`, `pypi-publish.yml`,
`python.yml`, `rust.yml`, `update_package_lock_run.yml`,
`update_package_lock_run_nodejs.yml`
2026-04-20 09:22:27 +08:00
Lance Release
75b0a8e0a3 Bump version: 0.28.0-beta.8 → 0.28.0-beta.9 2026-04-19 20:39:29 +00:00
62 changed files with 1542 additions and 808 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.28.0-beta.8"
current_version = "0.28.0-beta.10"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

18
.github/dependabot.yml vendored Normal file
View File

@@ -0,0 +1,18 @@
version: 2
# Scope: the root Cargo workspace, which produces the Rust binaries we
# ship to users (the Node.js and Python native extensions). The
# `rust/lancedb` library crate shares the same lockfile; its consumers
# pick their own dependency versions, but bumping transitive deps here
# keeps the binaries we ship current.
updates:
- package-ecosystem: cargo
directory: /
schedule:
interval: weekly
open-pull-requests-limit: 10
groups:
rust-minor-patch:
update-types:
- minor
- patch

View File

@@ -8,6 +8,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
permissions:
contents: read
jobs:
labeler:
permissions:

View File

@@ -19,6 +19,9 @@ on:
paths:
- .github/workflows/java-publish.yml
permissions:
contents: read
jobs:
publish:
name: Build and Publish

View File

@@ -24,6 +24,9 @@ on:
- java/**
- .github/workflows/java.yml
permissions:
contents: read
jobs:
build-java:
runs-on: ubuntu-24.04

View File

@@ -10,6 +10,10 @@ on:
- nodejs/**
- java/**
- .github/workflows/license-header-check.yml
permissions:
contents: read
jobs:
check-licenses:
runs-on: ubuntu-latest

View File

@@ -15,6 +15,9 @@ on:
- .github/workflows/nodejs.yml
- docker-compose.yml
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

View File

@@ -14,10 +14,16 @@ on:
env:
PIP_EXTRA_INDEX_URL: "https://pypi.fury.io/lance-format/ https://pypi.fury.io/lancedb/"
permissions:
contents: read
jobs:
linux:
name: Python ${{ matrix.config.platform }} manylinux${{ matrix.config.manylinux }}
timeout-minutes: 60
permissions:
id-token: write
contents: read
strategy:
matrix:
config:
@@ -57,10 +63,12 @@ jobs:
- uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }}
mac:
timeout-minutes: 90
permissions:
id-token: write
contents: read
runs-on: ${{ matrix.config.runner }}
strategy:
matrix:
@@ -85,10 +93,12 @@ jobs:
- uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }}
windows:
timeout-minutes: 60
permissions:
id-token: write
contents: read
runs-on: windows-latest
steps:
- uses: actions/checkout@v4
@@ -107,7 +117,6 @@ jobs:
- uses: ./.github/workflows/upload_wheel
if: startsWith(github.ref, 'refs/tags/python-v')
with:
pypi_token: ${{ secrets.LANCEDB_PYPI_API_TOKEN }}
fury_token: ${{ secrets.FURY_TOKEN }}
gh-release:
if: startsWith(github.ref, 'refs/tags/python-v')

View File

@@ -17,6 +17,9 @@ on:
- .github/workflows/build_windows_wheel/**
- .github/workflows/run_tests/**
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
@@ -108,7 +111,6 @@ jobs:
- name: Install
run: |
pip install --extra-index-url https://pypi.fury.io/lance-format/ --extra-index-url https://pypi.fury.io/lancedb/ -e .[tests,dev,embeddings]
pip install tantivy
pip install mlx
- name: Doctest
run: pytest --doctest-modules python/lancedb
@@ -227,6 +229,5 @@ jobs:
pip install "pydantic<2"
pip install pyarrow==16
pip install --extra-index-url https://pypi.fury.io/lance-format/ --extra-index-url https://pypi.fury.io/lancedb/ -e .[tests]
pip install tantivy
- name: Run tests
run: pytest -m "not slow and not s3_test" -x -v --durations=30 python/tests

View File

@@ -9,9 +9,15 @@ on:
- Cargo.toml
- Cargo.lock
- rust-toolchain.toml
- deny.toml
- rust/**
- nodejs/Cargo.toml
- python/Cargo.toml
- .github/workflows/rust.yml
permissions:
contents: read
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
@@ -53,6 +59,17 @@ jobs:
- name: Run clippy (without remote feature)
run: cargo clippy --profile ci --workspace --tests -- -D warnings
deny:
# Supply-chain checks: advisories, licenses, banned crates, and source
# restrictions. Configuration lives in `deny.toml` at the workspace root.
timeout-minutes: 10
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: EmbarkStudios/cargo-deny-action@v2
with:
command: check advisories bans licenses sources
build-no-lock:
runs-on: ubuntu-24.04
timeout-minutes: 30

View File

@@ -3,6 +3,9 @@ name: Update package-lock.json
on:
workflow_dispatch:
permissions:
contents: read
jobs:
publish:
runs-on: ubuntu-latest

View File

@@ -3,6 +3,9 @@ name: Update NodeJs package-lock.json
on:
workflow_dispatch:
permissions:
contents: read
jobs:
publish:
runs-on: ubuntu-latest

View File

@@ -2,9 +2,6 @@ name: upload-wheel
description: "Upload wheels to Pypi"
inputs:
pypi_token:
required: true
description: "release token for the repo"
fury_token:
required: true
description: "release token for the fury repo"
@@ -12,12 +9,6 @@ inputs:
runs:
using: "composite"
steps:
- name: Install dependencies
shell: bash
run: |
python -m pip install --upgrade pip
pip install twine
python3 -m pip install --upgrade pkginfo
- name: Choose repo
shell: bash
id: choose_repo
@@ -27,19 +18,17 @@ runs:
else
echo "repo=pypi" >> $GITHUB_OUTPUT
fi
- name: Publish to PyPI
- name: Publish to Fury
if: steps.choose_repo.outputs.repo == 'fury'
shell: bash
env:
FURY_TOKEN: ${{ inputs.fury_token }}
PYPI_TOKEN: ${{ inputs.pypi_token }}
run: |
if [[ ${{ steps.choose_repo.outputs.repo }} == fury ]]; then
WHEEL=$(ls target/wheels/lancedb-*.whl 2> /dev/null | head -n 1)
echo "Uploading $WHEEL to Fury"
curl -f -F package=@$WHEEL https://$FURY_TOKEN@push.fury.io/lancedb/
else
twine upload --repository ${{ steps.choose_repo.outputs.repo }} \
--username __token__ \
--password $PYPI_TOKEN \
target/wheels/lancedb-*.whl
fi
WHEEL=$(ls target/wheels/lancedb-*.whl 2> /dev/null | head -n 1)
echo "Uploading $WHEEL to Fury"
curl -f -F package=@$WHEEL https://$FURY_TOKEN@push.fury.io/lancedb/
- name: Publish to PyPI
if: steps.choose_repo.outputs.repo == 'pypi'
uses: pypa/gh-action-pypi-publish@release/v1
with:
packages-dir: target/wheels/

186
Cargo.lock generated
View File

@@ -572,9 +572,9 @@ dependencies = [
[[package]]
name = "aws-lc-rs"
version = "1.16.1"
version = "1.16.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "94bffc006df10ac2a68c83692d734a465f8ee6c5b384d8545a636f81d858f4bf"
checksum = "0ec6fb3fe69024a75fa7e1bfb48aa6cf59706a101658ea01bfd33b2b248a038f"
dependencies = [
"aws-lc-sys",
"untrusted 0.7.1",
@@ -583,9 +583,9 @@ dependencies = [
[[package]]
name = "aws-lc-sys"
version = "0.38.0"
version = "0.40.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "4321e568ed89bb5a7d291a7f37997c2c0df89809d7b6d12062c81ddb54aa782e"
checksum = "f50037ee5e1e41e7b8f9d161680a725bd1626cb6f8c7e901f91f942850852fe7"
dependencies = [
"cc",
"cmake",
@@ -1373,7 +1373,7 @@ dependencies = [
"memmap2 0.9.10",
"num-traits",
"num_cpus",
"rand 0.9.2",
"rand 0.9.4",
"rand_distr 0.5.1",
"rayon",
"safetensors",
@@ -1409,7 +1409,7 @@ dependencies = [
"candle-nn",
"fancy-regex",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
"rayon",
"serde",
"serde_json",
@@ -1966,7 +1966,7 @@ dependencies = [
"log",
"object_store",
"parking_lot",
"rand 0.9.2",
"rand 0.9.4",
"regex",
"sqlparser 0.59.0",
"tempfile",
@@ -2080,7 +2080,7 @@ dependencies = [
"itertools 0.14.0",
"log",
"object_store",
"rand 0.9.2",
"rand 0.9.4",
"tokio",
"url",
]
@@ -2176,7 +2176,7 @@ dependencies = [
"log",
"object_store",
"parking_lot",
"rand 0.9.2",
"rand 0.9.4",
"tempfile",
"url",
]
@@ -2240,7 +2240,7 @@ dependencies = [
"log",
"md-5",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
"regex",
"sha2",
"unicode-segmentation",
@@ -2642,7 +2642,7 @@ dependencies = [
"libc",
"option-ext",
"redox_users",
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]
@@ -2830,7 +2830,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb"
dependencies = [
"libc",
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]
@@ -2965,7 +2965,7 @@ checksum = "719a903cc23e4a89e87962c2a80fdb45cdaad0983a89bd150bb57b4c8571a7d5"
dependencies = [
"half",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
"rand_distr 0.5.1",
]
@@ -3010,11 +3010,11 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
[[package]]
name = "fsst"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-array",
"rand 0.9.2",
"rand 0.9.4",
]
[[package]]
@@ -3387,7 +3387,7 @@ dependencies = [
"cfg-if",
"crunchy",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
"rand_distr 0.5.1",
"zerocopy",
]
@@ -3470,7 +3470,7 @@ dependencies = [
"libc",
"log",
"num_cpus",
"rand 0.9.2",
"rand 0.9.4",
"reqwest",
"serde",
"serde_json",
@@ -3980,7 +3980,7 @@ dependencies = [
"portable-atomic",
"portable-atomic-util",
"serde_core",
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]
@@ -4043,7 +4043,7 @@ dependencies = [
"nom 8.0.0",
"num-traits",
"ordered-float",
"rand 0.9.2",
"rand 0.9.4",
"ryu",
"serde",
"serde_json",
@@ -4066,8 +4066,8 @@ dependencies = [
[[package]]
name = "lance"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-arith",
@@ -4119,7 +4119,7 @@ dependencies = [
"prost",
"prost-build",
"prost-types",
"rand 0.9.2",
"rand 0.9.4",
"roaring",
"semver",
"serde",
@@ -4135,8 +4135,8 @@ dependencies = [
[[package]]
name = "lance-arrow"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4152,13 +4152,13 @@ dependencies = [
"half",
"jsonb",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
]
[[package]]
name = "lance-bitpacking"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrayref",
"paste",
@@ -4167,8 +4167,8 @@ dependencies = [
[[package]]
name = "lance-core"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4191,7 +4191,7 @@ dependencies = [
"object_store",
"pin-project",
"prost",
"rand 0.9.2",
"rand 0.9.4",
"roaring",
"serde_json",
"snafu 0.9.0",
@@ -4205,8 +4205,8 @@ dependencies = [
[[package]]
name = "lance-datafusion"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-array",
@@ -4237,8 +4237,8 @@ dependencies = [
[[package]]
name = "lance-datagen"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-array",
@@ -4248,7 +4248,7 @@ dependencies = [
"futures",
"half",
"hex",
"rand 0.9.2",
"rand 0.9.4",
"rand_distr 0.5.1",
"rand_xoshiro",
"random_word 0.5.2",
@@ -4256,8 +4256,8 @@ dependencies = [
[[package]]
name = "lance-encoding"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4283,7 +4283,7 @@ dependencies = [
"prost",
"prost-build",
"prost-types",
"rand 0.9.2",
"rand 0.9.4",
"snafu 0.9.0",
"strum",
"tokio",
@@ -4294,8 +4294,8 @@ dependencies = [
[[package]]
name = "lance-file"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4327,8 +4327,8 @@ dependencies = [
[[package]]
name = "lance-index"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-arith",
@@ -4374,7 +4374,7 @@ dependencies = [
"prost",
"prost-build",
"prost-types",
"rand 0.9.2",
"rand 0.9.4",
"rand_distr 0.5.1",
"rangemap",
"rayon",
@@ -4392,8 +4392,8 @@ dependencies = [
[[package]]
name = "lance-io"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-arith",
@@ -4426,7 +4426,7 @@ dependencies = [
"path_abs",
"pin-project",
"prost",
"rand 0.9.2",
"rand 0.9.4",
"serde",
"snafu 0.9.0",
"tempfile",
@@ -4437,8 +4437,8 @@ dependencies = [
[[package]]
name = "lance-linalg"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4449,13 +4449,13 @@ dependencies = [
"lance-arrow",
"lance-core",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
]
[[package]]
name = "lance-namespace"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"async-trait",
@@ -4468,17 +4468,19 @@ dependencies = [
[[package]]
name = "lance-namespace-impls"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-ipc",
"arrow-schema",
"async-trait",
"axum",
"base64 0.22.1",
"bytes",
"chrono",
"futures",
"hmac",
"lance",
"lance-core",
"lance-index",
@@ -4488,10 +4490,12 @@ dependencies = [
"lance-table",
"log",
"object_store",
"rand 0.9.2",
"quick-xml 0.38.4",
"rand 0.9.4",
"reqwest",
"serde",
"serde_json",
"sha2",
"snafu 0.9.0",
"tokio",
"tower",
@@ -4501,9 +4505,9 @@ dependencies = [
[[package]]
name = "lance-namespace-reqwest-client"
version = "0.6.1"
version = "0.7.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ee2e48de899e2931afb67fcddd0a08e439bf5d8b6ea2a2ed9cb8f4df669bd5cc"
checksum = "0f061dd6fe63e3ba4052702a9d40973ee4ac57f612f04222897a149576213832"
dependencies = [
"reqwest",
"serde",
@@ -4514,8 +4518,8 @@ dependencies = [
[[package]]
name = "lance-table"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow",
"arrow-array",
@@ -4539,7 +4543,7 @@ dependencies = [
"prost",
"prost-build",
"prost-types",
"rand 0.9.2",
"rand 0.9.4",
"rangemap",
"roaring",
"semver",
@@ -4554,20 +4558,20 @@ dependencies = [
[[package]]
name = "lance-testing"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"arrow-array",
"arrow-schema",
"lance-arrow",
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
]
[[package]]
name = "lance-tokenizer"
version = "6.0.0-beta.1"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.1#c7a7d3a0e944646e793d297d4a2e2cf7e4fb28a3"
version = "6.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v6.0.0-beta.4#226dafe0a00f995d9e3230c2d61cc06be51994d4"
dependencies = [
"rust-stemmers",
"serde",
@@ -4576,7 +4580,7 @@ dependencies = [
[[package]]
name = "lancedb"
version = "0.28.0-beta.8"
version = "0.28.0-beta.10"
dependencies = [
"ahash",
"anyhow",
@@ -4637,7 +4641,7 @@ dependencies = [
"pin-project",
"polars",
"polars-arrow",
"rand 0.9.2",
"rand 0.9.4",
"random_word 0.4.3",
"regex",
"reqwest",
@@ -4658,7 +4662,7 @@ dependencies = [
[[package]]
name = "lancedb-nodejs"
version = "0.28.0-beta.8"
version = "0.28.0-beta.10"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4680,7 +4684,7 @@ dependencies = [
[[package]]
name = "lancedb-python"
version = "0.31.0-beta.8"
version = "0.31.0-beta.10"
dependencies = [
"arrow",
"async-trait",
@@ -5235,7 +5239,7 @@ version = "0.50.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7957b9740744892f114936ab4a57b3f487491bbeafaf8083688b16841a4240e5"
dependencies = [
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]
@@ -5357,7 +5361,7 @@ dependencies = [
"parking_lot",
"percent-encoding",
"quick-xml 0.38.4",
"rand 0.9.2",
"rand 0.9.4",
"reqwest",
"ring",
"rustls-pemfile",
@@ -6199,8 +6203,8 @@ version = "0.14.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "343d3bd7056eda839b03204e68deff7d1b13aba7af2b2fd16890697274262ee7"
dependencies = [
"heck 0.5.0",
"itertools 0.11.0",
"heck 0.4.1",
"itertools 0.14.0",
"log",
"multimap",
"petgraph",
@@ -6219,7 +6223,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "27c6023962132f4b30eb4c172c91ce92d933da334c59c23cddee82358ddafb0b"
dependencies = [
"anyhow",
"itertools 0.11.0",
"itertools 0.14.0",
"proc-macro2",
"quote",
"syn 2.0.117",
@@ -6402,7 +6406,7 @@ dependencies = [
"bytes",
"getrandom 0.3.4",
"lru-slab",
"rand 0.9.2",
"rand 0.9.4",
"ring",
"rustc-hash",
"rustls 0.23.37",
@@ -6425,7 +6429,7 @@ dependencies = [
"once_cell",
"socket2 0.6.3",
"tracing",
"windows-sys 0.59.0",
"windows-sys 0.60.2",
]
[[package]]
@@ -6468,9 +6472,9 @@ dependencies = [
[[package]]
name = "rand"
version = "0.9.2"
version = "0.9.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6db2770f06117d490610c7488547d543617b21bfa07796d7a12f6f1bd53850d1"
checksum = "44c5af06bb1b7d3216d91932aed5265164bf384dc89cd6ba05cf59a35f5f76ea"
dependencies = [
"rand_chacha 0.9.0",
"rand_core 0.9.5",
@@ -6531,7 +6535,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6a8615d50dcf34fa31f7ab52692afec947c4dd0ab803cc87cb3b0b4570ff7463"
dependencies = [
"num-traits",
"rand 0.9.2",
"rand 0.9.4",
]
[[package]]
@@ -6566,7 +6570,7 @@ dependencies = [
"ahash",
"brotli 8.0.2",
"paste",
"rand 0.9.2",
"rand 0.9.4",
"unicase",
]
@@ -6954,7 +6958,7 @@ dependencies = [
"errno",
"libc",
"linux-raw-sys",
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]
@@ -6980,7 +6984,7 @@ dependencies = [
"once_cell",
"ring",
"rustls-pki-types",
"rustls-webpki 0.103.10",
"rustls-webpki 0.103.13",
"subtle",
"zeroize",
]
@@ -7028,9 +7032,9 @@ dependencies = [
[[package]]
name = "rustls-webpki"
version = "0.103.10"
version = "0.103.13"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "df33b2b81ac578cabaf06b89b0631153a3f416b0a886e8a7a1707fb51abbd1ef"
checksum = "61c429a8649f110dddef65e2a5ad240f747e85f7758a6bccc7e5777bd33f756e"
dependencies = [
"aws-lc-rs",
"ring",
@@ -7465,7 +7469,7 @@ version = "0.8.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c1c97747dbf44bb1ca44a561ece23508e99cb592e862f22222dcf42f51d1e451"
dependencies = [
"heck 0.5.0",
"heck 0.4.1",
"proc-macro2",
"quote",
"syn 2.0.117",
@@ -7477,7 +7481,7 @@ version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "54254b8531cafa275c5e096f62d48c81435d1015405a91198ddb11e967301d40"
dependencies = [
"heck 0.5.0",
"heck 0.4.1",
"proc-macro2",
"quote",
"syn 2.0.117",
@@ -7818,7 +7822,7 @@ dependencies = [
"getrandom 0.4.2",
"once_cell",
"rustix",
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]
@@ -8242,7 +8246,7 @@ version = "2.1.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9ea3136b675547379c4bd395ca6b938e5ad3c3d20fad76e7fe85f9e0d011419c"
dependencies = [
"rand 0.9.2",
"rand 0.9.4",
]
[[package]]
@@ -8298,9 +8302,9 @@ dependencies = [
[[package]]
name = "unicode-segmentation"
version = "1.13.1"
version = "1.13.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "da36089a805484bcccfffe0739803392c8298778a2d2f09febf76fac5ad9025b"
checksum = "9629274872b2bfaf8d66f5f15725007f635594914870f65218920345aa11aa8c"
[[package]]
name = "unicode-width"
@@ -8632,7 +8636,7 @@ version = "0.1.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22"
dependencies = [
"windows-sys 0.59.0",
"windows-sys 0.61.2",
]
[[package]]

View File

@@ -1,7 +1,5 @@
[workspace]
members = ["rust/lancedb", "nodejs", "python"]
# Python package needs to be built by maturin.
exclude = ["python"]
resolver = "2"
[workspace.package]
@@ -15,20 +13,20 @@ categories = ["database-implementations"]
rust-version = "1.91.0"
[workspace.dependencies]
lance = { "version" = "=6.0.0-beta.1", default-features = false, "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=6.0.0-beta.1", default-features = false, "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=6.0.0-beta.1", default-features = false, "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=6.0.0-beta.1", "tag" = "v6.0.0-beta.1", "git" = "https://github.com/lance-format/lance.git" }
lance = { "version" = "=6.0.0-beta.4", default-features = false, "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=6.0.0-beta.4", default-features = false, "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=6.0.0-beta.4", default-features = false, "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=6.0.0-beta.4", "tag" = "v6.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "57.2", optional = false }

172
deny.toml Normal file
View File

@@ -0,0 +1,172 @@
# cargo-deny configuration for LanceDB.
#
# Run locally with `cargo deny check`. See
# https://embarkstudios.github.io/cargo-deny/ for the full reference.
# The set of target triples we care about. cargo-deny will only consider
# dependencies that are used on at least one of these targets. Keeping this
# explicit avoids noise from platform-specific crates (e.g. wasm, android,
# ios) that we never actually ship.
[graph]
targets = [
"x86_64-unknown-linux-gnu",
"aarch64-unknown-linux-gnu",
"x86_64-apple-darwin",
"aarch64-apple-darwin",
"x86_64-pc-windows-msvc",
"aarch64-pc-windows-msvc",
]
all-features = true
[output]
feature-depth = 1
# ---------------------------------------------------------------------------
# Advisories: security vulnerabilities and yanked crates.
# ---------------------------------------------------------------------------
[advisories]
version = 2
# Fail the check if any crate in the lockfile has been yanked from crates.io.
# Yanked crates are a signal the author retracted the release (often due to
# bugs or security issues) and should not be depended on.
yanked = "deny"
# Advisory IDs we have explicitly reviewed and chosen to accept. Every
# entry must include a rationale and, where possible, an upstream issue
# pointing to a fix. Revisit this list whenever dependencies are updated.
ignore = [
# rsa: Marvin Attack timing side-channel in PKCS#1 v1.5 decryption.
# Reached only through opendal → reqsign → rsa. We do not use RSA
# decryption in LanceDB ourselves; this is dormant in the signing path.
# No fixed release exists upstream as of this writing.
# https://rustsec.org/advisories/RUSTSEC-2023-0071
{ id = "RUSTSEC-2023-0071", reason = "rsa crate via opendal/reqsign; no fixed upstream release" },
# instant: unmaintained. Pulled in via backoff → instant. Upstream
# recommends switching to `web-time`; fix has to come from backoff.
# https://rustsec.org/advisories/RUSTSEC-2024-0384
{ id = "RUSTSEC-2024-0384", reason = "transitive via backoff; waiting on backoff replacement" },
# paste: unmaintained (author archived the repo). Used transitively by
# datafusion and the arrow ecosystem; widespread, no drop-in replacement.
# https://rustsec.org/advisories/RUSTSEC-2024-0436
{ id = "RUSTSEC-2024-0436", reason = "transitive via datafusion; awaiting ecosystem migration" },
# tantivy: segfault on malformed input due to missing bounds check.
# Pulled in via lance for full-text search. We only feed tantivy
# documents we construct ourselves, not attacker-controlled bytes.
# Tracked for a lance dependency bump.
# https://rustsec.org/advisories/RUSTSEC-2025-0003
{ id = "RUSTSEC-2025-0003", reason = "tantivy via lance; inputs are internally produced, not user-supplied bytes" },
# backoff: unmaintained. Reached only via async-openai. Replacement
# requires async-openai to migrate (or us to drop async-openai).
# https://rustsec.org/advisories/RUSTSEC-2025-0012
{ id = "RUSTSEC-2025-0012", reason = "transitive via async-openai; waiting on upstream migration" },
# number_prefix: unmaintained. Transitive via indicatif → hf-hub.
# No security impact, just maintenance status.
# https://rustsec.org/advisories/RUSTSEC-2025-0119
{ id = "RUSTSEC-2025-0119", reason = "transitive via hf-hub/indicatif; cosmetic formatting crate" },
# rustls-pemfile: unmaintained. Reached from two separate chains:
# rustls-native-certs 0.6 (via hyper-rustls 0.24) and object_store 0.12.
# Both upstream dependencies need to move before we can drop it.
# https://rustsec.org/advisories/RUSTSEC-2025-0134
{ id = "RUSTSEC-2025-0134", reason = "transitive via rustls-native-certs/object_store; waiting on upstream migration" },
# rustls-webpki 0.101.7 (old major line): name-constraint checks for
# URI / wildcard names. Pulled in only via the legacy rustls 0.21 chain
# from aws-smithy-http-client. The 0.103 line we actively use is patched.
# Clearing the 0.101 copy requires the aws-sdk chain to migrate off
# rustls 0.21.
# https://rustsec.org/advisories/RUSTSEC-2026-0098
# https://rustsec.org/advisories/RUSTSEC-2026-0099
{ id = "RUSTSEC-2026-0098", reason = "only affects rustls-webpki 0.101 from legacy aws-smithy/rustls 0.21 chain" },
{ id = "RUSTSEC-2026-0099", reason = "only affects rustls-webpki 0.101 from legacy aws-smithy/rustls 0.21 chain" },
# rustls-webpki 0.101.7: reachable panic in CRL parsing. Same legacy
# rustls 0.21 chain from aws-smithy-http-client as above. The 0.103 line
# we actively use is upgraded to 0.103.13 which contains the fix.
# https://rustsec.org/advisories/RUSTSEC-2026-0104
{ id = "RUSTSEC-2026-0104", reason = "only affects rustls-webpki 0.101 from legacy aws-smithy/rustls 0.21 chain" },
]
# ---------------------------------------------------------------------------
# Licenses: only allow licenses we've reviewed as compatible with Apache-2.0.
# ---------------------------------------------------------------------------
[licenses]
version = 2
# SPDX identifiers for licenses that are compatible with our Apache-2.0
# distribution. Additions require legal review.
allow = [
"Apache-2.0",
"Apache-2.0 WITH LLVM-exception",
"MIT",
"BSD-2-Clause",
"BSD-3-Clause",
"ISC",
"Unicode-3.0",
"Unicode-DFS-2016",
"Zlib",
"CC0-1.0",
"MPL-2.0",
"BSL-1.0",
"OpenSSL",
# 0BSD ("BSD Zero Clause") is effectively public domain — no attribution
# required. Pulled in by `mock_instant`.
"0BSD",
# bzip2-1.0.6 is the permissive upstream bzip2 license (BSD-like). Pulled
# in by `libbz2-rs-sys`, the pure-Rust bzip2 implementation.
"bzip2-1.0.6",
# CDLA-Permissive-2.0 is a permissive data license used by `webpki-roots`
# for the Mozilla CA root bundle. Data-only, distribution-compatible.
"CDLA-Permissive-2.0",
]
confidence-threshold = 0.8
# Crates whose license cannot be determined from Cargo metadata but whose
# license we've manually confirmed from upstream. Keep this list minimal.
[[licenses.clarify]]
# polars-arrow-format omits the `license` field in its Cargo.toml, but the
# upstream repo (pola-rs/polars-arrow-format) is dual-licensed Apache-2.0 OR
# MIT. See https://github.com/pola-rs/polars-arrow-format/blob/main/LICENSE
crate = "polars-arrow-format"
expression = "Apache-2.0 OR MIT"
license-files = []
# ---------------------------------------------------------------------------
# Bans: disallow specific crates and flag dependency hygiene issues.
# ---------------------------------------------------------------------------
[bans]
# Warn (not deny) on duplicate versions of the same crate. In a large
# workspace like this one, duplicates are common and often unavoidable
# transitively. We surface them to discourage growth, but don't fail CI.
multiple-versions = "warn"
# Wildcard version requirements (`foo = "*"`) are a footgun — they let any
# future release in without review. Ban them outright.
wildcards = "deny"
# Internal workspace crates reference each other via `path = "..."`, which
# cargo-deny sees as a wildcard version. That's fine for private workspace
# members (not published to crates.io), so allow it specifically for paths.
allow-wildcard-paths = true
# Features that, if enabled, should cause the check to fail.
deny = []
# Crates to skip when checking for duplicate versions.
skip = []
# Similar to `skip`, but also skips the entire transitive subtree.
skip-tree = []
# ---------------------------------------------------------------------------
# Sources: restrict where crates can come from.
# ---------------------------------------------------------------------------
[sources]
# Deny any registry other than the ones explicitly listed below.
unknown-registry = "deny"
# Deny any git dependency whose host isn't in the allow-list below. This
# prevents accidental pulls from arbitrary forks.
unknown-git = "deny"
allow-registry = ["https://github.com/rust-lang/crates.io-index"]
# Lance is developed in a sibling repo and pulled as a git dependency until
# releases are cut to crates.io. Allow that specific host.
allow-git = [
"https://github.com/lance-format/lance",
]

View File

@@ -24,4 +24,4 @@ RUN python --version && \
rustc --version && \
protoc --version
RUN pip install --no-cache-dir tantivy lancedb
RUN pip install --no-cache-dir lancedb

View File

@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
<dependency>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-core</artifactId>
<version>0.28.0-beta.8</version>
<version>0.28.0-beta.10</version>
</dependency>
```

View File

@@ -41,6 +41,29 @@ for testing purposes.
***
### manifestEnabled?
```ts
optional manifestEnabled: boolean;
```
(For LanceDB OSS only): use directory namespace manifests as the source
of truth for table metadata. Existing directory-listed root tables are
migrated into the manifest on access.
***
### namespaceClientProperties?
```ts
optional namespaceClientProperties: Record<string, string>;
```
(For LanceDB OSS only): extra properties for the backing namespace
client used by manifest-enabled native connections.
***
### readConsistencyInterval?
```ts

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.28.0-beta.8</version>
<version>0.28.0-beta.10</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.28.0-beta.8</version>
<version>0.28.0-beta.10</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>
@@ -28,7 +28,7 @@
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<arrow.version>15.0.0</arrow.version>
<lance-core.version>6.0.0-beta.1</lance-core.version>
<lance-core.version>6.0.0-beta.4</lance-core.version>
<spotless.skip>false</spotless.skip>
<spotless.version>2.30.0</spotless.version>
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>

View File

@@ -1,7 +1,8 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.28.0-beta.8"
version = "0.28.0-beta.10"
publish = false
license.workspace = true
description.workspace = true
repository.workspace = true
@@ -31,8 +32,8 @@ lzma-sys = { version = "0.1", features = ["static"] }
log.workspace = true
# Pin to resolve build failures; update periodically for security patches.
aws-lc-sys = "=0.38.0"
aws-lc-rs = "=1.16.1"
aws-lc-sys = "=0.40.0"
aws-lc-rs = "=1.16.3"
[build-dependencies]
napi-build = "2.3.1"

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"cpu": [
"x64",
"arm64"

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.28.0-beta.8",
"version": "0.28.0-beta.10",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",

View File

@@ -67,6 +67,12 @@ impl Connection {
builder = builder.storage_option(key, value);
}
}
if let Some(manifest_enabled) = options.manifest_enabled {
builder = builder.manifest_enabled(manifest_enabled);
}
if let Some(namespace_client_properties) = options.namespace_client_properties {
builder = builder.namespace_client_properties(namespace_client_properties);
}
// Create client config, optionally with header provider
let client_config = options.client_config.unwrap_or_default();

View File

@@ -37,6 +37,13 @@ pub struct ConnectionOptions {
///
/// The available options are described at https://docs.lancedb.com/storage/
pub storage_options: Option<HashMap<String, String>>,
/// (For LanceDB OSS only): use directory namespace manifests as the source
/// of truth for table metadata. Existing directory-listed root tables are
/// migrated into the manifest on access.
pub manifest_enabled: Option<bool>,
/// (For LanceDB OSS only): extra properties for the backing namespace
/// client used by manifest-enabled native connections.
pub namespace_client_properties: Option<HashMap<String, String>>,
/// (For LanceDB OSS only): the session to use for this connection. Holds
/// shared caches and other session-specific state.
pub session: Option<session::Session>,

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.31.0-beta.9"
current_version = "0.31.0-beta.10"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -1,6 +1,7 @@
[package]
name = "lancedb-python"
version = "0.31.0-beta.9"
version = "0.31.0-beta.10"
publish = false
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true

View File

@@ -183,7 +183,6 @@
| stack-data | 0.6.3 | MIT License | http://github.com/alexmojaki/stack_data |
| sympy | 1.14.0 | BSD License | https://sympy.org |
| tabulate | 0.9.0 | MIT License | https://github.com/astanin/python-tabulate |
| tantivy | 0.25.1 | UNKNOWN | UNKNOWN |
| threadpoolctl | 3.6.0 | BSD License | https://github.com/joblib/threadpoolctl |
| timm | 1.0.24 | Apache Software License | https://github.com/huggingface/pytorch-image-models |
| tinycss2 | 1.4.0 | BSD License | https://www.courtbouillon.org/tinycss2 |

View File

@@ -57,7 +57,6 @@ tests = [
"duckdb>=0.9.0",
"pytz>=2023.3",
"polars>=0.19, <=1.3.0",
"tantivy>=0.20.0",
"pyarrow-stubs>=16.0",
"pylance>=5.0.0b5",
"requests>=2.31.0",

View File

@@ -73,6 +73,7 @@ def connect(
client_config: Union[ClientConfig, Dict[str, Any], None] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
manifest_enabled: bool = False,
namespace_client_impl: Optional[str] = None,
namespace_client_properties: Optional[Dict[str, str]] = None,
namespace_client_pushdown_operations: Optional[List[str]] = None,
@@ -111,6 +112,10 @@ def connect(
storage_options: dict, optional
Additional options for the storage backend. See available options at
<https://docs.lancedb.com/storage/>
manifest_enabled : bool, default False
When true for local/native connections, use directory namespace
manifests as the source of truth for table metadata. Existing
directory-listed root tables are migrated into the manifest on access.
session: Session, optional
(For LanceDB OSS only)
A session to use for this connection. Sessions allow you to configure
@@ -158,11 +163,11 @@ def connect(
conn : DBConnection
A connection to a LanceDB database.
"""
if namespace_client_impl is not None or namespace_client_properties is not None:
if namespace_client_impl is None or namespace_client_properties is None:
if namespace_client_impl is not None:
if namespace_client_properties is None:
raise ValueError(
"Both namespace_client_impl and "
"namespace_client_properties must be provided"
"namespace_client_properties must be provided when "
"namespace_client_impl is set"
)
if kwargs:
raise ValueError(f"Unknown keyword arguments: {kwargs}")
@@ -175,6 +180,12 @@ def connect(
namespace_client_pushdown_operations=namespace_client_pushdown_operations,
)
if namespace_client_properties is not None and not manifest_enabled:
raise ValueError(
"namespace_client_impl must be provided when using "
"namespace_client_properties unless manifest_enabled=True"
)
if namespace_client_pushdown_operations is not None:
raise ValueError(
"namespace_client_pushdown_operations is only valid when "
@@ -212,6 +223,8 @@ def connect(
read_consistency_interval=read_consistency_interval,
storage_options=storage_options,
session=session,
manifest_enabled=manifest_enabled,
namespace_client_properties=namespace_client_properties,
)
@@ -289,6 +302,8 @@ def deserialize_conn(
parsed["uri"],
read_consistency_interval=rci,
storage_options=storage_options,
manifest_enabled=parsed.get("manifest_enabled", False),
namespace_client_properties=parsed.get("namespace_client_properties"),
)
else:
raise ValueError(f"Unknown connection_type: {connection_type}")
@@ -304,6 +319,8 @@ async def connect_async(
client_config: Optional[Union[ClientConfig, Dict[str, Any]]] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
manifest_enabled: bool = False,
namespace_client_properties: Optional[Dict[str, str]] = None,
) -> AsyncConnection:
"""Connect to a LanceDB database.
@@ -343,6 +360,13 @@ async def connect_async(
cache sizes for index and metadata caches, which can significantly
impact memory use and performance. They can also be re-used across
multiple connections to share the same cache state.
manifest_enabled : bool, default False
When true for local/native connections, use directory namespace
manifests as the source of truth for table metadata. Existing
directory-listed root tables are migrated into the manifest on access.
namespace_client_properties : dict, optional
Additional directory namespace client properties to use with
``manifest_enabled=True``.
Examples
--------
@@ -385,6 +409,8 @@ async def connect_async(
client_config,
storage_options,
session,
manifest_enabled,
namespace_client_properties,
)
)

View File

@@ -242,6 +242,8 @@ async def connect(
client_config: Optional[Union[ClientConfig, Dict[str, Any]]],
storage_options: Optional[Dict[str, str]],
session: Optional[Session],
manifest_enabled: bool = False,
namespace_client_properties: Optional[Dict[str, str]] = None,
) -> Connection: ...
class RecordBatchStream:

View File

@@ -590,8 +590,13 @@ class LanceDBConnection(DBConnection):
read_consistency_interval: Optional[timedelta] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
manifest_enabled: bool = False,
namespace_client_properties: Optional[Dict[str, str]] = None,
_inner: Optional[LanceDbConnection] = None,
):
self.storage_options = storage_options
self._manifest_enabled = manifest_enabled
self._namespace_client_properties = namespace_client_properties
if _inner is not None:
self._conn = _inner
self._cached_namespace_client = None
@@ -633,6 +638,8 @@ class LanceDBConnection(DBConnection):
None,
storage_options,
session,
manifest_enabled,
namespace_client_properties,
)
# TODO: It would be nice if we didn't store self.storage_options but it is
@@ -640,7 +647,6 @@ class LanceDBConnection(DBConnection):
# work because some paths like LanceDBConnection.from_inner will lose the
# storage_options. Also, this class really shouldn't be holding any state
# beyond _conn.
self.storage_options = storage_options
self._conn = AsyncConnection(LOOP.run(do_connect()))
self._cached_namespace_client: Optional[LanceNamespace] = None
@@ -677,6 +683,8 @@ class LanceDBConnection(DBConnection):
"connection_type": "local",
"uri": self.uri,
"storage_options": self.storage_options,
"manifest_enabled": self._manifest_enabled,
"namespace_client_properties": self._namespace_client_properties,
"read_consistency_interval_seconds": (
rci.total_seconds() if rci else None
),

View File

@@ -1,201 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""Full text search index using tantivy-py"""
import os
from typing import List, Tuple, Optional
import pyarrow as pa
try:
import tantivy
except ImportError:
raise ImportError(
"Please install tantivy-py `pip install tantivy` to use the full text search feature." # noqa: E501
)
from .table import LanceTable
def create_index(
index_path: str,
text_fields: List[str],
ordering_fields: Optional[List[str]] = None,
tokenizer_name: str = "default",
) -> tantivy.Index:
"""
Create a new Index (not populated)
Parameters
----------
index_path : str
Path to the index directory
text_fields : List[str]
List of text fields to index
ordering_fields: List[str]
List of unsigned type fields to order by at search time
tokenizer_name : str, default "default"
The tokenizer to use
Returns
-------
index : tantivy.Index
The index object (not yet populated)
"""
if ordering_fields is None:
ordering_fields = []
# Declaring our schema.
schema_builder = tantivy.SchemaBuilder()
# special field that we'll populate with row_id
schema_builder.add_integer_field("doc_id", stored=True)
# data fields
for name in text_fields:
schema_builder.add_text_field(name, stored=True, tokenizer_name=tokenizer_name)
if ordering_fields:
for name in ordering_fields:
schema_builder.add_unsigned_field(name, fast=True)
schema = schema_builder.build()
os.makedirs(index_path, exist_ok=True)
index = tantivy.Index(schema, path=index_path)
return index
def populate_index(
index: tantivy.Index,
table: LanceTable,
fields: List[str],
writer_heap_size: Optional[int] = None,
ordering_fields: Optional[List[str]] = None,
) -> int:
"""
Populate an index with data from a LanceTable
Parameters
----------
index : tantivy.Index
The index object
table : LanceTable
The table to index
fields : List[str]
List of fields to index
writer_heap_size : int
The writer heap size in bytes, defaults to 1GB
Returns
-------
int
The number of rows indexed
"""
if ordering_fields is None:
ordering_fields = []
writer_heap_size = writer_heap_size or 1024 * 1024 * 1024
# first check the fields exist and are string or large string type
nested = []
for name in fields:
try:
f = table.schema.field(name) # raises KeyError if not found
except KeyError:
f = resolve_path(table.schema, name)
nested.append(name)
if not pa.types.is_string(f.type) and not pa.types.is_large_string(f.type):
raise TypeError(f"Field {name} is not a string type")
# create a tantivy writer
writer = index.writer(heap_size=writer_heap_size)
# write data into index
dataset = table.to_lance()
row_id = 0
max_nested_level = 0
if len(nested) > 0:
max_nested_level = max([len(name.split(".")) for name in nested])
for b in dataset.to_batches(columns=fields + ordering_fields):
if max_nested_level > 0:
b = pa.Table.from_batches([b])
for _ in range(max_nested_level - 1):
b = b.flatten()
for i in range(b.num_rows):
doc = tantivy.Document()
for name in fields:
value = b[name][i].as_py()
if value is not None:
doc.add_text(name, value)
for name in ordering_fields:
value = b[name][i].as_py()
if value is not None:
doc.add_unsigned(name, value)
if not doc.is_empty:
doc.add_integer("doc_id", row_id)
writer.add_document(doc)
row_id += 1
# commit changes
writer.commit()
return row_id
def resolve_path(schema, field_name: str) -> pa.Field:
"""
Resolve a nested field path to a list of field names
Parameters
----------
field_name : str
The field name to resolve
Returns
-------
List[str]
The resolved path
"""
path = field_name.split(".")
field = schema.field(path.pop(0))
for segment in path:
if pa.types.is_struct(field.type):
field = field.type.field(segment)
else:
raise KeyError(f"field {field_name} not found in schema {schema}")
return field
def search_index(
index: tantivy.Index, query: str, limit: int = 10, ordering_field=None
) -> Tuple[Tuple[int], Tuple[float]]:
"""
Search an index for a query
Parameters
----------
index : tantivy.Index
The index object
query : str
The query string
limit : int
The maximum number of results to return
Returns
-------
ids_and_score: list[tuple[int], tuple[float]]
A tuple of two tuples, the first containing the document ids
and the second containing the scores
"""
searcher = index.searcher()
query = index.parse_query(query)
# get top results
if ordering_field:
results = searcher.search(query, limit, order_by_field=ordering_field)
else:
results = searcher.search(query, limit)
if results.count == 0:
return tuple(), tuple()
return tuple(
zip(
*[
(searcher.doc(doc_address)["doc_id"][0], score)
for score, doc_address in results.hits
]
)
)

View File

@@ -0,0 +1,230 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""
PyTorch integration for LanceDB.
Exposes ``LanceTorchDataset`` (map-style) and ``LanceIterableTorchDataset``
(iterable-style) wrappers that adapt a LanceDB table or permutation to the
PyTorch ``torch.utils.data`` API, while transparently handling the bits
that make a hand-rolled subclass tricky:
* The underlying Lance reader holds Rust state that is not picklable, but
``DataLoader(num_workers > 0)`` needs to fork the dataset to its workers.
These classes strip the reader on pickle and re-open it in the worker on
first read.
* Constructing a permutation from a table involves several steps
(``permutation_builder``/``Permutation.from_tables``/``select_columns``
/``with_format``/...). The wrapper takes those as constructor arguments
and applies them once the dataset is opened in the worker.
Example
-------
>>> import lancedb, torch # doctest: +SKIP
>>> from lancedb.integrations.torch import LanceTorchDataset
>>> db = lancedb.connect(uri) # doctest: +SKIP
>>> tbl = db.open_table("images_224") # doctest: +SKIP
>>> ds = LanceTorchDataset( # doctest: +SKIP
... tbl, columns=["image_bytes", "label"], format="torch"
... )
>>> loader = torch.utils.data.DataLoader( # doctest: +SKIP
... ds, batch_size=64, num_workers=4, shuffle=True,
... )
"""
from typing import Any, Callable, Dict, List, Optional, Union
import torch.utils.data as _torch_data
from ..permutation import Permutation
from ..table import LanceTable
def _capture_table_state(table: LanceTable) -> Dict[str, Any]:
"""Pull just enough state out of a LanceTable so we can re-open the same
table in a forked worker process where the Rust handle isn't valid."""
conn = table._conn
connect_kwargs: Dict[str, Any] = {}
storage_options = getattr(conn, "storage_options", None)
if storage_options is not None:
connect_kwargs["storage_options"] = storage_options
return {
"uri": conn.uri,
"table_name": table.name,
"connect_kwargs": connect_kwargs,
}
def _open_permutation(state: Dict[str, Any]) -> Permutation:
"""Reconstruct a Permutation from a captured state dict."""
import lancedb
db = lancedb.connect(state["uri"], **state["connect_kwargs"])
base = db.open_table(state["table_name"])
perm_table_name = state.get("perm_table_name")
if perm_table_name is not None:
perm_tbl = db.open_table(perm_table_name)
perm = Permutation.from_tables(base, perm_tbl, state.get("split"))
else:
perm = Permutation.identity(base)
columns = state.get("columns")
fmt = state.get("format")
transform = state.get("transform")
batch_size = state.get("batch_size")
if columns is not None:
perm = perm.select_columns(columns)
if fmt is not None:
perm = perm.with_format(fmt)
if transform is not None:
perm = perm.with_transform(transform)
if batch_size is not None:
perm = perm.with_batch_size(batch_size)
return perm
class LanceTorchDataset(_torch_data.Dataset):
"""
A PyTorch map-style ``Dataset`` backed by a LanceDB table or permutation.
Pass the same ``LanceTable`` you already opened (and, optionally, a
permutation table / split / column selection / output format) and use
the result anywhere a ``torch.utils.data.Dataset`` is expected.
The wrapper:
* Stores the URI / table name / storage options needed to re-open the
table, not the Rust reader handle. Pickling keeps only the rebuild
recipe, so ``DataLoader(num_workers > 0)`` works out of the box.
* Implements both ``__getitem__`` and PyTorch's ``__getitems__`` dunder
so the underlying batched ``Permutation.fetch`` is used when the
DataLoader fetches a batch of indices.
Parameters
----------
table : LanceTable, optional
The base table to read from. Either ``table`` or both ``uri`` and
``table_name`` must be provided.
uri : str, optional
Database URI to reconnect to. Required if ``table`` is not given.
table_name : str, optional
Name of the base table within ``uri``.
connect_kwargs : dict, optional
Extra keyword arguments forwarded to ``lancedb.connect`` when
re-opening the database in a worker.
permutation_table : LanceTable, optional
A pre-built permutation table (see ``permutation_builder``) used to
define the row ordering. If omitted, the identity permutation is
used (rows in physical order).
split : str or int, optional
Split selector when ``permutation_table`` defines splits.
columns : list[str], optional
Subset of columns to read.
format : str, optional
Output format, forwarded to ``Permutation.with_format`` (e.g.
``"torch"`` for HuggingFace-style ``dict[str, Tensor]`` batches).
transform : Callable, optional
Custom batch transform, forwarded to ``Permutation.with_transform``.
Must be picklable to work with ``num_workers > 0``.
batch_size : int, optional
Forwarded to ``Permutation.with_batch_size`` for direct iteration.
DataLoader controls its own batching, so this only matters if the
dataset is iterated directly.
"""
def __init__(
self,
table: Optional[LanceTable] = None,
*,
uri: Optional[str] = None,
table_name: Optional[str] = None,
connect_kwargs: Optional[Dict[str, Any]] = None,
permutation_table: Optional[LanceTable] = None,
split: Optional[Union[str, int]] = None,
columns: Optional[List[str]] = None,
format: Optional[str] = None,
transform: Optional[Callable] = None,
batch_size: Optional[int] = None,
):
if table is None and (uri is None or table_name is None):
raise ValueError(
"Provide either `table` or both `uri` and `table_name`."
)
if table is not None:
state = _capture_table_state(table)
if connect_kwargs is not None:
state["connect_kwargs"] = connect_kwargs
else:
state = {
"uri": uri,
"table_name": table_name,
"connect_kwargs": connect_kwargs or {},
}
state["perm_table_name"] = (
permutation_table.name if permutation_table is not None else None
)
state["split"] = split
state["columns"] = columns
state["format"] = format
state["transform"] = transform
state["batch_size"] = batch_size
self._state: Dict[str, Any] = state
self._perm: Optional[Permutation] = None
def __getstate__(self) -> Dict[str, Any]:
# Strip the Rust-backed reader so the dataset is picklable. Workers
# rebuild it on first read via _ensure_open().
d = self.__dict__.copy()
d["_perm"] = None
return d
def __setstate__(self, d: Dict[str, Any]) -> None:
self.__dict__.update(d)
def _ensure_open(self) -> None:
if self._perm is None:
self._perm = _open_permutation(self._state)
def __len__(self) -> int:
self._ensure_open()
return len(self._perm)
def __getitem__(self, index: int) -> Any:
self._ensure_open()
return self._perm[index]
def __getitems__(self, indices: List[int]) -> Any:
self._ensure_open()
return self._perm.fetch(indices)
class LanceIterableTorchDataset(_torch_data.IterableDataset):
"""
PyTorch iterable-style ``IterableDataset`` over a LanceDB permutation.
Yields batches in the order defined by the underlying ``Permutation``.
With ``num_workers > 1`` each worker iterates the permutation
independently — for sharded iteration use the map-style
``LanceTorchDataset`` together with a sampler.
Constructor arguments mirror ``LanceTorchDataset``.
"""
def __init__(self, *args, **kwargs):
self._inner = LanceTorchDataset(*args, **kwargs)
def __getstate__(self) -> Dict[str, Any]:
return {"_inner": self._inner.__getstate__()}
def __setstate__(self, d: Dict[str, Any]) -> None:
self._inner = LanceTorchDataset.__new__(LanceTorchDataset)
self._inner.__setstate__(d["_inner"])
def __iter__(self):
self._inner._ensure_open()
return iter(self._inner._perm)

View File

@@ -779,6 +779,25 @@ class Permutation:
batch = LOOP.run(do_getitems())
return self.transform_fn(batch)
def fetch(self, indices: list[int]) -> Any:
"""
Fetch rows from the permutation by offset.
This is the public batch-access API. It returns the rows for the given
offsets in the same shape as configured by
[with_format](#with_format) / [with_transform](#with_transform).
Examples
--------
>>> import lancedb
>>> db = lancedb.connect("memory:///")
>>> tbl = db.create_table("tbl", data=[{"x": x} for x in range(10)])
>>> perm = Permutation.identity(tbl)
>>> perm.fetch([0, 5, 9])
[{'x': 0}, {'x': 5}, {'x': 9}]
"""
return self.__getitems__(indices)
@deprecated(details="Use with_skip instead")
def skip(self, skip: int) -> "Permutation":
"""

View File

@@ -25,7 +25,6 @@ import deprecation
import numpy as np
import pyarrow as pa
import pyarrow.compute as pc
import pyarrow.fs as pa_fs
import pydantic
from lancedb.pydantic import PYDANTIC_VERSION
@@ -1526,9 +1525,7 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
return self._table._output_schema(self.to_query_object())
def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
path, fs, exist = self._table._get_fts_index_path()
if exist:
return self.tantivy_to_arrow()
self._table._ensure_no_legacy_fts_index()
query = self._query
if self._phrase_query:
@@ -1552,90 +1549,6 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
):
raise NotImplementedError("to_batches on an FTS query")
def tantivy_to_arrow(self) -> pa.Table:
try:
import tantivy
except ImportError:
raise ImportError(
"Please install tantivy-py `pip install tantivy` to use the full text search feature." # noqa: E501
)
from .fts import search_index
# get the index path
path, fs, exist = self._table._get_fts_index_path()
# check if the index exist
if not exist:
raise FileNotFoundError(
"Fts index does not exist. "
"Please first call table.create_fts_index(['<field_names>']) to "
"create the fts index."
)
# Check that we are on local filesystem
if not isinstance(fs, pa_fs.LocalFileSystem):
raise NotImplementedError(
"Tantivy-based full text search "
"is only supported on the local filesystem"
)
# open the index
index = tantivy.Index.open(path)
# get the scores and doc ids
query = self._query
if self._phrase_query:
query = query.replace('"', "'")
query = f'"{query}"'
limit = self._limit if self._limit is not None else 10
row_ids, scores = search_index(
index, query, limit, ordering_field=self.ordering_field_name
)
if len(row_ids) == 0:
empty_schema = pa.schema([pa.field("_score", pa.float32())])
return pa.Table.from_batches([], schema=empty_schema)
scores = pa.array(scores)
output_tbl = self._table.to_lance().take(row_ids, columns=self._columns)
output_tbl = output_tbl.append_column("_score", scores)
# this needs to match vector search results which are uint64
row_ids = pa.array(row_ids, type=pa.uint64())
if self._where is not None:
tmp_name = "__lancedb__duckdb__indexer__"
output_tbl = output_tbl.append_column(
tmp_name, pa.array(range(len(output_tbl)))
)
try:
# TODO would be great to have Substrait generate pyarrow compute
# expressions or conversely have pyarrow support SQL expressions
# using Substrait
import duckdb
indexer = duckdb.sql(
f"SELECT {tmp_name} FROM output_tbl WHERE {self._where}"
).to_arrow_table()[tmp_name]
output_tbl = output_tbl.take(indexer).drop([tmp_name])
row_ids = row_ids.take(indexer)
except ImportError:
import tempfile
import lance
# TODO Use "memory://" instead once that's supported
with tempfile.TemporaryDirectory() as tmp:
ds = lance.write_dataset(output_tbl, tmp)
output_tbl = ds.to_table(filter=self._where)
indexer = output_tbl[tmp_name]
row_ids = row_ids.take(indexer)
output_tbl = output_tbl.drop([tmp_name])
if self._with_row_id:
output_tbl = output_tbl.append_column("_rowid", row_ids)
if self._reranker is not None:
output_tbl = self._reranker.rerank_fts(self._query, output_tbl)
return output_tbl
def rerank(self, reranker: Reranker) -> LanceFtsQueryBuilder:
"""Rerank the results using the specified reranker.

View File

@@ -943,29 +943,26 @@ class Table(ABC):
Parameters
----------
field_names: str or list of str
The name(s) of the field to index.
If ``use_tantivy`` is False (default), only a single field name
(str) is supported. To index multiple fields, create a separate
FTS index for each field.
The name of the field to index. Native FTS indexes can only be
created on a single field at a time. To search over multiple text
fields, create a separate FTS index for each field.
replace: bool, default False
If True, replace the existing index if it exists. Note that this is
not yet an atomic operation; the index will be temporarily
unavailable while the new index is being created.
writer_heap_size: int, default 1GB
Only available with use_tantivy=True
Deprecated legacy Tantivy parameter. Any value other than the
default raises an error.
ordering_field_names:
A list of unsigned type fields to index to optionally order
results on at search time.
only available with use_tantivy=True
Deprecated legacy Tantivy parameter. Setting this raises an error.
tokenizer_name: str, default "default"
The tokenizer to use for the index. Can be "raw", "default" or the 2 letter
language code followed by "_stem". So for english it would be "en_stem".
For available languages see: https://docs.rs/tantivy/latest/tantivy/tokenizer/enum.Language.html
A compatibility alias for native tokenizer configs. Can be "raw",
"default" or the 2 letter language code followed by "_stem". So
for english it would be "en_stem".
use_tantivy: bool, default False
If True, use the legacy full-text search implementation based on tantivy.
If False, use the new full-text search implementation based on lance-index.
Deprecated legacy Tantivy parameter. Setting this to True raises an
error.
with_position: bool, default False
Only available with use_tantivy=False
If False, do not store the positions of the terms in the text.
This can reduce the size of the index and improve indexing speed.
But it will raise an exception for phrase queries.
@@ -1746,6 +1743,16 @@ class Table(ABC):
index_exists = fs.get_file_info(path).type != pa_fs.FileType.NotFound
return (path, fs, index_exists)
def _ensure_no_legacy_fts_index(self):
path, _, exists = self._get_fts_index_path()
if exists:
raise ValueError(
"Legacy Tantivy FTS index detected at "
f"{path}. Tantivy-based FTS has been removed. "
"Delete the legacy index and recreate it with "
"table.create_fts_index(...)."
)
@abstractmethod
def uses_v2_manifest_paths(self) -> bool:
"""
@@ -2405,84 +2412,63 @@ class LanceTable(Table):
prefix_only: bool = False,
name: Optional[str] = None,
):
if not use_tantivy:
if not isinstance(field_names, str):
raise ValueError(
"Native FTS indexes can only be created on a single field "
"at a time. To search over multiple text fields, create a "
"separate FTS index for each field."
)
self._ensure_no_legacy_fts_index()
if tokenizer_name is None:
tokenizer_configs = {
"base_tokenizer": base_tokenizer,
"language": language,
"with_position": with_position,
"max_token_length": max_token_length,
"lower_case": lower_case,
"stem": stem,
"remove_stop_words": remove_stop_words,
"ascii_folding": ascii_folding,
"ngram_min_length": ngram_min_length,
"ngram_max_length": ngram_max_length,
"prefix_only": prefix_only,
}
else:
tokenizer_configs = self.infer_tokenizer_configs(tokenizer_name)
config = FTS(
**tokenizer_configs,
if use_tantivy:
raise ValueError(
"Tantivy-based FTS has been removed. "
"Remove use_tantivy and recreate the index with native FTS."
)
# delete the existing legacy index if it exists
if replace:
path, fs, exist = self._get_fts_index_path()
if exist:
fs.delete_dir(path)
LOOP.run(
self._table.create_index(
field_names,
replace=replace,
config=config,
name=name,
)
if ordering_field_names is not None:
raise ValueError(
"ordering_field_names was only supported by the removed "
"Tantivy-based FTS implementation."
)
return
from .fts import create_index, populate_index
if isinstance(field_names, str):
field_names = [field_names]
if isinstance(ordering_field_names, str):
ordering_field_names = [ordering_field_names]
path, fs, exist = self._get_fts_index_path()
if exist:
if not replace:
raise ValueError("Index already exists. Use replace=True to overwrite.")
fs.delete_dir(path)
if not isinstance(fs, pa_fs.LocalFileSystem):
raise NotImplementedError(
"Full-text search is only supported on the local filesystem"
if writer_heap_size != 1024 * 1024 * 1024:
raise ValueError(
"writer_heap_size was only supported by the removed "
"Tantivy-based FTS implementation."
)
if not isinstance(field_names, str):
raise ValueError(
"Native FTS indexes can only be created on a single field "
"at a time. To search over multiple text fields, create a "
"separate FTS index for each field."
)
if "." in field_names:
raise ValueError(
"Native FTS indexes can only be created on top-level fields. "
f"Received nested field path: {field_names!r}."
)
if tokenizer_name is None:
tokenizer_name = "default"
index = create_index(
path,
field_names,
ordering_fields=ordering_field_names,
tokenizer_name=tokenizer_name,
tokenizer_configs = {
"base_tokenizer": base_tokenizer,
"language": language,
"with_position": with_position,
"max_token_length": max_token_length,
"lower_case": lower_case,
"stem": stem,
"remove_stop_words": remove_stop_words,
"ascii_folding": ascii_folding,
"ngram_min_length": ngram_min_length,
"ngram_max_length": ngram_max_length,
"prefix_only": prefix_only,
}
else:
tokenizer_configs = self.infer_tokenizer_configs(tokenizer_name)
config = FTS(
**tokenizer_configs,
)
populate_index(
index,
self,
field_names,
ordering_fields=ordering_field_names,
writer_heap_size=writer_heap_size,
LOOP.run(
self._table.create_index(
field_names,
replace=replace,
config=config,
name=name,
)
)
@staticmethod

View File

@@ -180,7 +180,7 @@ def test_fts_fuzzy_query():
),
mode="overwrite",
)
table.create_fts_index("text", use_tantivy=False, replace=True)
table.create_fts_index("text", replace=True)
results = table.search(MatchQuery("foo", "text", fuzziness=1)).to_pandas()
assert len(results) == 4
@@ -230,7 +230,7 @@ def test_fts_boost_query():
),
mode="overwrite",
)
table.create_fts_index("desc", use_tantivy=False, replace=True)
table.create_fts_index("desc", replace=True)
results = table.search(
BoostQuery(
@@ -265,7 +265,7 @@ def test_fts_boolean_query(tmp_path):
],
mode="overwrite",
)
table.create_fts_index("text", use_tantivy=False, replace=True)
table.create_fts_index("text", replace=True)
# SHOULD
results = table.search(
@@ -319,9 +319,7 @@ def test_fts_native():
],
)
# passing `use_tantivy=False` to use lance FTS index
# `use_tantivy=True` by default
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
table.search("puppy").limit(10).select(["text"]).to_list()
# [{'text': 'Frodo was a happy puppy', '_score': 0.6931471824645996}]
# ...
@@ -332,7 +330,6 @@ def test_fts_native():
# --8<-- [start:fts_config_folding]
table.create_fts_index(
"text",
use_tantivy=False,
language="French",
stem=True,
ascii_folding=True,
@@ -346,7 +343,7 @@ def test_fts_native():
table.search("puppy").limit(10).where("text='foo'", prefilter=False).to_list()
# --8<-- [end:fts_postfiltering]
# --8<-- [start:fts_with_position]
table.create_fts_index("text", use_tantivy=False, with_position=True, replace=True)
table.create_fts_index("text", with_position=True, replace=True)
# --8<-- [end:fts_with_position]
# --8<-- [start:fts_incremental_index]
table.add([{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"}])

View File

@@ -15,8 +15,7 @@ import pytest
from lancedb.pydantic import LanceModel, Vector
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_basic(tmp_path, use_tantivy):
def test_basic(tmp_path):
db = lancedb.connect(tmp_path)
assert db.uri == str(tmp_path)
@@ -49,7 +48,7 @@ def test_basic(tmp_path, use_tantivy):
assert len(rs) == 1
assert rs["item"].iloc[0] == "foo"
table.create_fts_index("item", use_tantivy=use_tantivy)
table.create_fts_index("item")
rs = table.search("bar", query_type="fts").to_pandas()
assert len(rs) == 1
assert rs["item"].iloc[0] == "bar"

View File

@@ -36,9 +36,6 @@ import pytest
import pytest_asyncio
from utils import exception_output
pytest.importorskip("lancedb.fts")
tantivy = pytest.importorskip("tantivy")
@pytest.fixture
def table(tmp_path) -> ldb.table.LanceTable:
@@ -144,58 +141,53 @@ async def async_table(tmp_path) -> ldb.table.AsyncTable:
return table
def test_create_index(tmp_path):
index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
assert isinstance(index, tantivy.Index)
assert os.path.exists(str(tmp_path / "index"))
@pytest.mark.parametrize(
("kwargs", "match"),
[
(
{"use_tantivy": True},
"Tantivy-based FTS has been removed",
),
(
{"ordering_field_names": ["count"]},
"ordering_field_names was only supported",
),
(
{"writer_heap_size": 128},
"writer_heap_size was only supported",
),
],
)
def test_reject_removed_tantivy_parameters(table, kwargs, match):
with pytest.raises(ValueError, match=match):
table.create_fts_index("text", **kwargs)
def test_create_index_with_stemming(tmp_path, table):
index = ldb.fts.create_index(
str(tmp_path / "index"), ["text"], tokenizer_name="en_stem"
)
assert isinstance(index, tantivy.Index)
assert os.path.exists(str(tmp_path / "index"))
def test_reject_legacy_tantivy_index(table):
path, _, _ = table._get_fts_index_path()
os.makedirs(path, exist_ok=True)
# Check stemming by running tokenizer on non empty table
table.create_fts_index("text", tokenizer_name="en_stem", use_tantivy=True)
with pytest.raises(ValueError, match="Legacy Tantivy FTS index detected"):
table.search("puppy").limit(5).to_list()
with pytest.raises(ValueError, match="Legacy Tantivy FTS index detected"):
table.create_fts_index("text")
@pytest.mark.parametrize("use_tantivy", [True, False])
@pytest.mark.parametrize("with_position", [True, False])
def test_create_inverted_index(table, use_tantivy, with_position):
if use_tantivy and not with_position:
pytest.skip("we don't support building a tantivy index without position")
def test_create_inverted_index(table, with_position):
table.create_fts_index(
"text",
use_tantivy=use_tantivy,
with_position=with_position,
name="custom_fts_index",
)
if not use_tantivy:
indices = table.list_indices()
fts_indices = [i for i in indices if i.index_type == "FTS"]
assert any(i.name == "custom_fts_index" for i in fts_indices)
indices = table.list_indices()
fts_indices = [i for i in indices if i.index_type == "FTS"]
assert any(i.name == "custom_fts_index" for i in fts_indices)
def test_populate_index(tmp_path, table):
index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
assert ldb.fts.populate_index(index, table, ["text"]) == len(table)
def test_search_index(tmp_path, table):
index = ldb.fts.create_index(str(tmp_path / "index"), ["text"])
ldb.fts.populate_index(index, table, ["text"])
index.reload()
results = ldb.fts.search_index(index, query="puppy", limit=5)
assert len(results) == 2
assert len(results[0]) == 5 # row_ids
assert len(results[1]) == 5 # _score
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_search_fts(table, use_tantivy):
table.create_fts_index("text", use_tantivy=use_tantivy)
def test_search_fts(table):
table.create_fts_index("text")
results = table.search("puppy").select(["id", "text"]).limit(5).to_list()
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
@@ -204,53 +196,52 @@ def test_search_fts(table, use_tantivy):
results = table.search("puppy").select(["id", "text"]).to_list()
assert len(results) == 10
if not use_tantivy:
# Test with a query
results = (
table.search(MatchQuery("puppy", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
# Test with a query
results = (
table.search(MatchQuery("puppy", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
# Test boost query
results = (
table.search(
BoostQuery(
MatchQuery("puppy", "text"),
MatchQuery("runs", "text"),
)
# Test boost query
results = (
table.search(
BoostQuery(
MatchQuery("puppy", "text"),
MatchQuery("runs", "text"),
)
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
# Test multi match query
table.create_fts_index("text2", use_tantivy=use_tantivy)
results = (
table.search(MultiMatchQuery("puppy", ["text", "text2"]))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
# Test multi match query
table.create_fts_index("text2")
results = (
table.search(MultiMatchQuery("puppy", ["text", "text2"]))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
# Test boolean query
results = (
table.search(MatchQuery("puppy", "text") & MatchQuery("runs", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
for r in results:
assert "puppy" in r["text"]
assert "runs" in r["text"]
# Test boolean query
results = (
table.search(MatchQuery("puppy", "text") & MatchQuery("runs", "text"))
.select(["id", "text"])
.limit(5)
.to_list()
)
assert len(results) == 5
assert len(results[0]) == 3 # id, text, _score
for r in results:
assert "puppy" in r["text"]
assert "runs" in r["text"]
@pytest.mark.asyncio
@@ -318,13 +309,13 @@ async def test_fts_select_async(async_table):
def test_search_fts_phrase_query(table):
table.create_fts_index("text", use_tantivy=False, with_position=False)
table.create_fts_index("text", with_position=False)
try:
phrase_results = table.search('"puppy runs"').limit(100).to_list()
assert False
except Exception:
pass
table.create_fts_index("text", use_tantivy=False, with_position=True, replace=True)
table.create_fts_index("text", with_position=True, replace=True)
results = table.search("puppy").limit(100).to_list()
# Test with quotation marks
@@ -375,8 +366,8 @@ async def test_search_fts_phrase_query_async(async_table):
def test_search_fts_specify_column(table):
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text2", use_tantivy=False)
table.create_fts_index("text")
table.create_fts_index("text2")
results = table.search("puppy", fts_columns="text").limit(5).to_list()
assert len(results) == 5
@@ -470,42 +461,8 @@ async def test_search_fts_specify_column_async(async_table):
pass
def test_search_ordering_field_index_table(tmp_path, table):
table.create_fts_index("text", ordering_field_names=["count"], use_tantivy=True)
rows = (
table.search("puppy", ordering_field_name="count")
.limit(20)
.select(["text", "count"])
.to_list()
)
for r in rows:
assert "puppy" in r["text"]
assert sorted(rows, key=lambda x: x["count"], reverse=True) == rows
def test_search_ordering_field_index(tmp_path, table):
index = ldb.fts.create_index(
str(tmp_path / "index"), ["text"], ordering_fields=["count"]
)
ldb.fts.populate_index(index, table, ["text"], ordering_fields=["count"])
index.reload()
results = ldb.fts.search_index(
index, query="puppy", limit=5, ordering_field="count"
)
assert len(results) == 2
assert len(results[0]) == 5 # row_ids
assert len(results[1]) == 5 # _distance
rows = table.to_lance().take(results[0]).to_pylist()
for r in rows:
assert "puppy" in r["text"]
assert sorted(rows, key=lambda x: x["count"], reverse=True) == rows
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_create_index_from_table(tmp_path, table, use_tantivy):
table.create_fts_index("text", use_tantivy=use_tantivy)
def test_create_index_from_table(tmp_path, table):
table.create_fts_index("text")
df = table.search("puppy").limit(5).select(["text"]).to_pandas()
assert len(df) <= 5
assert "text" in df.columns
@@ -525,36 +482,24 @@ def test_create_index_from_table(tmp_path, table, use_tantivy):
)
with pytest.raises(Exception, match="already exists"):
table.create_fts_index("text", use_tantivy=use_tantivy)
table.create_fts_index("text")
table.create_fts_index("text", replace=True, use_tantivy=use_tantivy)
table.create_fts_index("text", replace=True)
assert len(table.search("gorilla").limit(1).to_pandas()) == 1
def test_create_index_multiple_columns(tmp_path, table):
table.create_fts_index(["text", "text2"], use_tantivy=True)
df = table.search("puppy").limit(5).to_pandas()
assert len(df) == 5
assert "text" in df.columns
assert "text2" in df.columns
def test_empty_rs(tmp_path, table, mocker):
table.create_fts_index(["text", "text2"], use_tantivy=True)
mocker.patch("lancedb.fts.search_index", return_value=([], []))
df = table.search("puppy").limit(5).to_pandas()
assert len(df) == 0
with pytest.raises(ValueError, match="Native FTS indexes can only be created"):
table.create_fts_index(["text", "text2"])
def test_nested_schema(tmp_path, table):
table.create_fts_index("nested.text", use_tantivy=True)
rs = table.search("puppy").limit(5).to_list()
assert len(rs) == 5
with pytest.raises(ValueError, match="top-level fields"):
table.create_fts_index("nested.text")
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_search_index_with_filter(table, use_tantivy):
table.create_fts_index("text", use_tantivy=use_tantivy)
def test_search_index_with_filter(table):
table.create_fts_index("text")
orig_import = __import__
def import_mock(name, *args):
@@ -584,8 +529,7 @@ def test_search_index_with_filter(table, use_tantivy):
assert r["_rowid"] is not None
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_null_input(table, use_tantivy):
def test_null_input(table):
table.add(
[
{
@@ -598,14 +542,13 @@ def test_null_input(table, use_tantivy):
}
]
)
table.create_fts_index("text", use_tantivy=use_tantivy)
table.create_fts_index("text")
def test_syntax(table):
# https://github.com/lancedb/lancedb/issues/769
table.create_fts_index("text", use_tantivy=True)
with pytest.raises(ValueError, match="Syntax Error"):
table.search("they could have been dogs OR").limit(10).to_list()
table.create_fts_index("text")
table.search("they could have been dogs OR").limit(10).to_list()
# these should work
@@ -616,6 +559,7 @@ def test_syntax(table):
).to_list()
# phrase queries
table.create_fts_index("text", with_position=True, replace=True)
table.search("they could have been dogs OR cats").phrase_query().limit(10).to_list()
table.search('"they could have been dogs OR cats"').limit(10).to_list()
table.search('''"the cats OR dogs were not really 'pets' at all"''').limit(
@@ -639,7 +583,7 @@ def test_language(mem_db: DBConnection):
table = mem_db.create_table("test", data=data)
with pytest.raises(ValueError) as e:
table.create_fts_index("text", use_tantivy=False, language="klingon")
table.create_fts_index("text", language="klingon")
assert exception_output(e) == (
"ValueError: LanceDB does not support the requested language: 'klingon'\n"
@@ -650,7 +594,6 @@ def test_language(mem_db: DBConnection):
table.create_fts_index(
"text",
use_tantivy=False,
language="French",
stem=True,
ascii_folding=True,
@@ -690,7 +633,7 @@ def test_fts_on_list(mem_db: DBConnection):
}
)
table = mem_db.create_table("test", data=data)
table.create_fts_index("text", use_tantivy=False, with_position=True)
table.create_fts_index("text", with_position=True)
res = table.search("lance").limit(5).to_list()
assert len(res) == 3
@@ -702,7 +645,7 @@ def test_fts_on_list(mem_db: DBConnection):
def test_fts_ngram(mem_db: DBConnection):
data = pa.table({"text": ["hello world", "lance database", "lance is cool"]})
table = mem_db.create_table("test", data=data)
table.create_fts_index("text", use_tantivy=False, base_tokenizer="ngram")
table.create_fts_index("text", base_tokenizer="ngram")
results = table.search("lan", query_type="fts").limit(10).to_list()
assert len(results) == 2
@@ -721,7 +664,6 @@ def test_fts_ngram(mem_db: DBConnection):
# test setting min_ngram_length and prefix_only
table.create_fts_index(
"text",
use_tantivy=False,
base_tokenizer="ngram",
replace=True,
ngram_min_length=2,
@@ -886,7 +828,7 @@ def test_fts_query_to_json():
def test_fts_fast_search(table):
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
# Insert some unindexed data
table.add(

View File

@@ -28,7 +28,7 @@ def sync_table(tmpdir_factory) -> Table:
}
)
table = db.create_table("test", data)
table.create_fts_index("text", with_position=False, use_tantivy=False)
table.create_fts_index("text", with_position=False)
return table
@@ -192,7 +192,7 @@ def table_with_id(tmpdir_factory) -> Table:
}
)
table = db.create_table("test_with_id", data)
table.create_fts_index("text", with_position=False, use_tantivy=False)
table.create_fts_index("text", with_position=False)
return table

View File

@@ -1095,3 +1095,23 @@ def test_getitems_invalid_offset(some_permutation: Permutation):
"""Test __getitems__ with an out-of-range offset raises an error."""
with pytest.raises(Exception):
some_permutation.__getitems__([999999])
def test_fetch_matches_getitems(some_permutation: Permutation):
"""Public fetch() should be equivalent to __getitems__."""
indices = [0, 1, 2, 10, 100]
assert some_permutation.fetch(indices) == some_permutation.__getitems__(indices)
def test_fetch_respects_format(some_permutation: Permutation):
"""fetch() applies the configured format/transform."""
arrow_perm = some_permutation.with_format("arrow")
result = arrow_perm.fetch([0, 1, 2])
assert isinstance(result, pa.RecordBatch)
assert result.num_rows == 3
def test_fetch_invalid_offset(some_permutation: Permutation):
"""fetch() with an out-of-range offset raises an error."""
with pytest.raises(Exception):
some_permutation.fetch([999999])

View File

@@ -1385,7 +1385,7 @@ def test_query_timeout(tmp_path):
}
)
table = db.create_table("test", data)
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
with pytest.raises(Exception, match="Query timeout"):
table.search().where("text = 'a'").to_list(timeout=timedelta(0))

View File

@@ -26,11 +26,8 @@ from lancedb.rerankers import (
)
from lancedb.table import LanceTable
# Tests rely on FTS index
pytest.importorskip("lancedb.fts")
def get_test_table(tmp_path, use_tantivy):
def get_test_table(tmp_path):
db = lancedb.connect(tmp_path)
# Create a LanceDB table schema with a vector and a text column
emb = EmbeddingFunctionRegistry.get_instance().get("test").create()
@@ -98,7 +95,7 @@ def get_test_table(tmp_path, use_tantivy):
)
# Create a fts index
table.create_fts_index("text", use_tantivy=use_tantivy, replace=True)
table.create_fts_index("text", replace=True)
return table, MyTable
@@ -208,8 +205,8 @@ def _run_test_reranker(reranker, table, query, query_vector, schema):
assert len(result) == 20 and result == result_arrow
def _run_test_hybrid_reranker(reranker, tmp_path, use_tantivy):
table, schema = get_test_table(tmp_path, use_tantivy)
def _run_test_hybrid_reranker(reranker, tmp_path):
table, schema = get_test_table(tmp_path)
# The default reranker
result1 = (
table.search(
@@ -285,8 +282,7 @@ def _run_test_hybrid_reranker(reranker, tmp_path, use_tantivy):
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_linear_combination(tmp_path, use_tantivy):
def test_linear_combination(tmp_path):
reranker = LinearCombinationReranker()
vector_results = pa.Table.from_pydict(
@@ -313,22 +309,20 @@ def test_linear_combination(tmp_path, use_tantivy):
assert "_score" not in combined_results.column_names
assert "_relevance_score" in combined_results.column_names
_run_test_hybrid_reranker(reranker, tmp_path, use_tantivy)
_run_test_hybrid_reranker(reranker, tmp_path)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_rrf_reranker(tmp_path, use_tantivy):
def test_rrf_reranker(tmp_path):
reranker = RRFReranker()
_run_test_hybrid_reranker(reranker, tmp_path, use_tantivy)
_run_test_hybrid_reranker(reranker, tmp_path)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_mrr_reranker(tmp_path, use_tantivy):
def test_mrr_reranker(tmp_path):
reranker = MRRReranker()
_run_test_hybrid_reranker(reranker, tmp_path, use_tantivy)
_run_test_hybrid_reranker(reranker, tmp_path)
# Test multi-vector part
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
query = "single player experience"
rs1 = table.search(query, vector_column_name="vector").limit(10).with_row_id(True)
rs2 = (
@@ -363,7 +357,7 @@ def test_rrf_reranker_distance():
table = db.create_table("test", data)
table.create_index(num_partitions=1, num_sub_vectors=2)
table.create_fts_index("text", use_tantivy=False)
table.create_fts_index("text")
reranker = RRFReranker(return_score="all")
@@ -422,35 +416,31 @@ def test_rrf_reranker_distance():
@pytest.mark.skipif(
os.environ.get("COHERE_API_KEY") is None, reason="COHERE_API_KEY not set"
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cohere_reranker(tmp_path, use_tantivy):
def test_cohere_reranker(tmp_path):
pytest.importorskip("cohere")
reranker = CohereReranker()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cross_encoder_reranker(tmp_path, use_tantivy):
def test_cross_encoder_reranker(tmp_path):
pytest.importorskip("sentence_transformers")
reranker = CrossEncoderReranker()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_colbert_reranker(tmp_path, use_tantivy):
def test_colbert_reranker(tmp_path):
pytest.importorskip("rerankers")
reranker = ColbertReranker()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_answerdotai_reranker(tmp_path, use_tantivy):
def test_answerdotai_reranker(tmp_path):
pytest.importorskip("rerankers")
reranker = AnswerdotaiRerankers()
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -459,10 +449,9 @@ def test_answerdotai_reranker(tmp_path, use_tantivy):
or os.environ.get("OPENAI_BASE_URL") is not None,
reason="OPENAI_API_KEY not set",
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_openai_reranker(tmp_path, use_tantivy):
def test_openai_reranker(tmp_path):
pytest.importorskip("openai")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
reranker = OpenaiReranker()
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -470,10 +459,9 @@ def test_openai_reranker(tmp_path, use_tantivy):
@pytest.mark.skipif(
os.environ.get("JINA_API_KEY") is None, reason="JINA_API_KEY not set"
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_jina_reranker(tmp_path, use_tantivy):
def test_jina_reranker(tmp_path):
pytest.importorskip("jina")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
reranker = JinaReranker()
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -481,11 +469,10 @@ def test_jina_reranker(tmp_path, use_tantivy):
@pytest.mark.skipif(
os.environ.get("VOYAGE_API_KEY") is None, reason="VOYAGE_API_KEY not set"
)
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_voyageai_reranker(tmp_path, use_tantivy):
def test_voyageai_reranker(tmp_path):
pytest.importorskip("voyageai")
reranker = VoyageAIReranker(model_name="rerank-2.5")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
_run_test_reranker(reranker, table, "single player experience", None, schema)
@@ -504,7 +491,7 @@ def test_empty_result_reranker():
# Create empty table with schema
empty_table = db.create_table("empty_table", schema=schema, mode="overwrite")
empty_table.create_fts_index("text", use_tantivy=False, replace=True)
empty_table.create_fts_index("text", replace=True)
for reranker in [
CrossEncoderReranker(),
# ColbertReranker(),
@@ -603,11 +590,10 @@ def test_empty_hybrid_result_reranker():
assert "_rowid" in result.column_names
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cross_encoder_reranker_return_all(tmp_path, use_tantivy):
def test_cross_encoder_reranker_return_all(tmp_path):
pytest.importorskip("sentence_transformers")
reranker = CrossEncoderReranker(return_score="all")
table, schema = get_test_table(tmp_path, use_tantivy)
table, schema = get_test_table(tmp_path)
query = "single player experience"
result = (
table.search(query, query_type="hybrid", vector_column_name="vector")

View File

@@ -242,8 +242,8 @@ def test_s3_dynamodb_sync(s3_bucket: str, commit_table: str, monkeypatch):
# FTS indices should error since they are not supported yet.
with pytest.raises(
NotImplementedError,
match="Full-text search is only supported on the local filesystem",
ValueError,
match="Tantivy-based FTS has been removed",
):
table.create_fts_index("x", use_tantivy=True)

View File

@@ -1948,7 +1948,6 @@ def setup_hybrid_search_table(db: DBConnection, embedding_func):
def test_hybrid_search(tmp_db: DBConnection):
# This test uses an FTS index
pytest.importorskip("lancedb.fts")
pytest.importorskip("lance")
table, MyTable, emb = setup_hybrid_search_table(tmp_db, "test")
@@ -2019,7 +2018,6 @@ def test_hybrid_search(tmp_db: DBConnection):
def test_hybrid_search_metric_type(tmp_db: DBConnection):
# This test uses an FTS index
pytest.importorskip("lancedb.fts")
pytest.importorskip("lance")
# Need to use nonnorm as the embedding function so l2 and dot results

View File

@@ -0,0 +1,140 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import pickle
import pyarrow as pa
import pytest
from lancedb import connect
from lancedb.permutation import permutation_builder
torch = pytest.importorskip("torch")
from lancedb.integrations.torch import ( # noqa: E402
LanceIterableTorchDataset,
LanceTorchDataset,
)
@pytest.fixture
def db_path(tmp_path):
"""LanceTorchDataset needs a real, on-disk DB so workers can re-open it."""
return tmp_path
def _make_table(db_path, name="imgs", n=20):
db = connect(db_path)
return db.create_table(
name,
pa.table({"x": [float(i) for i in range(n)], "y": list(range(n))}),
)
def test_basic_len_and_getitem(db_path):
tbl = _make_table(db_path)
ds = LanceTorchDataset(tbl)
assert len(ds) == 20
row = ds[0]
# Default ("python") format = list of dicts; __getitem__ wraps a single index.
assert isinstance(row, list)
assert row[0] == {"x": 0.0, "y": 0}
def test_getitems_uses_fetch(db_path):
tbl = _make_table(db_path)
ds = LanceTorchDataset(tbl)
rows = ds.__getitems__([0, 2, 4])
assert rows == [
{"x": 0.0, "y": 0},
{"x": 2.0, "y": 2},
{"x": 4.0, "y": 4},
]
def test_dataloader_default_collate(db_path):
tbl = _make_table(db_path, n=40)
ds = LanceTorchDataset(tbl)
loader = torch.utils.data.DataLoader(ds, batch_size=8, shuffle=False)
batch = next(iter(loader))
# default collate stacks list-of-dicts into dict-of-tensors
assert isinstance(batch, dict)
assert batch["x"].size() == (8,)
assert batch["y"].size() == (8,)
def test_picklable(db_path):
tbl = _make_table(db_path)
ds = LanceTorchDataset(tbl, columns=["x"])
# Force open then ensure pickle drops the Rust handle.
_ = len(ds)
blob = pickle.dumps(ds)
restored: LanceTorchDataset = pickle.loads(blob)
# Rust state should not survive pickling.
assert restored._perm is None
# …but the dataset must work after re-opening transparently.
assert len(restored) == 20
assert restored[0] == [{"x": 0.0}]
def test_dataloader_with_workers(db_path):
tbl = _make_table(db_path, n=32)
ds = LanceTorchDataset(tbl)
loader = torch.utils.data.DataLoader(
ds, batch_size=4, num_workers=2, shuffle=False
)
batches = list(loader)
seen = []
for b in batches:
seen.extend(b["x"].tolist())
assert sorted(seen) == [float(i) for i in range(32)]
def test_with_permutation_table(db_path):
tbl = _make_table(db_path, n=30)
db = connect(db_path)
perm_tbl = (
permutation_builder(tbl)
.split_random(ratios=[0.5, 0.5], seed=1, split_names=["train", "test"])
.persist(db, "imgs_perm")
.execute()
)
ds = LanceTorchDataset(tbl, permutation_table=perm_tbl, split="train")
# Should pickle/restore the permutation table reference too.
blob = pickle.dumps(ds)
restored = pickle.loads(blob)
assert len(restored) == 15
def test_format_passthrough_dataloader(db_path):
"""Custom `format` is forwarded to the underlying Permutation."""
tbl = _make_table(db_path, n=20)
ds = LanceTorchDataset(tbl, format="arrow")
# Arrow batches don't go through default_collate, so use a no-op collate.
loader = torch.utils.data.DataLoader(
ds, batch_size=5, shuffle=False, collate_fn=lambda x: x
)
batch = next(iter(loader))
assert isinstance(batch, pa.RecordBatch)
assert batch.num_rows == 5
def test_iterable_dataset(db_path):
tbl = _make_table(db_path, n=20)
ds = LanceIterableTorchDataset(tbl, batch_size=5)
batches = list(ds)
# default batch size + skip_last_batch=True yields full-size batches only
assert len(batches) == 4
assert all(len(b) == 5 for b in batches)
def test_uri_table_name_constructor(db_path):
_make_table(db_path)
ds = LanceTorchDataset(uri=str(db_path), table_name="imgs")
assert len(ds) == 20
assert ds[0] == [{"x": 0.0, "y": 0}]
def test_constructor_validates_args():
with pytest.raises(ValueError, match="table"):
LanceTorchDataset()

View File

@@ -525,7 +525,7 @@ impl Connection {
}
#[pyfunction]
#[pyo3(signature = (uri, api_key=None, region=None, host_override=None, read_consistency_interval=None, client_config=None, storage_options=None, session=None))]
#[pyo3(signature = (uri, api_key=None, region=None, host_override=None, read_consistency_interval=None, client_config=None, storage_options=None, session=None, manifest_enabled=false, namespace_client_properties=None))]
#[allow(clippy::too_many_arguments)]
pub fn connect(
py: Python<'_>,
@@ -537,6 +537,8 @@ pub fn connect(
client_config: Option<PyClientConfig>,
storage_options: Option<HashMap<String, String>>,
session: Option<crate::session::Session>,
manifest_enabled: bool,
namespace_client_properties: Option<HashMap<String, String>>,
) -> PyResult<Bound<'_, PyAny>> {
future_into_py(py, async move {
let mut builder = lancedb::connect(&uri);
@@ -556,6 +558,12 @@ pub fn connect(
if let Some(storage_options) = storage_options {
builder = builder.storage_options(storage_options);
}
if manifest_enabled {
builder = builder.manifest_enabled(true);
}
if let Some(namespace_client_properties) = namespace_client_properties {
builder = builder.namespace_client_properties(namespace_client_properties);
}
#[cfg(feature = "remote")]
if let Some(client_config) = client_config {
builder = builder.client_config(client_config.into());

40
python/uv.lock generated
View File

@@ -1996,7 +1996,6 @@ tests = [
{ name = "pytest-mock" },
{ name = "pytz" },
{ name = "requests" },
{ name = "tantivy" },
]
[package.metadata]
@@ -2050,7 +2049,6 @@ requires-dist = [
{ name = "sentence-transformers", marker = "extra == 'embeddings'", specifier = ">=2.2.0" },
{ name = "sentencepiece", marker = "extra == 'embeddings'", specifier = ">=0.1.99" },
{ name = "sentencepiece", marker = "extra == 'siglip'" },
{ name = "tantivy", marker = "extra == 'tests'", specifier = ">=0.20.0" },
{ name = "torch", marker = "extra == 'clip'" },
{ name = "torch", marker = "extra == 'embeddings'", specifier = ">=2.0.0" },
{ name = "torch", marker = "extra == 'siglip'" },
@@ -4779,44 +4777,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-any.whl", hash = "sha256:024ca478df22e9340661486f85298cff5f6dcdba14f3813e8830015b9ed1948f", size = 35252, upload-time = "2022-10-06T17:21:44.262Z" },
]
[[package]]
name = "tantivy"
version = "0.25.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/1b/f9/0cd3955d155d3e3ef74b864769514dd191e5dacba9f0beb7af2d914942ce/tantivy-0.25.1.tar.gz", hash = "sha256:68a3314699a7d18fcf338b52bae8ce46a97dde1128a3e47e33fa4db7f71f265e", size = 75120, upload-time = "2025-12-02T11:57:12.997Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/80/f7/2276bed3bed983ce2970dc70e3571f372587fe4f5f2bac1d6d617df08fa3/tantivy-0.25.1-cp310-cp310-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:7aa587a3dc9470584cacf5e3640fee93d12ec5f10109669c1f47c4e90820b958", size = 7638510, upload-time = "2025-12-02T11:56:08.754Z" },
{ url = "https://files.pythonhosted.org/packages/20/8c/078dc50570e243414356b05633f52fe544b85179281ffa9f1fe05d76bbd8/tantivy-0.25.1-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:56d77fe667595693d9fa5f0b4545776d84da9526bab0273b3fc6c7536dc0d8a2", size = 3932659, upload-time = "2025-12-02T11:56:10.621Z" },
{ url = "https://files.pythonhosted.org/packages/bd/dc/281c48436a1e3178b58fe463af314434fe0f3a4ec0c7588a362900e0c69e/tantivy-0.25.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5ba8c347cd48595fcaeabb28a909ebce92cf9c5e5c84ab5ba1136a280a307b5c", size = 4197430, upload-time = "2025-12-02T11:56:12.65Z" },
{ url = "https://files.pythonhosted.org/packages/7b/6c/61e6e0b0a350007d10a9b66a35703361d3345e14e7a7cc83494776b2a054/tantivy-0.25.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:aa7c4932e8fde1f09f2d46226060e827e197c2749abdc6129d73a752773adc38", size = 4184055, upload-time = "2025-12-02T11:56:14.647Z" },
{ url = "https://files.pythonhosted.org/packages/5f/fd/0eb059b12f0b6f91623a54a46448a83b7f716d08f3bca68c095d697b85da/tantivy-0.25.1-cp310-cp310-win_amd64.whl", hash = "sha256:afcfc5dbb0bcd5d24531f4471737ae0896f33528426ab0b1dad3e427c19120f6", size = 3424134, upload-time = "2025-12-02T11:56:16.242Z" },
{ url = "https://files.pythonhosted.org/packages/4e/7a/8a277f377e8a151fc0e71d4ffc1114aefb6e5e1c7dd609fed0955cf34ed8/tantivy-0.25.1-cp311-cp311-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:d363d7b4207d3a5aa7f0d212420df35bed18bdb6bae26a2a8bd57428388b7c29", size = 7637033, upload-time = "2025-12-02T11:56:18.104Z" },
{ url = "https://files.pythonhosted.org/packages/71/31/8b4acdedfc9f9a2d04b1340d07eef5213d6f151d1e18da0cb423e5f090d2/tantivy-0.25.1-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:8f4389cf1d889a1df7c5a3195806b4b56c37cee10d8a26faaa0dea35a867b5ff", size = 3932180, upload-time = "2025-12-02T11:56:19.833Z" },
{ url = "https://files.pythonhosted.org/packages/2f/dc/3e8499c21b4b9795e8f2fc54c68ce5b92905aaeadadaa56ecfa9180b11b1/tantivy-0.25.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:99864c09fc54652c3c2486cdf13f86cdc8200f4b481569cb291e095ca5d496e5", size = 4197620, upload-time = "2025-12-02T11:56:21.496Z" },
{ url = "https://files.pythonhosted.org/packages/f8/8e/f2ce62fffc811eb62bead92c7b23c2e218f817cbd54c4f3b802e03ba1438/tantivy-0.25.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:05abf37ddbc5063c575548be0d62931629c086bff7a5a1b67cf5a8f5ebf4cd8c", size = 4183794, upload-time = "2025-12-02T11:56:23.215Z" },
{ url = "https://files.pythonhosted.org/packages/de/64/24e2891b0ba3fd9853e10c296095a33b89bf3efd65e29da1ee5dae736040/tantivy-0.25.1-cp311-cp311-win_amd64.whl", hash = "sha256:f307ee8ad21597b0be23af83008fd66cfd5f958cdfa24ec0aaa08a38e86bbef4", size = 3424235, upload-time = "2025-12-02T11:56:25.172Z" },
{ url = "https://files.pythonhosted.org/packages/41/e7/6849c713ed0996c7628324c60512c4882006f0a62145e56c624a93407f90/tantivy-0.25.1-cp312-cp312-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:90fd919e5f611809f746560ecf36eb9be824dec62e21ae17a27243759edb9aa1", size = 7621494, upload-time = "2025-12-02T11:56:27.069Z" },
{ url = "https://files.pythonhosted.org/packages/c5/22/c3d8294600dc6e7fa350daef9ff337d3c06e132b81df727de9f7a50c692a/tantivy-0.25.1-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:4613c7cf6c23f3a97989819690a0f956d799354957de7a204abcc60083cebe02", size = 3925219, upload-time = "2025-12-02T11:56:29.403Z" },
{ url = "https://files.pythonhosted.org/packages/41/fc/cbb1df71dd44c9110eff4eaaeda9d44f2d06182fe0452193be20ddfba93f/tantivy-0.25.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c477bd20b4df804d57dfc5033431bef27cde605695ae141b03abbf6ebc069129", size = 4198699, upload-time = "2025-12-02T11:56:31.359Z" },
{ url = "https://files.pythonhosted.org/packages/47/4d/71abb78b774073c3ce12a4faa4351a9d910a71ffa3659526affba163873d/tantivy-0.25.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f9b1a1ba1113c523c7ff7b10f282d6c4074006f7ef8d71e1d973d51bf7291ddb", size = 4183585, upload-time = "2025-12-02T11:56:33.317Z" },
{ url = "https://files.pythonhosted.org/packages/be/16/3f00cd7ec458b92a0e977960af9ddfbeb762127d9acc68da9094a1fda556/tantivy-0.25.1-cp312-cp312-win_amd64.whl", hash = "sha256:9de0bafd3bd7ac9f8f82d53e17562e9db11a5af308fe5185c4bd86feaddbe4a6", size = 3424622, upload-time = "2025-12-02T11:56:34.788Z" },
{ url = "https://files.pythonhosted.org/packages/3d/25/73cfbcf1a8ea49be6c42817431cac46b70a119fe64da903fcc2d92b5b511/tantivy-0.25.1-cp313-cp313-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:f51ff7196c6f31719202080ed8372d5e3d51e92c749c032fb8234f012e99744c", size = 7622530, upload-time = "2025-12-02T11:56:36.839Z" },
{ url = "https://files.pythonhosted.org/packages/12/c8/c0d7591cdf4f7e7a9fc4da786d1ca8cd1aacffaa2be16ea6d401a8e4a566/tantivy-0.25.1-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:550e63321bfcacc003859f2fa29c1e8e56450807b3c9a501c1add27cfb9236d9", size = 3925637, upload-time = "2025-12-02T11:56:38.425Z" },
{ url = "https://files.pythonhosted.org/packages/3a/09/bedfc223bffec7641b417dd7ab071134b2ef8f8550e9b1fb6014657ef52e/tantivy-0.25.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:fde31cc8d6e122faf7902aeea32bc008a429a6e8904e34d3468126a3ec01b016", size = 4197322, upload-time = "2025-12-02T11:56:40.411Z" },
{ url = "https://files.pythonhosted.org/packages/f5/f1/1fa5183500c8042200c9f2b840d34f5bbcfb434a1ee750e7132262d2a5c9/tantivy-0.25.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b11bd5a518b0be645320b47af8493f6a40c4f3234313e37adcf4534a564d27dd", size = 4183143, upload-time = "2025-12-02T11:56:42.048Z" },
{ url = "https://files.pythonhosted.org/packages/d5/74/a4c4f4eb95888ccb784da3b017aa0625ab1ac411bf5d022a9a797d9a2334/tantivy-0.25.1-cp313-cp313-win_amd64.whl", hash = "sha256:cc7fe88853e06b3251ee4fa42b7a2038727f850c8765bcc8167cfc73585dd24e", size = 3423491, upload-time = "2025-12-02T11:56:43.858Z" },
{ url = "https://files.pythonhosted.org/packages/8b/2f/581519492226f97d23bd0adc95dad991ebeaa73ea6abc8bff389a3096d9a/tantivy-0.25.1-cp313-cp313t-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:dae99e75b7eaa9bf5bd16ab106b416370f08c135aed0e117d62a3201cd1ffe36", size = 7610316, upload-time = "2025-12-02T11:56:45.927Z" },
{ url = "https://files.pythonhosted.org/packages/91/40/5d7bc315ab9e6a22c5572656e8ada1c836cfa96dccf533377504fbc3c9d9/tantivy-0.25.1-cp313-cp313t-macosx_10_12_x86_64.whl", hash = "sha256:506e9533c5ef4d3df43bad64ffecc0aa97c76e361ea610815dc3a20a9d6b30b3", size = 3919882, upload-time = "2025-12-02T11:56:48.469Z" },
{ url = "https://files.pythonhosted.org/packages/02/b9/e0ef2f57a6a72444cb66c2ffbc310ab33ffaace275f1c4b0319d84ea3f18/tantivy-0.25.1-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5dbd4f8f264dacbcc9dee542832da2173fd53deaaea03f082d95214f8b5ed6bc", size = 4196031, upload-time = "2025-12-02T11:56:50.151Z" },
{ url = "https://files.pythonhosted.org/packages/1e/02/bf3f8cacfd08642e14a73f7956a3fb95d58119132c98c121b9065a1f8615/tantivy-0.25.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:824c643ccb640dd9e35e00c5d5054ddf3323f56fe4219d57d428a9eeea13d22c", size = 4183437, upload-time = "2025-12-02T11:56:51.818Z" },
{ url = "https://files.pythonhosted.org/packages/9c/83/afa90e570198e2d1139dd567bec3c9cf44d8c54f63a649f16d711ede02f5/tantivy-0.25.1-cp313-cp313t-win_amd64.whl", hash = "sha256:09c987b840afcebac817836ac08407eff17272d8aa60ce6e291f89c81830221d", size = 3419409, upload-time = "2025-12-02T11:56:53.451Z" },
{ url = "https://files.pythonhosted.org/packages/ff/44/9f1d67aa5030f7eebc966c863d1316a510a971dd8bb45651df4acdfae9ed/tantivy-0.25.1-cp314-cp314-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl", hash = "sha256:7f5d29ae85dd0f23df8d15b3e7b341d4f9eb5a446bbb9640df48ac1f6d9e0c6c", size = 7623723, upload-time = "2025-12-02T11:56:55.066Z" },
{ url = "https://files.pythonhosted.org/packages/db/30/6e085bd3ed9d12da3c91c185854abd70f9dfd35fb36a75ea98428d42c30b/tantivy-0.25.1-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:f2d2938fb69a74fc1bb36edfaf7f0d1596fa1264db0f377bda2195c58bcb6245", size = 3926243, upload-time = "2025-12-02T11:56:57.058Z" },
{ url = "https://files.pythonhosted.org/packages/32/f5/a00d65433430f51718e5cc6938df571765d7c4e03aedec5aef4ab567aa9b/tantivy-0.25.1-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4f5ff124c4802558e627091e780b362ca944169736caba5a372eef39a79d0ae0", size = 4207186, upload-time = "2025-12-02T11:56:58.803Z" },
{ url = "https://files.pythonhosted.org/packages/19/63/61bdb12fc95f2a7f77bd419a5149bfa9f28caa76cb569bf2b6b06e1d033e/tantivy-0.25.1-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:43b80ef62a340416139c93d19264e5f808da48e04f9305f1092b8ed22be0a5be", size = 4187312, upload-time = "2025-12-02T11:57:00.595Z" },
{ url = "https://files.pythonhosted.org/packages/b7/de/e39c0b01d59019bf5c38face8b81defbc4a68cebf5e0c53bcb2cd715a449/tantivy-0.25.1-cp314-cp314-win_amd64.whl", hash = "sha256:286b654f40c70c1e6b64b9bc7031ed0bf5c440f5bffeaeeee21a0ee6cc39f0e2", size = 3436535, upload-time = "2025-12-02T11:57:02.267Z" },
]
[[package]]
name = "threadpoolctl"
version = "3.6.0"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.28.0-beta.8"
version = "0.28.0-beta.10"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -111,7 +111,12 @@ default = []
aws = ["lance/aws", "lance-io/aws", "lance-namespace-impls/dir-aws"]
oss = ["lance/oss", "lance-io/oss", "lance-namespace-impls/dir-oss"]
gcs = ["lance/gcp", "lance-io/gcp", "lance-namespace-impls/dir-gcp"]
azure = ["lance/azure", "lance-io/azure", "lance-namespace-impls/dir-azure"]
azure = [
"lance/azure",
"lance-io/azure",
"lance-namespace-impls/dir-azure",
"lance-namespace-impls/credential-vendor-azure",
]
huggingface = [
"lance/huggingface",
"lance-io/huggingface",

View File

@@ -590,6 +590,15 @@ pub struct ConnectRequest {
/// storage options.
pub namespace_client_properties: HashMap<String, String>,
/// Use directory namespace manifests as the source of truth for native
/// LanceDB table metadata.
///
/// When enabled for a local/native connection, LanceDB returns a
/// namespace-backed database directly. Directory listing fallback remains
/// enabled for migration, and directory-listing-to-manifest migration is
/// forced on.
pub manifest_enabled: bool,
/// The interval at which to check for updates from other processes.
///
/// If None, then consistency is not checked. For performance
@@ -630,6 +639,7 @@ impl ConnectBuilder {
read_consistency_interval: None,
options: HashMap::new(),
namespace_client_properties: HashMap::new(),
manifest_enabled: false,
session: None,
},
embedding_registry: None,
@@ -791,6 +801,17 @@ impl ConnectBuilder {
self
}
/// Enable or disable manifest-backed directory namespace mode for local
/// native connections.
///
/// When enabled, the connection uses the directory namespace database
/// directly for all table operations and forces
/// `dir_listing_to_manifest_migration_enabled=true`.
pub fn manifest_enabled(mut self, enabled: bool) -> Self {
self.request.manifest_enabled = enabled;
self
}
/// The interval at which to check for updates from other processes. This
/// only affects LanceDB OSS.
///
@@ -886,6 +907,16 @@ impl ConnectBuilder {
pub async fn execute(self) -> Result<Connection> {
if self.request.uri.starts_with("db") {
self.execute_remote()
} else if self.request.manifest_enabled {
let internal = Arc::new(
ListingDatabase::connect_manifest_enabled_namespace_database(&self.request).await?,
);
Ok(Connection {
internal,
embedding_registry: self
.embedding_registry
.unwrap_or_else(|| Arc::new(MemoryRegistry::new())),
})
} else {
let internal = Arc::new(ListingDatabase::connect_with_options(&self.request).await?);
Ok(Connection {
@@ -1132,6 +1163,9 @@ mod tests {
use lance_testing::datagen::{BatchGenerator, IncrementingInt32};
use tempfile::tempdir;
use crate::database::listing::{ListingDatabaseOptions, OPT_NEW_TABLE_V2_MANIFEST_PATHS};
use crate::database::namespace::LanceNamespaceDatabase;
use crate::table::NativeTable;
use crate::test_utils::connection::new_test_connection;
use super::*;
@@ -1204,6 +1238,147 @@ mod tests {
);
}
#[tokio::test]
async fn test_connect_with_manifest_enabled_uses_directory_namespace() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri)
.manifest_enabled(true)
.storage_option("timeout", "30s")
.namespace_client_property("manifest_enabled", "false")
.namespace_client_property("dir_listing_to_manifest_migration_enabled", "false")
.execute()
.await
.unwrap();
assert!(
db.database()
.as_any()
.downcast_ref::<LanceNamespaceDatabase>()
.is_some()
);
assert_eq!(db.uri(), uri);
let (ns_impl, properties) = db.namespace_client_config().await.unwrap();
assert_eq!(ns_impl, "dir");
assert_eq!(properties.get("root"), Some(&uri.to_string()));
assert_eq!(
properties.get("manifest_enabled"),
Some(&"true".to_string())
);
assert_eq!(
properties.get("dir_listing_to_manifest_migration_enabled"),
Some(&"true".to_string())
);
assert_eq!(properties.get("storage.timeout"), Some(&"30s".to_string()));
}
#[tokio::test]
async fn test_manifest_enabled_rejects_commit_engine_uri() {
let Err(err) = connect("s3+ddb://bucket/db?ddbTableName=manifest")
.manifest_enabled(true)
.execute()
.await
else {
panic!("expected manifest-enabled s3+ddb connection to fail");
};
assert!(
matches!(err, Error::NotSupported { message } if message.contains("commit engine URI schemes"))
);
let Err(err) = connect("s3://bucket/db?engine=ddb&ddbTableName=manifest")
.manifest_enabled(true)
.execute()
.await
else {
panic!("expected manifest-enabled engine query connection to fail");
};
assert!(
matches!(err, Error::NotSupported { message } if message.contains("commit engine"))
);
}
#[tokio::test]
async fn test_manifest_enabled_connection_migrates_root_listing_table() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
connect(uri)
.execute()
.await
.unwrap()
.create_empty_table("legacy", schema)
.execute()
.await
.unwrap();
let db = connect(uri).manifest_enabled(true).execute().await.unwrap();
let tables = db.table_names().execute().await.unwrap();
assert_eq!(tables, vec!["legacy".to_string()]);
db.open_table("legacy").execute().await.unwrap();
}
#[tokio::test]
async fn test_manifest_enabled_preserves_new_table_options() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let options = ListingDatabaseOptions::builder()
.enable_v2_manifest_paths(true)
.build();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
let table = connect(uri)
.manifest_enabled(true)
.database_options(&options)
.execute()
.await
.unwrap()
.create_empty_table("v1_manifest", schema)
.storage_option(OPT_NEW_TABLE_V2_MANIFEST_PATHS, "false")
.execute()
.await
.unwrap();
let native_table = table
.base_table()
.as_any()
.downcast_ref::<NativeTable>()
.unwrap();
assert!(!native_table.uses_v2_manifest_paths().await.unwrap());
}
#[tokio::test]
async fn test_manifest_enabled_vend_input_storage_options() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
let table = connect(uri)
.manifest_enabled(true)
.storage_option("test_storage_option", "test_value")
.namespace_client_property("vend_input_storage_options", "true")
.namespace_client_property(
"vend_input_storage_options_refresh_interval_millis",
"60000",
)
.execute()
.await
.unwrap()
.create_empty_table("vended", schema)
.execute()
.await
.unwrap();
let storage_options = table.latest_storage_options().await.unwrap().unwrap();
assert_eq!(
storage_options.get("test_storage_option"),
Some(&"test_value".to_string())
);
assert!(storage_options.contains_key("expires_at_millis"));
}
#[tokio::test]
async fn test_table_names() {
let tc = new_test_connection().await.unwrap();

View File

@@ -285,7 +285,7 @@ const MIRRORED_STORE: &str = "mirroredStore";
/// A connection to LanceDB
impl ListingDatabase {
fn build_namespace_client_properties(
pub(crate) fn build_namespace_client_properties(
uri: &str,
storage_options: &HashMap<String, String>,
namespace_client_properties: HashMap<String, String>,
@@ -298,6 +298,24 @@ impl ListingDatabase {
properties
}
pub(crate) fn build_manifest_enabled_namespace_client_properties(
uri: &str,
storage_options: &HashMap<String, String>,
namespace_client_properties: HashMap<String, String>,
) -> HashMap<String, String> {
let mut properties = Self::build_namespace_client_properties(
uri,
storage_options,
namespace_client_properties,
);
properties.insert("manifest_enabled".to_string(), "true".to_string());
properties.insert(
"dir_listing_to_manifest_migration_enabled".to_string(),
"true".to_string(),
);
properties
}
async fn connect_namespace_database(
uri: &str,
storage_options: HashMap<String, String>,
@@ -323,6 +341,119 @@ impl ListingDatabase {
))
}
async fn prepare_namespace_root(
uri: &str,
storage_options: &HashMap<String, String>,
session: Arc<lance::session::Session>,
) -> Result<String> {
match url::Url::parse(uri) {
Ok(url) if url.scheme().len() == 1 && cfg!(windows) => {
let (object_store, _) = ObjectStore::from_uri_and_params(
session.store_registry(),
uri,
&ObjectStoreParams::default(),
)
.await?;
if object_store.is_local() {
Self::try_create_dir(uri).context(CreateDirSnafu { path: uri })?;
}
Ok(uri.to_string())
}
Ok(mut url) => {
if url.scheme().contains('+') {
return Err(Error::NotSupported {
message: "commit engine URI schemes are not supported for manifest-enabled namespace connections".to_string(),
});
}
for (key, value) in url.query_pairs() {
if key == ENGINE {
return Err(Error::NotSupported {
message: format!(
"commit engine '{}' is not supported for manifest-enabled namespace connections",
value
),
});
} else if key == MIRRORED_STORE {
return Err(Error::NotSupported {
message: "mirrored store is not supported for manifest-enabled namespace connections"
.to_string(),
});
}
}
url.set_query(None);
let plain_uri = url.to_string();
let os_params = ObjectStoreParams {
storage_options_accessor: if storage_options.is_empty() {
None
} else {
Some(Arc::new(StorageOptionsAccessor::with_static_options(
storage_options.clone(),
)))
},
..Default::default()
};
let (object_store, _) = ObjectStore::from_uri_and_params(
session.store_registry(),
&plain_uri,
&os_params,
)
.await?;
if object_store.is_local() {
Self::try_create_dir(&plain_uri).context(CreateDirSnafu {
path: plain_uri.clone(),
})?;
}
Ok(plain_uri)
}
Err(_) => {
let (object_store, _) = ObjectStore::from_uri_and_params(
session.store_registry(),
uri,
&ObjectStoreParams::default(),
)
.await?;
if object_store.is_local() {
Self::try_create_dir(uri).context(CreateDirSnafu { path: uri })?;
}
Ok(uri.to_string())
}
}
}
pub(crate) async fn connect_manifest_enabled_namespace_database(
request: &ConnectRequest,
) -> Result<LanceNamespaceDatabase> {
let options = ListingDatabaseOptions::parse_from_map(&request.options)?;
let session = request
.session
.clone()
.unwrap_or_else(|| Arc::new(lance::session::Session::default()));
let namespace_root =
Self::prepare_namespace_root(&request.uri, &options.storage_options, session.clone())
.await?;
let ns_properties = Self::build_manifest_enabled_namespace_client_properties(
&namespace_root,
&options.storage_options,
request.namespace_client_properties.clone(),
);
LanceNamespaceDatabase::connect_with_new_table_config(
"dir",
ns_properties,
options.storage_options,
request.read_consistency_interval,
Some(session),
HashSet::new(),
options.new_table_config,
)
.await
.map(|db| db.with_uri(request.uri.clone()))
}
/// Connect to a listing database
///
/// The URI should be a path to a directory where the tables are stored.
@@ -690,15 +821,12 @@ impl ListingDatabase {
store_params.storage_options_accessor = Some(Arc::new(accessor));
}
write_params.data_storage_version = self
.new_table_config
.data_storage_version
.or(storage_version_override);
write_params.data_storage_version = storage_version_override
.or(write_params.data_storage_version)
.or(self.new_table_config.data_storage_version);
if let Some(enable_v2_manifest_paths) = self
.new_table_config
.enable_v2_manifest_paths
.or(v2_manifest_override)
if let Some(enable_v2_manifest_paths) =
v2_manifest_override.or(self.new_table_config.enable_v2_manifest_paths)
{
write_params.enable_v2_manifest_paths = enable_v2_manifest_paths;
}
@@ -1158,6 +1286,7 @@ mod tests {
client_config: Default::default(),
options: Default::default(),
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1292,6 +1421,7 @@ mod tests {
client_config: Default::default(),
options: options.clone(),
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1827,6 +1957,7 @@ mod tests {
client_config: Default::default(),
options,
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -1933,6 +2064,7 @@ mod tests {
client_config: Default::default(),
options,
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -2005,6 +2137,7 @@ mod tests {
client_config: Default::default(),
options,
namespace_client_properties: Default::default(),
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};
@@ -2202,6 +2335,7 @@ mod tests {
client_config: Default::default(),
options: Default::default(),
namespace_client_properties,
manifest_enabled: false,
read_consistency_interval: None,
session: None,
};

View File

@@ -24,6 +24,10 @@ use lance_table::io::commit::external_manifest::ExternalManifestCommitHandler;
use crate::connection::NamespaceClientPushdownOperation;
use crate::database::ReadConsistency;
use crate::database::listing::{
NewTableConfig, OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS, OPT_NEW_TABLE_STORAGE_VERSION,
OPT_NEW_TABLE_V2_MANIFEST_PATHS,
};
use crate::error::{Error, Result};
use crate::table::NativeTable;
use lance::dataset::WriteMode;
@@ -50,6 +54,8 @@ pub struct LanceNamespaceDatabase {
ns_impl: String,
// Namespace properties used to construct the namespace client
ns_properties: HashMap<String, String>,
// Options for tables created by this connection
new_table_config: NewTableConfig,
}
impl LanceNamespaceDatabase {
@@ -71,9 +77,15 @@ impl LanceNamespaceDatabase {
pushdown_operations: namespace_client_pushdown_operations,
ns_impl: namespace_client_impl,
ns_properties: namespace_client_properties,
new_table_config: NewTableConfig::default(),
}
}
pub(crate) fn with_uri(mut self, uri: impl Into<String>) -> Self {
self.uri = uri.into();
self
}
pub async fn connect(
ns_impl: &str,
ns_properties: HashMap<String, String>,
@@ -81,6 +93,27 @@ impl LanceNamespaceDatabase {
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
) -> Result<Self> {
Self::connect_with_new_table_config(
ns_impl,
ns_properties,
storage_options,
read_consistency_interval,
session,
pushdown_operations,
NewTableConfig::default(),
)
.await
}
pub(crate) async fn connect_with_new_table_config(
ns_impl: &str,
ns_properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
new_table_config: NewTableConfig,
) -> Result<Self> {
let mut builder = ConnectBuilder::new(ns_impl);
for (key, value) in ns_properties.clone() {
@@ -102,8 +135,79 @@ impl LanceNamespaceDatabase {
pushdown_operations,
ns_impl: ns_impl.to_string(),
ns_properties,
new_table_config,
})
}
fn extract_storage_overrides(
&self,
request: &DbCreateTableRequest,
) -> Result<(
Option<lance_encoding::version::LanceFileVersion>,
Option<bool>,
Option<bool>,
)> {
let storage_options = request
.write_options
.lance_write_params
.as_ref()
.and_then(|p| p.store_params.as_ref())
.and_then(|sp| sp.storage_options());
let storage_version_override = storage_options
.and_then(|opts| opts.get(OPT_NEW_TABLE_STORAGE_VERSION))
.map(|s| s.parse::<lance_encoding::version::LanceFileVersion>())
.transpose()?;
let v2_manifest_override = storage_options
.and_then(|opts| opts.get(OPT_NEW_TABLE_V2_MANIFEST_PATHS))
.map(|s| s.parse::<bool>())
.transpose()
.map_err(|_| Error::InvalidInput {
message: "enable_v2_manifest_paths must be a boolean".to_string(),
})?;
let stable_row_ids_override = storage_options
.and_then(|opts| opts.get(OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS))
.map(|s| s.parse::<bool>())
.transpose()
.map_err(|_| Error::InvalidInput {
message: "enable_stable_row_ids must be a boolean".to_string(),
})?;
Ok((
storage_version_override,
v2_manifest_override,
stable_row_ids_override,
))
}
fn apply_new_table_config(
&self,
params: &mut lance::dataset::WriteParams,
request: &DbCreateTableRequest,
) -> Result<()> {
let (storage_version_override, v2_manifest_override, stable_row_ids_override) =
self.extract_storage_overrides(request)?;
params.data_storage_version = storage_version_override
.or(params.data_storage_version)
.or(self.new_table_config.data_storage_version);
if let Some(enable_v2_manifest_paths) =
v2_manifest_override.or(self.new_table_config.enable_v2_manifest_paths)
{
params.enable_v2_manifest_paths = enable_v2_manifest_paths;
}
if let Some(enable_stable_row_ids) =
stable_row_ids_override.or(self.new_table_config.enable_stable_row_ids)
{
params.enable_stable_row_ids = enable_stable_row_ids;
}
Ok(())
}
}
impl std::fmt::Debug for LanceNamespaceDatabase {
@@ -299,7 +403,12 @@ impl Database for LanceNamespaceDatabase {
};
// Build write params with storage options and commit handler
let mut params = request.write_options.lance_write_params.unwrap_or_default();
let mut params = request
.write_options
.lance_write_params
.clone()
.unwrap_or_default();
self.apply_new_table_config(&mut params, &request)?;
if matches!(request.mode, CreateTableMode::Overwrite) {
params.mode = WriteMode::Overwrite;

View File

@@ -16,7 +16,7 @@ use crate::remote::retry::{ResolvedRetryConfig, RetryCounter};
const REQUEST_ID_HEADER: HeaderName = HeaderName::from_static("x-request-id");
/// Configuration for TLS/mTLS settings.
#[derive(Clone, Debug, Default)]
#[derive(Clone, Debug)]
pub struct TlsConfig {
/// Path to the client certificate file (PEM format)
pub cert_file: Option<String>,
@@ -24,10 +24,22 @@ pub struct TlsConfig {
pub key_file: Option<String>,
/// Path to the CA certificate file for server verification (PEM format)
pub ssl_ca_cert: Option<String>,
/// Whether to verify the hostname in the server's certificate
/// Whether to verify the hostname in the server's certificate.
/// Defaults to `true`.
pub assert_hostname: bool,
}
impl Default for TlsConfig {
fn default() -> Self {
Self {
cert_file: None,
key_file: None,
ssl_ca_cert: None,
assert_hostname: true,
}
}
}
/// Trait for providing custom headers for each request
#[async_trait::async_trait]
pub trait HeaderProvider: Send + Sync + std::fmt::Debug {
@@ -926,7 +938,7 @@ mod tests {
assert!(config.cert_file.is_none());
assert!(config.key_file.is_none());
assert!(config.ssl_ca_cert.is_none());
assert!(!config.assert_hostname);
assert!(config.assert_hostname);
}
#[test]