Compare commits

..

6 Commits

Author SHA1 Message Date
Will Jones
c0a9a4d48a ci: fix pypi publish on mac/windows/arm (#3449)
The python-v0.32.0 publish run failed on every build matrix entry. Three
independent issues:

1. **Mac and Windows**: `pypa/gh-action-pypi-publish` only runs on
Linux, but was being called inline from each build job.
2. **Linux (all arches)**: `pypa/gh-action-pypi-publish` derives its
docker image name from `github.action_repository`, which is empty when
the action is invoked from inside a composite action
(actions/runner#2473 — pypa's own `action.yml` references this bug). It
falls back to `github.repository`, generating
`docker://ghcr.io/lancedb/lancedb:<tag>`, which doesn't exist →
`denied`. Only the ARM matrix entry surfaced this because it failed
first and cancel-cascaded the rest.
3. **Windows**: `upload-artifact` in `build_windows_wheel` pointed at
`python\target\wheels`, but maturin writes to the workspace-root
`target/wheels`. The artifact was always empty. Also, `pypi-publish.yml`
passed a `vcpkg_token` input that the composite doesn't declare.

## Changes

- Build jobs (linux/mac/windows) now upload their wheels as
`actions/upload-artifact` artifacts.
- New Linux `publish` job downloads all wheel artifacts and runs the
Fury or PyPA publish step directly (not via a composite), so
`github.action_repository` resolves correctly.
- Delete the unused `upload_wheel` composite action.
- Drop the broken upload-artifact step inside `build_windows_wheel`.
- Remove the bogus `vcpkg_token` input.
- Fury upload now loops over all wheels instead of just the first.
- Bump `actions/checkout`, `actions/upload-artifact`,
`actions/download-artifact` to current major versions (Node 24) to clear
deprecation warnings.
- Bump Windows job timeout 60 → 90 minutes; previous run was
cancel-timing-out on a 60m cap.
- Use `rust-lld` as the Windows MSVC linker via
`CARGO_TARGET_X86_64_PC_WINDOWS_MSVC_LINKER`. `link.exe` is
single-threaded and the long pole on Windows builds.

Fixes #3445

## Test plan

- [x] Open this PR — `paths` filter triggers a dry-run build on all
three platforms.
- [x] Verify all three builds produce wheels.
- [x] Confirm the `pypa/gh-action-pypi-publish` container actually
starts (the actions/runner#2473 bug) via the `publish-dry-run` job
pointed at TestPyPI.
- [x] **REMOVE BEFORE MERGE**: drop the `publish-dry-run` job and the
now-redundant `actions/upload-artifact` runs on PRs (currently always-on
so the dry-run has wheels to publish).
- [ ] After merge, cherry-pick onto `python-v0.32.0` and force-push the
tag to re-trigger the publish.
2026-05-28 12:35:58 -07:00
Lance Release
32c3d39f2a Bump version: 0.30.0-beta.2 → 0.30.0 2026-05-28 19:03:15 +00:00
Lance Release
7a41f4a5eb Bump version: 0.30.0-beta.1 → 0.30.0-beta.2 2026-05-28 19:02:41 +00:00
Lance Release
28fc8f0f26 Bump version: 0.33.0-beta.2 → 0.33.0 2026-05-28 19:02:05 +00:00
Lance Release
7677a5279c Bump version: 0.33.0-beta.1 → 0.33.0-beta.2 2026-05-28 19:02:03 +00:00
Will Jones
b6c645592a chore: update lance dependency to v7.0.0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:24:12 -07:00
108 changed files with 4221 additions and 15620 deletions

View File

@@ -1,7 +0,0 @@
# Agent Skills
This directory contains repo-scoped code agent skills for the LanceDB project.
Each skill is a folder that contains a required `SKILL.md` and optional bundled resources.
Codex discovers skills from `.agents/skills` in the current working directory and parent directories.

View File

@@ -1,98 +0,0 @@
---
name: lancedb-update-lance-dependency
description: Update LanceDB to a specific Lance release or tag. Use when bumping Lance dependencies in the lancedb repository, including Rust workspace Lance crates, Java lance-core, validation, branch creation, commit, push, and PR creation when requested.
---
# LanceDB Update Lance Dependency
## Scope
Use this skill in the `lancedb/lancedb` repository when updating the Lance dependency to a specific Lance version or tag.
Inputs can be a version (`7.2.0-beta.1`), a tag (`v7.2.0-beta.1`), a tag ref (`refs/tags/v7.2.0-beta.1`), or `latest`.
## Workflow
1. Confirm the worktree status with `git status --short`.
2. Resolve the target Lance version:
- If the input is `latest`, empty, or omitted, run:
```bash
python3 ci/check_lance_release.py
```
Parse the JSON output. If `needs_update` is not `true`, stop without creating a PR. Otherwise use `latest_tag`.
- If the input is explicit, use it directly.
3. Compute update metadata without changing files:
```bash
python3 ci/update_lance_dependency.py "$TAG_OR_VERSION" --metadata-only
```
Before making changes, check for an existing open PR with the emitted `pr_title`:
```bash
gh pr list --search "\"$PR_TITLE\" in:title" --state open --limit 1 --json number,url,title
```
If a matching open PR exists, stop and report it instead of creating a duplicate.
4. Run the deterministic update entrypoint:
```bash
python3 ci/update_lance_dependency.py "$TAG_OR_VERSION"
```
This updates the Rust workspace Lance dependencies through `ci/set_lance_version.py`, updates `java/pom.xml`, refreshes Cargo metadata, and prints JSON metadata containing `branch_name`, `commit_message`, and `pr_title`.
5. Run validation:
```bash
cargo clippy --quiet --workspace --tests --all-features -- -D warnings
cargo fmt --all --quiet
```
Fix real diagnostics and rerun clippy until it succeeds. Do not skip warnings.
6. Inspect `git status --short` and `git diff` to ensure only the Lance dependency update and required compatibility fixes are present.
7. If the task only asks to prepare local changes, stop here and report the changed files and validation result.
8. If the task asks to publish the update, create a branch using the printed `branch_name`, stage all relevant files, and commit using the printed `commit_message`. Do not amend or rewrite existing commits.
9. Push to `origin`. Before creating the PR, check that the current token has push permission:
```bash
gh api repos/lancedb/lancedb --jq .permissions.push
```
If the remote branch already exists for the same generated branch name, delete the remote ref with `gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/$BRANCH_NAME`, then push. Do not force-push.
10. Create a PR targeting `main` with the printed `pr_title`. If there is no PR template, keep the body to two or three concise sentences: state the Lance dependency bump, note any required compatibility fixes, and link the triggering Lance tag or release.
11. Read back the remote PR title after creation. If it is not a Conventional Commit title, fix it immediately.
12. When running in GitHub Actions after creating the LanceDB PR, trigger the Sophon dependency update:
```bash
gh workflow run codex-bump-lancedb-lance.yml \
--repo lancedb/sophon \
-f lance_ref="$LANCE_TAG" \
-f lancedb_ref="$BRANCH_NAME"
gh run list --repo lancedb/sophon --workflow codex-bump-lancedb-lance.yml --limit 1 --json databaseId,url,displayTitle
```
Use the emitted metadata `tag` value as `LANCE_TAG`. Do this only after a new LanceDB PR has been created. If the update was skipped because no update is needed or an open PR already exists, do not trigger Sophon.
## GitHub Actions
When this skill is used from GitHub Actions, `TAG`, `GH_TOKEN`, and `GITHUB_TOKEN` may already be set. Resolve `latest` first when `TAG` is empty. Once an explicit tag or version is known, use:
```bash
python3 ci/update_lance_dependency.py "$TAG" --github-output "$GITHUB_OUTPUT"
```
Then use the emitted `branch_name`, `commit_message`, and `pr_title` values for branch, commit, and PR creation.

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.30.1-beta.2"
current_version = "0.30.0"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -21,14 +21,3 @@ updates:
update-types:
- minor
- patch
- package-ecosystem: pip
directory: /python
schedule:
interval: weekly
# Only update uv.lock, never widen version requirements in pyproject.toml.
versioning-strategy: lockfile-only
groups:
python-deps:
patterns:
- "*"

View File

@@ -4,16 +4,14 @@ on:
workflow_call:
inputs:
tag:
description: "Tag name from Lance. If omitted, the skill will use the latest Lance release that needs an update."
required: false
default: ""
description: "Tag name from Lance"
required: true
type: string
workflow_dispatch:
inputs:
tag:
description: "Tag name from Lance. Leave empty to use the latest Lance release that needs an update."
required: false
default: ""
description: "Tag name from Lance"
required: true
type: string
permissions:
@@ -27,7 +25,7 @@ jobs:
steps:
- name: Show inputs
run: |
echo "tag = ${{ inputs.tag || 'latest' }}"
echo "tag = ${{ inputs.tag }}"
- name: Checkout Repo LanceDB
uses: actions/checkout@v4
@@ -73,21 +71,65 @@ jobs:
OPENAI_API_KEY: ${{ secrets.CODEX_TOKEN }}
run: |
set -euo pipefail
TARGET_TAG="${TAG:-latest}"
VERSION="${TAG#refs/tags/}"
VERSION="${VERSION#v}"
BRANCH_NAME="codex/update-lance-${VERSION//[^a-zA-Z0-9]/-}"
# Use "chore" for beta/rc versions, "feat" for stable releases
if [[ "${VERSION}" == *beta* ]] || [[ "${VERSION}" == *rc* ]]; then
COMMIT_TYPE="chore"
else
COMMIT_TYPE="feat"
fi
cat <<EOF >/tmp/codex-prompt.txt
You are running inside the lancedb repository on a GitHub Actions runner.
You are running inside the lancedb repository on a GitHub Actions runner. Update the Lance dependency to version ${VERSION} and prepare a pull request for maintainers to review.
Use \$lancedb-update-lance-dependency with target "${TARGET_TAG}".
Follow these steps exactly:
1. Use script "ci/set_lance_version.py" to update Lance Rust dependencies. The script already refreshes Cargo metadata, so allow it to finish even if it takes time.
2. Update the Java lance-core dependency version in "java/pom.xml": change the "<lance-core.version>...</lance-core.version>" property to "${VERSION}".
3. Run "cargo clippy --workspace --tests --all-features -- -D warnings". If diagnostics appear, fix them yourself and rerun clippy until it exits cleanly. Do not skip any warnings.
4. After clippy succeeds, run "cargo fmt --all" to format the workspace.
5. Ensure the repository is clean except for intentional changes. Inspect "git status --short" and "git diff" to confirm the dependency update and any required fixes.
6. Create and switch to a new branch named "${BRANCH_NAME}" (replace any duplicated hyphens if necessary).
7. Stage all relevant files with "git add -A". Commit using the message "${COMMIT_TYPE}: update lance dependency to v${VERSION}".
8. Push the branch to origin. If the remote branch already exists, delete it first with "gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/${BRANCH_NAME}" then push with "git push origin ${BRANCH_NAME}". Do NOT use "git push --force" or "git push -f".
9. env "GH_TOKEN" is available, use "gh" tools for github related operations like creating pull request.
10. Create a pull request targeting "main" with title "${COMMIT_TYPE}: update lance dependency to v${VERSION}". First, write the PR body to /tmp/pr-body.md using a heredoc (cat <<'EOF' > /tmp/pr-body.md). The body should summarize the dependency bump, clippy/fmt verification, and link the triggering tag (${TAG}). Then run "gh pr create --body-file /tmp/pr-body.md".
11. After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
Constraints:
- Use env "GH_TOKEN" for GitHub operations.
- Do not merge the pull request.
- Do not force-push.
- Do not create a duplicate pull request if an open PR already exists for the target Lance version.
- If any command fails, diagnose and fix the root cause instead of aborting.
- After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
- Use bash commands; avoid modifying GitHub workflow files other than through the scripted task above.
- Do not merge the PR.
- If any command fails, diagnose and fix the issue instead of aborting.
EOF
printenv OPENAI_API_KEY | codex login --with-api-key
codex --config shell_environment_policy.ignore_default_excludes=true exec --dangerously-bypass-approvals-and-sandbox "$(cat /tmp/codex-prompt.txt)"
- name: Trigger sophon dependency update
env:
TAG: ${{ inputs.tag }}
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
set -euo pipefail
VERSION="${TAG#refs/tags/}"
VERSION="${VERSION#v}"
LANCEDB_BRANCH="codex/update-lance-${VERSION//[^a-zA-Z0-9]/-}"
echo "Triggering sophon workflow with:"
echo " lance_ref: ${TAG#refs/tags/}"
echo " lancedb_ref: ${LANCEDB_BRANCH}"
gh workflow run codex-bump-lancedb-lance.yml \
--repo lancedb/sophon \
-f lance_ref="${TAG#refs/tags/}" \
-f lancedb_ref="${LANCEDB_BRANCH}"
- name: Show latest sophon workflow run
env:
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
set -euo pipefail
echo "Latest sophon workflow run:"
gh run list --repo lancedb/sophon --workflow codex-bump-lancedb-lance.yml --limit 1 --json databaseId,url,displayTitle

View File

@@ -0,0 +1,62 @@
name: Lance Release Timer
on:
schedule:
- cron: "*/10 * * * *"
workflow_dispatch:
permissions:
contents: read
actions: write
concurrency:
group: lance-release-timer
cancel-in-progress: false
jobs:
trigger-update:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Check for new Lance tag
id: check
env:
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
python3 ci/check_lance_release.py --github-output "$GITHUB_OUTPUT"
- name: Look for existing PR
if: steps.check.outputs.needs_update == 'true'
id: pr
env:
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
set -euo pipefail
TITLE="chore: update lance dependency to v${{ steps.check.outputs.latest_version }}"
COUNT=$(gh pr list --search "\"$TITLE\" in:title" --state open --limit 1 --json number --jq 'length')
if [ "$COUNT" -gt 0 ]; then
echo "Open PR already exists for $TITLE"
echo "pr_exists=true" >> "$GITHUB_OUTPUT"
else
echo "No existing PR for $TITLE"
echo "pr_exists=false" >> "$GITHUB_OUTPUT"
fi
- name: Trigger codex update workflow
if: steps.check.outputs.needs_update == 'true' && steps.pr.outputs.pr_exists != 'true'
env:
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
set -euo pipefail
TAG=${{ steps.check.outputs.latest_tag }}
gh workflow run codex-update-lance-dependency.yml -f tag=refs/tags/$TAG
- name: Show latest codex workflow run
if: steps.check.outputs.needs_update == 'true' && steps.pr.outputs.pr_exists != 'true'
env:
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
set -euo pipefail
gh run list --workflow codex-update-lance-dependency.yml --limit 1 --json databaseId,url,displayTitle

View File

@@ -27,11 +27,19 @@ jobs:
strategy:
matrix:
config:
- platform: x86_64
manylinux: "2_17"
extra_args: ""
runner: ubuntu-22.04
- platform: x86_64
manylinux: "2_28"
extra_args: "--features fp16kernels"
runner: ubuntu-22.04
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
- platform: aarch64
manylinux: "2_17"
extra_args: ""
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
runner: ubuntu-2404-8x-arm64
- platform: aarch64
manylinux: "2_28"
extra_args: "--features fp16kernels"

1
.gitignore vendored
View File

@@ -27,7 +27,6 @@ python/dist
*.so
*.dylib
*.dll
*.pdb
## Javascript
*.node

901
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -13,20 +13,20 @@ categories = ["database-implementations"]
rust-version = "1.91.0"
[workspace.dependencies]
lance = { "version" = "=8.0.0-beta.12", default-features = false, "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=8.0.0-beta.12", default-features = false, "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=8.0.0-beta.12", default-features = false, "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=8.0.0-beta.12", "tag" = "v8.0.0-beta.12", "git" = "https://github.com/lance-format/lance.git" }
lance = { "version" = "=7.0.0", default-features = false }
lance-core = "=7.0.0"
lance-datagen = "=7.0.0"
lance-file = "=7.0.0"
lance-io = { "version" = "=7.0.0", default-features = false }
lance-index = "=7.0.0"
lance-linalg = "=7.0.0"
lance-namespace = "=7.0.0"
lance-namespace-impls = { "version" = "=7.0.0", default-features = false }
lance-table = "=7.0.0"
lance-testing = "=7.0.0"
lance-datafusion = "=7.0.0"
lance-encoding = "=7.0.0"
lance-arrow = "=7.0.0"
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "58.0.0", optional = false }

View File

@@ -1,26 +0,0 @@
# Code review guidelines
Repo-specific guidance for automated PR reviews.
## Cross-SDK parity
LanceDB exposes the same core (`rust/lancedb`) through Python, TypeScript (`nodejs`),
and Java bindings. Behavioral drift between SDKs is a recurring problem, so watch for
parity gaps when reviewing — but only flag real ones:
* If the change adds or modifies user-facing API or behavior in the shared core
(`rust/lancedb`), check whether each binding that should expose it (`python`,
`nodejs`) does. A core change with no corresponding binding update is worth a note.
* If the change adds or modifies a public API in one SDK but not the other, open the
sibling SDK's corresponding module and state whether an equivalent exists. If not,
note it as a possible parity gap and suggest a follow-up issue.
* For bug fixes, first read the sibling SDK's analogous code path to check whether the
same bug exists there. Only raise parity if it actually does. Do not ask to "port" a
fix for a bug that only ever existed in one binding.
* Stay silent on internal-only refactors, tests, docs, and changes with no cross-SDK
surface.
* Parity expectations apply to the Python and TypeScript (`nodejs`) SDKs. Java currently
implements only the remote table, not the local/embedded backend, so it is expected to
be partial — do not flag Java for missing local-only functionality.
* Keep parity feedback to a short, clearly-labeled note (e.g. "Possible SDK parity
gap: …"). It is advisory, not a merge blocker.

View File

@@ -1,126 +0,0 @@
#!/usr/bin/env python3
"""Prepare a Lance dependency update for LanceDB."""
from __future__ import annotations
import argparse
import json
import re
import subprocess
import sys
from pathlib import Path
from typing import Sequence
try:
from check_lance_release import parse_semver
except ModuleNotFoundError:
# Supports importing as ci.update_lance_dependency from tests or ad hoc checks.
from ci.check_lance_release import parse_semver # type: ignore
def normalize_version(raw: str) -> str:
value = raw.strip()
value = value.removeprefix("refs/tags/")
value = value.removeprefix("v")
try:
parse_semver(value)
except ValueError:
raise ValueError(f"Unsupported Lance version or tag: {raw}")
return value
def normalized_tag(version: str) -> str:
return f"v{version}"
def branch_name(version: str) -> str:
suffix = re.sub(r"[^a-zA-Z0-9]+", "-", version).strip("-")
suffix = re.sub(r"-+", "-", suffix)
return f"codex/update-lance-{suffix}"
def commit_type(version: str) -> str:
prerelease = version.split("-", maxsplit=1)[1] if "-" in version else ""
return "chore" if "beta" in prerelease or "rc" in prerelease else "feat"
def metadata_for(version: str) -> dict[str, str]:
kind = commit_type(version)
message = f"{kind}: update lance dependency to v{version}"
return {
"version": version,
"tag": normalized_tag(version),
"branch_name": branch_name(version),
"commit_type": kind,
"commit_message": message,
"pr_title": message,
}
def run_command(cmd: Sequence[str], *, cwd: Path) -> None:
subprocess.run(cmd, cwd=cwd, check=True)
def update_java_lance_core_version(repo_root: Path, version: str) -> None:
pom_path = repo_root / "java" / "pom.xml"
contents = pom_path.read_text(encoding="utf-8")
updated, count = re.subn(
r"(<lance-core\.version>)[^<]+(</lance-core\.version>)",
rf"\g<1>{version}\g<2>",
contents,
count=1,
)
if count != 1:
raise RuntimeError(
"Expected exactly one <lance-core.version> entry in java/pom.xml"
)
pom_path.write_text(updated, encoding="utf-8")
def write_github_outputs(path: str | None, payload: dict[str, str]) -> None:
if not path:
return
with open(path, "a", encoding="utf-8") as output:
for key, value in payload.items():
output.write(f"{key}={value}\n")
def main(argv: Sequence[str] | None = None) -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"tag_or_version",
help="Lance tag or version, for example refs/tags/v7.2.0-beta.1 or 7.2.0",
)
parser.add_argument(
"--repo-root",
type=Path,
default=Path(__file__).resolve().parents[1],
help="Path to the lancedb repository root",
)
parser.add_argument(
"--github-output",
default=None,
help="Optional GitHub Actions output file to receive metadata fields",
)
parser.add_argument(
"--metadata-only",
action="store_true",
help="Only print derived metadata; do not modify dependency files",
)
args = parser.parse_args(argv)
repo_root = args.repo_root.resolve()
version = normalize_version(args.tag_or_version)
payload = metadata_for(version)
if not args.metadata_only:
run_command([sys.executable, "ci/set_lance_version.py", version], cwd=repo_root)
update_java_lance_core_version(repo_root, version)
write_github_outputs(args.github_output, payload)
print(json.dumps(payload, sort_keys=True))
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@@ -147,14 +147,6 @@ allow = [
"CDLA-Permissive-2.0",
]
confidence-threshold = 0.8
# Per-crate license exceptions: allow a license for a specific crate only,
# rather than globally via the `allow` list above.
exceptions = [
# CDDL-1.0 (copyleft) is pulled in only as a dev/profiling dependency via
# `inferno` -> `pprof` -> `lance-testing`; it is a test dependency that we
# do not distribute, so scope the allowance to `inferno` alone.
{ allow = ["CDDL-1.0"], crate = "inferno" },
]
# Crates whose license cannot be determined from Cargo metadata but whose
# license we've manually confirmed from upstream. Keep this list minimal.
[[licenses.clarify]]

View File

@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
<dependency>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-core</artifactId>
<version>0.30.1-beta.2</version>
<version>0.30.0</version>
</dependency>
```

View File

@@ -1,43 +0,0 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / BranchContents
# Class: BranchContents
## Constructors
### new BranchContents()
```ts
new BranchContents(): BranchContents
```
#### Returns
[`BranchContents`](BranchContents.md)
## Properties
### manifestSize
```ts
manifestSize: number;
```
***
### parentBranch?
```ts
optional parentBranch: string;
```
***
### parentVersion
```ts
parentVersion: number;
```

View File

@@ -1,96 +0,0 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / Branches
# Class: Branches
Branch manager for a [Table](Table.md).
Unlike tags, `create` and `checkout` return a new [Table](Table.md) handle scoped
to the branch; writes on it do not affect `main`.
## Methods
### checkout()
```ts
checkout(name, version?): Promise<Table>
```
Check out an existing branch and return a handle scoped to it.
With `version` set, the returned handle is pinned to that version of the
branch (a read-only, detached view); otherwise it tracks the branch's
latest and stays writable.
#### Parameters
* **name**: `string`
* **version?**: `number`
#### Returns
`Promise`&lt;[`Table`](Table.md)&gt;
***
### create()
```ts
create(
name,
fromRef?,
fromVersion?): Promise<Table>
```
Create a branch and return a handle scoped to it.
#### Parameters
* **name**: `string`
Name of the new branch.
* **fromRef?**: `string`
Source branch to fork from. Defaults to `main`.
* **fromVersion?**: `number`
A specific version on `fromRef`. Defaults to latest.
#### Returns
`Promise`&lt;[`Table`](Table.md)&gt;
***
### delete()
```ts
delete(name): Promise<void>
```
Delete a branch.
#### Parameters
* **name**: `string`
#### Returns
`Promise`&lt;`void`&gt;
***
### list()
```ts
list(): Promise<Record<string, BranchContents>>
```
List all branches, mapping name to branch metadata.
#### Returns
`Promise`&lt;`Record`&lt;`string`, [`BranchContents`](BranchContents.md)&gt;&gt;

View File

@@ -57,24 +57,6 @@ block size may be added in the future.
***
### fm()
```ts
static fm(): Index
```
Create an FM-Index.
An FM-Index is a scalar index on string or binary columns that accelerates
substring search, i.e. `contains(col, 'needle')`. Unlike the tokenized
full-text-search index, it matches arbitrary substrings of the raw bytes.
#### Returns
[`Index`](Index.md)
***
### fts()
```ts

View File

@@ -76,57 +76,6 @@ the query optimizer chooses a suboptimal path.
***
### useLsmWrite()
```ts
useLsmWrite(useLsmWrite): MergeInsertBuilder
```
Controls whether the merge uses the MemWAL LSM write path.
By default (unset), a `mergeInsert` on a table with an LSM write spec is
routed through Lance's MemWAL shard writer, and a table without one uses
the standard path. Pass `false` to force the standard path even when a
spec is set. Pass `true` to require a spec — `mergeInsert` rejects if none
is installed.
#### Parameters
* **useLsmWrite**: `boolean`
Whether to use the LSM write path.
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### validateSingleShard()
```ts
validateSingleShard(validateSingleShard): MergeInsertBuilder
```
Controls how an LSM merge checks that its input targets a single shard.
When a table has an LSM write spec, every row in a `mergeInsert` call must
route to the same shard. When `true` (the default), every row is inspected
to verify this. When `false`, only the first row is inspected and the
shard it routes to is used for the whole input — a faster path for callers
that have already pre-sharded their input. Has no effect on tables without
an LSM write spec.
#### Parameters
* **validateSingleShard**: `boolean`
Whether to check every row routes to one shard. Defaults to `true`.
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### whenMatchedUpdateAll()
```ts

View File

@@ -110,23 +110,6 @@ containing the new version number of the table after altering the columns.
***
### branches()
```ts
abstract branches(): Promise<Branches>
```
Get the branch manager for this table.
Branches are isolated, writable lines of history forked from another
branch (or version). Writes on a branch do not affect `main`.
#### Returns
`Promise`&lt;[`Branches`](Branches.md)&gt;
***
### checkout()
```ts
@@ -204,25 +187,6 @@ Any attempt to use the table after it is closed will result in an error.
***
### closeLsmWriters()
```ts
abstract closeLsmWriters(): Promise<void>
```
Drain and close any cached MemWAL shard writers held for this table.
When an [LsmWriteSpec](../interfaces/LsmWriteSpec.md) is installed, `mergeInsert` opens MemWAL
shard writers and caches them for reuse across calls. This closes them,
flushing pending data; writers reopen lazily on the next `mergeInsert`.
It is a no-op when no writers are cached.
#### Returns
`Promise`&lt;`void`&gt;
***
### countRows()
```ts
@@ -295,23 +259,6 @@ await table.createIndex("my_float_col");
***
### currentBranch()
```ts
abstract currentBranch(): null | string
```
The branch this table handle is scoped to, or `null` for the main branch.
A handle returned by [Branches.create](Branches.md#create) or [Branches.checkout](Branches.md#checkout)
reports the branch it targets; a handle opened normally reports `null`.
#### Returns
`null` \| `string`
***
### delete()
```ts
@@ -1028,29 +975,6 @@ based on the row being updated (e.g. "my_col + 1")
***
### updateFieldMetadata()
```ts
abstract updateFieldMetadata(updates): Promise<UpdateFieldMetadataResult>
```
Update per-field (column) metadata.
#### Parameters
* **updates**: [`FieldMetadataUpdate`](../interfaces/FieldMetadataUpdate.md)[]
One or more per-field updates. Each
update's metadata is merged into the field's existing metadata by default;
a value of `null` deletes that key, and `replace: true` swaps the whole map.
#### Returns
`Promise`&lt;[`UpdateFieldMetadataResult`](../interfaces/UpdateFieldMetadataResult.md)&gt;
resolves to the new table version.
***
### vectorSearch()
```ts

View File

@@ -19,8 +19,6 @@
- [BooleanQuery](classes/BooleanQuery.md)
- [BoostQuery](classes/BoostQuery.md)
- [BranchContents](classes/BranchContents.md)
- [Branches](classes/Branches.md)
- [Connection](classes/Connection.md)
- [HeaderProvider](classes/HeaderProvider.md)
- [Index](classes/Index.md)
@@ -67,7 +65,6 @@
- [DropNamespaceOptions](interfaces/DropNamespaceOptions.md)
- [DropNamespaceResponse](interfaces/DropNamespaceResponse.md)
- [ExecutableQuery](interfaces/ExecutableQuery.md)
- [FieldMetadataUpdate](interfaces/FieldMetadataUpdate.md)
- [FragmentStatistics](interfaces/FragmentStatistics.md)
- [FragmentSummaryStats](interfaces/FragmentSummaryStats.md)
- [FtsOptions](interfaces/FtsOptions.md)
@@ -104,7 +101,6 @@
- [TimeoutConfig](interfaces/TimeoutConfig.md)
- [TlsConfig](interfaces/TlsConfig.md)
- [TokenResponse](interfaces/TokenResponse.md)
- [UpdateFieldMetadataResult](interfaces/UpdateFieldMetadataResult.md)
- [UpdateOptions](interfaces/UpdateOptions.md)
- [UpdateResult](interfaces/UpdateResult.md)
- [Version](interfaces/Version.md)

View File

@@ -1,41 +0,0 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / FieldMetadataUpdate
# Interface: FieldMetadataUpdate
A per-field metadata update, addressed by dot-path.
## Properties
### metadata
```ts
metadata: Record<string, null | string>;
```
Metadata key/value pairs. Merged into the field's existing metadata by
default; a value of `null` deletes that key.
***
### path
```ts
path: string;
```
Dot-separated path to the field. For a top-level column this is just its
name; for a nested field it's the path, e.g. "a.b.c".
***
### replace?
```ts
optional replace: boolean;
```
If true, replace the field's entire metadata map instead of merging.

View File

@@ -23,31 +23,6 @@ be more columns to represent composite indices.
***
### createdAt?
```ts
optional createdAt: Date;
```
When the index was created.
`undefined` for remote tables or indices created before timestamps were tracked.
***
### indexDetails?
```ts
optional indexDetails: any;
```
Index-type-specific details parsed as a JavaScript object.
Falls back to a raw string if JSON parsing fails. `undefined` for
remote tables or when details are unavailable.
***
### indexType
```ts
@@ -58,30 +33,6 @@ The type of the index
***
### indexUuid?
```ts
optional indexUuid: string;
```
The UUID of the first segment of the index.
`undefined` for remote tables, which do not yet surface this.
***
### indexVersion?
```ts
optional indexVersion: number;
```
The on-disk index format version.
`undefined` for remote tables.
***
### name
```ts
@@ -89,63 +40,3 @@ name: string;
```
The name of the index
***
### numIndexedRows?
```ts
optional numIndexedRows: number;
```
The number of rows indexed, across all segments.
`undefined` for remote tables.
***
### numSegments?
```ts
optional numSegments: number;
```
The number of segments that make up the index.
`undefined` for remote tables.
***
### numUnindexedRows?
```ts
optional numUnindexedRows: number;
```
The number of rows not yet covered by this index.
`undefined` for remote tables.
***
### sizeBytes?
```ts
optional sizeBytes: number;
```
The total size in bytes of all index files across all segments.
`undefined` for remote tables or indices without size tracking.
***
### typeUrl?
```ts
optional typeUrl: string;
```
The protobuf type URL, a precise type identifier for the index.
`undefined` for remote tables.

View File

@@ -30,6 +30,17 @@ The type of the index
***
### loss?
```ts
optional loss: number;
```
The KMeans loss value of the index,
it is only present for vector indices.
***
### numIndexedRows
```ts

View File

@@ -11,10 +11,7 @@ Specification selecting Lance's MemWAL LSM-style write path for
`specType` is `"bucket"`, `"identity"`, or `"unsharded"`. For `"bucket"`,
`column` and `numBuckets` are required; for `"identity"`, `column` is
required and must be a deterministic function of the unenforced primary
key (every row with a given primary key must always produce the same
`column` value, or upserts of that key can land in different shards and a
stale version can win).
required.
## Properties

View File

@@ -32,14 +32,6 @@ numInsertedRows: number;
***
### numRows
```ts
numRows: number;
```
***
### numUpdatedRows
```ts

View File

@@ -8,18 +8,6 @@
## Properties
### branch?
```ts
optional branch: string;
```
Open the table scoped to this branch instead of the default branch.
Reads and writes on the returned table operate in the branch's context.
***
### ~~indexCacheSize?~~
```ts
@@ -55,17 +43,3 @@ Options already set on the connection will be inherited by the table,
but can be overridden here.
The available options are described at https://docs.lancedb.com/storage/
***
### version?
```ts
optional version: number;
```
Open the table pinned to this version, producing a read-only view.
Composes with [OpenTableOptions.branch](OpenTableOptions.md#branch): when both are set, opens
that branch at the version; otherwise opens `main` at the version. Call
`checkoutLatest` to return to a writable state.

View File

@@ -1,15 +0,0 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / UpdateFieldMetadataResult
# Interface: UpdateFieldMetadataResult
## Properties
### version
```ts
version: number;
```

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.30.1-beta.2</version>
<version>0.30.0-final.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.30.1-beta.2</version>
<version>0.30.0-final.0</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>
@@ -28,7 +28,7 @@
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<arrow.version>15.0.0</arrow.version>
<lance-core.version>8.0.0-beta.12</lance-core.version>
<lance-core.version>7.0.0</lance-core.version>
<spotless.skip>false</spotless.skip>
<spotless.version>2.30.0</spotless.version>
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>

View File

@@ -1,7 +1,7 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.30.1-beta.2"
version = "0.30.0"
publish = false
license.workspace = true
description.workspace = true
@@ -25,12 +25,8 @@ lancedb = { path = "../rust/lancedb", default-features = false }
lance-namespace.workspace = true
napi = { version = "3.8.3", default-features = false, features = [
"napi9",
"async",
"chrono_date",
"serde-json",
"async"
] }
chrono = { version = "0.4", default-features = false, features = ["clock"] }
serde_json = "1"
napi-derive = "3.5.2"
# Prevent dynamic linking of lzma, which comes from datafusion
lzma-sys = { version = "0.1", features = ["static"] }

View File

@@ -191,40 +191,6 @@ describe("remote connection", () => {
);
});
it("supports version time-travel and branches on remote", async () => {
await withMockDatabase(
(req, res) => {
const body = req.url?.includes("/branches/list")
? JSON.stringify({
branches: {
exp: { parentVersion: 1, createAt: 1, manifestSize: 1 },
},
})
: JSON.stringify({ name: "t", version: 2, schema: { fields: [] } });
res.writeHead(200, { "Content-Type": "application/json" }).end(body);
},
async (db) => {
// version-only (and "main" + version) time-travel the main chain
const v2 = await db.openTable("t", undefined, { version: 2 });
expect(v2.currentBranch()).toBeNull();
const mainV2 = await db.openTable("t", undefined, {
branch: "main",
version: 2,
});
expect(mainV2.currentBranch()).toBeNull();
// a non-main branch opens a handle scoped to that branch
const exp = await db.openTable("t", undefined, { branch: "exp" });
expect(exp.currentBranch()).toBe("exp");
const expV2 = await db.openTable("t", undefined, {
branch: "exp",
version: 2,
});
expect(expV2.currentBranch()).toBe("exp");
},
);
});
describe("TlsConfig", () => {
it("should create TlsConfig with all fields", () => {
const tlsConfig: TlsConfig = {

View File

@@ -85,140 +85,6 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
await expect(table.countRows()).resolves.toBe(3);
});
it("should support branches", async () => {
await table.add([{ id: 1 }]);
expect(await table.countRows()).toBe(1);
expect(table.currentBranch()).toBeNull();
// fork an isolated, writable branch from main
const branch = await (await table.branches()).create("exp");
expect(branch.currentBranch()).toBe("exp");
expect(await branch.countRows()).toBe(1);
await branch.add([{ id: 2 }]);
expect(await branch.countRows()).toBe(2);
// main is untouched by branch writes
expect(await table.countRows()).toBe(1);
// listed, with main (null) as the parent
const list = await (await table.branches()).list();
expect(Object.keys(list)).toContain("exp");
expect(list["exp"].parentBranch).toBeNull();
// fromRef="main" is equivalent to the default
await (await table.branches()).create("exp2", "main");
const list2 = await (await table.branches()).list();
expect(list2["exp2"].parentBranch).toBeNull();
// checkout returns a handle scoped to the branch's latest
const checkedOut = await (await table.branches()).checkout("exp");
expect(checkedOut.currentBranch()).toBe("exp");
expect(await checkedOut.countRows()).toBe(2);
// delete removes it
await (await table.branches()).delete("exp");
await (await table.branches()).delete("exp2");
const after = await (await table.branches()).list();
expect(Object.keys(after)).not.toContain("exp");
});
it("should open a branch via open_table", async () => {
const db = await connect(tmpDir.name);
await table.add([{ id: 1 }]);
const branch = await (await table.branches()).create("exp");
await branch.add([{ id: 2 }]);
// open_table(..., { branch }) returns a handle scoped to the branch
const opened = await db.openTable("some_table", undefined, {
branch: "exp",
});
expect(await opened.countRows()).toBe(2);
// opening without branch still tracks main
expect(await (await db.openTable("some_table")).countRows()).toBe(1);
});
it("should open a branch at a version isolated from main and HEAD", async () => {
const db = await connect(tmpDir.name);
// main: a single fork-point row
const t = await db.createTable("bv_table", [{ id: 0 }]);
const mainV1 = await t.version();
// fork "exp", then advance exp AND main independently past the fork so
// they diverge while sharing version numbers
const exp = await (await t.branches()).create("exp");
await exp.add([{ id: 1 }]); // exp: {0, 1}
const expV2 = await exp.version();
await exp.add([{ id: 2 }]); // exp HEAD: {0, 1, 2}
await t.add([{ id: 100 }, { id: 101 }, { id: 102 }]); // main HEAD: {0,100,101,102}
expect(await t.version()).toBe(expV2);
// open exp at the shared version: the data must be exp's, not main's.
// count alone cannot prove this (main@v2 also exists), so assert
// provenance by content.
const pinned = await db.openTable("bv_table", undefined, {
branch: "exp",
version: expV2,
});
expect(await pinned.countRows()).toBe(2); // not exp HEAD (3), not main@v2 (4)
expect(await pinned.countRows("id = 1")).toBe(1); // exp's post-fork row
expect(await pinned.countRows("id = 100")).toBe(0); // main's rows invisible
// the same coordinate is reachable directly via branches().checkout(name, version)
const pinnedDirect = await (await t.branches()).checkout("exp", expV2);
expect(await pinnedDirect.countRows()).toBe(2);
// the HEADs are unaffected
expect(
await (
await db.openTable("bv_table", undefined, { branch: "exp" })
).countRows(),
).toBe(3);
expect(await (await db.openTable("bv_table")).countRows()).toBe(4);
// version-only (no branch) time-travels main itself: its fork-point
// version holds only main's first row, and the shared version number
// resolves to main's data, not the branch's ("opens main at the version")
const oldMain = await db.openTable("bv_table", undefined, {
version: mainV1,
});
expect(await oldMain.countRows()).toBe(1);
const sharedOnMain = await db.openTable("bv_table", undefined, {
version: expV2,
});
expect(await sharedOnMain.countRows()).toBe(4); // main@v2, not exp@v2 (2)
// detached head: writing to a pinned version is rejected
await expect(pinned.add([{ id: 9 }])).rejects.toThrow(
/cannot be modified/,
);
// a nonexistent version is rejected -- on main, and on a branch (a
// distinct resolution path, on the branch's manifests)
await expect(
db.openTable("bv_table", undefined, { version: 9999 }),
).rejects.toThrow();
await expect(
db.openTable("bv_table", undefined, { branch: "exp", version: 9999 }),
).rejects.toThrow();
// checkoutLatest re-attaches the pinned handle to the BRANCH's HEAD
// (writable again), not main's HEAD (4), and not staying pinned (2)
await pinned.checkoutLatest();
expect(await pinned.countRows()).toBe(3); // exp HEAD
await pinned.add([{ id: 3 }]);
expect(await pinned.countRows()).toBe(4); // writable again
});
it("rejects invalid branch inputs", async () => {
const branches = await table.branches();
await expect(branches.create("")).rejects.toThrow("non-empty");
await expect(branches.checkout("")).rejects.toThrow("non-empty");
await expect(branches.delete("")).rejects.toThrow("non-empty");
await expect(branches.create("bad", "main", -1)).rejects.toThrow(
"non-negative",
);
});
it("should show table stats", async () => {
await table.add([{ id: 1 }, { id: 2 }]);
await table.add([{ id: 1 }]);
@@ -849,15 +715,13 @@ describe("When creating an index", () => {
expect(fs.readdirSync(indexDir)).toHaveLength(1);
const indices = await tbl.listIndices();
expect(indices.length).toBe(1);
expect(indices[0]).toEqual(
expect.objectContaining({
name: "vec_idx",
indexType: "IvfPq",
columns: ["vec"],
}),
);
expect(indices[0]).toEqual({
name: "vec_idx",
indexType: "IvfPq",
columns: ["vec"],
});
const stats = await tbl.indexStats("vec_idx");
expect(stats).toBeDefined();
expect(stats?.loss).toBeDefined();
// Search without specifying the column
let rst = await tbl
@@ -917,22 +781,10 @@ describe("When creating an index", () => {
expect(indices2.length).toBe(0);
});
it("should preserve canonical nested field paths across index lifecycle", async () => {
it("should create and search a nested vector index", async () => {
const db = await connect(tmpDir.name);
const nestedSchema = new Schema([
new Field("rowId", new Int32(), true),
new Field("row-id", new Int32(), true),
new Field("userId", new Int32(), true),
new Field(
"metadata",
new Struct([new Field("user_id", new Int32(), true)]),
true,
),
new Field(
"MetaData",
new Struct([new Field("userId", new Int32(), true)]),
true,
),
new Field("id", new Int32(), true),
new Field(
"image",
new Struct([
@@ -944,146 +796,27 @@ describe("When creating an index", () => {
]),
true,
),
new Field(
"payload",
new Struct([new Field("text", new Utf8(), true)]),
true,
),
new Field(
"meta-data",
new Struct([new Field("user-id", new Int32(), true)]),
true,
),
new Field(
"literal",
new Struct([new Field("a.b", new Int32(), true)]),
true,
),
]);
const nestedTable = await db.createTable(
"nested_field_index_lifecycle",
"nested_vector",
makeArrowTable(
Array.from({ length: 300 }, (_, rowId) => ({
rowId,
"row-id": rowId,
userId: rowId,
metadata: { ["user_id"]: rowId },
["MetaData"]: { userId: rowId },
image: { embedding: [rowId, rowId + 1] },
payload: { text: `document ${rowId}` },
"meta-data": { "user-id": rowId },
literal: { "a.b": rowId },
Array.from({ length: 300 }, (_, id) => ({
id,
image: { embedding: [id, id + 1] },
})),
{ schema: nestedSchema },
),
);
await nestedTable.createIndex("rowId", {
config: Index.btree(),
name: "row_id_idx",
});
await nestedTable.createIndex("`row-id`", {
config: Index.btree(),
name: "row_dash_id_idx",
});
await nestedTable.createIndex("userId", {
config: Index.btree(),
name: "top_user_id_idx",
});
await nestedTable.createIndex("metadata.user_id", {
config: Index.btree(),
name: "nested_user_id_idx",
});
await nestedTable.createIndex("MetaData.userId", {
config: Index.btree(),
name: "mixed_case_metadata_user_id_idx",
});
await nestedTable.createIndex("`meta-data`.`user-id`", {
config: Index.btree(),
name: "escaped_names_idx",
});
await nestedTable.createIndex("literal.`a.b`", {
config: Index.btree(),
name: "literal_dot_idx",
});
await nestedTable.createIndex("image.embedding", {
name: "image_embedding_idx",
});
await nestedTable.createIndex("payload.text", {
config: Index.fts({ withPosition: false }),
name: "payload_text_idx",
});
const indices = await nestedTable.listIndices();
expect(indices).toEqual(
expect.arrayContaining([
expect.objectContaining({
name: "row_id_idx",
indexType: "BTree",
columns: ["rowId"],
}),
expect.objectContaining({
name: "row_dash_id_idx",
indexType: "BTree",
columns: ["`row-id`"],
}),
expect.objectContaining({
name: "top_user_id_idx",
indexType: "BTree",
columns: ["userId"],
}),
expect.objectContaining({
name: "nested_user_id_idx",
indexType: "BTree",
columns: ["metadata.user_id"],
}),
expect.objectContaining({
name: "mixed_case_metadata_user_id_idx",
indexType: "BTree",
columns: ["MetaData.userId"],
}),
expect.objectContaining({
name: "escaped_names_idx",
indexType: "BTree",
columns: ["`meta-data`.`user-id`"],
}),
expect.objectContaining({
name: "literal_dot_idx",
indexType: "BTree",
columns: ["literal.`a.b`"],
}),
expect.objectContaining({
name: "image_embedding_idx",
indexType: "IvfPq",
columns: ["image.embedding"],
}),
expect.objectContaining({
name: "payload_text_idx",
indexType: "FTS",
columns: ["payload.text"],
}),
]),
);
const stats = await nestedTable.indexStats(
"mixed_case_metadata_user_id_idx",
);
expect(stats?.numIndexedRows).toEqual(300);
expect(stats?.indexType).toEqual("BTREE");
const filtered = await nestedTable
.query()
.where("MetaData.userId = 42")
.limit(1)
.toArray();
expect(filtered[0].MetaData.userId).toEqual(42);
const escapedFiltered = await nestedTable
.query()
.where("`row-id` = 43")
.limit(1)
.toArray();
expect(escapedFiltered[0]["row-id"]).toEqual(43);
expect(indices).toContainEqual({
name: "image_embedding_idx",
indexType: "IvfPq",
columns: ["image.embedding"],
});
const explicit = await nestedTable
.query()
@@ -1096,37 +829,7 @@ describe("When creating an index", () => {
.nearestTo([0.0, 1.0])
.limit(1)
.toArray();
expect(inferred[0].rowId).toEqual(explicit[0].rowId);
await nestedTable.add([
{
rowId: 300,
"row-id": 300,
userId: 300,
metadata: { ["user_id"]: 300 },
["MetaData"]: { userId: 300 },
image: { embedding: [300.0, 301.0] },
payload: { text: "document 300" },
"meta-data": { "user-id": 300 },
literal: { "a.b": 300 },
},
]);
await nestedTable.optimize();
const indicesAfterOptimize = await nestedTable.listIndices();
expect(indicesAfterOptimize).toEqual(
expect.arrayContaining([
expect.objectContaining({
name: "mixed_case_metadata_user_id_idx",
indexType: "BTree",
columns: ["MetaData.userId"],
}),
expect.objectContaining({
name: "image_embedding_idx",
indexType: "IvfPq",
columns: ["image.embedding"],
}),
]),
);
expect(inferred[0].id).toEqual(explicit[0].id);
});
it("should report multiple nested vector candidates", async () => {
@@ -1260,13 +963,11 @@ describe("When creating an index", () => {
expect(fs.readdirSync(indexDir)).toHaveLength(1);
const indices = await tbl.listIndices();
expect(indices.length).toBe(1);
expect(indices[0]).toEqual(
expect.objectContaining({
name: "vec_idx",
indexType: "IvfHnswSq",
columns: ["vec"],
}),
);
expect(indices[0]).toEqual({
name: "vec_idx",
indexType: "IvfHnswSq",
columns: ["vec"],
});
// Search without specifying the column
let rst = await tbl
@@ -1439,20 +1140,6 @@ describe("When creating an index", () => {
expect(fs.readdirSync(indexDir)).toHaveLength(1);
});
test("create an FM index", async () => {
// FM-Index accelerates substring search on a string/binary column.
const db = await connect(tmpDir.name);
const fmTbl = await db.createTable("fm_table", [
{ id: 0, text: "hello world" },
{ id: 1, text: "foo bar" },
]);
await fmTbl.createIndex("text", {
config: Index.fm(),
});
const indexDir = path.join(tmpDir.name, "fm_table.lance", "_indices");
expect(fs.readdirSync(indexDir)).toHaveLength(1);
});
test("should be able to get index stats", async () => {
await tbl.createIndex("id");
@@ -1463,6 +1150,7 @@ describe("When creating an index", () => {
expect(stats?.distanceType).toBeUndefined();
expect(stats?.indexType).toEqual("BTREE");
expect(stats?.numIndices).toEqual(1);
expect(stats?.loss).toBeUndefined();
});
test("when getting stats on non-existent index", async () => {
@@ -1612,35 +1300,6 @@ describe("When creating an index", () => {
expect(rst64Query.toString()).toEqual(rst64Search.toString());
expect(rst64Query.numRows).toBe(2);
});
it("should expose rich metadata fields on IndexConfig", async () => {
await tbl.createIndex("id", { config: Index.btree() });
await tbl.createIndex("vec");
const indicesByName = Object.fromEntries(
(await tbl.listIndices()).map((idx) => [idx.name, idx]),
);
const scalarIdx = indicesByName["id_idx"];
expect(scalarIdx).toBeDefined();
expect(typeof scalarIdx.indexUuid).toBe("string");
expect(scalarIdx.numIndexedRows).toBe(300);
expect(scalarIdx.numUnindexedRows).toBe(0);
expect(scalarIdx.numSegments).toBeGreaterThanOrEqual(1);
expect(scalarIdx.sizeBytes).toBeGreaterThan(0);
// Use toString check to avoid cross-realm instanceof failures with native Date objects
expect(Object.prototype.toString.call(scalarIdx.createdAt)).toBe(
"[object Date]",
);
expect((scalarIdx.createdAt as Date).getTime()).toBeGreaterThan(0);
expect(typeof scalarIdx.indexDetails).toBe("object");
const vectorIdx = indicesByName["vec_idx"];
expect(vectorIdx).toBeDefined();
expect(typeof vectorIdx.indexUuid).toBe("string");
expect(vectorIdx.numIndexedRows).toBe(300);
expect(typeof vectorIdx.indexDetails).toBe("object");
});
});
describe("When querying a table", () => {
@@ -1912,33 +1571,6 @@ describe("schema evolution", function () {
expect(await table.schema()).toEqual(expectedSchema3);
});
it("can update field metadata", async function () {
const con = await connect(tmpDir.name);
const table = await con.createTable("fm", [
{ id: 1, category: "a" },
{ id: 2, category: "b" },
]);
const res = await table.updateFieldMetadata([
{ path: "category", metadata: { unit: "label", pii: "false" } },
]);
expect(res).toHaveProperty("version");
expect(res.version).toBe(2);
let cat = (await table.schema()).fields.find((f) => f.name === "category");
expect(cat?.metadata.get("unit")).toBe("label");
expect(cat?.metadata.get("pii")).toBe("false");
// merge: add a key, delete one via null, keep the rest
await table.updateFieldMetadata([
{ path: "category", metadata: { source: "import", pii: null } },
]);
cat = (await table.schema()).fields.find((f) => f.name === "category");
expect(cat?.metadata.get("unit")).toBe("label"); // preserved
expect(cat?.metadata.get("source")).toBe("import"); // added
expect(cat?.metadata.has("pii")).toBe(false); // deleted
});
it("can cast to various types", async function () {
const con = await connect(tmpDir.name);
@@ -2993,97 +2625,3 @@ describe("setLsmWriteSpec / unsetLsmWriteSpec", () => {
).rejects.toThrow();
});
});
describe("LSM merge insert", () => {
let tmpDir: tmp.DirResult;
beforeEach(() => {
tmpDir = tmp.dirSync({ unsafeCleanup: true });
});
afterEach(() => tmpDir.removeCallback());
async function bucketTable(conn: Connection): Promise<Table> {
// The primary key column must be non-nullable.
const table = await conn.createEmptyTable(
"t",
new arrow.Schema([
new arrow.Field("id", new arrow.Utf8(), false),
new arrow.Field("value", new arrow.Float64(), true),
]),
);
await table.add([
{ id: "a", value: 1 },
{ id: "b", value: 2 },
]);
await table.setUnenforcedPrimaryKey("id");
// numBuckets = 1: every row routes to the single bucket.
await table.setLsmWriteSpec({
specType: "bucket",
column: "id",
numBuckets: 1,
});
return table;
}
it("routes merge_insert through the shard writer", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
const res = await table
.mergeInsert("id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute([
{ id: "c", value: 3 },
{ id: "d", value: 4 },
]);
// LSM path: rows go to the MemWAL, so only numRows is populated.
expect(res.numRows).toBe(2);
expect(res.version).toBe(0);
expect(res.numInsertedRows).toBe(0);
await table.closeLsmWriters();
});
it("falls back to the standard path with useLsmWrite(false)", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
const res = await table
.mergeInsert("id")
.whenNotMatchedInsertAll()
.useLsmWrite(false)
.execute([
{ id: "b", value: 9 },
{ id: "e", value: 5 },
]);
// Standard path commits: id="e" inserted ("b" already exists).
expect(res.numInsertedRows).toBe(1);
expect(await table.countRows()).toBe(3);
});
it("supports validateSingleShard(false)", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
const res = await table
.mergeInsert("id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.validateSingleShard(false)
.execute([{ id: "f", value: 6 }]);
expect(res.numRows).toBe(1);
});
it("rejects a non-upsert merge under an LSM spec", async () => {
const conn = await connect(tmpDir.name);
const table = await bucketTable(conn);
await expect(
table
.mergeInsert("id")
.whenNotMatchedInsertAll()
.execute([{ id: "g", value: 7 }]),
).rejects.toThrow();
});
});

View File

@@ -84,20 +84,6 @@ export interface CreateTableOptions {
}
export interface OpenTableOptions {
/**
* Open the table scoped to this branch instead of the default branch.
*
* Reads and writes on the returned table operate in the branch's context.
*/
branch?: string;
/**
* Open the table pinned to this version, producing a read-only view.
*
* Composes with {@link OpenTableOptions.branch}: when both are set, opens
* that branch at the version; otherwise opens `main` at the version. Call
* `checkoutLatest` to return to a writable state.
*/
version?: number;
/**
* Configuration for object storage.
*
@@ -497,20 +483,7 @@ export class LocalConnection extends Connection {
options?.indexCacheSize,
);
let table: Table = new LocalTable(innerTable);
// "main" is the default branch, so treat it as no branch. On a real branch,
// scope and pin in one step (yielding "version V of branch B"); otherwise
// pin the version, if any, against main.
const branch =
options?.branch != null && options.branch !== "main"
? options.branch
: undefined;
if (branch != null) {
table = await (await table.branches()).checkout(branch, options?.version);
} else if (options?.version != null) {
await table.checkout(options.version);
}
return table;
return new LocalTable(innerTable);
}
async cloneTable(

View File

@@ -38,12 +38,10 @@ export {
FragmentSummaryStats,
Tags,
TagContents,
BranchContents,
MergeResult,
AddResult,
AddColumnsResult,
AlterColumnsResult,
UpdateFieldMetadataResult,
DeleteResult,
DropColumnsResult,
UpdateResult,
@@ -112,7 +110,6 @@ export {
export {
Table,
Branches,
AddDataOptions,
UpdateOptions,
OptimizeOptions,
@@ -120,7 +117,6 @@ export {
WriteProgress,
LsmWriteSpec,
ColumnAlteration,
FieldMetadataUpdate,
} from "./table";
export {

View File

@@ -702,17 +702,6 @@ export class Index {
return new Index(LanceDbIndex.labelList());
}
/**
* Create an FM-Index.
*
* An FM-Index is a scalar index on string or binary columns that accelerates
* substring search, i.e. `contains(col, 'needle')`. Unlike the tokenized
* full-text-search index, it matches arbitrary substrings of the raw bytes.
*/
static fm() {
return new Index(LanceDbIndex.fm());
}
/**
* Create a full text search index
*

View File

@@ -87,41 +87,6 @@ export class MergeInsertBuilder {
this.#schema,
);
}
/**
* Controls whether the merge uses the MemWAL LSM write path.
*
* By default (unset), a `mergeInsert` on a table with an LSM write spec is
* routed through Lance's MemWAL shard writer, and a table without one uses
* the standard path. Pass `false` to force the standard path even when a
* spec is set. Pass `true` to require a spec — `mergeInsert` rejects if none
* is installed.
*
* @param useLsmWrite - Whether to use the LSM write path.
*/
useLsmWrite(useLsmWrite: boolean): MergeInsertBuilder {
return new MergeInsertBuilder(
this.#native.useLsmWrite(useLsmWrite),
this.#schema,
);
}
/**
* Controls how an LSM merge checks that its input targets a single shard.
*
* When a table has an LSM write spec, every row in a `mergeInsert` call must
* route to the same shard. When `true` (the default), every row is inspected
* to verify this. When `false`, only the first row is inspected and the
* shard it routes to is used for the whole input — a faster path for callers
* that have already pre-sharded their input. Has no effect on tables without
* an LSM write spec.
*
* @param validateSingleShard - Whether to check every row routes to one shard. Defaults to `true`.
*/
validateSingleShard(validateSingleShard: boolean): MergeInsertBuilder {
return new MergeInsertBuilder(
this.#native.validateSingleShard(validateSingleShard),
this.#schema,
);
}
/**
* Executes the merge insert operation
*

View File

@@ -25,16 +25,13 @@ import {
AddColumnsSql,
AddResult,
AlterColumnsResult,
BranchContents,
DeleteResult,
DropColumnsResult,
IndexConfig,
IndexStatistics,
Branches as NativeBranches,
OptimizeStats,
TableStatistics,
Tags,
UpdateFieldMetadataResult,
UpdateResult,
Table as _NativeTable,
} from "./native";
@@ -164,10 +161,7 @@ export interface Version {
*
* `specType` is `"bucket"`, `"identity"`, or `"unsharded"`. For `"bucket"`,
* `column` and `numBuckets` are required; for `"identity"`, `column` is
* required and must be a deterministic function of the unenforced primary
* key (every row with a given primary key must always produce the same
* `column` value, or upserts of that key can land in different shards and a
* stale version can win).
* required.
*/
export interface LsmWriteSpec {
/** One of `"bucket"`, `"identity"`, or `"unsharded"`. */
@@ -511,18 +505,6 @@ export abstract class Table {
abstract alterColumns(
columnAlterations: ColumnAlteration[],
): Promise<AlterColumnsResult>;
/**
* Update per-field (column) metadata.
* @param {FieldMetadataUpdate[]} updates One or more per-field updates. Each
* update's metadata is merged into the field's existing metadata by default;
* a value of `null` deletes that key, and `replace: true` swaps the whole map.
* @returns {Promise<UpdateFieldMetadataResult>} resolves to the new table version.
*/
abstract updateFieldMetadata(
updates: FieldMetadataUpdate[],
): Promise<UpdateFieldMetadataResult>;
/**
* Drop one or more columns from the dataset
*
@@ -585,16 +567,6 @@ export abstract class Table {
* @returns {Promise<void>}
*/
abstract unsetLsmWriteSpec(): Promise<void>;
/**
* Drain and close any cached MemWAL shard writers held for this table.
*
* When an {@link LsmWriteSpec} is installed, `mergeInsert` opens MemWAL
* shard writers and caches them for reuse across calls. This closes them,
* flushing pending data; writers reopen lazily on the next `mergeInsert`.
* It is a no-op when no writers are cached.
* @returns {Promise<void>}
*/
abstract closeLsmWriters(): Promise<void>;
/** Retrieve the version of the table */
abstract version(): Promise<number>;
@@ -655,22 +627,6 @@ export abstract class Table {
*/
abstract tags(): Promise<Tags>;
/**
* Get the branch manager for this table.
*
* Branches are isolated, writable lines of history forked from another
* branch (or version). Writes on a branch do not affect `main`.
*/
abstract branches(): Promise<Branches>;
/**
* The branch this table handle is scoped to, or `null` for the main branch.
*
* A handle returned by {@link Branches.create} or {@link Branches.checkout}
* reports the branch it targets; a handle opened normally reports `null`.
*/
abstract currentBranch(): string | null;
/**
* Restore the table to the currently checked out version
*
@@ -1068,12 +1024,6 @@ export class LocalTable extends Table {
return await this.inner.alterColumns(processedAlterations);
}
async updateFieldMetadata(
updates: FieldMetadataUpdate[],
): Promise<UpdateFieldMetadataResult> {
return await this.inner.updateFieldMetadata(updates);
}
async dropColumns(columnNames: string[]): Promise<DropColumnsResult> {
return await this.inner.dropColumns(columnNames);
}
@@ -1091,10 +1041,6 @@ export class LocalTable extends Table {
return await this.inner.unsetLsmWriteSpec();
}
async closeLsmWriters(): Promise<void> {
return await this.inner.closeLsmWriters();
}
async version(): Promise<number> {
return await this.inner.version();
}
@@ -1126,14 +1072,6 @@ export class LocalTable extends Table {
return await this.inner.tags();
}
async branches(): Promise<Branches> {
return new Branches(await this.inner.branches());
}
currentBranch(): string | null {
return this.inner.currentBranch() ?? null;
}
async optimize(options?: Partial<OptimizeOptions>): Promise<OptimizeStats> {
let cleanupOlderThanMs;
if (
@@ -1248,73 +1186,3 @@ export interface ColumnAlteration {
/** Set the new nullability. Note that a nullable column cannot be made non-nullable. */
nullable?: boolean;
}
/** A per-field metadata update, addressed by dot-path. */
export interface FieldMetadataUpdate {
/**
* Dot-separated path to the field. For a top-level column this is just its
* name; for a nested field it's the path, e.g. "a.b.c".
*/
path: string;
/**
* Metadata key/value pairs. Merged into the field's existing metadata by
* default; a value of `null` deletes that key.
*/
metadata: Record<string, string | null>;
/** If true, replace the field's entire metadata map instead of merging. */
replace?: boolean;
}
/**
* Branch manager for a {@link Table}.
*
* Unlike tags, `create` and `checkout` return a new {@link Table} handle scoped
* to the branch; writes on it do not affect `main`.
*/
export class Branches {
#inner: NativeBranches;
/**
* Construct a Branches manager. Internal use only.
* @hidden
*/
constructor(inner: NativeBranches) {
this.#inner = inner;
}
/** List all branches, mapping name to branch metadata. */
async list(): Promise<Record<string, BranchContents>> {
return await this.#inner.list();
}
/**
* Create a branch and return a handle scoped to it.
*
* @param name Name of the new branch.
* @param fromRef Source branch to fork from. Defaults to `main`.
* @param fromVersion A specific version on `fromRef`. Defaults to latest.
*/
async create(
name: string,
fromRef?: string,
fromVersion?: number,
): Promise<Table> {
return new LocalTable(await this.#inner.create(name, fromRef, fromVersion));
}
/**
* Check out an existing branch and return a handle scoped to it.
*
* With `version` set, the returned handle is pinned to that version of the
* branch (a read-only, detached view); otherwise it tracks the branch's
* latest and stays writable.
*/
async checkout(name: string, version?: number): Promise<Table> {
return new LocalTable(await this.#inner.checkout(name, version));
}
/** Delete a branch. */
async delete(name: string): Promise<void> {
return await this.#inner.delete(name);
}
}

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.30.1-beta.2",
"version": "0.30.0",
"cpu": [
"x64",
"arm64"
@@ -26,7 +26,7 @@
"@aws-sdk/client-s3": "3.1003.0",
"@biomejs/biome": "^1.7.3",
"@jest/globals": "^29.7.0",
"@napi-rs/cli": "3.7.0",
"@napi-rs/cli": "3.5.1",
"@types/axios": "^0.14.0",
"@types/jest": "^29.1.2",
"@types/node": "22.7.4",
@@ -2942,9 +2942,9 @@
}
},
"node_modules/@napi-rs/cli": {
"version": "3.7.0",
"resolved": "https://registry.npmjs.org/@napi-rs/cli/-/cli-3.7.0.tgz",
"integrity": "sha512-3d3+rmxlOIV/G1zPWeX4PCxuYnhcCQM2BvY9rtimC8RO0dFR9gtYP+Grov+WoduZtfWRj5N1XvytWeRxxCk5zw==",
"version": "3.5.1",
"resolved": "https://registry.npmjs.org/@napi-rs/cli/-/cli-3.5.1.tgz",
"integrity": "sha512-XBfLQRDcB3qhu6bazdMJsecWW55kR85l5/k0af9BIBELXQSsCFU0fzug7PX8eQp6vVdm7W/U3z6uP5WmITB2Gw==",
"dev": true,
"license": "MIT",
"dependencies": {
@@ -2954,7 +2954,7 @@
"@octokit/rest": "^22.0.1",
"clipanion": "^4.0.0-rc.4",
"colorette": "^2.0.20",
"emnapi": "^1.10.0",
"emnapi": "^1.7.1",
"es-toolkit": "^1.41.0",
"js-yaml": "^4.1.0",
"obug": "^2.0.0",

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.30.1-beta.2",
"version": "0.30.0",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",
@@ -43,7 +43,7 @@
"@aws-sdk/client-s3": "3.1003.0",
"@biomejs/biome": "^1.7.3",
"@jest/globals": "^29.7.0",
"@napi-rs/cli": "3.7.0",
"@napi-rs/cli": "3.5.1",
"@types/axios": "^0.14.0",
"@types/jest": "^29.1.2",
"@types/node": "22.7.4",

10
nodejs/pnpm-lock.yaml generated
View File

@@ -31,8 +31,8 @@ importers:
specifier: ^29.7.0
version: 29.7.0
'@napi-rs/cli':
specifier: 3.7.0
version: 3.7.0(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)
specifier: 3.5.1
version: 3.5.1(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)
'@types/axios':
specifier: ^0.14.0
version: 0.14.4
@@ -887,8 +887,8 @@ packages:
'@jridgewell/trace-mapping@0.3.31':
resolution: {integrity: sha512-zzNR+SdQSDJzc8joaeP8QQoCQr8NuYx2dIIytl1QeBEZHJ9uW6hebsrYgbz8hJwUQao3TWCMtmfV8Nu1twOLAw==}
'@napi-rs/cli@3.7.0':
resolution: {integrity: sha512-3d3+rmxlOIV/G1zPWeX4PCxuYnhcCQM2BvY9rtimC8RO0dFR9gtYP+Grov+WoduZtfWRj5N1XvytWeRxxCk5zw==}
'@napi-rs/cli@3.5.1':
resolution: {integrity: sha512-XBfLQRDcB3qhu6bazdMJsecWW55kR85l5/k0af9BIBELXQSsCFU0fzug7PX8eQp6vVdm7W/U3z6uP5WmITB2Gw==}
engines: {node: '>= 16'}
hasBin: true
peerDependencies:
@@ -4582,7 +4582,7 @@ snapshots:
'@jridgewell/resolve-uri': 3.1.2
'@jridgewell/sourcemap-codec': 1.5.5
'@napi-rs/cli@3.7.0(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)':
'@napi-rs/cli@3.5.1(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)':
dependencies:
'@inquirer/prompts': 8.4.3(@types/node@22.7.4)
'@napi-rs/cross-toolchain': 1.0.3(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)

View File

@@ -4,7 +4,7 @@
use std::sync::Mutex;
use lancedb::index::Index as LanceDbIndex;
use lancedb::index::scalar::{BTreeIndexBuilder, FmIndexBuilder, FtsIndexBuilder};
use lancedb::index::scalar::{BTreeIndexBuilder, FtsIndexBuilder};
use lancedb::index::vector::{
IvfFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder,
IvfRqIndexBuilder,
@@ -143,13 +143,6 @@ impl Index {
}
}
#[napi(factory)]
pub fn fm() -> Self {
Self {
inner: Mutex::new(Some(LanceDbIndex::Fm(FmIndexBuilder::default()))),
}
}
#[napi(factory)]
#[allow(clippy::too_many_arguments)]
pub fn fts(

View File

@@ -50,20 +50,6 @@ impl NativeMergeInsertBuilder {
this
}
#[napi]
pub fn use_lsm_write(&self, use_lsm_write: bool) -> Self {
let mut this = self.clone();
this.inner.use_lsm_write(use_lsm_write);
this
}
#[napi]
pub fn validate_single_shard(&self, validate_single_shard: bool) -> Self {
let mut this = self.clone();
this.inner.validate_single_shard(validate_single_shard);
this
}
#[napi(catch_unwind)]
pub async fn execute(&self, buf: Buffer) -> napi::Result<MergeResult> {
let data = ipc_file_to_batches(buf.to_vec())

View File

@@ -3,13 +3,10 @@
use std::collections::HashMap;
use chrono::{DateTime, Utc};
use lancedb::ipc::{ipc_file_to_batches, ipc_file_to_schema};
use lancedb::table::{
AddDataMode, ColumnAlteration as LanceColumnAlteration, Duration,
FieldMetadataUpdate as LanceFieldMetadataUpdate, NewColumnTransform, OptimizeAction,
OptimizeOptions, Ref, Table as LanceDbTable,
AddDataMode, ColumnAlteration as LanceColumnAlteration, Duration, NewColumnTransform,
OptimizeAction, OptimizeOptions, Table as LanceDbTable,
};
use napi::bindgen_prelude::*;
use napi::threadsafe_function::{ThreadsafeFunction, ThreadsafeFunctionCallMode};
@@ -358,23 +355,6 @@ impl Table {
Ok(res.into())
}
#[napi(catch_unwind)]
pub async fn update_field_metadata(
&self,
updates: Vec<FieldMetadataUpdate>,
) -> napi::Result<UpdateFieldMetadataResult> {
let updates = updates
.into_iter()
.map(LanceFieldMetadataUpdate::from)
.collect::<Vec<_>>();
let res = self
.inner_ref()?
.update_field_metadata(&updates)
.await
.default_error()?;
Ok(res.into())
}
#[napi(catch_unwind)]
pub async fn drop_columns(&self, columns: Vec<String>) -> napi::Result<DropColumnsResult> {
let col_refs = columns.iter().map(String::as_str).collect::<Vec<_>>();
@@ -411,11 +391,6 @@ impl Table {
.default_error()
}
#[napi(catch_unwind)]
pub async fn close_lsm_writers(&self) -> napi::Result<()> {
self.inner_ref()?.close_lsm_writers().await.default_error()
}
#[napi(catch_unwind)]
pub async fn version(&self) -> napi::Result<i64> {
self.inner_ref()?
@@ -480,19 +455,6 @@ impl Table {
})
}
#[napi(catch_unwind)]
pub async fn branches(&self) -> napi::Result<Branches> {
Ok(Branches {
inner: self.inner_ref()?.clone(),
})
}
/// The branch this handle is scoped to, or `null` for the main branch.
#[napi]
pub fn current_branch(&self) -> napi::Result<Option<String>> {
Ok(self.inner_ref()?.current_branch())
}
#[napi(catch_unwind)]
pub async fn optimize(
&self,
@@ -610,43 +572,6 @@ pub struct IndexConfig {
/// Currently this is always an array of size 1. In the future there may
/// be more columns to represent composite indices.
pub columns: Vec<String>,
/// The UUID of the first segment of the index.
///
/// `undefined` for remote tables, which do not yet surface this.
pub index_uuid: Option<String>,
/// The protobuf type URL, a precise type identifier for the index.
///
/// `undefined` for remote tables.
pub type_url: Option<String>,
/// When the index was created.
///
/// `undefined` for remote tables or indices created before timestamps were tracked.
pub created_at: Option<DateTime<Utc>>,
/// The number of rows indexed, across all segments.
///
/// `undefined` for remote tables.
pub num_indexed_rows: Option<i64>,
/// The number of rows not yet covered by this index.
///
/// `undefined` for remote tables.
pub num_unindexed_rows: Option<i64>,
/// The total size in bytes of all index files across all segments.
///
/// `undefined` for remote tables or indices without size tracking.
pub size_bytes: Option<i64>,
/// The number of segments that make up the index.
///
/// `undefined` for remote tables.
pub num_segments: Option<i32>,
/// The on-disk index format version.
///
/// `undefined` for remote tables.
pub index_version: Option<i32>,
/// Index-type-specific details parsed as a JavaScript object.
///
/// Falls back to a raw string if JSON parsing fails. `undefined` for
/// remote tables or when details are unavailable.
pub index_details: Option<serde_json::Value>,
}
impl From<lancedb::index::IndexConfig> for IndexConfig {
@@ -656,17 +581,6 @@ impl From<lancedb::index::IndexConfig> for IndexConfig {
index_type,
columns: value.columns,
name: value.name,
index_uuid: value.index_uuid,
type_url: value.type_url,
created_at: value.created_at,
num_indexed_rows: value.num_indexed_rows.map(|n| n as i64),
num_unindexed_rows: value.num_unindexed_rows.map(|n| n as i64),
size_bytes: value.size_bytes.map(|n| n as i64),
num_segments: value.num_segments.map(|n| n as i32),
index_version: value.index_version,
index_details: value
.index_details
.and_then(|s| serde_json::from_str(&s).ok()),
}
}
}
@@ -828,29 +742,6 @@ pub struct ColumnAlteration {
pub nullable: Option<bool>,
}
/// A per-field metadata update, addressed by dot-path. Merges into the field's
/// existing metadata by default; a `null` value deletes a key, and `replace`
/// swaps the field's entire metadata map.
#[napi(object)]
pub struct FieldMetadataUpdate {
/// Dot-separated path to the field (e.g. "embedding" or "a.b.c").
pub path: String,
/// Metadata keys to set; a `null` value deletes that key.
pub metadata: HashMap<String, Option<String>>,
/// If true, replace the field's entire metadata map instead of merging.
pub replace: Option<bool>,
}
impl From<FieldMetadataUpdate> for LanceFieldMetadataUpdate {
fn from(js: FieldMetadataUpdate) -> Self {
Self {
path: js.path,
metadata: js.metadata,
replace: js.replace.unwrap_or(false),
}
}
}
impl TryFrom<ColumnAlteration> for LanceColumnAlteration {
type Error = String;
fn try_from(js: ColumnAlteration) -> std::result::Result<Self, Self::Error> {
@@ -901,6 +792,9 @@ pub struct IndexStatistics {
pub distance_type: Option<String>,
/// The number of parts this index is split into.
pub num_indices: Option<u32>,
/// The KMeans loss value of the index,
/// it is only present for vector indices.
pub loss: Option<f64>,
}
impl From<lancedb::index::IndexStatistics> for IndexStatistics {
fn from(value: lancedb::index::IndexStatistics) -> Self {
@@ -910,6 +804,7 @@ impl From<lancedb::index::IndexStatistics> for IndexStatistics {
index_type: value.index_type.to_string(),
distance_type: value.distance_type.map(|d| d.to_string()),
num_indices: value.num_indices,
loss: value.loss,
}
}
}
@@ -1045,7 +940,6 @@ pub struct MergeResult {
pub num_updated_rows: i64,
pub num_deleted_rows: i64,
pub num_attempts: i64,
pub num_rows: i64,
}
impl From<lancedb::table::MergeResult> for MergeResult {
@@ -1056,7 +950,6 @@ impl From<lancedb::table::MergeResult> for MergeResult {
num_updated_rows: value.num_updated_rows as i64,
num_deleted_rows: value.num_deleted_rows as i64,
num_attempts: value.num_attempts as i64,
num_rows: value.num_rows as i64,
}
}
}
@@ -1087,19 +980,6 @@ impl From<lancedb::table::AlterColumnsResult> for AlterColumnsResult {
}
}
#[napi(object)]
pub struct UpdateFieldMetadataResult {
pub version: i64,
}
impl From<lancedb::table::UpdateFieldMetadataResult> for UpdateFieldMetadataResult {
fn from(value: lancedb::table::UpdateFieldMetadataResult) -> Self {
Self {
version: value.version as i64,
}
}
}
#[napi(object)]
pub struct DropColumnsResult {
pub version: i64,
@@ -1119,13 +999,6 @@ pub struct TagContents {
pub manifest_size: i64,
}
#[napi]
pub struct BranchContents {
pub parent_branch: Option<String>,
pub parent_version: i64,
pub manifest_size: i64,
}
#[napi]
pub struct Tags {
inner: LanceDbTable,
@@ -1194,75 +1067,3 @@ impl Tags {
.default_error()
}
}
#[napi]
pub struct Branches {
inner: LanceDbTable,
}
#[napi]
impl Branches {
#[napi]
pub async fn list(&self) -> napi::Result<HashMap<String, BranchContents>> {
let branches = self.inner.list_branches().await.default_error()?;
let result = branches
.into_iter()
.map(|(k, v)| {
(
k,
BranchContents {
parent_branch: v.parent_branch,
parent_version: v.parent_version as i64,
manifest_size: v.manifest_size as i64,
},
)
})
.collect();
Ok(result)
}
#[napi]
pub async fn create(
&self,
name: String,
from_ref: Option<String>,
from_version: Option<i64>,
) -> napi::Result<Table> {
let from_ref = from_ref.filter(|b| b != "main");
let from_version = from_version
.map(|v| {
u64::try_from(v).map_err(|_| {
napi::Error::from_reason("from_version must be a non-negative integer")
})
})
.transpose()?;
let from = Ref::Version(from_ref, from_version);
let table = self
.inner
.create_branch(&name, from)
.await
.default_error()?;
Ok(Table::new(table))
}
#[napi]
pub async fn checkout(&self, name: String, version: Option<i64>) -> napi::Result<Table> {
let version = version
.map(|v| {
u64::try_from(v)
.map_err(|_| napi::Error::from_reason("version must be a non-negative integer"))
})
.transpose()?;
let table = self
.inner
.checkout_branch(&name, version)
.await
.default_error()?;
Ok(Table::new(table))
}
#[napi]
pub async fn delete(&self, name: String) -> napi::Result<()> {
self.inner.delete_branch(&name).await.default_error()
}
}

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.33.1-beta.2"
current_version = "0.33.0"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-python"
version = "0.33.1-beta.2"
version = "0.33.0"
publish = false
edition.workspace = true
description = "Python bindings for LanceDB"
@@ -26,8 +26,7 @@ lance-namespace-impls.workspace = true
lance-io.workspace = true
env_logger.workspace = true
log.workspace = true
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39", "chrono"] }
chrono = { version = "0.4", default-features = false, features = ["clock"] }
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39"] }
pyo3-async-runtimes = { version = "0.28", features = [
"attributes",
"tokio-runtime",

View File

@@ -315,15 +315,6 @@ def deserialize_conn(
manifest_enabled=parsed.get("manifest_enabled", False),
namespace_client_properties=parsed.get("namespace_client_properties"),
)
elif connection_type == "remote":
return RemoteDBConnection(
parsed["db_url"],
parsed["api_key"],
parsed.get("region", "us-east-1"),
host_override=parsed.get("host_override"),
client_config=parsed.get("client_config"),
storage_options=storage_options,
)
else:
raise ValueError(f"Unknown connection_type: {connection_type}")

View File

@@ -1,4 +1,4 @@
from datetime import datetime, timedelta
from datetime import timedelta
from typing import Dict, List, Optional, Tuple, Any, TypedDict, Union, Literal
import pyarrow as pa
@@ -10,7 +10,6 @@ from .index import (
IvfSq,
Bitmap,
LabelList,
Fm,
HnswPq,
HnswSq,
HnswFlat,
@@ -48,7 +47,6 @@ class PyExpr:
def lower(self) -> "PyExpr": ...
def upper(self) -> "PyExpr": ...
def contains(self, substr: "PyExpr") -> "PyExpr": ...
def isin(self, values: List["PyExpr"]) -> "PyExpr": ...
def cast(self, data_type: pa.DataType) -> "PyExpr": ...
def to_sql(self) -> str: ...
@@ -188,7 +186,6 @@ class Table:
BTree,
Bitmap,
LabelList,
Fm,
FTS,
],
replace: Optional[bool],
@@ -205,15 +202,12 @@ class Table:
async def prewarm_index(self, index_name: str) -> None: ...
async def prewarm_data(self, columns: Optional[List[str]] = None) -> None: ...
async def list_indices(self) -> list[IndexConfig]: ...
async def delete(self, filter: Union[str, PyExpr]) -> DeleteResult: ...
async def delete(self, filter: str) -> DeleteResult: ...
async def add_columns(self, columns: list[tuple[str, str]]) -> AddColumnsResult: ...
async def add_columns_with_schema(self, schema: pa.Schema) -> AddColumnsResult: ...
async def alter_columns(
self, columns: list[dict[str, Any]]
) -> AlterColumnsResult: ...
async def update_field_metadata(
self, updates: list[dict[str, Any]]
) -> UpdateFieldMetadataResult: ...
async def optimize(
self,
*,
@@ -226,12 +220,8 @@ class Table:
async def set_unenforced_primary_key(self, columns: List[str]) -> None: ...
async def set_lsm_write_spec(self, spec: LsmWriteSpec) -> None: ...
async def unset_lsm_write_spec(self) -> None: ...
async def close_lsm_writers(self) -> None: ...
@property
def tags(self) -> Tags: ...
@property
def branches(self) -> Branches: ...
def current_branch(self) -> Optional[str]: ...
def query(self) -> Query: ...
def take_offsets(self, offsets: list[int]) -> TakeQuery: ...
def take_row_ids(self, row_ids: list[int]) -> TakeQuery: ...
@@ -244,30 +234,10 @@ class Tags:
async def delete(self, tag: str): ...
async def update(self, tag: str, version: int): ...
class Branches:
async def list(self) -> Dict[str, Any]: ...
async def create(
self,
name: str,
from_ref: Optional[str] = None,
from_version: Optional[int] = None,
) -> Table: ...
async def checkout(self, name: str, version: Optional[int] = None) -> Table: ...
async def delete(self, name: str) -> None: ...
class IndexConfig:
name: str
index_type: str
columns: List[str]
index_uuid: Optional[str]
type_url: Optional[str]
created_at: Optional[datetime]
num_indexed_rows: Optional[int]
num_unindexed_rows: Optional[int]
size_bytes: Optional[int]
num_segments: Optional[int]
index_version: Optional[int]
index_details: Optional[Any]
async def connect(
uri: str,
@@ -450,7 +420,6 @@ class MergeResult:
num_inserted_rows: int
num_deleted_rows: int
num_attempts: int
num_rows: int
class LsmWriteSpec:
"""Specification selecting Lance's MemWAL LSM-style write path for
@@ -489,9 +458,6 @@ class AddColumnsResult:
class AlterColumnsResult:
version: int
class UpdateFieldMetadataResult:
version: int
class DropColumnsResult:
version: int

View File

@@ -2,7 +2,6 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import asyncio
import concurrent.futures
import os
import threading
import warnings
@@ -38,24 +37,6 @@ class BackgroundEventLoop:
LOOP = BackgroundEventLoop()
def _new_embedding_executor() -> concurrent.futures.ThreadPoolExecutor:
return concurrent.futures.ThreadPoolExecutor(thread_name_prefix="lancedb-embedding")
# Embedding functions can block for a long time -- a heavy local model or an
# HTTP request to a remote embeddings API. Running them on asyncio's default
# executor lets them starve the unrelated blocking I/O that shares that pool,
# so they get a dedicated one. See
# https://github.com/lancedb/lancedb/issues/3310.
_EMBEDDING_EXECUTOR = _new_embedding_executor()
def embedding_executor() -> concurrent.futures.ThreadPoolExecutor:
"""Return the executor dedicated to running blocking embedding calls."""
return _EMBEDDING_EXECUTOR
_FORK_WARNED = False
@@ -66,12 +47,6 @@ def _reset_after_fork():
# the new state. The Rust-side tokio runtime is reset analogously by a
# pthread_atfork hook installed in the _lancedb extension.
LOOP._start()
# The embedding executor's worker threads are dead in the child as well.
# Replace it with a fresh pool (threads are spawned lazily, so this is
# cheap); we don't shut down the old one, since joining its dead workers
# could hang.
global _EMBEDDING_EXECUTOR
_EMBEDDING_EXECUTOR = _new_embedding_executor()
global _FORK_WARNED
if not _FORK_WARNED:
_FORK_WARNED = True

View File

@@ -416,8 +416,6 @@ class DBConnection(EnforceOverrides):
namespace_path: Optional[List[str]] = None,
storage_options: Optional[Dict[str, str]] = None,
index_cache_size: Optional[int] = None,
branch: Optional[str] = None,
version: Optional[int] = None,
) -> Table:
"""Open a Lance Table in the database.
@@ -446,14 +444,6 @@ class DBConnection(EnforceOverrides):
connection will be inherited by the table, but can be overridden here.
See available options at
<https://docs.lancedb.com/storage/>
branch: str, optional
If provided, open a handle scoped to this branch instead of the
default branch. Reads and writes operate in the branch's context.
version: int, optional
If provided, open the table pinned to this version, producing a
read-only handle. Composes with ``branch``: when both are given,
opens that branch at the version; otherwise opens ``main`` at the
version. Call ``checkout_latest`` to return to a writable state.
Returns
-------
@@ -968,8 +958,6 @@ class LanceDBConnection(DBConnection):
namespace_path: Optional[List[str]] = None,
storage_options: Optional[Dict[str, str]] = None,
index_cache_size: Optional[int] = None,
branch: Optional[str] = None,
version: Optional[int] = None,
) -> LanceTable:
"""Open a table in the database.
@@ -980,14 +968,6 @@ class LanceDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to open the table from. When non-empty, the
table is resolved through the directory namespace client.
branch: str, optional
If provided, open a handle scoped to this branch instead of the
default branch. Reads and writes operate in the branch's context.
version: int, optional
If provided, open the table pinned to this version, producing a
read-only handle. Composes with ``branch``: when both are given,
opens that branch at the version; otherwise opens ``main`` at the
version. Call ``checkout_latest`` to return to a writable state.
Returns
-------
@@ -1007,26 +987,20 @@ class LanceDBConnection(DBConnection):
)
if namespace_path:
tbl = self._namespace_conn().open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
index_cache_size=index_cache_size,
)
else:
tbl = LanceTable.open(
self,
return self._namespace_conn().open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
index_cache_size=index_cache_size,
)
if branch is not None:
tbl = tbl.branches.checkout(branch, version)
elif version is not None:
tbl.checkout(version)
return tbl
return LanceTable.open(
self,
name,
namespace_path=namespace_path,
storage_options=storage_options,
index_cache_size=index_cache_size,
)
def clone_table(
self,
@@ -1667,8 +1641,6 @@ class AsyncConnection(object):
location: Optional[str] = None,
namespace_client: Optional[Any] = None,
managed_versioning: Optional[bool] = None,
branch: Optional[str] = None,
version: Optional[int] = None,
) -> AsyncTable:
"""Open a Lance Table in the database.
@@ -1704,14 +1676,6 @@ class AsyncConnection(object):
managed_versioning: bool, optional
Whether managed versioning is enabled for this table. If provided,
avoids a redundant describe_table call when namespace_client is set.
branch: str, optional
If provided, open a handle scoped to this branch instead of the
default branch. Reads and writes operate in the branch's context.
version: int, optional
If provided, open the table pinned to this version, producing a
read-only handle. Composes with ``branch``: when both are given,
opens that branch at the version; otherwise opens ``main`` at the
version. Call ``checkout_latest`` to return to a writable state.
Returns
-------
@@ -1728,14 +1692,7 @@ class AsyncConnection(object):
namespace_client=namespace_client,
managed_versioning=managed_versioning,
)
tbl = AsyncTable(table)
# "main" is the default branch, so treat it as no branch: remote rejects
# every branch checkout (even "main"), and the version still applies.
if branch is not None and branch != "main":
tbl = await tbl.branches.checkout(branch, version)
elif version is not None:
await tbl.checkout(version)
return tbl
return AsyncTable(table)
async def clone_table(
self,

View File

@@ -19,7 +19,7 @@ operators::
from __future__ import annotations
from typing import Iterable, Union
from typing import Union
import pyarrow as pa
@@ -174,11 +174,6 @@ class Expr:
"""Return True where the string contains *substr*."""
return Expr(self._inner.contains(_coerce(substr)._inner))
def isin(self, values: "Iterable[ExprLike]") -> "Expr":
"""Return True where the value is one of *values* (SQL ``IN``)."""
inner = [_coerce(v)._inner for v in values]
return Expr(self._inner.isin(inner))
# ── type cast ────────────────────────────────────────────────────────────
def cast(self, data_type: Union[str, "pa.DataType"]) -> "Expr":

View File

@@ -93,20 +93,6 @@ class LabelList:
pass
@dataclass
class Fm:
"""Describe an FM-Index configuration.
`Fm` is a scalar index on string or binary columns that accelerates
substring search, i.e. `contains(col, 'needle')`. Unlike the tokenized
`FTS` index, it matches arbitrary substrings of the raw bytes.
For example, it works with `url`, `path`, `content`, etc.
"""
pass
@dataclass
class FTS:
"""Describe a FTS index configuration.
@@ -295,9 +281,6 @@ class HnswPq:
m: int = 20
ef_construction: int = 300
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -403,9 +386,6 @@ class HnswSq:
m: int = 20
ef_construction: int = 300
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -599,9 +579,6 @@ class IvfFlat:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -632,9 +609,6 @@ class IvfSq:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -765,9 +739,6 @@ class IvfPq:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
@dataclass
@@ -821,9 +792,6 @@ class IvfRq:
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
# create_index() dispatches to pylance to build the index on the accelerator.
accelerator: Optional[str] = None
__all__ = [
@@ -842,5 +810,4 @@ __all__ = [
"FTS",
"Bitmap",
"LabelList",
"Fm",
]

View File

@@ -5,9 +5,7 @@
from __future__ import annotations
from datetime import timedelta
from typing import TYPE_CHECKING, List, Optional, Union
from .expr import Expr
from typing import TYPE_CHECKING, List, Optional
if TYPE_CHECKING:
from .common import DATA
@@ -34,11 +32,8 @@ class LanceMergeInsertBuilder(object):
self._when_not_matched_insert_all = False
self._when_not_matched_by_source_delete = False
self._when_not_matched_by_source_condition = None
self._when_not_matched_by_source_condition_expr = None
self._timeout = None
self._use_index = True
self._use_lsm_write = None
self._validate_single_shard = None
def when_matched_update_all(
self, *, where: Optional[str] = None
@@ -65,7 +60,7 @@ class LanceMergeInsertBuilder(object):
return self
def when_not_matched_by_source_delete(
self, condition: Union[str, Expr, None] = None
self, condition: Optional[str] = None
) -> LanceMergeInsertBuilder:
"""
Rows that exist only in the target table (old data) will be
@@ -74,16 +69,13 @@ class LanceMergeInsertBuilder(object):
Parameters
----------
condition: str or :class:`~lancedb.expr.Expr` or None, default None
condition: Optional[str], default None
If None then all such rows will be deleted. Otherwise the
condition will be used as a filter to limit what rows are deleted.
Can be a SQL string or a type-safe :class:`~lancedb.expr.Expr`
built with :func:`~lancedb.expr.col` and :func:`~lancedb.expr.lit`.
condition will be used as an SQL filter to limit what rows
are deleted.
"""
self._when_not_matched_by_source_delete = True
if isinstance(condition, Expr):
self._when_not_matched_by_source_condition_expr = condition._inner
elif condition is not None:
if condition is not None:
self._when_not_matched_by_source_condition = condition
return self
@@ -104,46 +96,6 @@ class LanceMergeInsertBuilder(object):
self._use_index = use_index
return self
def use_lsm_write(self, use_lsm_write: bool) -> LanceMergeInsertBuilder:
"""
Controls whether the merge uses the MemWAL LSM write path.
By default (unset), a `merge_insert` on a table with an LSM write spec
is routed through Lance's MemWAL shard writer, and a table without one
uses the standard path. Pass `False` to force the standard path even
when a spec is set. Pass `True` to require a spec — `merge_insert`
raises an error if none is installed.
Parameters
----------
use_lsm_write: bool
Whether to use the LSM write path.
"""
self._use_lsm_write = use_lsm_write
return self
def validate_single_shard(
self, validate_single_shard: bool
) -> LanceMergeInsertBuilder:
"""
Controls how an LSM merge checks that its input targets a single shard.
When a table has an LSM write spec, every row in a `merge_insert` call
must route to the same shard. When `True` (the default), every row is
inspected to verify this. When `False`, only the first row is inspected
and the shard it routes to is used for the whole input — a faster path
for callers that have already pre-sharded their input.
Has no effect on tables without an LSM write spec.
Parameters
----------
validate_single_shard: bool
Whether to check every row routes to one shard. Defaults to `True`.
"""
self._validate_single_shard = validate_single_shard
return self
def execute(
self,
new_data: DATA,

View File

@@ -58,7 +58,6 @@ from lance_namespace import (
ListTablesRequest,
DescribeNamespaceRequest,
DropTableRequest,
RenameTableRequest,
ListNamespacesRequest,
CreateNamespaceRequest,
DropNamespaceRequest,
@@ -145,12 +144,7 @@ def _query_to_namespace_request(
if query.postfilter is not None:
prefilter = not query.postfilter
if query.limit is not None:
k = query.limit
elif query.vector is None and query.full_text_query is None:
k = sys.maxsize
else:
k = 10
k = query.limit if query.limit is not None else 10
# Build request kwargs, only including non-None values for optional fields
# that Pydantic doesn't accept as None
@@ -550,8 +544,6 @@ class LanceNamespaceDBConnection(DBConnection):
namespace_path: Optional[List[str]] = None,
storage_options: Optional[Dict[str, str]] = None,
index_cache_size: Optional[int] = None,
branch: Optional[str] = None,
version: Optional[int] = None,
) -> Table:
if namespace_path is None:
namespace_path = []
@@ -570,7 +562,7 @@ class LanceNamespaceDBConnection(DBConnection):
raise TableNotFoundError(f"Table not found: {'$'.join(table_id)}")
raise
tbl = LanceTable(
return LanceTable(
self,
name,
namespace_path=namespace_path,
@@ -578,11 +570,6 @@ class LanceNamespaceDBConnection(DBConnection):
pushdown_operations=self._namespace_client_pushdown_operations,
_async=async_table,
)
if branch is not None:
tbl = tbl.branches.checkout(branch, version)
elif version is not None:
tbl.checkout(version)
return tbl
@override
def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):
@@ -605,14 +592,9 @@ class LanceNamespaceDBConnection(DBConnection):
cur_namespace_path = []
if new_namespace_path is None:
new_namespace_path = []
cur_table_id = cur_namespace_path + [cur_name]
new_namespace_id = new_namespace_path if new_namespace_path else None
request = RenameTableRequest(
id=cur_table_id,
new_table_name=new_name,
new_namespace_id=new_namespace_id,
raise NotImplementedError(
"rename_table is not supported for namespace connections"
)
self._namespace_client.rename_table(request)
@override
def drop_database(self):
@@ -972,7 +954,7 @@ class AsyncLanceNamespaceDBConnection:
if mode.lower() not in ["create", "overwrite"]:
raise ValueError("mode must be either 'create' or 'overwrite'")
validate_table_name(name)
table = await self._inner.create_table(
return await self._inner.create_table(
name,
data,
schema=schema,
@@ -984,11 +966,6 @@ class AsyncLanceNamespaceDBConnection:
embedding_functions=embedding_functions,
storage_options=storage_options,
)
return table._set_namespace_context(
namespace_path=namespace_path,
namespace_client=self._namespace_client,
pushdown_operations=self._namespace_client_pushdown_operations,
)
async def open_table(
self,
@@ -997,14 +974,12 @@ class AsyncLanceNamespaceDBConnection:
namespace_path: Optional[List[str]] = None,
storage_options: Optional[Dict[str, str]] = None,
index_cache_size: Optional[int] = None,
branch: Optional[str] = None,
version: Optional[int] = None,
) -> AsyncTable:
"""Open an existing table from the namespace."""
if namespace_path is None:
namespace_path = []
try:
table = await self._inner.open_table(
return await self._inner.open_table(
name,
namespace_path=namespace_path,
storage_options=storage_options,
@@ -1015,17 +990,6 @@ class AsyncLanceNamespaceDBConnection:
table_id = namespace_path + [name]
raise TableNotFoundError(f"Table not found: {'$'.join(table_id)}")
raise
# "main" is the default branch, so treat it as no branch (mirrors the
# sync remote path); the version still applies.
if branch is not None and branch != "main":
table = await table.branches.checkout(branch, version)
elif version is not None:
await table.checkout(version)
return table._set_namespace_context(
namespace_path=namespace_path,
namespace_client=self._namespace_client,
pushdown_operations=self._namespace_client_pushdown_operations,
)
async def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):
"""Drop a table from the namespace."""
@@ -1042,19 +1006,14 @@ class AsyncLanceNamespaceDBConnection:
cur_namespace_path: Optional[List[str]] = None,
new_namespace_path: Optional[List[str]] = None,
):
"""Rename a table in the namespace."""
"""Rename is not supported for namespace connections."""
if cur_namespace_path is None:
cur_namespace_path = []
if new_namespace_path is None:
new_namespace_path = []
cur_table_id = cur_namespace_path + [cur_name]
new_namespace_id = new_namespace_path if new_namespace_path else None
request = RenameTableRequest(
id=cur_table_id,
new_table_name=new_name,
new_namespace_id=new_namespace_id,
raise NotImplementedError(
"rename_table is not supported for namespace connections"
)
self._namespace_client.rename_table(request)
async def drop_database(self):
"""Deprecated method."""

View File

@@ -3,13 +3,12 @@
import copy
import json
import os
from deprecation import deprecated
import pyarrow as pa
from ._lancedb import async_permutation_builder, PermutationReader
from .table import LanceTable, Table
from .table import LanceTable
from .background_loop import LOOP
from .util import batch_to_tensor, batch_to_tensor_rows
from typing import Any, Callable, Iterator, Literal, Optional, TYPE_CHECKING, Union
@@ -355,49 +354,6 @@ class Transforms:
DEFAULT_BATCH_SIZE = 100
def _table_to_pickle_state(table: Table) -> dict[str, Any]:
from .remote.table import RemoteTable
if isinstance(table, RemoteTable):
return {
"kind": "remote",
"table": table,
}
if not isinstance(table, LanceTable):
raise ValueError(f"Cannot pickle table of type {type(table)!r}")
base_uri = table._conn.uri
if base_uri.startswith("memory://"):
return {
"kind": "memory",
"name": table.name,
"data": table.to_arrow(),
}
return {
"kind": "local",
"name": table.name,
"uri": base_uri,
"namespace": table._namespace_path,
"storage_options": table._conn.storage_options,
}
def _table_from_pickle_state(state: dict[str, Any]) -> Table:
from . import connect
kind = state["kind"]
if kind == "remote":
return state["table"]
if kind == "memory":
return connect("memory://").create_table(state["name"], state["data"])
if kind == "local":
db = connect(state["uri"], storage_options=state["storage_options"])
return db.open_table(state["name"], namespace_path=state["namespace"] or None)
raise ValueError(f"Unknown table pickle state kind: {kind}")
class Permutation:
"""
A Permutation is a view of a dataset that can be used as input to model training
@@ -413,15 +369,15 @@ class Permutation:
def __init__(
self,
base_table: Table,
permutation_table: Optional[Table],
base_table: LanceTable,
permutation_table: Optional[LanceTable],
split: int,
selection: dict[str, str],
batch_size: int,
transform_fn: Callable[pa.RecordBatch, Any],
offset: Optional[int] = None,
limit: Optional[int] = None,
connection_factory: Optional[Callable[[str], Table]] = None,
connection_factory: Optional[Callable[[str], LanceTable]] = None,
_reader: Optional[PermutationReader] = None,
):
"""
@@ -441,7 +397,6 @@ class Permutation:
if _reader is None:
_reader = LOOP.run(self._build_reader())
self.reader: PermutationReader = _reader
self._pid = os.getpid()
async def _build_reader(self) -> PermutationReader:
reader = await PermutationReader.from_tables(
@@ -473,25 +428,29 @@ class Permutation:
return new
def with_connection_factory(
self, connection_factory: Callable[[str], Table]
self, connection_factory: Callable[[str], LanceTable]
) -> "Permutation":
"""
Creates a new permutation that will use ``connection_factory`` to reopen
the base table when this permutation is unpickled in a worker process.
The factory is a callable that takes a single argument, the base table
name, and returns a LanceDB table. It must be picklable; the worker
The factory is a callable that takes a single argument the base table
name and returns a [LanceTable]. It must be picklable; the worker
will pickle it via standard ``pickle`` and call it to recover the base
table. Picklable callables in practice means top-level (module-level)
functions, ``functools.partial`` of such functions, or instances of
picklable classes implementing ``__call__``. Lambdas and closures over
local variables don't pickle with the default protocol.
A factory is optional for normal local and remote LanceDB connections:
if not set, ``__getstate__`` captures the table's own picklable reopen
state. Use a factory when that default state is not enough, for example
when credentials should be loaded from the worker environment instead
of being embedded in the pickle.
Setting a factory is necessary when the URI alone is not enough to
re-open the connection — most importantly for LanceDB Cloud (``db://``)
connections, where ``api_key`` and ``region`` aren't recoverable from
the connection object after construction.
For local file or cloud-storage paths the factory is optional: if not
set, ``__getstate__`` falls back to capturing
``(uri, storage_options, namespace_path)`` and re-opening via
``lancedb.connect(uri, storage_options=...)``.
Examples
--------
@@ -549,7 +508,7 @@ class Permutation:
return new
@classmethod
def identity(cls, table: Table) -> "Permutation":
def identity(cls, table: LanceTable) -> "Permutation":
"""
Creates an identity permutation for the given table.
"""
@@ -558,8 +517,8 @@ class Permutation:
@classmethod
def from_tables(
cls,
base_table: Table,
permutation_table: Optional[Table] = None,
base_table: LanceTable,
permutation_table: Optional[LanceTable] = None,
split: Optional[Union[str, int]] = None,
) -> "Permutation":
"""
@@ -635,10 +594,11 @@ class Permutation:
The base table is captured either via a user-supplied
``connection_factory`` (see [with_connection_factory]) or, as a
fallback, by the table's own picklable reopen state. The permutation
table is captured as a pyarrow Table (which pickles via Arrow IPC
natively). The reader is dropped from the wire format and rebuilt
lazily on first use.
fallback, by introspecting ``(uri, storage_options, namespace_path)``
on the connection. The permutation table — always an in-memory
LanceDB table — is captured as a pyarrow Table (which pickles via
Arrow IPC natively). The reader is dropped from the wire format;
``__setstate__`` rebuilds it from the restored tables.
"""
permutation_data: Optional[pa.Table] = None
if self.permutation_table is not None:
@@ -662,9 +622,39 @@ class Permutation:
# namespace from the existing connection.
return common
# URI-introspection fallback: only viable for native (OSS) connections
# where (uri, storage_options) is enough to reopen. Remote / cloud
# connections don't expose recoverable api_key / region — those users
# must call with_connection_factory().
try:
base_uri = self.base_table._conn.uri
storage_options = self.base_table._conn.storage_options
except AttributeError as e:
raise ValueError(
"Cannot pickle this Permutation: the base table's connection "
"does not expose a uri/storage_options, which usually means it "
"is a remote (LanceDB Cloud) connection. Call "
"Permutation.with_connection_factory(...) first to provide a "
"picklable callable that re-opens the base table from a worker "
"process."
) from e
if base_uri.startswith("memory://"):
# In-memory base tables don't exist in any worker process by
# default, so dump the entire base table into the pickle. This
# can be expensive for large datasets — users with large
# in-memory base tables should either persist them or set a
# connection_factory.
return {
**common,
"base_table_data": self.base_table.to_arrow(),
}
return {
**common,
"base_table_state": _table_to_pickle_state(self.base_table),
"base_table_uri": base_uri,
"base_table_namespace": self.base_table._namespace_path,
"base_table_storage_options": storage_options,
}
def __setstate__(self, state: dict[str, Any]) -> None:
@@ -673,8 +663,6 @@ class Permutation:
connection_factory = state["connection_factory"]
if connection_factory is not None:
base_table = connection_factory(state["base_table_name"])
elif "base_table_state" in state:
base_table = _table_from_pickle_state(state["base_table_state"])
elif "base_table_data" in state:
# In-memory base table inlined into the pickle; rebuild the same
# way we rebuild the in-memory permutation table.
@@ -692,7 +680,7 @@ class Permutation:
namespace_path=state["base_table_namespace"] or None,
)
permutation_table: Optional[Table] = None
permutation_table: Optional[LanceTable] = None
if state["permutation_data"] is not None:
mem_db = connect("memory://")
permutation_table = mem_db.create_table(
@@ -708,28 +696,10 @@ class Permutation:
self.offset = state["offset"]
self.limit = state["limit"]
self.connection_factory = connection_factory
self.reader = None
self._pid = None
def _ensure_open(self) -> None:
pid = os.getpid()
if self.reader is not None and getattr(self, "_pid", None) == pid:
return
# The reader owns Rust-side table handles. Rebuild it after unpickle or
# fork even though the Python table wrappers reopen themselves.
if hasattr(self.base_table, "_ensure_open"):
self.base_table._ensure_open()
if self.permutation_table is not None and hasattr(
self.permutation_table, "_ensure_open"
):
self.permutation_table._ensure_open()
self.reader = LOOP.run(self._build_reader())
self._pid = pid
@property
def schema(self) -> pa.Schema:
self._ensure_open()
async def do_output_schema():
return await self.reader.output_schema(self.selection)
@@ -747,7 +717,6 @@ class Permutation:
"""
The number of rows in the permutation
"""
self._ensure_open()
return self.reader.count_rows()
@property
@@ -906,7 +875,6 @@ class Permutation:
If skip_last_batch is True, the last batch will be skipped if it is not a
multiple of batch_size.
"""
self._ensure_open()
async def get_iter():
return await self.reader.read(self.selection, batch_size=batch_size)
@@ -1008,7 +976,6 @@ class Permutation:
so `with_format` and `with_transform` affect this method in the same way
they affect iteration.
"""
self._ensure_open()
async def do_take_offsets():
return await self.reader.take_offsets(offsets, selection=self.selection)
@@ -1044,11 +1011,9 @@ class Permutation:
"""
Skip the first `skip` rows of the permutation
"""
self._ensure_open()
new = copy.copy(self)
new.offset = skip
new.reader = LOOP.run(new._build_reader())
new._pid = os.getpid()
return new
@deprecated(details="Use with_take instead")
@@ -1067,11 +1032,9 @@ class Permutation:
"""
Limit the permutation to `limit` rows (following any `skip`)
"""
self._ensure_open()
new = copy.copy(self)
new.limit = limit
new.reader = LOOP.run(new._build_reader())
new._pid = os.getpid()
return new
@deprecated(details="Use with_repeat instead")

View File

@@ -41,14 +41,6 @@ from .rerankers.rrf import RRFReranker
from .rerankers.util import check_reranker_result
from .util import flatten_columns
BlobMode = Literal["lazy", "bytes", "descriptions"]
_BLOB_MODE_TO_HANDLING = {
"lazy": "blobs_descriptions",
"bytes": "all_binary",
"descriptions": "blobs_descriptions",
}
if TYPE_CHECKING:
import sys
@@ -63,7 +55,7 @@ if TYPE_CHECKING:
from ._lancedb import VectorQuery as LanceVectorQuery
from .common import VEC
from .pydantic import LanceModel
from .table import AsyncTable, Table
from .table import Table
if sys.version_info >= (3, 11):
from typing import Self
@@ -73,179 +65,6 @@ if TYPE_CHECKING:
T = TypeVar("T", bound="LanceModel")
def _validate_blob_mode(blob_mode: BlobMode) -> None:
if blob_mode not in _BLOB_MODE_TO_HANDLING:
modes = ", ".join(repr(mode) for mode in _BLOB_MODE_TO_HANDLING)
raise ValueError(f"blob_mode must be one of {modes}, got {blob_mode!r}")
def _field_is_blob(field: pa.Field) -> bool:
metadata = field.metadata or {}
return metadata.get(b"lance-encoding:blob") == b"true" or (
metadata.get("lance-encoding:blob") == "true"
)
def _schema_has_blob_field(schema: pa.Schema) -> bool:
return any(_field_is_blob(field) for field in schema)
def _blob_mode_requires_native_pandas(blob_mode: BlobMode, schema: pa.Schema) -> bool:
return blob_mode in _BLOB_MODE_TO_HANDLING and _schema_has_blob_field(schema)
def _unsupported_blob_pandas_error(reason: str) -> RuntimeError:
return RuntimeError(
"blob columns require Lance native scanner conversion for query "
f"to_pandas(), but {reason}. Use a plain scan query or remove blob "
"columns from the projection."
)
def _query_is_plain_scan(query: Query) -> bool:
return (
query.vector is None
and query.full_text_query is None
and not query.postfilter
and not query.order_by
)
def _filter_to_sql(filter: Optional[Union[str, Expr]]) -> Optional[str]:
if filter is None:
return None
if isinstance(filter, Expr):
return filter.to_sql()
return filter
def _projection_to_scanner_kwargs(
columns: Optional[
Union[
List[str], List[Tuple[str, Union[str, Expr]]], Dict[str, Union[str, Expr]]
]
],
) -> Dict[str, Any]:
if columns is None:
return {}
if isinstance(columns, list):
if all(isinstance(column, str) for column in columns):
return {"columns": columns}
if all(isinstance(column, tuple) and len(column) == 2 for column in columns):
return {
"columns": {
name: expr.to_sql() if isinstance(expr, Expr) else expr
for name, expr in columns
}
}
# Let Lance raise the detailed projection validation error.
return {"columns": columns}
projection = {}
for name, expr in columns.items():
if isinstance(expr, Expr):
expr = expr.to_sql()
projection[name] = expr
return {"columns": projection}
def _scanner_kwargs_for_query(
query: Query, blob_mode: BlobMode, dataset: Optional[Any] = None
) -> Dict[str, Any]:
fragments = _scanner_fragments_for_query(query, dataset)
kwargs = {
**_projection_to_scanner_kwargs(query.columns),
"filter": _filter_to_sql(query.filter),
"limit": query.limit,
"offset": query.offset,
"with_row_id": query.with_row_id,
"with_row_address": query.with_row_address,
"fast_search": query.fast_search,
"blob_handling": _BLOB_MODE_TO_HANDLING[blob_mode],
"fragments": fragments,
}
return {key: value for key, value in kwargs.items() if value is not None}
def _scanner_fragments_for_query(query: Query, dataset: Optional[Any]) -> Optional[Any]:
if query.fragments is not None and query.fragment_ids is not None:
raise ValueError("fragments and fragment_ids cannot both be set")
if query.fragments is not None:
return query.fragments
if query.fragment_ids is None:
return None
if dataset is None:
raise ValueError("fragment_ids require a Lance dataset")
requested = set(query.fragment_ids)
fragments = [
fragment
for fragment in dataset.get_fragments()
if fragment.fragment_id in requested
]
found = {fragment.fragment_id for fragment in fragments}
missing = requested - found
if missing:
missing_ids = ", ".join(str(fragment_id) for fragment_id in sorted(missing))
raise ValueError(f"fragment_ids not found in dataset: {missing_ids}")
return fragments
def _ensure_lazy_blob_frame(
df: "pd.DataFrame", schema: pa.Schema, blob_mode: BlobMode
) -> "pd.DataFrame":
if blob_mode != "lazy" or not _schema_has_blob_field(schema) or len(df) == 0:
return df
for field in schema:
if not _field_is_blob(field) or field.name not in df.columns:
continue
value = df[field.name].iloc[0]
if value is not None and not hasattr(value, "readall"):
raise _unsupported_blob_pandas_error(
"the Lance scanner did not return lazy blob files"
)
return df
def _scanner_to_table(scanner: Any) -> pa.Table:
if hasattr(scanner, "to_pyarrow"):
reader = scanner.to_pyarrow()
return reader.read_all()
if hasattr(scanner, "to_table"):
return scanner.to_table()
reader = scanner.to_reader()
return reader.read_all()
def _scanner_to_pandas(scanner: Any, blob_mode: BlobMode, **kwargs) -> "pd.DataFrame":
schema = getattr(scanner, "projected_schema", None)
if schema is None:
schema = getattr(scanner, "schema", None)
if schema is None:
schema = getattr(scanner, "dataset_schema", None)
if callable(schema):
schema = schema()
if hasattr(scanner, "to_pandas"):
try:
df = scanner.to_pandas(blob_mode=blob_mode, **kwargs)
except TypeError as err:
message = str(err)
if "blob_mode" not in message and "unexpected keyword" not in message:
raise
df = scanner.to_pandas(**kwargs)
if schema is not None:
return _ensure_lazy_blob_frame(df, schema, blob_mode)
return df
tbl = _scanner_to_table(scanner)
if blob_mode == "lazy" and _schema_has_blob_field(tbl.schema):
raise _unsupported_blob_pandas_error(
"the Lance scanner does not expose to_pandas"
)
return tbl.to_pandas(**kwargs)
# Pydantic validation function for vector queries
def ensure_vector_query(
val: Any,
@@ -680,13 +499,6 @@ class Query(pydantic.BaseModel):
# if true, include the row id in the results
with_row_id: Optional[bool] = None
# if true, include the row address in the results
with_row_address: Optional[bool] = None
# Lance fragments or fragment ids to scan on scanner-backed plain queries
fragments: Optional[Any] = None
fragment_ids: Optional[List[int]] = None
# offset to start fetching results from
offset: Optional[int] = None
@@ -879,9 +691,6 @@ class LanceQueryBuilder(ABC):
self._where = None
self._postfilter = None
self._with_row_id = None
self._with_row_address = None
self._fragments = None
self._fragment_ids = None
self._vector = None
self._text = None
self._ef = None
@@ -909,7 +718,6 @@ class LanceQueryBuilder(ABC):
self,
flatten: Optional[Union[int, bool]] = None,
*,
blob_mode: BlobMode = "lazy",
timeout: Optional[timedelta] = None,
**kwargs,
) -> "pd.DataFrame":
@@ -929,41 +737,11 @@ class LanceQueryBuilder(ABC):
timeout: Optional[timedelta]
The maximum time to wait for the query to complete.
If None, wait indefinitely.
blob_mode: str, default "lazy"
Controls how blob columns are returned for plain scan queries.
Vector, FTS, hybrid, and other non-native query shapes keep the
existing Arrow conversion path and only support blob descriptions.
**kwargs
Forwarded to pyarrow.Table.to_pandas after query execution and
optional flattening.
"""
_validate_blob_mode(blob_mode)
output_schema = getattr(self, "output_schema", None)
if output_schema is not None:
schema = output_schema()
if _blob_mode_requires_native_pandas(blob_mode, schema):
native_error = None
if (flatten is None or blob_mode == "descriptions") and timeout is None:
try:
df = self._plain_scan_to_pandas(
blob_mode, flatten=flatten, **kwargs
)
if df is not None:
return df
except Exception as err:
native_error = err
reason = (
"this query shape cannot use Lance native pandas conversion"
if native_error is None
else str(native_error)
)
raise _unsupported_blob_pandas_error(reason) from native_error
tbl = flatten_columns(self.to_arrow(timeout=timeout), flatten)
if _blob_mode_requires_native_pandas(blob_mode, tbl.schema):
raise _unsupported_blob_pandas_error(
"this query shape cannot use Lance native pandas conversion"
)
return tbl.to_pandas(**kwargs)
@abstractmethod
@@ -1169,32 +947,6 @@ class LanceQueryBuilder(ABC):
self._with_row_id = with_row_id
return self
def with_row_address(self, with_row_address: bool = True) -> Self:
"""Set whether to return row addresses.
Parameters
----------
with_row_address: bool, default True
If True, return the _rowaddr column in the results.
Returns
-------
LanceQueryBuilder
The LanceQueryBuilder object.
"""
self._with_row_address = with_row_address
return self
def with_fragments(self, fragments: Any) -> Self:
"""Set the Lance fragments to scan for plain scanner-backed queries."""
self._fragments = fragments
return self
def fragment_ids(self, fragment_ids: List[int]) -> Self:
"""Set the Lance fragment ids to scan for plain scanner-backed queries."""
self._fragment_ids = fragment_ids
return self
def explain_plan(self, verbose: Optional[bool] = False) -> str:
"""Return the execution plan for this query.
@@ -1334,25 +1086,6 @@ class LanceQueryBuilder(ABC):
"""
raise NotImplementedError
def _plain_scan_to_pandas(
self,
blob_mode: BlobMode,
flatten: Optional[Union[int, bool]] = None,
**kwargs,
) -> Optional["pd.DataFrame"]:
query = self.to_query_object()
if not _query_is_plain_scan(query):
return None
dataset = self._table.to_lance()
scanner = dataset.scanner(
**_scanner_kwargs_for_query(query, blob_mode, dataset)
)
if flatten is not None:
tbl = flatten_columns(_scanner_to_table(scanner), flatten)
return tbl.to_pandas(**kwargs)
return _scanner_to_pandas(scanner, blob_mode, **kwargs)
@abstractmethod
def to_query_object(self) -> Query:
"""Return a serializable representation of the query
@@ -1624,9 +1357,6 @@ class LanceVectorQueryBuilder(LanceQueryBuilder):
refine_factor=self._refine_factor,
vector_column=self._vector_column,
with_row_id=self._with_row_id,
with_row_address=self._with_row_address,
fragments=self._fragments,
fragment_ids=self._fragment_ids,
offset=self._offset,
fast_search=self._fast_search,
ef=self._ef,
@@ -1829,9 +1559,6 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
limit=self._limit,
postfilter=self._postfilter,
with_row_id=self._with_row_id,
with_row_address=self._with_row_address,
fragments=self._fragments,
fragment_ids=self._fragment_ids,
full_text_query=FullTextSearchQuery(
query=self._query, columns=self._fts_columns
),
@@ -1902,9 +1629,6 @@ class LanceEmptyQueryBuilder(LanceQueryBuilder):
filter=self._where,
limit=self._limit,
with_row_id=self._with_row_id,
with_row_address=self._with_row_address,
fragments=self._fragments,
fragment_ids=self._fragment_ids,
offset=self._offset,
order_by=self._order_by,
)
@@ -2483,11 +2207,7 @@ class AsyncQueryBase(object):
Base class for all async queries (take, scan, vector, fts, hybrid)
"""
def __init__(
self,
inner: Union[LanceQuery, LanceVectorQuery, LanceTakeQuery],
table: Optional["AsyncTable"] = None,
):
def __init__(self, inner: Union[LanceQuery, LanceVectorQuery, LanceTakeQuery]):
"""
Construct an AsyncQueryBase
@@ -2495,10 +2215,6 @@ class AsyncQueryBase(object):
[AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
"""
self._inner = inner
self._table = table
self._with_row_address = None
self._fragments = None
self._fragment_ids = None
def to_query_object(self) -> Query:
"""
@@ -2507,11 +2223,7 @@ class AsyncQueryBase(object):
This is currently experimental but can be useful as the query object is pure
python and more easily serializable.
"""
query = Query.from_inner(self._inner.to_query_request())
query.with_row_address = self._with_row_address
query.fragments = self._fragments
query.fragment_ids = self._fragment_ids
return query
return Query.from_inner(self._inner.to_query_request())
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
"""
@@ -2568,27 +2280,6 @@ class AsyncQueryBase(object):
self._inner.with_row_id()
return self
def with_row_address(self, with_row_address: bool = True) -> Self:
"""
Include the _rowaddr column in scanner-backed plain query results.
"""
self._with_row_address = with_row_address
return self
def with_fragments(self, fragments: Any) -> Self:
"""
Restrict scanner-backed plain query results to the given Lance fragments.
"""
self._fragments = fragments
return self
def fragment_ids(self, fragment_ids: List[int]) -> Self:
"""
Restrict scanner-backed plain query results to the given Lance fragment ids.
"""
self._fragment_ids = fragment_ids
return self
async def to_batches(
self,
*,
@@ -2666,8 +2357,6 @@ class AsyncQueryBase(object):
self,
flatten: Optional[Union[int, bool]] = None,
timeout: Optional[timedelta] = None,
*,
blob_mode: BlobMode = "lazy",
**kwargs,
) -> "pd.DataFrame":
"""
@@ -2701,63 +2390,13 @@ class AsyncQueryBase(object):
The maximum time to wait for the query to complete.
If not specified, no timeout is applied. If the query does not
complete within the specified time, an error will be raised.
blob_mode: str, default "lazy"
Controls how blob columns are returned for plain scan queries.
Vector, FTS, hybrid, and other non-native query shapes keep the
existing Arrow conversion path and only support blob descriptions.
**kwargs
Forwarded to pyarrow.Table.to_pandas after query execution and
optional flattening.
"""
_validate_blob_mode(blob_mode)
if hasattr(self._inner, "output_schema"):
schema = await self.output_schema()
if _blob_mode_requires_native_pandas(blob_mode, schema):
native_error = None
if (flatten is None or blob_mode == "descriptions") and timeout is None:
try:
df = await self._plain_scan_to_pandas(
blob_mode, flatten=flatten, **kwargs
)
if df is not None:
return df
except Exception as err:
native_error = err
reason = (
"this query shape cannot use Lance native pandas conversion"
if native_error is None
else str(native_error)
)
raise _unsupported_blob_pandas_error(reason) from native_error
tbl = flatten_columns(await self.to_arrow(timeout=timeout), flatten)
if _blob_mode_requires_native_pandas(blob_mode, tbl.schema):
raise _unsupported_blob_pandas_error(
"this query shape cannot use Lance native pandas conversion"
)
return tbl.to_pandas(**kwargs)
async def _plain_scan_to_pandas(
self,
blob_mode: BlobMode,
flatten: Optional[Union[int, bool]] = None,
**kwargs,
) -> Optional["pd.DataFrame"]:
if self._table is None:
return None
query = self.to_query_object()
if not _query_is_plain_scan(query):
return None
dataset = await self._table._to_lance()
scanner = dataset.scanner(
**_scanner_kwargs_for_query(query, blob_mode, dataset)
)
if flatten is not None:
tbl = flatten_columns(_scanner_to_table(scanner), flatten)
return tbl.to_pandas(**kwargs)
return _scanner_to_pandas(scanner, blob_mode, **kwargs)
return (
flatten_columns(await self.to_arrow(timeout=timeout), flatten)
).to_pandas(**kwargs)
async def to_polars(
self,
@@ -2864,18 +2503,14 @@ class AsyncStandardQuery(AsyncQueryBase):
Base class for "standard" async queries (all but take currently)
"""
def __init__(
self,
inner: Union[LanceQuery, LanceVectorQuery],
table: Optional["AsyncTable"] = None,
):
def __init__(self, inner: Union[LanceQuery, LanceVectorQuery]):
"""
Construct an AsyncStandardQuery
This method is not intended to be called directly. Instead, use the
[AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
"""
super().__init__(inner, table)
super().__init__(inner)
def where(self, predicate: Union[str, Expr]) -> Self:
"""
@@ -2981,14 +2616,14 @@ class AsyncStandardQuery(AsyncQueryBase):
class AsyncQuery(AsyncStandardQuery):
def __init__(self, inner: LanceQuery, table: Optional["AsyncTable"] = None):
def __init__(self, inner: LanceQuery):
"""
Construct an AsyncQuery
This method is not intended to be called directly. Instead, use the
[AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
"""
super().__init__(inner, table)
super().__init__(inner)
self._inner = inner
@classmethod
@@ -3072,11 +2707,10 @@ class AsyncQuery(AsyncStandardQuery):
new_self = self._inner.nearest_to(query_vectors[0])
for v in query_vectors[1:]:
new_self.add_query_vector(v)
return AsyncVectorQuery(new_self, self._table)
return AsyncVectorQuery(new_self)
else:
return AsyncVectorQuery(
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)),
self._table,
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
)
def nearest_to_text(
@@ -3109,18 +2743,17 @@ class AsyncQuery(AsyncStandardQuery):
if isinstance(query, str):
return AsyncFTSQuery(
self._inner.nearest_to_text({"query": query, "columns": columns}),
self._table,
self._inner.nearest_to_text({"query": query, "columns": columns})
)
# FullTextQuery object
return AsyncFTSQuery(self._inner.nearest_to_text({"query": query}), self._table)
return AsyncFTSQuery(self._inner.nearest_to_text({"query": query}))
class AsyncFTSQuery(AsyncStandardQuery):
"""A query for full text search for LanceDB."""
def __init__(self, inner: LanceFTSQuery, table: Optional["AsyncTable"] = None):
super().__init__(inner, table)
def __init__(self, inner: LanceFTSQuery):
super().__init__(inner)
self._inner = inner
self._reranker = None
@@ -3202,11 +2835,10 @@ class AsyncFTSQuery(AsyncStandardQuery):
new_self = self._inner.nearest_to(query_vectors[0])
for v in query_vectors[1:]:
new_self.add_query_vector(v)
return AsyncHybridQuery(new_self, self._table)
return AsyncHybridQuery(new_self)
else:
return AsyncHybridQuery(
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)),
self._table,
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
)
async def to_batches(
@@ -3397,7 +3029,7 @@ class AsyncVectorQueryBase:
class AsyncVectorQuery(AsyncStandardQuery, AsyncVectorQueryBase):
def __init__(self, inner: LanceVectorQuery, table: Optional["AsyncTable"] = None):
def __init__(self, inner: LanceVectorQuery):
"""
Construct an AsyncVectorQuery
@@ -3407,7 +3039,7 @@ class AsyncVectorQuery(AsyncStandardQuery, AsyncVectorQueryBase):
a vector query. Or you can use
[AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]
"""
super().__init__(inner, table)
super().__init__(inner)
self._inner = inner
self._reranker = None
self._query_string = None
@@ -3461,13 +3093,10 @@ class AsyncVectorQuery(AsyncStandardQuery, AsyncVectorQueryBase):
if isinstance(query, str):
return AsyncHybridQuery(
self._inner.nearest_to_text({"query": query, "columns": columns}),
self._table,
self._inner.nearest_to_text({"query": query, "columns": columns})
)
# FullTextQuery object
return AsyncHybridQuery(
self._inner.nearest_to_text({"query": query}), self._table
)
return AsyncHybridQuery(self._inner.nearest_to_text({"query": query}))
async def to_batches(
self,
@@ -3494,8 +3123,8 @@ class AsyncHybridQuery(AsyncStandardQuery, AsyncVectorQueryBase):
in the `rerank` method to convert the scores to ranks and then normalize them.
"""
def __init__(self, inner: LanceHybridQuery, table: Optional["AsyncTable"] = None):
super().__init__(inner, table)
def __init__(self, inner: LanceHybridQuery):
super().__init__(inner)
self._inner = inner
self._norm = "score"
self._reranker = RRFReranker()
@@ -3536,8 +3165,8 @@ class AsyncHybridQuery(AsyncStandardQuery, AsyncVectorQueryBase):
max_batch_length: Optional[int] = None,
timeout: Optional[timedelta] = None,
) -> AsyncRecordBatchReader:
fts_query = AsyncFTSQuery(self._inner.to_fts_query(), self._table)
vec_query = AsyncVectorQuery(self._inner.to_vector_query(), self._table)
fts_query = AsyncFTSQuery(self._inner.to_fts_query())
vec_query = AsyncVectorQuery(self._inner.to_vector_query())
# save the row ID choice that was made on the query builder and force it
# to actually fetch the row ids because we need this for reranking
@@ -3637,16 +3266,8 @@ class AsyncTakeQuery(AsyncQueryBase):
Builder for parameterizing and executing take queries.
"""
def __init__(self, inner: LanceTakeQuery, table: Optional["AsyncTable"] = None):
super().__init__(inner, table)
async def _plain_scan_to_pandas(
self,
blob_mode: BlobMode,
flatten: Optional[Union[int, bool]] = None,
**kwargs,
) -> Optional["pd.DataFrame"]:
return None
def __init__(self, inner: LanceTakeQuery):
super().__init__(inner)
class BaseQueryBuilder(object):
@@ -3698,27 +3319,6 @@ class BaseQueryBuilder(object):
self._inner.with_row_id()
return self
def with_row_address(self, with_row_address: bool = True) -> Self:
"""
Include the _rowaddr column in scanner-backed plain query results.
"""
self._inner.with_row_address(with_row_address)
return self
def with_fragments(self, fragments: Any) -> Self:
"""
Restrict scanner-backed plain query results to the given Lance fragments.
"""
self._inner.with_fragments(fragments)
return self
def fragment_ids(self, fragment_ids: List[int]) -> Self:
"""
Restrict scanner-backed plain query results to the given Lance fragment ids.
"""
self._inner.fragment_ids(fragment_ids)
return self
def output_schema(self) -> pa.Schema:
"""
Return the output schema for the query
@@ -3800,8 +3400,6 @@ class BaseQueryBuilder(object):
self,
flatten: Optional[Union[int, bool]] = None,
timeout: Optional[timedelta] = None,
*,
blob_mode: BlobMode = "lazy",
**kwargs,
) -> "pd.DataFrame":
"""
@@ -3835,15 +3433,11 @@ class BaseQueryBuilder(object):
The maximum time to wait for the query to complete.
If not specified, no timeout is applied. If the query does not
complete within the specified time, an error will be raised.
blob_mode: str, default "lazy"
Controls how blob columns are returned for plain scan queries.
**kwargs
Forwarded to pyarrow.Table.to_pandas after query execution and
optional flattening.
"""
return LOOP.run(
self._inner.to_pandas(flatten, timeout, blob_mode=blob_mode, **kwargs)
)
return LOOP.run(self._inner.to_pandas(flatten, timeout, **kwargs))
def to_polars(
self,

View File

@@ -3,7 +3,6 @@
from datetime import timedelta
import json
import logging
from concurrent.futures import ThreadPoolExecutor
import sys
@@ -18,7 +17,7 @@ else:
# Remove this import to fix circular dependency
# from lancedb import connect_async
from lancedb.remote import ClientConfig, RetryConfig, TimeoutConfig, TlsConfig
from lancedb.remote import ClientConfig
import pyarrow as pa
from ..common import DATA
@@ -37,64 +36,6 @@ from ..table import Table
from ..util import validate_table_name
def _duration_seconds(value: Optional[timedelta]) -> Optional[float]:
return value.total_seconds() if value is not None else None
def _timeout_config_to_dict(
config: Optional[TimeoutConfig],
) -> Optional[dict[str, Any]]:
if config is None:
return None
return {
"timeout": _duration_seconds(config.timeout),
"connect_timeout": _duration_seconds(config.connect_timeout),
"read_timeout": _duration_seconds(config.read_timeout),
"pool_idle_timeout": _duration_seconds(config.pool_idle_timeout),
}
def _retry_config_to_dict(config: RetryConfig) -> dict[str, Any]:
return {
"retries": config.retries,
"connect_retries": config.connect_retries,
"read_retries": config.read_retries,
"backoff_factor": config.backoff_factor,
"backoff_jitter": config.backoff_jitter,
"statuses": config.statuses,
}
def _tls_config_to_dict(config: Optional[TlsConfig]) -> Optional[dict[str, Any]]:
if config is None:
return None
return {
"cert_file": config.cert_file,
"key_file": config.key_file,
"ssl_ca_cert": config.ssl_ca_cert,
"assert_hostname": config.assert_hostname,
}
def _client_config_to_dict(config: ClientConfig) -> dict[str, Any]:
if config.header_provider is not None:
raise ValueError(
"Cannot serialize a remote connection with a header_provider. "
"Use static api_key/extra_headers or provide a worker-side "
"connection factory instead."
)
return {
"user_agent": config.user_agent,
"retry_config": _retry_config_to_dict(config.retry_config),
"timeout_config": _timeout_config_to_dict(config.timeout_config),
"extra_headers": config.extra_headers,
"id_delimiter": config.id_delimiter,
"tls_config": _tls_config_to_dict(config.tls_config),
"header_provider": None,
"user_id": config.user_id,
}
class RemoteDBConnection(DBConnection):
"""A connection to a remote LanceDB database."""
@@ -148,11 +89,6 @@ class RemoteDBConnection(DBConnection):
parsed = urlparse(db_url)
if parsed.scheme != "db":
raise ValueError(f"Invalid scheme: {parsed.scheme}, only accepts db://")
self.db_url = db_url
self.api_key = api_key
self.region = region
self.host_override = host_override
self.storage_options = storage_options
self.db_name = parsed.netloc
self.client_config = client_config
@@ -175,20 +111,6 @@ class RemoteDBConnection(DBConnection):
def __repr__(self) -> str:
return f"RemoteConnect(name={self.db_name})"
@override
def serialize(self) -> str:
return json.dumps(
{
"connection_type": "remote",
"db_url": self.db_url,
"api_key": self.api_key,
"region": self.region,
"host_override": self.host_override,
"client_config": _client_config_to_dict(self.client_config),
"storage_options": self.storage_options,
}
)
@override
def list_namespaces(
self,
@@ -383,8 +305,6 @@ class RemoteDBConnection(DBConnection):
namespace_path: Optional[List[str]] = None,
storage_options: Optional[Dict[str, str]] = None,
index_cache_size: Optional[int] = None,
branch: Optional[str] = None,
version: Optional[int] = None,
) -> Table:
"""Open a Lance Table in the database.
@@ -395,14 +315,6 @@ class RemoteDBConnection(DBConnection):
namespace_path: List[str], optional
The namespace to open the table from.
None or empty list represents root namespace.
branch: str, optional
If provided, open a handle scoped to this branch instead of the
default branch. Reads and writes operate in the branch's context.
version: int, optional
If provided, open the table pinned to this version, producing a
read-only handle. Composes with ``branch``: when both are given,
opens that branch at the version; otherwise opens ``main`` at the
version. Call ``checkout_latest`` to return to a writable state.
Returns
-------
@@ -419,17 +331,7 @@ class RemoteDBConnection(DBConnection):
)
table = LOOP.run(self._conn.open_table(name, namespace_path=namespace_path))
tbl = RemoteTable(
table,
self.db_name,
connection_state=self.serialize,
namespace_path=namespace_path,
)
if branch is not None:
tbl = tbl.branches.checkout(branch, version)
elif version is not None:
tbl.checkout(version)
return tbl
return RemoteTable(table, self.db_name)
def clone_table(
self,
@@ -478,12 +380,7 @@ class RemoteDBConnection(DBConnection):
is_shallow=is_shallow,
)
)
return RemoteTable(
table,
self.db_name,
connection_state=self.serialize,
namespace_path=target_namespace_path,
)
return RemoteTable(table, self.db_name)
@override
def create_table(
@@ -628,12 +525,7 @@ class RemoteDBConnection(DBConnection):
fill_value=fill_value,
)
)
return RemoteTable(
table,
self.db_name,
connection_state=self.serialize,
namespace_path=namespace_path,
)
return RemoteTable(table, self.db_name)
@override
def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):

View File

@@ -27,9 +27,6 @@ class LanceDBClientError(RuntimeError):
self.request_id = request_id
self.status_code = status_code
def __reduce__(self) -> tuple[type, tuple]:
return (self.__class__, (str(self), self.request_id, self.status_code))
class HttpError(LanceDBClientError):
"""An error that occurred during an HTTP request.
@@ -104,19 +101,3 @@ class RetryError(LanceDBClientError):
self.max_request_failures = max_request_failures
self.max_connect_failures = max_connect_failures
self.max_read_failures = max_read_failures
def __reduce__(self) -> tuple[type, tuple]:
return (
self.__class__,
(
str(self),
self.request_id,
self.request_failures,
self.connect_failures,
self.read_failures,
self.max_request_failures,
self.max_connect_failures,
self.max_read_failures,
self.status_code,
),
)

View File

@@ -2,30 +2,15 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
from datetime import timedelta
import deprecation
import logging
from functools import cached_property
import os
from typing import (
Any,
Callable,
Dict,
Iterable,
List,
Optional,
Union,
Literal,
overload,
)
from typing import Any, Callable, Dict, Iterable, List, Optional, Union, Literal
import warnings
from lancedb import __version__
from lancedb._lancedb import (
AddColumnsResult,
AddResult,
AlterColumnsResult,
UpdateFieldMetadataResult,
DeleteResult,
DropColumnsResult,
IndexConfig,
@@ -47,7 +32,6 @@ from lancedb.index import (
LabelList,
)
from lancedb.remote.db import LOOP
from lancedb.table import IndexConfigType, KNOWN_METRICS
import pyarrow as pa
from lancedb.common import DATA, VEC, VECTOR_COLUMN_NAME
@@ -56,7 +40,7 @@ from lancedb.embeddings import EmbeddingFunctionRegistry
from lancedb.table import _normalize_progress
from ..query import LanceVectorQueryBuilder, LanceQueryBuilder, LanceTakeQueryBuilder
from ..table import AsyncTable, BlobMode, Branches, IndexStatistics, Query, Table, Tags
from ..table import AsyncTable, BlobMode, IndexStatistics, Query, Table, Tags
from ..types import BaseTokenizerType
@@ -65,90 +49,14 @@ class RemoteTable(Table):
self,
table: AsyncTable,
db_name: str,
*,
connection_state: Optional[Union[str, Callable[[], str]]] = None,
namespace_path: Optional[List[str]] = None,
):
self._table_handle = table
self._name = table.name
self._table = table
self.db_name = db_name
self._connection_state = connection_state
self._namespace_path = list(namespace_path or [])
self._checkout_version: Optional[int] = None
# The branch this handle is scoped to (None == main). Persisted so a
# fork/pickle reopen restores the branch instead of reverting to main.
self._branch: Optional[str] = None
self._pid = os.getpid()
def _serialized_connection_state(self) -> str:
if self._connection_state is None:
raise RuntimeError(
"Cannot reopen this remote table because it does not carry "
"serialized connection state"
)
if callable(self._connection_state):
self._connection_state = self._connection_state()
return self._connection_state
@property
def _table(self) -> AsyncTable:
self._ensure_open()
assert self._table_handle is not None
return self._table_handle
@_table.setter
def _table(self, table: AsyncTable) -> None:
self._table_handle = table
self._name = table.name
self._pid = os.getpid()
def _ensure_open(self) -> None:
pid = os.getpid()
if self._table_handle is not None and self._pid == pid:
return
# Pickle clears the handle; fork inherits a handle created in the
# parent process. In both cases reopen before touching the Rust client.
from lancedb import deserialize_conn
db = deserialize_conn(self._serialized_connection_state(), for_worker=True)
# Reopen on the same branch and pinned version (branch=None / version=None
# reproduce the plain main-latest open).
table = db.open_table(
self._name,
namespace_path=self._namespace_path,
branch=self._branch,
version=self._checkout_version,
)
self._table_handle = table._table
self.db_name = table.db_name
self._pid = pid
def __getstate__(self) -> dict:
return {
"connection_state": self._serialized_connection_state(),
"db_name": self.db_name,
"name": self.name,
"namespace_path": self._namespace_path,
"checkout_version": self._checkout_version,
"branch": self._branch,
}
def __setstate__(self, state: dict) -> None:
self._table_handle = None
self._name = state["name"]
self.db_name = state["db_name"]
self._connection_state = state["connection_state"]
self._namespace_path = state["namespace_path"]
self._checkout_version = state["checkout_version"]
self._branch = state.get("branch")
self._pid = None
@property
def name(self) -> str:
"""The name of the table"""
return self._name
return self._table.name
def __repr__(self) -> str:
return f"RemoteTable({self.db_name}.{self.name})"
@@ -170,34 +78,6 @@ class RemoteTable(Table):
def tags(self) -> Tags:
return Tags(self._table)
@property
def branches(self) -> Branches:
"""Branch management for the table.
``create``/``checkout`` return a new table handle scoped to the branch;
writes on it do not affect ``main``.
"""
return Branches(self)
def current_branch(self) -> Optional[str]:
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
return self._table.current_branch()
def _wrap_branch_handle(
self, async_table: AsyncTable, version: Optional[int] = None
) -> "RemoteTable":
# A branch handle stays a RemoteTable with the same connection context.
# Record the branch and version pin so a fork/pickle reopen restores both.
handle = RemoteTable(
async_table,
self.db_name,
connection_state=self._connection_state,
namespace_path=self._namespace_path,
)
handle._branch = async_table.current_branch()
handle._checkout_version = version
return handle
@cached_property
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
"""
@@ -226,19 +106,13 @@ class RemoteTable(Table):
raise NotImplementedError("to_pandas() is not yet supported on LanceDB cloud.")
def checkout(self, version: Union[int, str]):
result = LOOP.run(self._table.checkout(version))
self._checkout_version = self.version
return result
return LOOP.run(self._table.checkout(version))
def checkout_latest(self):
result = LOOP.run(self._table.checkout_latest())
self._checkout_version = None
return result
return LOOP.run(self._table.checkout_latest())
def restore(self, version: Optional[Union[int, str]] = None):
result = LOOP.run(self._table.restore(version))
self._checkout_version = None
return result
return LOOP.run(self._table.restore(version))
def list_indices(self) -> Iterable[IndexConfig]:
"""List all the indices on the table"""
@@ -248,11 +122,6 @@ class RemoteTable(Table):
"""List all the stats of a specified index"""
return LOOP.run(self._table.index_stats(index_uuid))
@deprecation.deprecated(
deprecated_in="0.25.0",
current_version=__version__,
details="Use create_index() with config=BTree()/Bitmap()/LabelList() instead.",
)
def create_scalar_index(
self,
column: str,
@@ -262,12 +131,7 @@ class RemoteTable(Table):
wait_timeout: Optional[timedelta] = None,
name: Optional[str] = None,
):
"""Creates a scalar index.
.. deprecated:: 0.25.0
Use :meth:`create_index` with a BTree, Bitmap, or LabelList config instead.
Example: ``table.create_index("column", config=BTree())``
"""Creates a scalar index
Parameters
----------
column : str
@@ -298,11 +162,6 @@ class RemoteTable(Table):
)
)
@deprecation.deprecated(
deprecated_in="0.25.0",
current_version=__version__,
details="Use create_index() with config=FTS() instead.",
)
def create_fts_index(
self,
column: str,
@@ -323,12 +182,6 @@ class RemoteTable(Table):
prefix_only: bool = False,
name: Optional[str] = None,
):
"""Create a full-text search index on a column.
.. deprecated:: 0.25.0
Use :meth:`create_index` with an FTS config instead.
Example: ``table.create_index("text_column", config=FTS())``
"""
config = FTS(
with_position=with_position,
base_tokenizer=base_tokenizer,
@@ -352,43 +205,9 @@ class RemoteTable(Table):
)
)
# New unified API overload
@overload
def create_index(
self,
column: str,
/,
*,
config: IndexConfigType,
wait_timeout: Optional[timedelta] = ...,
name: Optional[str] = ...,
train: bool = ...,
) -> None: ...
# Legacy API overload (deprecated)
@overload
def create_index(
self,
metric: Literal["l2", "cosine", "dot", "hamming"] = ...,
vector_column_name: str = ...,
index_cache_size: Optional[int] = ...,
num_partitions: Optional[int] = ...,
num_sub_vectors: Optional[int] = ...,
replace: Optional[bool] = ...,
accelerator: Optional[str] = ...,
index_type: Literal[
"VECTOR", "IVF_FLAT", "IVF_SQ", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
] = ...,
wait_timeout: Optional[timedelta] = ...,
*,
num_bits: int = ...,
name: Optional[str] = ...,
train: bool = ...,
) -> None: ...
def create_index(
self,
metric: str = "l2",
metric="l2",
vector_column_name: str = VECTOR_COLUMN_NAME,
index_cache_size: Optional[int] = None,
num_partitions: Optional[int] = None,
@@ -399,113 +218,89 @@ class RemoteTable(Table):
wait_timeout: Optional[timedelta] = None,
*,
num_bits: int = 8,
config: Optional[IndexConfigType] = None,
name: Optional[str] = None,
train: bool = True,
):
"""Create an index on a column.
"""Create an index on the table.
This method supports both the new unified API and the legacy API
for backwards compatibility. The new API takes the column name as the
first positional argument and an index configuration object via
``config``; the legacy API takes the distance metric as the first
argument plus separate ``vector_column_name`` / ``num_partitions`` /
etc. parameters, and emits a ``DeprecationWarning``.
Parameters
----------
metric : str
The metric to use for the index. Default is "l2".
vector_column_name : str
The name of the vector column. Default is "vector".
Examples
--------
New API (recommended):
>>> table.create_index( # doctest: +SKIP
... "vector", config=IvfPq(distance_type="l2")
>>> import lancedb
>>> import uuid
>>> from lancedb.schema import vector
>>> db = lancedb.connect("db://...", api_key="...", # doctest: +SKIP
... region="...") # doctest: +SKIP
>>> table_name = uuid.uuid4().hex
>>> schema = pa.schema(
... [
... pa.field("id", pa.uint32(), False),
... pa.field("vector", vector(128), False),
... pa.field("s", pa.string(), False),
... ]
... )
>>> table.create_index("category", config=BTree()) # doctest: +SKIP
>>> table.create_index("content", config=FTS()) # doctest: +SKIP
Legacy API (deprecated):
>>> table.create_index( # doctest: +SKIP
... "l2", vector_column_name="vector"
>>> table = db.create_table( # doctest: +SKIP
... table_name, # doctest: +SKIP
... schema=schema, # doctest: +SKIP
... )
>>> table.create_index("l2", "vector") # doctest: +SKIP
"""
# Detect whether this is a legacy API call
is_legacy = self._is_legacy_create_index_call(
metric,
config,
num_partitions,
num_sub_vectors,
vector_column_name,
accelerator,
index_cache_size,
replace,
)
if is_legacy:
warnings.warn(
"The create_index() API with metric/num_partitions parameters is "
"deprecated and will be removed in a future version. "
"Please migrate to the new unified API:\n"
" # Old (deprecated):\n"
" table.create_index('l2', vector_column_name='my_vector')\n"
" # New (recommended):\n"
" table.create_index('my_vector', config=IvfPq(distance_type='l2'))",
DeprecationWarning,
stacklevel=2,
if accelerator is not None:
logging.warning(
"GPU accelerator is not yet supported on LanceDB cloud."
"If you have 100M+ vectors to index,"
"please contact us at contact@lancedb.com"
)
if replace is not None:
logging.warning(
"replace is not supported on LanceDB cloud."
"Existing indexes will always be replaced."
)
column = vector_column_name
if accelerator is not None:
logging.warning(
"GPU accelerator is not yet supported on LanceDB cloud."
"If you have 100M+ vectors to index,"
"please contact us at contact@lancedb.com"
)
if replace is not None:
logging.warning(
"replace is not supported on LanceDB cloud."
"Existing indexes will always be replaced."
)
idx_type = index_type.upper()
if idx_type == "VECTOR" or idx_type == "IVF_PQ":
config = IvfPq(
distance_type=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
num_bits=num_bits,
)
elif idx_type == "IVF_RQ":
config = IvfRq(
distance_type=metric,
num_partitions=num_partitions,
num_bits=num_bits,
)
elif idx_type == "IVF_SQ":
config = IvfSq(distance_type=metric, num_partitions=num_partitions)
elif idx_type == "IVF_HNSW_PQ":
raise ValueError(
"IVF_HNSW_PQ is not supported on LanceDB cloud."
"Please use IVF_HNSW_SQ instead."
)
elif idx_type == "IVF_HNSW_SQ":
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
elif idx_type == "IVF_HNSW_FLAT":
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
elif idx_type == "IVF_FLAT":
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
else:
raise ValueError(
f"Unknown vector index type: {idx_type}. Valid options are"
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
)
index_type = index_type.upper()
if index_type == "VECTOR" or index_type == "IVF_PQ":
config = IvfPq(
distance_type=metric,
num_partitions=num_partitions,
num_sub_vectors=num_sub_vectors,
num_bits=num_bits,
)
elif index_type == "IVF_RQ":
config = IvfRq(
distance_type=metric,
num_partitions=num_partitions,
num_bits=num_bits,
)
elif index_type == "IVF_SQ":
config = IvfSq(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_HNSW_PQ":
raise ValueError(
"IVF_HNSW_PQ is not supported on LanceDB cloud."
"Please use IVF_HNSW_SQ instead."
)
elif index_type == "IVF_HNSW_SQ":
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_HNSW_FLAT":
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
elif index_type == "IVF_FLAT":
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
else:
column = metric
raise ValueError(
f"Unknown vector index type: {index_type}. Valid options are"
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
)
LOOP.run(
self._table.create_index(
column,
vector_column_name,
config=config,
wait_timeout=wait_timeout,
name=name,
@@ -513,37 +308,6 @@ class RemoteTable(Table):
)
)
def _is_legacy_create_index_call(
self,
first_arg: str,
config: Optional[IndexConfigType],
num_partitions: Optional[int],
num_sub_vectors: Optional[int],
vector_column_name: str,
accelerator: Optional[str],
index_cache_size: Optional[int],
replace: Optional[bool],
) -> bool:
"""Detect if this is a legacy create_index call."""
if config is not None:
return False
if any(
x is not None
for x in (
num_partitions,
num_sub_vectors,
accelerator,
index_cache_size,
replace,
)
):
return True
if vector_column_name != VECTOR_COLUMN_NAME:
return True
if first_arg.lower() in KNOWN_METRICS:
return True
return False
def add(
self,
data: DATA,
@@ -889,11 +653,6 @@ class RemoteTable(Table):
) -> AlterColumnsResult:
return LOOP.run(self._table.alter_columns(*alterations))
def update_field_metadata(
self, *updates: dict[str, Any]
) -> UpdateFieldMetadataResult:
return LOOP.run(self._table.update_field_metadata(*updates))
def drop_columns(self, columns: Iterable[str]) -> DropColumnsResult:
return LOOP.run(self._table.drop_columns(columns))
@@ -909,10 +668,6 @@ class RemoteTable(Table):
"""Not supported on LanceDB Cloud."""
return LOOP.run(self._table.unset_lsm_write_spec())
def close_lsm_writers(self) -> None:
"""No-op on LanceDB Cloud (no local shard writers)."""
return LOOP.run(self._table.close_lsm_writers())
def drop_index(self, index_name: str):
return LOOP.run(self._table.drop_index(index_name))

View File

@@ -125,9 +125,6 @@ class MRRReranker(Reranker):
This cannot reuse rerank_hybrid because MRR semantics require treating
each vector result as a separate ranking system.
"""
if not vector_results:
raise ValueError("vector_results must not be empty")
if not all(isinstance(v, type(vector_results[0])) for v in vector_results):
raise ValueError(
"All elements in vector_results should be of the same type"

View File

@@ -82,9 +82,6 @@ class RRFReranker(Reranker):
results from multiple vector searches as it doesn't support reranking
vector results individually.
"""
if not vector_results:
raise ValueError("vector_results must not be empty")
# Make sure all elements are of the same type
if not all(isinstance(v, type(vector_results[0])) for v in vector_results):
raise ValueError(

File diff suppressed because it is too large Load Diff

View File

@@ -385,21 +385,6 @@ def _(value: np.ndarray):
return value_to_sql(value.tolist())
@value_to_sql.register(np.bool_)
def _(value: np.bool_):
return value_to_sql(bool(value))
@value_to_sql.register(np.integer)
def _(value: np.integer):
return value_to_sql(int(value))
@value_to_sql.register(np.floating)
def _(value: np.floating):
return value_to_sql(float(value))
def deprecated(func):
"""This is a decorator which can be used to mark functions
as deprecated. It will result in a warning being emitted

View File

@@ -57,7 +57,7 @@ async def test_upsert_async(mem_db_async):
await table.count_rows() # 3
res
# MergeResult(version=2, num_updated_rows=1,
# num_inserted_rows=1, num_deleted_rows=0, num_rows=2)
# num_inserted_rows=1, num_deleted_rows=0)
# --8<-- [end:upsert_basic_async]
assert await table.count_rows() == 3
assert res.version == 2
@@ -86,7 +86,7 @@ def test_insert_if_not_exists(mem_db):
table.count_rows() # 3
res
# MergeResult(version=2, num_updated_rows=0,
# num_inserted_rows=1, num_deleted_rows=0, num_rows=1)
# num_inserted_rows=1, num_deleted_rows=0)
# --8<-- [end:insert_if_not_exists]
assert table.count_rows() == 3
assert res.version == 2
@@ -116,7 +116,7 @@ async def test_insert_if_not_exists_async(mem_db_async):
await table.count_rows() # 3
res
# MergeResult(version=2, num_updated_rows=0,
# num_inserted_rows=1, num_deleted_rows=0, num_rows=1)
# num_inserted_rows=1, num_deleted_rows=0)
# --8<-- [end:insert_if_not_exists]
assert await table.count_rows() == 3
assert res.version == 2
@@ -150,7 +150,7 @@ def test_replace_range(mem_db):
table.count_rows("doc_id = 1") # 1
res
# MergeResult(version=2, num_updated_rows=1,
# num_inserted_rows=0, num_deleted_rows=1, num_rows=1)
# num_inserted_rows=0, num_deleted_rows=1)
# --8<-- [end:insert_if_not_exists]
assert table.count_rows("doc_id = 1") == 1
assert res.version == 2
@@ -185,7 +185,7 @@ async def test_replace_range_async(mem_db_async):
await table.count_rows("doc_id = 1") # 1
res
# MergeResult(version=2, num_updated_rows=1,
# num_inserted_rows=0, num_deleted_rows=1, num_rows=1)
# num_inserted_rows=0, num_deleted_rows=1)
# --8<-- [end:insert_if_not_exists]
assert await table.count_rows("doc_id = 1") == 1
assert res.version == 2

View File

@@ -1,56 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import pickle
from lancedb.remote.errors import HttpError, LanceDBClientError, RetryError
def test_pickle_lancedb_client_error():
err = LanceDBClientError("something went wrong", "req-123", 400)
restored = pickle.loads(pickle.dumps(err))
assert str(restored) == "something went wrong"
assert restored.request_id == "req-123"
assert restored.status_code == 400
def test_pickle_lancedb_client_error_no_status_code():
err = LanceDBClientError("fail", "req-456")
restored = pickle.loads(pickle.dumps(err))
assert str(restored) == "fail"
assert restored.request_id == "req-456"
assert restored.status_code is None
def test_pickle_http_error():
err = HttpError("not found", "req-789", 404)
restored = pickle.loads(pickle.dumps(err))
assert isinstance(restored, HttpError)
assert str(restored) == "not found"
assert restored.request_id == "req-789"
assert restored.status_code == 404
def test_pickle_retry_error():
err = RetryError(
"max retries exceeded",
"req-abc",
request_failures=3,
connect_failures=1,
read_failures=2,
max_request_failures=5,
max_connect_failures=3,
max_read_failures=3,
status_code=503,
)
restored = pickle.loads(pickle.dumps(err))
assert isinstance(restored, RetryError)
assert str(restored) == "max retries exceeded"
assert restored.request_id == "req-abc"
assert restored.request_failures == 3
assert restored.connect_failures == 1
assert restored.read_failures == 2
assert restored.max_request_failures == 5
assert restored.max_connect_failures == 3
assert restored.max_read_failures == 3
assert restored.status_code == 503

View File

@@ -215,12 +215,11 @@ def test_reject_legacy_tantivy_index(table):
@pytest.mark.parametrize("with_position", [True, False])
def test_create_inverted_index(table, with_position):
with pytest.warns(DeprecationWarning, match="create_fts_index"):
table.create_fts_index(
"text",
with_position=with_position,
name="custom_fts_index",
)
table.create_fts_index(
"text",
with_position=with_position,
name="custom_fts_index",
)
indices = table.list_indices()
fts_indices = [i for i in indices if i.index_type == "FTS"]
assert any(i.name == "custom_fts_index" for i in fts_indices)

View File

@@ -20,7 +20,6 @@ from lancedb.index import (
IvfRq,
Bitmap,
LabelList,
Fm,
HnswPq,
HnswSq,
HnswFlat,
@@ -114,14 +113,8 @@ async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
pa.field("user.id", pa.int32()),
]
)
mixed_case_metadata_type = pa.struct([pa.field("userId", pa.int32())])
escaped_metadata_type = pa.struct([pa.field("user-id", pa.int32())])
literal_type = pa.struct([pa.field("a.b", pa.int32())])
data = pa.Table.from_arrays(
[
pa.array([1, 2, 3], type=pa.int32()),
pa.array([1, 2, 3], type=pa.int32()),
pa.array([1, 2, 3], type=pa.int32()),
pa.array([1, 2, 3], type=pa.int32()),
pa.array(
[
@@ -131,67 +124,25 @@ async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
],
type=metadata_type,
),
pa.array(
[{"userId": 10}, {"userId": 20}, {"userId": 30}],
type=mixed_case_metadata_type,
),
pa.array(
[{"user-id": 10}, {"user-id": 20}, {"user-id": 30}],
type=escaped_metadata_type,
),
pa.array(
[{"a.b": 10}, {"a.b": 20}, {"a.b": 30}],
type=literal_type,
),
],
names=[
"rowId",
"row-id",
"userId",
"user_id",
"metadata",
"MetaData",
"meta-data",
"literal",
],
names=["user_id", "metadata"],
)
table = await db_async.create_table("nested_scalar_index", data)
await table.create_index("rowId", config=BTree(), name="row_id_idx")
await table.create_index("`row-id`", config=BTree(), name="row_dash_id_idx")
await table.create_index("userId", config=BTree(), name="top_user_id_idx")
await table.create_index("user_id", config=BTree(), name="top_snake_user_id_idx")
await table.create_index("user_id", config=BTree(), name="top_user_id_idx")
await table.create_index(
"metadata.user_id", config=BTree(), name="nested_user_id_idx"
)
await table.create_index(
"metadata.`user.id`", config=BTree(), name="escaped_user_id_idx"
)
await table.create_index(
"MetaData.userId", config=BTree(), name="mixed_case_metadata_user_id_idx"
)
await table.create_index(
"`meta-data`.`user-id`", config=BTree(), name="escaped_names_idx"
)
await table.create_index("literal.`a.b`", config=BTree(), name="literal_dot_idx")
columns_by_name = {
index.name: index.columns for index in await table.list_indices()
}
assert columns_by_name["row_id_idx"] == ["rowId"]
assert columns_by_name["row_dash_id_idx"] == ["`row-id`"]
assert columns_by_name["top_user_id_idx"] == ["userId"]
assert columns_by_name["top_snake_user_id_idx"] == ["user_id"]
assert columns_by_name["top_user_id_idx"] == ["user_id"]
assert columns_by_name["nested_user_id_idx"] == ["metadata.user_id"]
assert columns_by_name["escaped_user_id_idx"] == ["metadata.`user.id`"]
assert columns_by_name["mixed_case_metadata_user_id_idx"] == ["MetaData.userId"]
assert columns_by_name["escaped_names_idx"] == ["`meta-data`.`user-id`"]
assert columns_by_name["literal_dot_idx"] == ["literal.`a.b`"]
for index_name in columns_by_name:
stats = await table.index_stats(index_name)
assert stats is not None
assert stats.num_indexed_rows == 3
@pytest.mark.asyncio
@@ -204,16 +155,6 @@ async def test_create_fixed_size_binary_index(some_table: AsyncTable):
assert indices[0].columns == ["fsb"]
@pytest.mark.asyncio
async def test_create_fm_index(some_table: AsyncTable):
# FM-Index accelerates substring search on string/binary columns.
await some_table.create_index("data", config=Fm())
indices = await some_table.list_indices()
assert len(indices) == 1
assert indices[0].index_type == "Fm"
assert indices[0].columns == ["data"]
@pytest.mark.asyncio
async def test_create_bitmap_index(some_table: AsyncTable):
await some_table.create_index("id", config=Bitmap())
@@ -221,13 +162,12 @@ async def test_create_bitmap_index(some_table: AsyncTable):
await some_table.create_index("data", config=Bitmap())
indices = await some_table.list_indices()
assert len(indices) == 3
# list_indices returns indices in alphabetical order by name
assert indices[0].index_type == "Bitmap"
assert indices[0].columns == ["data"]
assert indices[0].columns == ["id"]
assert indices[1].index_type == "Bitmap"
assert indices[1].columns == ["id"]
assert indices[1].columns == ["is_active"]
assert indices[2].index_type == "Bitmap"
assert indices[2].columns == ["is_active"]
assert indices[2].columns == ["data"]
index_name = indices[0].name
stats = await some_table.index_stats(index_name)
@@ -248,51 +188,6 @@ async def test_create_label_list_index(some_table: AsyncTable):
await some_table.create_index("tags", config=LabelList())
indices = await some_table.list_indices()
assert str(indices) == '[Index(LabelList, columns=["tags"], name="tags_idx")]'
plan = await some_table.query().where("array_has(tags, 'tag0')").explain_plan()
assert "ScalarIndexQuery" in plan
@pytest.mark.asyncio
async def test_create_large_list_label_list_index(db_async):
data = pa.Table.from_pydict(
{"tags": [[f"tag{i % 2}", "shared"] for i in range(16)]},
schema=pa.schema([pa.field("tags", pa.large_list(pa.string()))]),
)
table = await db_async.create_table("large_list_label_list_index", data)
await table.create_index("tags", config=LabelList())
indices = await table.list_indices()
assert str(indices) == '[Index(LabelList, columns=["tags"], name="tags_idx")]'
plan = await table.query().where("array_has(tags, 'shared')").explain_plan()
assert "ScalarIndexQuery" in plan
@pytest.mark.asyncio
async def test_create_label_list_index_rejects_list_struct(db_async):
item_type = pa.struct(
[
pa.field("tag", pa.string()),
pa.field(
"metadata",
pa.struct([pa.field("userId", pa.string())]),
),
]
)
data = pa.Table.from_pylist(
[
{
"items": [
{"tag": "tag0", "metadata": {"userId": "user0"}},
{"tag": "shared", "metadata": {"userId": "user1"}},
]
}
],
schema=pa.schema([pa.field("items", pa.list_(item_type))]),
)
table = await db_async.create_table("list_struct_label_list_index", data)
with pytest.raises(Exception, match="LabelList index cannot be created"):
await table.create_index("items", config=LabelList())
@pytest.mark.asyncio
@@ -330,6 +225,7 @@ async def test_create_vector_index(some_table: AsyncTable):
assert stats.num_indexed_rows == await some_table.count_rows()
assert stats.num_unindexed_rows == 0
assert stats.num_indices == 1
assert stats.loss >= 0.0
@pytest.mark.asyncio
@@ -353,6 +249,7 @@ async def test_create_4bit_ivfpq_index(some_table: AsyncTable):
assert stats.num_indexed_rows == await some_table.count_rows()
assert stats.num_unindexed_rows == 0
assert stats.num_indices == 1
assert stats.loss >= 0.0
@pytest.mark.asyncio

View File

@@ -1,196 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""Tests for the MemWAL LSM ``merge_insert`` dispatch."""
from datetime import timedelta
import lancedb
import pyarrow as pa
import pytest
from lancedb._lancedb import LsmWriteSpec
SCHEMA = pa.schema(
[
pa.field("id", pa.int64(), nullable=False),
pa.field("value", pa.int64(), nullable=False),
]
)
REGION_SCHEMA = pa.schema(
[
pa.field("id", pa.int64(), nullable=False),
pa.field("region", pa.utf8(), nullable=False),
]
)
def _reader(ids):
batch = pa.RecordBatch.from_arrays(
[
pa.array(ids, type=pa.int64()),
pa.array(list(range(len(ids))), type=pa.int64()),
],
schema=SCHEMA,
)
return pa.RecordBatchReader.from_batches(SCHEMA, [batch])
def _region_reader(rows):
batch = pa.RecordBatch.from_arrays(
[
pa.array([row[0] for row in rows], type=pa.int64()),
pa.array([row[1] for row in rows], type=pa.utf8()),
],
schema=REGION_SCHEMA,
)
return pa.RecordBatchReader.from_batches(REGION_SCHEMA, [batch])
def _bucket_table(tmp_path):
"""A table with ``id`` as the primary key and a single-bucket LSM spec."""
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _reader([1, 2, 3]))
table.set_unenforced_primary_key("id")
# num_buckets = 1: every row routes to the single bucket.
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1))
return table
def test_lsm_merge_insert_bucket(tmp_path):
table = _bucket_table(tmp_path)
# Empty `on` defaults to the primary key.
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([3, 4, 5]))
)
# LSM path: rows go to the MemWAL, so only num_rows is populated.
assert result.num_rows == 3
assert result.version == 0
assert result.num_inserted_rows == 0
assert result.num_updated_rows == 0
def test_lsm_merge_insert_unsharded(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _reader([1, 2, 3]))
table.set_unenforced_primary_key("id")
table.set_lsm_write_spec(LsmWriteSpec.unsharded())
result = (
table.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([10, 11, 12, 13]))
)
assert result.num_rows == 4
def test_lsm_merge_insert_identity(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _region_reader([(1, "us"), (2, "us")]))
table.set_unenforced_primary_key("id")
table.set_lsm_write_spec(LsmWriteSpec.identity("region"))
# All rows share one identity value, so they route to one shard.
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_region_reader([(3, "us"), (4, "us")]))
)
assert result.num_rows == 2
def test_lsm_merge_insert_use_lsm_write_false(tmp_path):
table = _bucket_table(tmp_path) # rows id = 1, 2, 3
# use_lsm_write(False) opts out: the standard path runs and commits.
result = (
table.merge_insert("id")
.when_not_matched_insert_all()
.use_lsm_write(False)
.execute(_reader([3, 4, 5]))
)
assert result.num_inserted_rows == 2
assert table.count_rows() == 5
def test_lsm_merge_insert_validate_single_shard_off(tmp_path):
table = _bucket_table(tmp_path)
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.validate_single_shard(False)
.execute(_reader([6, 7, 8]))
)
assert result.num_rows == 3
def test_lsm_merge_insert_use_lsm_write_true_requires_spec(tmp_path):
# A table with a primary key but no LSM write spec installed.
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
table = db.create_table("t", _reader([1, 2, 3]))
table.set_unenforced_primary_key("id")
with pytest.raises(Exception, match="use_lsm_write"):
(
table.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.use_lsm_write(True)
.execute(_reader([4]))
)
def test_lsm_merge_insert_rejects_on_not_primary_key(tmp_path):
table = _bucket_table(tmp_path)
with pytest.raises(Exception, match="primary key"):
(
table.merge_insert("value")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([1]))
)
def test_lsm_merge_insert_rejects_non_upsert(tmp_path):
table = _bucket_table(tmp_path)
# Insert-only (no when_matched_update_all) is not the upsert shape.
with pytest.raises(Exception, match="upsert"):
table.merge_insert([]).when_not_matched_insert_all().execute(_reader([4]))
def test_lsm_close_writers(tmp_path):
table = _bucket_table(tmp_path)
(
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([7, 8]))
)
table.close_lsm_writers()
# The writer reopens lazily on the next merge_insert.
result = (
table.merge_insert([])
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(_reader([9]))
)
assert result.num_rows == 1
@pytest.mark.asyncio
async def test_async_lsm_merge_insert(tmp_path):
db = await lancedb.connect_async(
tmp_path, read_consistency_interval=timedelta(seconds=0)
)
table = await db.create_table("t", _reader([1, 2, 3]))
await table.set_unenforced_primary_key("id")
await table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1))
builder = (
table.merge_insert([]).when_matched_update_all().when_not_matched_insert_all()
)
result = await builder.execute(_reader([3, 4, 5]))
assert result.num_rows == 3
await table.close_lsm_writers()

View File

@@ -5,67 +5,10 @@
import tempfile
import shutil
import sys
import pytest
import pyarrow as pa
import lancedb
from lance_namespace.errors import NamespaceNotEmptyError, TableNotFoundError
from lancedb.table import AsyncTable, LanceTable
PUSHDOWN_DATA = pa.table(
{"id": list(range(12)), "text": [f"row-{idx}" for idx in range(12)]}
)
def _ipc_file(table: pa.Table = PUSHDOWN_DATA) -> bytes:
sink = pa.BufferOutputStream()
with pa.ipc.new_file(sink, table.schema) as writer:
writer.write_table(table)
return sink.getvalue().to_pybytes()
class _FailingSyncInner:
name = "hist"
def current_branch(self):
# The pushdown gate only routes server-side when on the default branch.
return None
async def schema(self):
return PUSHDOWN_DATA.schema
async def to_arrow(self):
raise RuntimeError("direct table to_arrow should not be used")
class _FailingAsyncInner:
def name(self):
return "hist"
async def schema(self):
return PUSHDOWN_DATA.schema
def query(self):
raise AssertionError("direct async query should not be used")
class _NamespaceClient:
def __init__(self):
self.requests = []
def query_table(self, request):
self.requests.append(request)
return _ipc_file()
def _namespace_lance_table(namespace_client: _NamespaceClient) -> LanceTable:
table = LanceTable.__new__(LanceTable)
table._table = _FailingSyncInner()
table._namespace_path = ["geneva"]
table._namespace_client = namespace_client
table._pushdown_operations = {"QueryTable"}
return table
class TestNamespaceConnection:
@@ -133,35 +76,6 @@ class TestNamespaceConnection:
assert len(result) == 0
assert list(result.columns) == ["id", "vector", "text"]
def test_table_to_pandas_blob_lazy_through_namespace(self):
"""Namespace-backed tables should use Lance blob-aware pandas conversion."""
pytest.importorskip("lance")
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
db.create_namespace(["test_ns"])
data = pa.table(
{
"id": pa.array([1, 2], pa.int64()),
"blob": pa.array([b"hello", b"world"], pa.large_binary()),
},
schema=pa.schema(
[
pa.field("id", pa.int64()),
pa.field(
"blob",
pa.large_binary(),
metadata={"lance-encoding:blob": "true"},
),
]
),
)
table = db.create_table("blob_table", data, namespace_path=["test_ns"])
df = table.to_pandas(blob_mode="lazy").sort_values("id")
blob = df["blob"].iloc[0]
assert hasattr(blob, "readall")
assert blob.readall() == b"hello"
def test_open_table_through_namespace(self):
"""Test opening an existing table through namespace."""
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
@@ -257,15 +171,8 @@ class TestNamespaceConnection:
assert table_schema.field("id").type == pa.int64()
assert table_schema.field("text").type == pa.string()
def test_rename_table(self):
"""Test that rename_table renames a table in the namespace.
The `dir` namespace implementation in lance-namespace-impls does not
implement `rename_table` yet (only the `rest` backend does), so it
currently falls back to the default trait method which raises
NotSupported. This is expected to start passing once the `dir`
backend gains rename_table support upstream.
"""
def test_rename_table_not_supported(self):
"""Test that rename_table raises NotImplementedError."""
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
# Create a child namespace first
@@ -280,14 +187,9 @@ class TestNamespaceConnection:
)
db.create_table("old_name", schema=schema, namespace_path=["test_ns"])
# Rename the table within the same namespace
with pytest.raises(NotImplementedError, match="rename_table not implemented"):
db.rename_table(
"old_name",
"new_name",
cur_namespace_path=["test_ns"],
new_namespace_path=["test_ns"],
)
# Rename should raise NotImplementedError
with pytest.raises(NotImplementedError, match="rename_table is not supported"):
db.rename_table("old_name", "new_name")
def test_drop_all_tables(self):
"""Test dropping all tables through namespace."""
@@ -805,22 +707,6 @@ class TestPushdownOperations:
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
assert len(db._namespace_client_pushdown_operations) == 0
def test_lance_table_to_arrow_uses_query_pushdown(self):
namespace_client = _NamespaceClient()
table = _namespace_lance_table(namespace_client)
assert table.to_arrow().equals(PUSHDOWN_DATA)
assert table.to_pandas()["id"].tolist() == list(range(12))
assert len(namespace_client.requests) == 2
assert [request.id for request in namespace_client.requests] == [
["geneva", "hist"],
["geneva", "hist"],
]
assert [request.k for request in namespace_client.requests] == [
sys.maxsize,
sys.maxsize,
]
@pytest.mark.asyncio
class TestAsyncPushdownOperations:
@@ -856,39 +742,3 @@ class TestAsyncPushdownOperations:
"""Test that pushdown operations default to empty on async connection."""
db = lancedb.connect_namespace_async("dir", {"root": self.temp_dir})
assert len(db._namespace_client_pushdown_operations) == 0
async def test_async_table_to_arrow_uses_query_pushdown(self):
namespace_client = _NamespaceClient()
table = AsyncTable(
_FailingAsyncInner(),
namespace_path=["geneva"],
namespace_client=namespace_client,
pushdown_operations={"QueryTable"},
)
assert (await table.to_arrow()).equals(PUSHDOWN_DATA)
assert (await table.to_pandas())["id"].tolist() == list(range(12))
assert len(namespace_client.requests) == 2
assert [request.id for request in namespace_client.requests] == [
["geneva", "hist"],
["geneva", "hist"],
]
assert [request.k for request in namespace_client.requests] == [
sys.maxsize,
sys.maxsize,
]
def test_local_table_to_arrow_and_to_pandas_are_unchanged(tmp_path):
db = lancedb.connect(str(tmp_path / "db"))
table = db.create_table(
"local",
data=[
{"id": 1, "vector": [1.0, 2.0]},
{"id": 2, "vector": [3.0, 4.0]},
],
)
assert table.to_arrow().column("id").to_pylist() == [1, 2]
assert table.to_pandas()["id"].tolist() == [1, 2]

View File

@@ -1,686 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""Regression matrix for nested field support across LanceDB Python APIs.
Covers the lifecycle described in lancedb/lancedb#3406:
- Nested scalar, vector, and FTS index creation with full dotted paths
- list_indices / index_stats return canonical full paths (not leaf names)
- search, filter, append, optimize behaviour
- Field-name edge cases: mixed case, literal-dot field names, same-name leaves
- Both sync and async Python table APIs
The matrix uses the following field-name variants from the acceptance criteria:
- rowId (camelCase top-level)
- `row-id` (hyphenated top-level, escaped)
- parent.`leaf.name` (struct leaf whose name contains a literal dot)
- MetaData.userId (mixed-case nested path)
- `meta-data`.`user-id` (hyphenated struct with hyphenated leaf)
Note: Lance forbids top-level field names that contain a '.', so the literal-dot
edge case is exercised via a struct leaf field (parent.`leaf.name`) instead.
"""
from datetime import timedelta
import pyarrow as pa
import pytest
import pytest_asyncio
import lancedb
from lancedb.db import AsyncConnection, DBConnection
from lancedb.index import BTree, FTS, IvfPq
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
DIM = 8
# IvfPq requires at least num_partitions * 256 rows by default; keeping rows
# small means we must drop num_sub_vectors and num_partitions very low.
NROWS = 256
def _vec(row: int) -> list:
return [float((row * DIM + i) % 256) for i in range(DIM)]
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def sync_db(tmp_path) -> DBConnection:
return lancedb.connect(tmp_path)
@pytest_asyncio.fixture
async def async_db(tmp_path) -> AsyncConnection:
return await lancedb.connect_async(
tmp_path, read_consistency_interval=timedelta(seconds=0)
)
# ---------------------------------------------------------------------------
# Schema / data builders
# ---------------------------------------------------------------------------
def _nested_scalar_schema() -> pa.Schema:
"""Schema with nested scalar fields covering the acceptance-criteria names.
Top-level columns:
- rowId int32 (camelCase top-level)
- row-id int32 (hyphenated top-level name)
- MetaData struct{userId int32} (mixed-case nested path)
- meta-data struct{user-id int32} (hyphenated struct + hyphenated leaf)
Lance disallows top-level field names that contain '.' (e.g. a field
literally named 'a.b'), so that edge case is tested separately using
_literal_dot_schema() below.
"""
return pa.schema(
[
pa.field("rowId", pa.int32()),
pa.field("row-id", pa.int32()),
pa.field(
"MetaData",
pa.struct([pa.field("userId", pa.int32())]),
),
pa.field(
"meta-data",
pa.struct([pa.field("user-id", pa.int32())]),
),
]
)
def _nested_scalar_data(nrows: int = NROWS) -> pa.Table:
schema = _nested_scalar_schema()
return pa.table(
{
"rowId": pa.array(list(range(nrows)), pa.int32()),
"row-id": pa.array(list(range(nrows)), pa.int32()),
"MetaData": pa.array(
[{"userId": i} for i in range(nrows)],
type=pa.struct([pa.field("userId", pa.int32())]),
),
"meta-data": pa.array(
[{"user-id": i} for i in range(nrows)],
type=pa.struct([pa.field("user-id", pa.int32())]),
),
},
schema=schema,
)
def _literal_dot_schema() -> pa.Schema:
"""Schema where a struct *leaf* field is named with a literal dot.
The path used in the index API is ``parent.`leaf.name` ``.
"""
return pa.schema(
[
pa.field("id", pa.int32()),
pa.field(
"parent",
pa.struct([pa.field("leaf.name", pa.int32())]),
),
]
)
def _literal_dot_data(nrows: int = NROWS) -> pa.Table:
parent_type = pa.struct([pa.field("leaf.name", pa.int32())])
return pa.table(
{
"id": pa.array(list(range(nrows)), pa.int32()),
"parent": pa.array(
[{"leaf.name": i} for i in range(nrows)],
type=parent_type,
),
},
schema=_literal_dot_schema(),
)
def _same_leaf_schema() -> pa.Schema:
return pa.schema(
[
pa.field("StructA", pa.struct([pa.field("userId", pa.int32())])),
pa.field("StructB", pa.struct([pa.field("userId", pa.int32())])),
]
)
def _same_leaf_data(nrows: int = NROWS) -> pa.Table:
t = pa.struct([pa.field("userId", pa.int32())])
return pa.table(
{
"StructA": pa.array([{"userId": i} for i in range(nrows)], type=t),
"StructB": pa.array([{"userId": i * 10} for i in range(nrows)], type=t),
},
schema=_same_leaf_schema(),
)
def _nested_vector_schema() -> pa.Schema:
return pa.schema(
[
pa.field("id", pa.int32()),
pa.field(
"image",
pa.struct([pa.field("embedding", pa.list_(pa.float32(), DIM))]),
),
pa.field(
"MetaData",
pa.struct([pa.field("userId", pa.int32())]),
),
]
)
def _nested_vector_data(nrows: int = NROWS) -> pa.Table:
embedding_type = pa.list_(pa.float32(), DIM)
image_type = pa.struct([pa.field("embedding", embedding_type)])
meta_type = pa.struct([pa.field("userId", pa.int32())])
return pa.table(
{
"id": pa.array(list(range(nrows)), pa.int32()),
"image": pa.array(
[{"embedding": _vec(i)} for i in range(nrows)],
type=image_type,
),
"MetaData": pa.array(
[{"userId": i} for i in range(nrows)],
type=meta_type,
),
},
schema=_nested_vector_schema(),
)
def _nested_fts_schema() -> pa.Schema:
return pa.schema(
[
pa.field("id", pa.int32()),
pa.field(
"payload",
pa.struct([pa.field("text", pa.utf8())]),
),
pa.field(
"MetaData",
pa.struct([pa.field("userId", pa.int32())]),
),
]
)
def _nested_fts_data(nrows: int = NROWS) -> pa.Table:
words = ["alpha", "bravo", "charlie", "delta", "echo"]
payload_type = pa.struct([pa.field("text", pa.utf8())])
meta_type = pa.struct([pa.field("userId", pa.int32())])
return pa.table(
{
"id": pa.array(list(range(nrows)), pa.int32()),
"payload": pa.array(
[{"text": words[i % len(words)]} for i in range(nrows)],
type=payload_type,
),
"MetaData": pa.array(
[{"userId": i} for i in range(nrows)],
type=meta_type,
),
},
schema=_nested_fts_schema(),
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _columns_by_name_sync(tbl) -> dict:
return {idx.name: idx.columns for idx in tbl.list_indices()}
async def _columns_by_name_async(tbl) -> dict:
return {idx.name: idx.columns for idx in await tbl.list_indices()}
# ===========================================================================
# SYNC TESTS
# ===========================================================================
#
# The sync LanceTable API uses:
# - create_scalar_index(column, ...) for scalar (BTree/Bitmap/LabelList) indices
# - create_fts_index(column, ...) for full-text-search indices
# - create_index(...) for vector indices (older positional API)
# ===========================================================================
class TestNestedScalarIndexSync:
"""Sync regression matrix for nested scalar (BTree) indices."""
def test_top_level_camelcase_field(self, sync_db):
"""list_indices must return the full camelCase field name."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index("rowId", index_type="BTREE", name="rowid_idx")
col_map = _columns_by_name_sync(tbl)
assert col_map["rowid_idx"] == ["rowId"], (
"list_indices must return 'rowId', not a truncated leaf name"
)
def test_top_level_hyphenated_field_escaped(self, sync_db):
"""Top-level field 'row-id' (hyphenated) accessed via escaped path."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index("`row-id`", index_type="BTREE", name="rowid_hyph_idx")
col_map = _columns_by_name_sync(tbl)
assert col_map["rowid_hyph_idx"] == ["`row-id`"], (
"list_indices must return escaped path '`row-id`'"
)
def test_struct_leaf_literal_dot_field_escaped(self, sync_db):
"""Struct leaf with a literal-dot name: parent.`leaf.name`.
The index listing must use the full escaped path, not just the leaf.
"""
tbl = sync_db.create_table("t", _literal_dot_data())
tbl.create_scalar_index(
"parent.`leaf.name`", index_type="BTREE", name="leaf_dot_idx"
)
col_map = _columns_by_name_sync(tbl)
assert col_map["leaf_dot_idx"] == ["parent.`leaf.name`"], (
"list_indices must return 'parent.`leaf.name`', not just '`leaf.name`'"
)
def test_nested_mixed_case_path(self, sync_db):
"""Nested path MetaData.userId (mixed case) must appear as full path."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index(
"MetaData.userId", index_type="BTREE", name="metadata_userid_idx"
)
col_map = _columns_by_name_sync(tbl)
assert col_map["metadata_userid_idx"] == ["MetaData.userId"], (
"list_indices must return 'MetaData.userId', not leaf 'userId'"
)
def test_nested_hyphenated_path_escaped(self, sync_db):
"""`meta-data`.`user-id` path with both parts escaped."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index(
"`meta-data`.`user-id`", index_type="BTREE", name="metauid_idx"
)
col_map = _columns_by_name_sync(tbl)
assert col_map["metauid_idx"] == ["`meta-data`.`user-id`"], (
"list_indices must return '`meta-data`.`user-id`', not 'user-id'"
)
def test_filter_on_nested_mixed_case(self, sync_db):
"""WHERE filter on a nested dotted path works after index creation."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index(
"MetaData.userId", index_type="BTREE", name="metadata_userid_idx"
)
rows = tbl.search().where("MetaData.userId = 5").to_list()
assert len(rows) == 1
assert rows[0]["MetaData"]["userId"] == 5
def test_append_and_list_indices_stable(self, sync_db):
"""After appending rows the index listing must remain unchanged."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index(
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
)
tbl.add(_nested_scalar_data(nrows=4))
col_map = _columns_by_name_sync(tbl)
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
def test_optimize_and_list_indices_stable(self, tmp_path):
"""After optimize the index listing must still show full paths."""
db = lancedb.connect(tmp_path / "opt_db")
tbl = db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index(
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
)
tbl.add(_nested_scalar_data(nrows=4))
tbl.optimize()
col_map = _columns_by_name_sync(tbl)
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
def test_same_name_leaves_are_distinct(self, sync_db):
"""Two structs sharing a leaf name must produce distinct index paths."""
tbl = sync_db.create_table("same_leaf", _same_leaf_data())
tbl.create_scalar_index(
"StructA.userId", index_type="BTREE", name="a_userid_idx"
)
tbl.create_scalar_index(
"StructB.userId", index_type="BTREE", name="b_userid_idx"
)
col_map = _columns_by_name_sync(tbl)
assert col_map["a_userid_idx"] == ["StructA.userId"]
assert col_map["b_userid_idx"] == ["StructB.userId"]
def test_index_stats_canonical_path(self, sync_db):
"""index_stats round-trip: create on nested field, verify row count."""
tbl = sync_db.create_table("t", _nested_scalar_data())
tbl.create_scalar_index(
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
)
stats = tbl.index_stats("meta_uid_idx")
assert stats is not None
assert stats.index_type == "BTREE"
assert stats.num_indexed_rows == NROWS
class TestNestedVectorIndexSync:
"""Sync regression matrix for nested vector (IvfPq) indices."""
def test_nested_vector_index_full_path(self, sync_db):
"""Listing after vector index creation must use the full dotted path."""
tbl = sync_db.create_table("vt", _nested_vector_data())
tbl.create_index(
num_partitions=2,
num_sub_vectors=2,
vector_column_name="image.embedding",
name="image_emb_idx",
)
col_map = _columns_by_name_sync(tbl)
assert col_map["image_emb_idx"] == ["image.embedding"], (
"list_indices must return 'image.embedding', not leaf 'embedding'"
)
def test_nested_vector_search(self, sync_db):
"""Vector search on nested embedding field must return results."""
tbl = sync_db.create_table("vt", _nested_vector_data())
tbl.create_index(
num_partitions=2,
num_sub_vectors=2,
vector_column_name="image.embedding",
name="image_emb_idx",
)
results = (
tbl.search(_vec(0), vector_column_name="image.embedding").limit(5).to_list()
)
assert len(results) > 0
def test_nested_vector_index_stats(self, sync_db):
"""index_stats for a nested vector index must reflect correct row count."""
tbl = sync_db.create_table("vt", _nested_vector_data())
tbl.create_index(
num_partitions=2,
num_sub_vectors=2,
vector_column_name="image.embedding",
name="image_emb_idx",
)
stats = tbl.index_stats("image_emb_idx")
assert stats is not None
assert stats.num_indexed_rows == NROWS
def test_nested_vector_append_optimize(self, tmp_path):
"""After append and optimize the vector index listing must be stable."""
db = lancedb.connect(tmp_path / "vec_opt_db")
tbl = db.create_table("vt", _nested_vector_data())
tbl.create_index(
num_partitions=2,
num_sub_vectors=2,
vector_column_name="image.embedding",
name="image_emb_idx",
)
tbl.add(_nested_vector_data(nrows=4))
tbl.optimize()
col_map = _columns_by_name_sync(tbl)
assert col_map["image_emb_idx"] == ["image.embedding"]
class TestNestedFTSIndexSync:
"""Sync regression matrix for nested FTS indices."""
def test_nested_fts_index_full_path(self, sync_db):
"""FTS index on payload.text must be listed with the full path."""
tbl = sync_db.create_table("ft", _nested_fts_data())
tbl.create_fts_index("payload.text", name="payload_text_idx")
col_map = _columns_by_name_sync(tbl)
assert col_map["payload_text_idx"] == ["payload.text"], (
"list_indices must return 'payload.text', not leaf 'text'"
)
def test_nested_fts_search(self, sync_db):
"""FTS search on a nested text field must return correct results."""
tbl = sync_db.create_table("ft", _nested_fts_data())
tbl.create_fts_index("payload.text", name="payload_text_idx")
results = (
tbl.search("alpha", query_type="fts", fts_columns="payload.text")
.limit(10)
.to_list()
)
assert len(results) > 0
assert all(row["payload"]["text"] == "alpha" for row in results)
def test_nested_fts_append_optimize(self, tmp_path):
"""After append and optimize the FTS index listing must be stable."""
db = lancedb.connect(tmp_path / "fts_opt_db")
tbl = db.create_table("ft", _nested_fts_data())
tbl.create_fts_index("payload.text", name="payload_text_idx")
tbl.add(_nested_fts_data(nrows=4))
tbl.optimize()
col_map = _columns_by_name_sync(tbl)
assert col_map["payload_text_idx"] == ["payload.text"]
# ===========================================================================
# ASYNC TESTS
# ===========================================================================
#
# The async AsyncTable API uses create_index(column, config=...) uniformly
# for scalar, vector, and FTS indices.
# ===========================================================================
class TestNestedScalarIndexAsync:
"""Async regression matrix for nested scalar (BTree) indices."""
@pytest.mark.asyncio
async def test_top_level_camelcase_field(self, async_db):
"""list_indices must return the full camelCase field name."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index("rowId", config=BTree(), name="rowid_idx")
col_map = await _columns_by_name_async(tbl)
assert col_map["rowid_idx"] == ["rowId"]
@pytest.mark.asyncio
async def test_top_level_hyphenated_field_escaped(self, async_db):
"""Hyphenated top-level field accessed via escaped path."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index("`row-id`", config=BTree(), name="rowid_hyph_idx")
col_map = await _columns_by_name_async(tbl)
assert col_map["rowid_hyph_idx"] == ["`row-id`"]
@pytest.mark.asyncio
async def test_struct_leaf_literal_dot_field_escaped(self, async_db):
"""Struct leaf with a literal-dot name: parent.`leaf.name`."""
tbl = await async_db.create_table("t", _literal_dot_data())
await tbl.create_index(
"parent.`leaf.name`", config=BTree(), name="leaf_dot_idx"
)
col_map = await _columns_by_name_async(tbl)
assert col_map["leaf_dot_idx"] == ["parent.`leaf.name`"]
@pytest.mark.asyncio
async def test_nested_mixed_case_path(self, async_db):
"""Mixed-case nested path MetaData.userId must appear as full path."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index(
"MetaData.userId", config=BTree(), name="metadata_userid_idx"
)
col_map = await _columns_by_name_async(tbl)
assert col_map["metadata_userid_idx"] == ["MetaData.userId"]
@pytest.mark.asyncio
async def test_nested_hyphenated_path_escaped(self, async_db):
"""`meta-data`.`user-id` path with both parts escaped."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index(
"`meta-data`.`user-id`", config=BTree(), name="metauid_idx"
)
col_map = await _columns_by_name_async(tbl)
assert col_map["metauid_idx"] == ["`meta-data`.`user-id`"]
@pytest.mark.asyncio
async def test_filter_on_nested_mixed_case(self, async_db):
"""WHERE filter on a nested dotted path works after index creation."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index(
"MetaData.userId", config=BTree(), name="metadata_userid_idx"
)
rows = await tbl.query().where("MetaData.userId = 5").to_list()
assert len(rows) == 1
assert rows[0]["MetaData"]["userId"] == 5
@pytest.mark.asyncio
async def test_index_stats_canonical_path(self, async_db):
"""index_stats round-trip: create on nested field, verify stats."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
stats = await tbl.index_stats("meta_uid_idx")
assert stats is not None
assert stats.index_type == "BTREE"
assert stats.num_indexed_rows == NROWS
@pytest.mark.asyncio
async def test_append_and_list_indices_stable(self, async_db):
"""After appending rows the index listing must remain unchanged."""
tbl = await async_db.create_table("t", _nested_scalar_data())
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
await tbl.add(_nested_scalar_data(nrows=4))
col_map = await _columns_by_name_async(tbl)
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
@pytest.mark.asyncio
async def test_optimize_and_list_indices_stable(self, tmp_path):
"""After optimize the index listing must still show full paths."""
db = await lancedb.connect_async(
tmp_path / "opt_db", read_consistency_interval=timedelta(seconds=0)
)
tbl = await db.create_table("t", _nested_scalar_data())
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
await tbl.add(_nested_scalar_data(nrows=4))
await tbl.optimize()
col_map = await _columns_by_name_async(tbl)
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
@pytest.mark.asyncio
async def test_same_name_leaves_are_distinct(self, async_db):
"""Two structs sharing a leaf name must produce distinct index paths."""
tbl = await async_db.create_table("same_leaf", _same_leaf_data())
await tbl.create_index("StructA.userId", config=BTree(), name="a_userid_idx")
await tbl.create_index("StructB.userId", config=BTree(), name="b_userid_idx")
col_map = await _columns_by_name_async(tbl)
assert col_map["a_userid_idx"] == ["StructA.userId"]
assert col_map["b_userid_idx"] == ["StructB.userId"]
class TestNestedVectorIndexAsync:
"""Async regression matrix for nested vector (IvfPq) indices."""
@pytest.mark.asyncio
async def test_nested_vector_index_full_path(self, async_db):
"""Listing after vector index creation must use the full dotted path."""
tbl = await async_db.create_table("vt", _nested_vector_data())
await tbl.create_index(
"image.embedding",
config=IvfPq(num_partitions=2, num_sub_vectors=2),
name="image_emb_idx",
)
col_map = await _columns_by_name_async(tbl)
assert col_map["image_emb_idx"] == ["image.embedding"]
@pytest.mark.asyncio
async def test_nested_vector_search(self, async_db):
"""Vector search on nested embedding field must return results."""
tbl = await async_db.create_table("vt", _nested_vector_data())
await tbl.create_index(
"image.embedding",
config=IvfPq(num_partitions=2, num_sub_vectors=2),
name="image_emb_idx",
)
results = (
await tbl.query()
.nearest_to(_vec(0))
.column("image.embedding")
.limit(5)
.to_list()
)
assert len(results) > 0
@pytest.mark.asyncio
async def test_nested_vector_index_stats(self, async_db):
"""index_stats for a nested vector index must reflect correct row count."""
tbl = await async_db.create_table("vt", _nested_vector_data())
await tbl.create_index(
"image.embedding",
config=IvfPq(num_partitions=2, num_sub_vectors=2),
name="image_emb_idx",
)
stats = await tbl.index_stats("image_emb_idx")
assert stats is not None
assert stats.num_indexed_rows == NROWS
@pytest.mark.asyncio
async def test_nested_vector_append_optimize(self, tmp_path):
"""After append and optimize the vector index listing must be stable."""
db = await lancedb.connect_async(
tmp_path / "vec_opt_db", read_consistency_interval=timedelta(seconds=0)
)
tbl = await db.create_table("vt", _nested_vector_data())
await tbl.create_index(
"image.embedding",
config=IvfPq(num_partitions=2, num_sub_vectors=2),
name="image_emb_idx",
)
await tbl.add(_nested_vector_data(nrows=4))
await tbl.optimize()
col_map = await _columns_by_name_async(tbl)
assert col_map["image_emb_idx"] == ["image.embedding"]
class TestNestedFTSIndexAsync:
"""Async regression matrix for nested FTS indices."""
@pytest.mark.asyncio
async def test_nested_fts_index_full_path(self, async_db):
"""FTS index on payload.text must be listed with the full path."""
tbl = await async_db.create_table("ft", _nested_fts_data())
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
col_map = await _columns_by_name_async(tbl)
assert col_map["payload_text_idx"] == ["payload.text"]
@pytest.mark.asyncio
async def test_nested_fts_search(self, async_db):
"""FTS search on a nested text field must return correct results."""
tbl = await async_db.create_table("ft", _nested_fts_data())
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
results = (
await tbl.query()
.nearest_to_text("alpha", columns="payload.text")
.limit(10)
.to_list()
)
assert len(results) > 0
assert all(row["payload"]["text"] == "alpha" for row in results)
@pytest.mark.asyncio
async def test_nested_fts_append_optimize(self, tmp_path):
"""After append and optimize the FTS index listing must be stable."""
db = await lancedb.connect_async(
tmp_path / "fts_opt_db", read_consistency_interval=timedelta(seconds=0)
)
tbl = await db.create_table("ft", _nested_fts_data())
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
await tbl.add(_nested_fts_data(nrows=4))
await tbl.optimize()
col_map = await _columns_by_name_async(tbl)
assert col_map["payload_text_idx"] == ["payload.text"]

View File

@@ -39,35 +39,6 @@ from utils import exception_output
from importlib.util import find_spec
def _blob_query_data():
return pa.table(
{
"id": pa.array([1, 2, 3, 4], pa.int64()),
"tag": pa.array(["drop", "keep", "keep", "keep"], pa.utf8()),
"vector": pa.array(
[[1.0, 0.0], [2.0, 0.0], [3.0, 0.0], [4.0, 0.0]],
type=pa.list_(pa.float32(), list_size=2),
),
"blob": pa.array([b"one", b"two", b"three", b"four"], pa.large_binary()),
},
schema=pa.schema(
[
pa.field("id", pa.int64()),
pa.field("tag", pa.utf8()),
pa.field("vector", pa.list_(pa.float32(), list_size=2)),
pa.field(
"blob", pa.large_binary(), metadata={"lance-encoding:blob": "true"}
),
]
),
)
def _assert_lazy_blob(value, expected: bytes):
assert hasattr(value, "readall")
assert value.readall() == expected
@pytest.fixture(scope="module")
def table(tmpdir_factory) -> lancedb.table.Table:
tmp_path = str(tmpdir_factory.mktemp("data"))
@@ -210,216 +181,6 @@ async def test_query_to_pandas_kwargs(table, table_async):
assert async_df["id"].tolist() == [1, 2]
@pytest.mark.parametrize("blob_mode", ["lazy", "bytes", "descriptions"])
def test_plain_scan_query_to_pandas_blob_modes(tmp_db, blob_mode):
pytest.importorskip("lance")
table = tmp_db.create_table(
f"test_query_to_pandas_blob_{blob_mode}", _blob_query_data()
)
df = (
table.search()
.select(["id", "blob"])
.where("id = 1")
.to_pandas(blob_mode=blob_mode)
)
assert df["id"].tolist() == [1]
if blob_mode == "lazy":
_assert_lazy_blob(df["blob"].iloc[0], b"one")
elif blob_mode == "bytes":
assert df["blob"].tolist() == [b"one"]
else:
first = df["blob"].iloc[0]
assert first != b"one"
assert not hasattr(first, "readall")
def test_plain_scan_query_to_pandas_blob_projection(tmp_db):
pytest.importorskip("lance")
table = tmp_db.create_table(
"test_query_to_pandas_blob_projection", _blob_query_data()
)
df = (
table.search()
.where("id >= 2")
.select({"id_alias": "id", "payload": "blob", "double_id": "id * 2"})
.limit(2)
.offset(1)
.to_pandas(blob_mode="bytes")
)
assert df["id_alias"].tolist() == [3, 4]
assert df["payload"].tolist() == [b"three", b"four"]
assert df["double_id"].tolist() == [6, 8]
@pytest.mark.parametrize("blob_mode", ["bytes", "descriptions"])
def test_plain_scan_query_to_pandas_blob_mode_does_not_collect_arrow(
tmp_db, monkeypatch, blob_mode
):
pytest.importorskip("lance")
table = tmp_db.create_table(
"test_query_to_pandas_blob_no_arrow_collect", _blob_query_data()
)
query = table.search().where("id = 1").select(["id", "blob"])
def fail_to_arrow(*args, **kwargs):
raise AssertionError("to_arrow should not be called before native pandas")
monkeypatch.setattr(query, "to_arrow", fail_to_arrow)
df = query.to_pandas(blob_mode=blob_mode)
assert df["id"].tolist() == [1]
if blob_mode == "bytes":
assert df["blob"].tolist() == [b"one"]
else:
first = df["blob"].iloc[0]
assert first != b"one"
assert not hasattr(first, "readall")
def test_plain_scan_query_to_pandas_blob_descriptions_flatten_uses_scanner(
tmp_db, monkeypatch
):
pytest.importorskip("lance")
table = tmp_db.create_table(
"test_query_to_pandas_blob_desc_flatten", _blob_query_data()
)
query = table.search().where("id = 1").select(["id", "blob"])
def fail_to_arrow(*args, **kwargs):
raise AssertionError("to_arrow should not be called before scanner pandas")
monkeypatch.setattr(query, "to_arrow", fail_to_arrow)
df = query.to_pandas(blob_mode="descriptions", flatten=True)
assert df["id"].tolist() == [1]
assert any(column == "blob" or column.startswith("blob.") for column in df.columns)
def test_plain_scan_query_to_pandas_scanner_state(tmp_db):
pytest.importorskip("lance")
data = _blob_query_data()
table = tmp_db.create_table("test_query_to_pandas_scanner_state", data.slice(0, 2))
table.add(data.slice(2, 2))
fragments = table.to_lance().get_fragments()
assert len(fragments) == 2
query = (
table.search()
.select(["id", "blob"])
.with_row_address()
.fragment_ids([fragments[1].fragment_id])
)
query_obj = query.to_query_object()
assert query_obj.with_row_address is True
assert query_obj.fragment_ids == [fragments[1].fragment_id]
df = query.to_pandas(blob_mode="descriptions")
assert df["id"].tolist() == [3, 4]
assert "_rowaddr" in df.columns
assert {rowaddr >> 32 for rowaddr in df["_rowaddr"]} == {fragments[1].fragment_id}
df_by_fragment = (
table.search()
.select(["id", "blob"])
.with_fragments([fragments[0]])
.to_pandas(blob_mode="descriptions")
)
assert df_by_fragment["id"].tolist() == [1, 2]
@pytest.mark.asyncio
async def test_async_plain_scan_query_to_pandas_blob_projection(tmp_db_async):
pytest.importorskip("lance")
table = await tmp_db_async.create_table(
"test_async_query_to_pandas_blob_projection", _blob_query_data()
)
lazy_df = await (
table.query().where("id = 1").select(["id", "blob"]).to_pandas(blob_mode="lazy")
)
assert lazy_df["id"].tolist() == [1]
_assert_lazy_blob(lazy_df["blob"].iloc[0], b"one")
bytes_df = await (
table.query()
.where("id >= 2")
.select({"id_alias": "id", "payload": "blob", "double_id": "id * 2"})
.limit(2)
.offset(1)
.to_pandas(blob_mode="bytes")
)
assert bytes_df["id_alias"].tolist() == [3, 4]
assert bytes_df["payload"].tolist() == [b"three", b"four"]
assert bytes_df["double_id"].tolist() == [6, 8]
desc_df = await (
table.query()
.where("id = 1")
.select(["blob"])
.to_pandas(blob_mode="descriptions")
)
first = desc_df["blob"].iloc[0]
assert first != b"one"
assert not hasattr(first, "readall")
@pytest.mark.asyncio
@pytest.mark.parametrize("blob_mode", ["bytes", "descriptions"])
async def test_async_plain_scan_query_to_pandas_blob_mode_does_not_collect_arrow(
tmp_db_async, monkeypatch, blob_mode
):
pytest.importorskip("lance")
table = await tmp_db_async.create_table(
"test_async_query_to_pandas_blob_no_arrow_collect", _blob_query_data()
)
query = table.query().where("id = 1").select(["id", "blob"])
async def fail_to_arrow(*args, **kwargs):
raise AssertionError("to_arrow should not be called before native pandas")
monkeypatch.setattr(query, "to_arrow", fail_to_arrow)
df = await query.to_pandas(blob_mode=blob_mode)
assert df["id"].tolist() == [1]
if blob_mode == "bytes":
assert df["blob"].tolist() == [b"one"]
else:
first = df["blob"].iloc[0]
assert first != b"one"
assert not hasattr(first, "readall")
def test_vector_query_to_pandas_blob_mode_requires_native_path(tmp_db):
pytest.importorskip("lance")
table = tmp_db.create_table("test_vector_query_blob_mode", _blob_query_data())
with pytest.raises(RuntimeError, match="Lance native pandas conversion"):
table.search([1.0, 0.0]).select(["blob", "vector"]).limit(1).to_pandas(
blob_mode="lazy"
)
def test_vector_query_to_pandas_blob_descriptions_requires_plain_scan(tmp_db):
pytest.importorskip("lance")
table = tmp_db.create_table(
"test_vector_query_blob_descriptions", _blob_query_data()
)
with pytest.raises(RuntimeError, match="plain scan query"):
table.search([1.0, 0.0]).select(["blob", "vector"]).limit(1).to_pandas(
blob_mode="descriptions"
)
def test_order_by_plain_query(mem_db):
table = mem_db.create_table(
"test_order_by",

View File

@@ -1,13 +1,12 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import re
from concurrent.futures import ThreadPoolExecutor
import contextlib
from datetime import timedelta
import http.server
import json
import multiprocessing as mp
import pickle
import re
import sys
import threading
import time
@@ -154,118 +153,6 @@ async def test_async_checkout():
assert await table.count_rows() == 300
def _branch_open_handler(request):
if "/branches/list" in request.path:
body = json.dumps(
{
"branches": {
"exp": {
"parentBranch": None,
"parentVersion": 1,
"createAt": 1,
"manifestSize": 1,
}
}
}
).encode()
else:
# describe (table open + version/branch validation)
body = json.dumps({"version": 2, "schema": {"fields": []}}).encode()
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(body)
def test_remote_open_table_branch_and_version():
with mock_lancedb_connection(_branch_open_handler) as db:
# version-only (and "main" + version) time-travels the main chain
assert db.open_table("test", version=2) is not None
assert db.open_table("test", branch="main", version=2).current_branch() is None
# a non-main branch opens a handle scoped to that branch, with or
# without a version
assert db.open_table("test", branch="exp").current_branch() == "exp"
assert db.open_table("test", branch="exp", version=2).current_branch() == "exp"
def test_remote_table_branches_sync():
# Branch CRUD + current_branch on the sync RemoteTable. The handle returned
# by create/checkout must stay a RemoteTable scoped to the branch.
from lancedb.remote.table import RemoteTable
def handler(request):
if "/branches/list" in request.path:
body = json.dumps(
{
"branches": {
"exp": {
"parentBranch": None,
"parentVersion": 1,
"createAt": 1,
"manifestSize": 1,
}
}
}
).encode()
elif "/branches/create" in request.path or "/branches/delete" in request.path:
body = b"{}"
else:
# describe (table open + checkout validation)
body = json.dumps({"version": 1, "schema": {"fields": []}}).encode()
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(body)
with mock_lancedb_connection(handler) as db:
table = db.open_table("test")
assert isinstance(table, RemoteTable)
assert table.current_branch() is None
branch = table.branches.create("exp")
assert isinstance(branch, RemoteTable)
assert branch.current_branch() == "exp"
# list + checkout round trip; checkout also yields a branch-scoped handle
assert "exp" in table.branches.list()
checked = table.branches.checkout("exp")
assert isinstance(checked, RemoteTable)
assert checked.current_branch() == "exp"
table.branches.delete("exp")
@pytest.mark.asyncio
async def test_async_remote_open_table_branch_and_version():
async with mock_lancedb_connection_async(_branch_open_handler) as db:
# version-only (and "main" + version) time-travels the main chain
assert await db.open_table("test", version=2) is not None
main_v2 = await db.open_table("test", branch="main", version=2)
assert main_v2.current_branch() is None
# a non-main branch opens a handle scoped to that branch
exp = await db.open_table("test", branch="exp")
assert exp.current_branch() == "exp"
exp_v2 = await db.open_table("test", branch="exp", version=2)
assert exp_v2.current_branch() == "exp"
def test_remote_table_branch_survives_pickle():
# Regression: a branch-scoped handle must keep its branch across a
# pickle/fork round-trip (it used to reopen on main).
with mock_lancedb_connection(_branch_open_handler) as db:
branch = db.open_table("test", branch="exp")
assert branch.current_branch() == "exp"
restored = pickle.loads(pickle.dumps(branch))
assert restored.current_branch() == "exp"
# the pinned version is carried through as well
branch_v2 = db.open_table("test", branch="exp", version=2)
restored_v2 = pickle.loads(pickle.dumps(branch_v2))
assert restored_v2.current_branch() == "exp"
def test_table_len_sync():
def handler(request):
if request.path == "/v1/table/test/create/?mode=create":
@@ -284,155 +171,6 @@ def test_table_len_sync():
assert len(table) == 1
def test_remote_connection_serializes():
def handler(request):
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"tables": []}')
with mock_lancedb_connection(handler) as db:
serialized = json.loads(db.serialize())
assert isinstance(serialized["client_config"], dict)
restored = lancedb.deserialize_conn(db.serialize())
assert restored.table_names() == []
def test_remote_table_is_picklable():
def handler(request):
request.close_connection = True
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
payload = json.dumps(
{
"version": 1,
"schema": {
"fields": [
{"name": "id", "type": {"type": "int64"}, "nullable": False}
]
},
}
)
request.wfile.write(payload.encode())
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"3")
else:
request.send_response(404)
request.end_headers()
with mock_lancedb_connection(handler) as db:
table = db.open_table("test")
restored = pickle.loads(pickle.dumps(table))
assert restored.count_rows() == 3
def test_remote_table_open_does_not_require_picklable_client_config():
from lancedb.remote import HeaderProvider
class LocalHeaderProvider(HeaderProvider):
def get_headers(self):
return {"X-Test-Header": "present"}
def handler(request):
request.close_connection = True
assert request.headers.get("X-Test-Header") == "present"
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"version": 1, "schema": {"fields": []}}')
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"3")
else:
request.send_response(404)
request.end_headers()
with http.server.HTTPServer(
("localhost", 0), make_mock_http_handler(handler)
) as server:
port = server.server_address[1]
handle = threading.Thread(target=server.serve_forever)
handle.start()
try:
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
"header_provider": LocalHeaderProvider(),
},
)
table = db.open_table("test")
assert table.count_rows() == 3
with pytest.raises(ValueError, match="header_provider"):
pickle.dumps(table)
finally:
server.shutdown()
handle.join()
def test_remote_permutation_is_picklable():
from lancedb.permutation import Permutation
rows = list(range(10))
def handler(request):
request.close_connection = True
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
payload = json.dumps(
{
"version": 1,
"schema": {
"fields": [
{"name": "a", "type": {"type": "int64"}, "nullable": False}
]
},
}
)
request.wfile.write(payload.encode())
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(str(len(rows)).encode())
elif request.path == "/v1/table/test/query/":
content_len = int(request.headers.get("Content-Length"))
body = json.loads(request.rfile.read(content_len))
if "filter" in body:
match = re.search(r"_rowoffset in \((.*?)\)", body["filter"])
offsets = [int(offset.strip()) for offset in match.group(1).split(",")]
else:
offsets = rows
table = pa.table({"a": [rows[offset] for offset in offsets]})
request.send_response(200)
request.send_header("Content-Type", "application/vnd.apache.arrow.file")
request.end_headers()
with pa.ipc.new_file(request.wfile, schema=table.schema) as writer:
writer.write_table(table)
else:
request.send_response(404)
request.end_headers()
with mock_lancedb_connection(handler) as db:
permutation = Permutation.identity(db.open_table("test"))
restored = pickle.loads(pickle.dumps(permutation))
assert restored.__getitems__([0, 2, 4]) == [{"a": 0}, {"a": 2}, {"a": 4}]
def test_create_table_exist_ok():
def handler(request):
if request.path == "/v1/table/test/create/?mode=exist_ok":
@@ -698,25 +436,22 @@ def test_table_create_indices():
# This is a smoke-test.
table = db.create_table("test", [{"id": 1}])
# Test create_scalar_index with custom name (legacy method)
with pytest.warns(DeprecationWarning, match="create_scalar_index"):
table.create_scalar_index(
"id", wait_timeout=timedelta(seconds=2), name="custom_scalar_idx"
)
# Test create_scalar_index with custom name
table.create_scalar_index(
"id", wait_timeout=timedelta(seconds=2), name="custom_scalar_idx"
)
# Test create_fts_index with custom name (legacy method)
with pytest.warns(DeprecationWarning, match="create_fts_index"):
table.create_fts_index(
"text", wait_timeout=timedelta(seconds=2), name="custom_fts_idx"
)
# Test create_fts_index with custom name
table.create_fts_index(
"text", wait_timeout=timedelta(seconds=2), name="custom_fts_idx"
)
# Test create_index with custom name (legacy form: vector_column_name kwarg)
with pytest.warns(DeprecationWarning, match="create_index"):
table.create_index(
vector_column_name="vector",
wait_timeout=timedelta(seconds=10),
name="custom_vector_idx",
)
# Test create_index with custom name
table.create_index(
vector_column_name="vector",
wait_timeout=timedelta(seconds=10),
name="custom_vector_idx",
)
# Validate that the name parameter was passed correctly in requests
assert len(received_requests) == 3
@@ -745,98 +480,6 @@ def test_table_create_indices():
table.drop_index("custom_fts_idx")
def test_remote_create_index_new_api():
received_requests = []
def handler(request):
if request.path == "/v1/table/test/create_index/":
content_len = int(request.headers.get("Content-Length", 0))
body = request.rfile.read(content_len) if content_len > 0 else b""
received_requests.append(json.loads(body) if body else {})
request.send_response(200)
request.end_headers()
elif request.path == "/v1/table/test/create/?mode=create":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"{}")
elif request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(
json.dumps(
dict(
version=1,
schema=dict(
fields=[
dict(name="id", type={"type": "int64"}, nullable=False),
dict(
name="category",
type={"type": "string"},
nullable=False,
),
dict(
name="text", type={"type": "string"}, nullable=False
),
dict(
name="vector",
type={
"type": "fixed_size_list",
"fields": [
dict(
name="item",
type={"type": "float"},
nullable=True,
)
],
"length": 2,
},
nullable=False,
),
]
),
)
).encode()
)
else:
request.send_response(404)
request.end_headers()
from lancedb.index import BTree, FTS, IvfPq, IvfRq
with mock_lancedb_connection(handler) as db:
table = db.create_table("test", [{"id": 1}])
# New API: column-first, config= kwarg. Should NOT emit DeprecationWarning.
import warnings as _warnings
with _warnings.catch_warnings():
_warnings.simplefilter("error", DeprecationWarning)
table.create_index("vector", config=IvfPq(distance_type="l2"))
table.create_index("category", config=BTree())
table.create_index("text", config=FTS())
# IvfRq via new API
table.create_index("vector", config=IvfRq(distance_type="l2"))
# Legacy index_type="IVF_RQ" routes to IvfRq config under the hood.
with pytest.warns(DeprecationWarning, match="create_index"):
table.create_index(
vector_column_name="vector",
index_type="IVF_RQ",
num_partitions=8,
)
assert len(received_requests) == 5
assert [req["column"] for req in received_requests] == [
"vector",
"category",
"text",
"vector",
"vector",
]
def test_table_wait_for_index_timeout():
def handler(request):
index_stats = dict(
@@ -1662,10 +1305,6 @@ def _remote_fork_child(port: int, queue) -> None:
queue.put(db.table_names())
def _remote_table_fork_child(table, queue) -> None:
queue.put(table.count_rows())
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
@@ -1728,65 +1367,3 @@ def test_remote_connection_after_fork():
finally:
server.shutdown()
server_thread.join()
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
"fork() is unavailable on Windows and unsafe on macOS "
"(Apple frameworks/TLS are not fork-safe)"
),
)
def test_inherited_remote_table_reopens_after_fork():
def handler(request):
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b'{"version": 1, "schema": {"fields": []}}')
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(b"7")
else:
request.send_response(404)
request.end_headers()
server = http.server.HTTPServer(("localhost", 0), make_mock_http_handler(handler))
port = server.server_address[1]
server_thread = threading.Thread(target=server.serve_forever)
server_thread.start()
try:
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
table = db.open_table("test")
assert table.count_rows() == 7
ctx = mp.get_context("fork")
queue = ctx.Queue()
proc = ctx.Process(target=_remote_table_fork_child, args=(table, queue))
proc.start()
proc.join(timeout=15)
if proc.is_alive():
proc.terminate()
proc.join(timeout=5)
if proc.is_alive():
proc.kill()
proc.join()
pytest.fail("Remote table hung after fork")
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no result"
assert queue.get() == 7
finally:
server.shutdown()
server_thread.join()

View File

@@ -344,12 +344,6 @@ def test_mrr_reranker(tmp_path):
assert len(result_deduped) == len(result)
def test_mrr_reranker_empty_input():
reranker = MRRReranker()
with pytest.raises(ValueError, match="must not be empty"):
reranker.rerank_multivector([])
def test_rrf_reranker_distance():
data = pa.table(
{

View File

@@ -4,8 +4,6 @@
import os
import sys
import threading
import warnings
from datetime import date, datetime, timedelta
from time import sleep
from typing import List
@@ -13,7 +11,7 @@ from unittest.mock import patch
import lancedb
from lancedb.dependencies import _PANDAS_AVAILABLE
from lancedb.index import BTree, FTS, HnswFlat, HnswPq, HnswSq, IvfPq
from lancedb.index import HnswFlat, HnswPq, HnswSq, IvfPq
import numpy as np
import polars as pl
import pyarrow as pa
@@ -22,34 +20,11 @@ import pytest
from lancedb.conftest import MockTextEmbeddingFunction
from lancedb.db import AsyncConnection, DBConnection
from lancedb.embeddings import EmbeddingFunctionConfig, EmbeddingFunctionRegistry
from lancedb.expr import col, lit
from lancedb.pydantic import LanceModel, Vector
from lancedb.table import LanceTable
from pydantic import BaseModel
def _blob_test_data():
return pa.table(
{
"id": pa.array([1, 2], pa.int64()),
"blob": pa.array([b"hello", b"world"], pa.large_binary()),
},
schema=pa.schema(
[
pa.field("id", pa.int64()),
pa.field(
"blob", pa.large_binary(), metadata={"lance-encoding:blob": "true"}
),
]
),
)
def _assert_lazy_blob(value, expected: bytes):
assert hasattr(value, "readall")
assert value.readall() == expected
def test_basic(mem_db: DBConnection):
data = [
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
@@ -81,30 +56,27 @@ def test_table_to_pandas_default_matches_arrow(tmp_db: DBConnection):
pd.testing.assert_frame_equal(table.to_pandas(), expected)
def test_table_to_pandas_invalid_blob_mode_non_blob_table(tmp_db: DBConnection):
data = pa.table({"id": [1, 2], "text": ["one", "two"]})
table = tmp_db.create_table("test_to_pandas_invalid_blob_mode", data=data)
with pytest.raises(ValueError, match="blob_mode must be one of"):
table.to_pandas(blob_mode="invalid")
@pytest.mark.parametrize("blob_mode", ["lazy", "bytes", "descriptions"])
def test_table_to_pandas_blob_modes(tmp_db: DBConnection, blob_mode):
def test_table_to_pandas_blob_bytes(tmp_db: DBConnection):
pytest.importorskip("lance")
table = tmp_db.create_table(f"test_to_pandas_blob_{blob_mode}", _blob_test_data())
data = pa.table(
{
"id": pa.array([1, 2], pa.int64()),
"blob": pa.array([b"hello", b"world"], pa.large_binary()),
},
schema=pa.schema(
[
pa.field("id", pa.int64()),
pa.field(
"blob", pa.large_binary(), metadata={"lance-encoding:blob": "true"}
),
]
),
)
table = tmp_db.create_table("test_to_pandas_blob_bytes", data=data)
df = table.to_pandas(blob_mode=blob_mode)
df = table.to_pandas(blob_mode="bytes")
if blob_mode == "lazy":
_assert_lazy_blob(df["blob"].iloc[0], b"hello")
_assert_lazy_blob(df["blob"].iloc[1], b"world")
elif blob_mode == "bytes":
assert df["blob"].tolist() == [b"hello", b"world"]
else:
first = df["blob"].iloc[0]
assert first != b"hello"
assert not hasattr(first, "readall")
assert df["blob"].tolist() == [b"hello", b"world"]
def test_table_to_pandas_kwargs(tmp_db: DBConnection):
@@ -120,8 +92,22 @@ def test_table_to_pandas_kwargs(tmp_db: DBConnection):
@pytest.mark.asyncio
async def test_async_table_to_pandas_blob_bytes(tmp_db_async: AsyncConnection):
pytest.importorskip("lance")
data = pa.table(
{
"id": pa.array([1, 2], pa.int64()),
"blob": pa.array([b"hello", b"world"], pa.large_binary()),
},
schema=pa.schema(
[
pa.field("id", pa.int64()),
pa.field(
"blob", pa.large_binary(), metadata={"lance-encoding:blob": "true"}
),
]
),
)
table = await tmp_db_async.create_table(
"test_async_to_pandas_blob_bytes", data=_blob_test_data()
"test_async_to_pandas_blob_bytes", data=data
)
df = await table.to_pandas(blob_mode="bytes")
@@ -129,19 +115,6 @@ async def test_async_table_to_pandas_blob_bytes(tmp_db_async: AsyncConnection):
assert df["blob"].tolist() == [b"hello", b"world"]
@pytest.mark.asyncio
async def test_async_table_to_pandas_invalid_blob_mode_non_blob_table(
tmp_db_async: AsyncConnection,
):
table = await tmp_db_async.create_table(
"test_async_to_pandas_invalid_blob_mode",
data=pa.table({"id": [1, 2], "text": ["one", "two"]}),
)
with pytest.raises(ValueError, match="blob_mode must be one of"):
await table.to_pandas(blob_mode="invalid")
@pytest.mark.asyncio
async def test_async_table_to_pandas_kwargs(tmp_db_async: AsyncConnection):
pd = pytest.importorskip("pandas")
@@ -929,346 +902,6 @@ async def test_async_tags(mem_db_async: AsyncConnection):
)
def test_branches(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(0))
table = db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
)
assert table.count_rows() == 2
# fork an isolated, writable branch from main
branch = table.branches.create("exp")
assert branch.count_rows() == 2
branch.add(data=[{"vector": [10.0, 11.0], "item": "baz", "price": 30.0}])
# writes on the branch do not touch main
assert branch.count_rows() == 3
assert table.count_rows() == 2
# the branch is listed, with main (None) as its parent
branches = table.branches.list()
assert "exp" in branches
assert branches["exp"]["parent_branch"] is None
# from_ref="main" is equivalent to the default
table.branches.create("exp2", from_ref="main")
assert table.branches.list()["exp2"]["parent_branch"] is None
# checkout returns a handle scoped to the branch's latest
checked_out = table.branches.checkout("exp")
assert checked_out.count_rows() == 3
# delete removes it
table.branches.delete("exp")
table.branches.delete("exp2")
assert "exp" not in table.branches.list()
def test_branch_handle_tracks_concurrent_writes(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(0))
table = db.create_table("t", [{"id": 1}])
# two independent handles on the same branch
writer = table.branches.create("exp")
reader = db.open_table("t", branch="exp")
assert reader.count_rows() == 1
# a concurrent write on the branch is visible to the other handle
writer.add([{"id": 2}])
assert reader.count_rows() == 2
# main is unaffected
assert table.count_rows() == 1
def test_branch_name_validation(tmp_path):
db = lancedb.connect(tmp_path)
table = db.create_table("t", [{"id": 1}])
with pytest.raises(ValueError, match="non-empty"):
table.branches.create("")
with pytest.raises(ValueError, match="non-empty"):
table.branches.checkout("")
with pytest.raises(ValueError, match="non-empty"):
table.branches.delete("")
def test_branches_preserve_namespace(tmp_path):
pytest.importorskip(
"lance"
) # namespace_path routes through lance's DirectoryNamespace
db = lancedb.connect(tmp_path)
table = db.create_table("t", [{"id": 1}], namespace_path=["ns1"])
assert table.namespace == ["ns1"]
branch = table.branches.create("exp")
assert branch.namespace == ["ns1"]
assert branch.id == table.id
# opening the branch directly also preserves namespace identity
opened = db.open_table("t", namespace_path=["ns1"], branch="exp")
assert opened.namespace == ["ns1"]
def test_open_table_with_branch(tmp_path):
db = lancedb.connect(tmp_path)
table = db.create_table("t", [{"i": 1}])
table.branches.create("exp").add([{"i": 2}])
# open_table(branch=...) returns a handle scoped to the branch
assert db.open_table("t", branch="exp").count_rows() == 2
# opening without branch still tracks main
assert db.open_table("t").count_rows() == 1
def test_open_table_with_branch_version(tmp_path):
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(0))
# main: a single fork-point row
t = db.create_table("t", [{"i": 0}])
main_v1 = t.version
# fork "exp", then advance exp AND main independently past the fork so they
# diverge while sharing version numbers
exp = t.branches.create("exp")
exp.add([{"i": 1}]) # exp: {0, 1}
exp_v2 = exp.version
exp.add([{"i": 2}]) # exp HEAD: {0, 1, 2}
t.add([{"i": 100}, {"i": 101}, {"i": 102}]) # main HEAD: {0, 100, 101, 102}
assert exp_v2 == t.version, "branch and main must share the version number"
# open exp at the shared version: the data must be exp's, not main's. count
# alone cannot prove this (main@v2 also exists), so assert provenance by
# content.
pinned = db.open_table("t", branch="exp", version=exp_v2)
assert pinned.current_branch() == "exp"
assert pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
assert pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
assert pinned.count_rows("i = 100") == 0 # main's divergent rows are invisible
# the same coordinate is reachable directly via branches.checkout(name, version)
pinned_direct = t.branches.checkout("exp", exp_v2)
assert pinned_direct.current_branch() == "exp"
assert pinned_direct.count_rows() == 2
# the HEADs are unaffected
assert db.open_table("t", branch="exp").count_rows() == 3
assert db.open_table("t").count_rows() == 4
# version-only (no branch) time-travels main itself: its fork-point version
# holds only main's first row, and the shared version number resolves to
# main's data, not the branch's ("opens main at the version")
old_main = db.open_table("t", version=main_v1)
assert old_main.current_branch() is None
assert old_main.count_rows() == 1
shared_on_main = db.open_table("t", version=exp_v2)
assert shared_on_main.current_branch() is None
assert shared_on_main.count_rows() == 4
# detached head: writing to a pinned version is rejected
with pytest.raises((ValueError, RuntimeError), match="cannot be modified"):
pinned.add([{"i": 9}])
# a nonexistent version is rejected -- on main, and on a branch (a distinct
# resolution path, on the branch's manifests)
with pytest.raises((ValueError, RuntimeError)):
db.open_table("t", version=9999)
with pytest.raises((ValueError, RuntimeError)):
db.open_table("t", branch="exp", version=9999)
# checkout_latest re-attaches the pinned handle to the BRANCH's HEAD
# (writable again), not main's HEAD, and not staying pinned
pinned.checkout_latest()
assert pinned.current_branch() == "exp"
assert pinned.count_rows() == 3 # exp HEAD, not main's 4
pinned.add([{"i": 3}])
assert pinned.count_rows() == 4 # writable again
@pytest.mark.asyncio
async def test_async_namespace_open_table_with_branch(tmp_path):
pytest.importorskip("lance") # "dir" impl is lance.namespace.DirectoryNamespace
db = lancedb.connect_namespace_async("dir", {"root": str(tmp_path)})
await db.create_namespace(["ns1"])
table = await db.create_table("t", [{"id": 1}], namespace_path=["ns1"])
branch = await table.branches.create("exp")
await branch.add([{"id": 2}])
# open_table(branch=...) on the async namespace connection must work
opened = await db.open_table("t", namespace_path=["ns1"], branch="exp")
assert await opened.count_rows() == 2
def test_namespace_open_table_with_branch_version(tmp_path):
pytest.importorskip("lance") # "dir" impl is lance.namespace.DirectoryNamespace
db = lancedb.connect_namespace("dir", {"root": str(tmp_path)})
db.create_namespace(["ns1"])
t = db.create_table("t", [{"i": 0}], namespace_path=["ns1"])
# fork "exp", then advance exp AND main past the fork so they diverge while
# sharing version numbers
exp = t.branches.create("exp")
exp.add([{"i": 1}])
exp_v2 = exp.version
exp.add([{"i": 2}])
t.add([{"i": 100}, {"i": 101}, {"i": 102}])
assert exp_v2 == t.version, "branch and main must share the version number"
# open_table(branch=, version=) on the namespace connection reads the
# branch's data at that version, not main's
pinned = db.open_table("t", namespace_path=["ns1"], branch="exp", version=exp_v2)
assert pinned.current_branch() == "exp"
assert pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
assert pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
assert pinned.count_rows("i = 100") == 0 # main's divergent rows are invisible
assert db.open_table("t", namespace_path=["ns1"], branch="exp").count_rows() == 3
@pytest.mark.asyncio
async def test_async_namespace_open_table_with_branch_version(tmp_path):
pytest.importorskip("lance") # "dir" impl is lance.namespace.DirectoryNamespace
db = lancedb.connect_namespace_async("dir", {"root": str(tmp_path)})
await db.create_namespace(["ns1"])
t = await db.create_table("t", [{"i": 0}], namespace_path=["ns1"])
# fork "exp", then advance exp AND main past the fork so they diverge while
# sharing version numbers
exp = await t.branches.create("exp")
await exp.add([{"i": 1}])
exp_v2 = await exp.version()
await exp.add([{"i": 2}])
await t.add([{"i": 100}, {"i": 101}, {"i": 102}])
assert exp_v2 == await t.version(), "branch and main must share the version number"
# open_table(branch=, version=) on the async namespace connection reads the
# branch's data at that version, not main's
pinned = await db.open_table(
"t", namespace_path=["ns1"], branch="exp", version=exp_v2
)
assert pinned.current_branch() == "exp"
assert await pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
assert await pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
assert await pinned.count_rows("i = 100") == 0 # main's rows are invisible
assert (
await (
await db.open_table("t", namespace_path=["ns1"], branch="exp")
).count_rows()
== 3
)
def test_branch_to_lance_targets_branch(tmp_path):
pytest.importorskip("lance")
db = lancedb.connect(tmp_path)
table = db.create_table("t", [{"i": 1}])
branch = table.branches.create("exp")
branch.add([{"i": 2}]) # branch: 2 rows, main: 1 row
assert branch.to_lance().count_rows() == 2
assert table.to_lance().count_rows() == 1
@pytest.mark.asyncio
async def test_async_branches(tmp_path):
db = await lancedb.connect_async(tmp_path)
table = await db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
],
)
assert await table.count_rows() == 2
branch = await table.branches.create("exp")
assert await branch.count_rows() == 2
await branch.add(data=[{"vector": [10.0, 11.0], "item": "baz", "price": 30.0}])
assert await branch.count_rows() == 3
assert await table.count_rows() == 2
branches = await table.branches.list()
assert "exp" in branches
assert branches["exp"]["parent_branch"] is None
await table.branches.create("exp2", from_ref="main")
assert (await table.branches.list())["exp2"]["parent_branch"] is None
checked_out = await table.branches.checkout("exp")
assert await checked_out.count_rows() == 3
await table.branches.delete("exp")
await table.branches.delete("exp2")
assert "exp" not in await table.branches.list()
@pytest.mark.asyncio
async def test_async_open_table_with_branch_version(tmp_path):
db = await lancedb.connect_async(tmp_path, read_consistency_interval=timedelta(0))
# main: a single fork-point row
t = await db.create_table("t", [{"i": 0}])
main_v1 = await t.version()
# fork "exp", then advance exp AND main independently past the fork so they
# diverge while sharing version numbers
exp = await t.branches.create("exp")
await exp.add([{"i": 1}]) # exp: {0, 1}
exp_v2 = await exp.version()
await exp.add([{"i": 2}]) # exp HEAD: {0, 1, 2}
await t.add([{"i": 100}, {"i": 101}, {"i": 102}]) # main HEAD: {0, 100, 101, 102}
assert exp_v2 == await t.version(), "branch and main must share the version number"
# open exp at the shared version: the data must be exp's, not main's. count
# alone cannot prove this (main@v2 also exists), so assert provenance by
# content.
pinned = await db.open_table("t", branch="exp", version=exp_v2)
assert pinned.current_branch() == "exp"
assert await pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
assert await pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
assert await pinned.count_rows("i = 100") == 0 # main's rows are invisible
# the same coordinate is reachable directly via branches.checkout(name, version)
pinned_direct = await t.branches.checkout("exp", exp_v2)
assert pinned_direct.current_branch() == "exp"
assert await pinned_direct.count_rows() == 2
# the HEADs are unaffected
assert await (await db.open_table("t", branch="exp")).count_rows() == 3
assert await (await db.open_table("t")).count_rows() == 4
# version-only (no branch) time-travels main itself: its fork-point version
# holds only main's first row, and the shared version number resolves to
# main's data, not the branch's ("opens main at the version")
old_main = await db.open_table("t", version=main_v1)
assert old_main.current_branch() is None
assert await old_main.count_rows() == 1
shared_on_main = await db.open_table("t", version=exp_v2)
assert shared_on_main.current_branch() is None
assert await shared_on_main.count_rows() == 4
# detached head: writing to a pinned version is rejected
with pytest.raises((ValueError, RuntimeError), match="cannot be modified"):
await pinned.add([{"i": 9}])
# a nonexistent version is rejected -- on main, and on a branch
with pytest.raises((ValueError, RuntimeError)):
await db.open_table("t", version=9999)
with pytest.raises((ValueError, RuntimeError)):
await db.open_table("t", branch="exp", version=9999)
# checkout_latest re-attaches the pinned handle to the BRANCH's HEAD
# (writable again), not main's HEAD, and not staying pinned
await pinned.checkout_latest()
assert pinned.current_branch() == "exp"
assert await pinned.count_rows() == 3 # exp HEAD, not main's 4
await pinned.add([{"i": 3}])
assert await pinned.count_rows() == 4 # writable again
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_method(mock_create_index, mem_db: DBConnection):
table = mem_db.create_table(
@@ -1295,12 +928,7 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
num_bits=4,
)
mock_create_index.assert_called_with(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
"vector", replace=True, config=expected_config, name=None, train=True
)
# Test with target_partition_size
@@ -1320,12 +948,7 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
target_partition_size=8192,
)
mock_create_index.assert_called_with(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
"vector", replace=True, config=expected_config, name=None, train=True
)
# target_partition_size has a default value,
@@ -1344,12 +967,7 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
num_bits=4,
)
mock_create_index.assert_called_with(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
"vector", replace=True, config=expected_config, name=None, train=True
)
table.create_index(
@@ -1360,12 +978,7 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
)
expected_config = HnswPq(distance_type="dot")
mock_create_index.assert_called_with(
"my_vector",
replace=False,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
"my_vector", replace=False, config=expected_config, name=None, train=True
)
table.create_index(
@@ -1380,12 +993,7 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
distance_type="cosine", sample_rate=0.1, m=29, ef_construction=10
)
mock_create_index.assert_called_with(
"my_vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
"my_vector", replace=True, config=expected_config, name=None, train=True
)
table.create_index(
@@ -1400,12 +1008,7 @@ def test_create_index_method(mock_create_index, mem_db: DBConnection):
distance_type="cosine", sample_rate=0.1, m=29, ef_construction=10
)
mock_create_index.assert_called_with(
"my_vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=True,
"my_vector", replace=True, config=expected_config, name=None, train=True
)
@@ -1429,7 +1032,6 @@ def test_create_index_name_and_train_parameters(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name="my_custom_index",
train=True,
)
@@ -1437,82 +1039,13 @@ def test_create_index_name_and_train_parameters(
# Test with train=False
table.create_index(vector_column_name="vector", train=False)
mock_create_index.assert_called_with(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name=None,
train=False,
"vector", replace=True, config=expected_config, name=None, train=False
)
# Test with both name and train
table.create_index(vector_column_name="vector", name="my_index_name", train=True)
mock_create_index.assert_called_with(
"vector",
replace=True,
config=expected_config,
wait_timeout=None,
name="my_index_name",
train=True,
)
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_legacy_emits_deprecation_warning(
mock_create_index, mem_db: DBConnection
):
table = mem_db.create_table(
"test",
data=[{"vector": [3.1, 4.1]}, {"vector": [5.9, 26.5]}],
)
with pytest.warns(DeprecationWarning, match="create_index"):
table.create_index(metric="l2", num_partitions=8, vector_column_name="vector")
@patch("lancedb.table.AsyncTable.create_index")
def test_create_index_new_api(mock_create_index, mem_db: DBConnection):
table = mem_db.create_table(
"test",
data=[
{"vector": [3.1, 4.1], "category": "a", "text": "hello world"},
{"vector": [5.9, 26.5], "category": "b", "text": "goodbye"},
],
)
# Vector index via new API should not warn
with warnings.catch_warnings():
warnings.simplefilter("error", DeprecationWarning)
table.create_index("vector", config=IvfPq(distance_type="l2"))
mock_create_index.assert_called_with(
"vector",
replace=True,
config=IvfPq(distance_type="l2"),
wait_timeout=None,
name=None,
train=True,
)
# Scalar index via new API
table.create_index("category", config=BTree())
mock_create_index.assert_called_with(
"category",
replace=True,
config=BTree(),
wait_timeout=None,
name=None,
train=True,
)
# FTS index via new API
table.create_index("text", config=FTS(with_position=True))
mock_create_index.assert_called_with(
"text",
replace=True,
config=FTS(with_position=True),
wait_timeout=None,
name=None,
train=True,
"vector", replace=True, config=expected_config, name="my_index_name", train=True
)
@@ -1630,45 +1163,6 @@ def test_add_with_empty_fixed_size_list_drops_bad_rows(mem_db: DBConnection):
assert np.allclose(data["embedding"].to_pylist()[0], np.array([0.1] * 16))
def test_add_nullable_struct_with_none(mem_db: DBConnection):
"""Regression test for issue #2654: a nullable struct column whose
first batch contains only None values must not crash in
_align_field_types with AttributeError: 'pyarrow.lib.DataType'
object has no attribute 'fields'.
PyArrow infers an all-None struct column as `null` (not `struct`),
so the type-alignment path needs to handle the case where the
source field type is null and use the target type directly.
"""
# Use the v2.1 file format so that nullable structs are supported.
table = mem_db.create_table(
"test_nullable_struct",
schema=pa.schema(
[
pa.field("id", pa.string()),
pa.field(
"data",
pa.struct([pa.field("x", pa.float32())]),
nullable=True,
),
]
),
storage_options=dict(new_table_data_storage_version="2.1"),
)
# Adding a row with a non-null struct should work.
table.add([{"id": "1", "data": {"x": 1.0}}])
# Adding a row with None for the nullable struct field should also
# work — this is what used to crash.
table.add([{"id": "2", "data": None}])
result = table.to_arrow()
assert result.num_rows == 2
assert result.column("id").to_pylist() == ["1", "2"]
assert result.column("data").to_pylist() == [{"x": 1.0}, None]
def test_add_with_integer_embeddings_preserves_casting(mem_db: DBConnection):
class Schema(LanceModel):
text: str
@@ -1967,38 +1461,6 @@ def test_delete(mem_db: DBConnection):
assert table.to_arrow()["id"].to_pylist() == [1]
def test_delete_expr(mem_db: DBConnection):
table = mem_db.create_table(
"my_table",
data=[
{"vector": [1.1, 0.9], "id": 0},
{"vector": [1.2, 1.9], "id": 1},
{"vector": [1.3, 2.9], "id": 2},
],
)
assert len(table) == 3
delete_res = table.delete(col("id") == lit(0))
assert delete_res.version == 2
assert len(table) == 2
assert sorted(table.to_arrow()["id"].to_pylist()) == [1, 2]
@pytest.mark.asyncio
async def test_delete_expr_async(mem_db_async: AsyncConnection):
table = await mem_db_async.create_table(
"my_table",
data=[
{"vector": [1.1, 0.9], "id": 0},
{"vector": [1.2, 1.9], "id": 1},
{"vector": [1.3, 2.9], "id": 2},
],
)
assert await table.count_rows() == 3
await table.delete(col("id") == lit(0))
assert await table.count_rows() == 2
assert sorted((await table.to_arrow())["id"].to_pylist()) == [1, 2]
def test_update(mem_db: DBConnection):
table = mem_db.create_table(
"my_table",
@@ -2184,50 +1646,6 @@ def test_merge_insert(mem_db: DBConnection):
)
def test_merge_insert_by_source_delete_expr(mem_db: DBConnection):
table = mem_db.create_table(
"my_table",
data=pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]}),
)
new_data = pa.table({"a": [2, 4], "b": ["x", "z"]})
# replace-range, limiting the source-absent delete with an Expr condition
merge_insert_res = (
table.merge_insert("a")
.when_matched_update_all()
.when_not_matched_insert_all()
.when_not_matched_by_source_delete(col("a") > lit(2))
.execute(new_data)
)
assert merge_insert_res.num_inserted_rows == 1
assert merge_insert_res.num_updated_rows == 1
assert merge_insert_res.num_deleted_rows == 1
expected = pa.table({"a": [1, 2, 4], "b": ["a", "x", "z"]})
assert table.to_arrow().sort_by("a") == expected
@pytest.mark.asyncio
async def test_merge_insert_by_source_delete_expr_async(
mem_db_async: AsyncConnection,
):
data = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
table = await mem_db_async.create_table("some_table", data=data)
new_data = pa.table({"a": [2, 4], "b": ["x", "z"]})
# replace-range, limiting the source-absent delete with an Expr condition
await (
table.merge_insert("a")
.when_matched_update_all()
.when_not_matched_insert_all()
.when_not_matched_by_source_delete(col("a") > lit(2))
.execute(new_data)
)
expected = pa.table({"a": [1, 2, 4], "b": ["a", "x", "z"]})
assert (await table.to_arrow()).sort_by("a") == expected
# We vary the data format because there are slight differences in how
# subschemas are handled in different formats
@pytest.mark.parametrize(
@@ -2443,9 +1861,8 @@ def test_create_scalar_index(mem_db: DBConnection):
"my_table",
data=test_data,
)
# Test with default name; confirm DeprecationWarning fires
with pytest.warns(DeprecationWarning, match="create_scalar_index"):
table.create_scalar_index("x")
# Test with default name
table.create_scalar_index("x")
indices = table.list_indices()
assert len(indices) == 1
scalar_index = indices[0]
@@ -2476,32 +1893,18 @@ def test_create_scalar_index(mem_db: DBConnection):
def test_create_index_nested_field_paths(mem_db: DBConnection):
schema = pa.schema(
[
pa.field("rowId", pa.int32()),
pa.field("row-id", pa.int32()),
pa.field("userId", pa.int32()),
pa.field("metadata", pa.struct([pa.field("user_id", pa.int32())])),
pa.field("MetaData", pa.struct([pa.field("userId", pa.int32())])),
pa.field(
"image",
pa.struct([pa.field("embedding", pa.list_(pa.float32(), 2))]),
),
pa.field("payload", pa.struct([pa.field("text", pa.string())])),
pa.field("meta-data", pa.struct([pa.field("user-id", pa.int32())])),
pa.field("literal", pa.struct([pa.field("a.b", pa.int32())])),
]
)
data = pa.Table.from_pylist(
[
{
"rowId": i,
"row-id": i,
"userId": i,
"metadata": {"user_id": i},
"MetaData": {"userId": i},
"image": {"embedding": [float(i), float(i + 1)]},
"payload": {"text": f"document {i}"},
"meta-data": {"user-id": i},
"literal": {"a.b": i},
}
for i in range(256)
],
@@ -2509,37 +1912,19 @@ def test_create_index_nested_field_paths(mem_db: DBConnection):
)
table = mem_db.create_table("nested_index_paths", data=data)
table.create_scalar_index("rowId", name="row_id_idx")
table.create_scalar_index("`row-id`", name="row_dash_id_idx")
table.create_scalar_index("userId", name="top_user_id_idx")
table.create_scalar_index("metadata.user_id", name="metadata_user_id_idx")
table.create_scalar_index("MetaData.userId", name="mixed_case_metadata_user_id_idx")
table.create_scalar_index("`meta-data`.`user-id`", name="escaped_names_idx")
table.create_scalar_index("literal.`a.b`", name="literal_dot_idx")
table.create_index(
vector_column_name="image.embedding",
num_partitions=1,
num_sub_vectors=1,
name="image_embedding_idx",
)
table.create_fts_index("payload.text", with_position=False, name="payload_text_idx")
indices = sorted(table.list_indices(), key=lambda idx: idx.name)
assert [(idx.name, idx.index_type, idx.columns) for idx in indices] == [
("escaped_names_idx", "BTree", ["`meta-data`.`user-id`"]),
("image_embedding_idx", "IvfPq", ["image.embedding"]),
("literal_dot_idx", "BTree", ["literal.`a.b`"]),
("metadata_user_id_idx", "BTree", ["metadata.user_id"]),
("mixed_case_metadata_user_id_idx", "BTree", ["MetaData.userId"]),
("payload_text_idx", "FTS", ["payload.text"]),
("row_dash_id_idx", "BTree", ["`row-id`"]),
("row_id_idx", "BTree", ["rowId"]),
("top_user_id_idx", "BTree", ["userId"]),
]
for index in indices:
stats = table.index_stats(index.name)
assert stats is not None
assert stats.num_indexed_rows == 256
vector_results = (
table.search([0.0, 1.0], vector_column_name="image.embedding")
@@ -2557,63 +1942,6 @@ def test_create_index_nested_field_paths(mem_db: DBConnection):
assert len(filtered_results) == 1
assert filtered_results[0]["metadata"]["user_id"] == 42
escaped_results = table.search().where("`row-id` = 43").limit(1).to_list()
assert len(escaped_results) == 1
assert escaped_results[0]["row-id"] == 43
fts_results = table.search("document 44", query_type="fts").limit(1).to_list()
assert len(fts_results) == 1
assert fts_results[0]["payload"]["text"] == "document 44"
def test_index_config_fields(mem_db: DBConnection):
"""Test that IndexConfig exposes the new rich metadata fields."""
vec_array = pa.array(
[[float(i), float(i + 1)] for i in range(300)], pa.list_(pa.float32(), 2)
)
data = pa.Table.from_pydict({"x": list(range(300)), "vector": vec_array})
table = mem_db.create_table("index_config_fields", data=data)
table.create_scalar_index("x", index_type="BTREE")
table.create_index(
vector_column_name="vector",
num_partitions=1,
num_sub_vectors=1,
)
indices = {idx.name: idx for idx in table.list_indices()}
scalar_idx = indices["x_idx"]
assert scalar_idx.index_uuid is not None
assert isinstance(scalar_idx.index_uuid, str)
assert scalar_idx.num_indexed_rows is not None
assert scalar_idx.num_indexed_rows == 300
assert scalar_idx.num_unindexed_rows is not None
assert scalar_idx.num_unindexed_rows == 0
assert scalar_idx.num_segments is not None
assert scalar_idx.num_segments >= 1
assert scalar_idx.size_bytes is not None
assert scalar_idx.size_bytes > 0
assert scalar_idx.created_at is not None
from datetime import datetime, timezone
assert isinstance(scalar_idx.created_at, datetime)
assert scalar_idx.created_at.tzinfo == timezone.utc
# __getitem__ compatibility
assert scalar_idx["index_uuid"] == scalar_idx.index_uuid
assert scalar_idx["num_indexed_rows"] == scalar_idx.num_indexed_rows
assert scalar_idx["created_at"] == scalar_idx.created_at
# index_details is parsed from JSON into a Python object
assert scalar_idx.index_details is not None
assert isinstance(scalar_idx.index_details, dict)
assert scalar_idx["index_details"] == scalar_idx.index_details
vector_idx = indices["vector_idx"]
assert vector_idx.index_uuid is not None
assert vector_idx.num_indexed_rows == 300
assert isinstance(vector_idx.index_details, dict)
def test_empty_query(mem_db: DBConnection):
table = mem_db.create_table(
@@ -3042,30 +2370,6 @@ def test_alter_columns(mem_db: DBConnection):
assert table.to_arrow().column_names == ["new_id"]
def test_update_field_metadata(mem_db: DBConnection):
data = pa.table({"id": [0, 1], "category": ["a", "b"]})
table = mem_db.create_table("my_table", data=data)
res = table.update_field_metadata(
{"path": "category", "metadata": {"unit": "label", "pii": "false"}}
)
assert res.version == 2
# Arrow field metadata is bytes-keyed
assert table.schema.field("category").metadata == {
b"unit": b"label",
b"pii": b"false",
}
# merge: add a key, delete one via None, keep the rest
table.update_field_metadata(
{"path": "category", "metadata": {"source": "import", "pii": None}}
)
assert table.schema.field("category").metadata == {
b"unit": b"label",
b"source": b"import",
}
@pytest.mark.asyncio
async def test_alter_columns_async(mem_db_async: AsyncConnection):
data = pa.table({"id": [0, 1]})
@@ -3344,38 +2648,3 @@ def test_sanitize_data_metadata_not_stripped():
assert result_schema.metadata is not None
assert result_schema.metadata[b"existing_key"] == b"existing_value"
assert result_schema.metadata[b"new_key"] == b"new_value"
@pytest.mark.asyncio
async def test_async_search_runs_embedding_on_dedicated_executor(
mem_db_async: AsyncConnection,
):
# Regression test for #3310: AsyncTable.search() must run the (potentially
# blocking) query-embedding call on the dedicated embedding executor, not
# asyncio's default executor -- which is shared with other blocking I/O and
# can be starved by a slow embedding call under concurrent load.
func = MockTextEmbeddingFunction.create()
class Schema(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField()
table = await mem_db_async.create_table("embed_executor", schema=Schema)
await table.add([{"text": "hello world"}])
captured_threads: List[str] = []
original = MockTextEmbeddingFunction.generate_embeddings
def record_thread(self, texts):
captured_threads.append(threading.current_thread().name)
return original(self, texts)
# Patch only around the search so we capture the query-embedding call, not
# the add-time source-embedding call.
with patch.object(MockTextEmbeddingFunction, "generate_embeddings", record_thread):
await (await table.search("a query string")).limit(1).to_list()
assert captured_threads, "search did not invoke the embedding function"
assert all(name.startswith("lancedb-embedding") for name in captured_threads), (
f"embedding ran off the dedicated executor: {captured_threads}"
)

View File

@@ -1,15 +1,10 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import contextlib
import functools
import http.server
import json
import multiprocessing as mp
import pickle
import re
import sys
import threading
import lancedb
import pyarrow as pa
@@ -20,107 +15,6 @@ from lancedb.util import tbl_to_tensor
torch = pytest.importorskip("torch")
REMOTE_ROWS = list(range(100))
def _make_mock_http_handler(handler):
class MockLanceDBHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
handler(self)
def do_POST(self):
handler(self)
return MockLanceDBHandler
def _remote_schema_payload():
return {
"version": 1,
"schema": {
"fields": [
{"name": "a", "type": {"type": "int64"}, "nullable": False},
]
},
}
def _offsets_from_filter(filter_sql: str | None) -> list[int]:
if filter_sql is None:
return REMOTE_ROWS
match = re.search(r"_rowoffset in \((.*?)\)", filter_sql)
if match is None:
return REMOTE_ROWS
raw_offsets = match.group(1).strip()
if raw_offsets == "":
return []
return [int(offset.strip()) for offset in raw_offsets.split(",")]
def _remote_dataset_handler(request):
request.close_connection = True
if request.path == "/v1/table/test/describe/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(json.dumps(_remote_schema_payload()).encode())
elif request.path == "/v1/table/test/count_rows/":
request.send_response(200)
request.send_header("Content-Type", "application/json")
request.end_headers()
request.wfile.write(str(len(REMOTE_ROWS)).encode())
elif request.path == "/v1/table/test/query/":
content_len = int(request.headers.get("Content-Length"))
body = json.loads(request.rfile.read(content_len))
offsets = _offsets_from_filter(body.get("filter"))
requested_columns = body.get("columns") or ["a"]
if isinstance(requested_columns, dict):
requested_columns = list(requested_columns)
data = {}
for column in requested_columns:
if column == "a":
data[column] = [REMOTE_ROWS[offset] for offset in offsets]
elif column == "_rowoffset":
data[column] = offsets
elif column == "_rowid":
data[column] = offsets
table = pa.table(data)
request.send_response(200)
request.send_header("Content-Type", "application/vnd.apache.arrow.file")
request.end_headers()
with pa.ipc.new_file(request.wfile, schema=table.schema) as writer:
writer.write_table(table)
else:
request.send_response(404)
request.end_headers()
@contextlib.contextmanager
def _remote_dataset_table():
with http.server.ThreadingHTTPServer(
("localhost", 0), _make_mock_http_handler(_remote_dataset_handler)
) as server:
port = server.server_address[1]
handle = threading.Thread(target=server.serve_forever)
handle.start()
try:
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
yield db.open_table("test")
finally:
server.shutdown()
handle.join()
def _open_native_table(uri: str, table_name: str):
"""Top-level connection factory used by the explicit-factory pickle test.
@@ -213,39 +107,6 @@ def test_permutation_dataloader_multiprocessing(tmp_db):
assert seen == 1000
def test_remote_table_dataloader_multiprocessing():
with _remote_dataset_table() as table:
dataloader = torch.utils.data.DataLoader(
table,
collate_fn=tbl_to_tensor,
batch_size=10,
num_workers=2,
multiprocessing_context="spawn",
)
seen = 0
for batch in dataloader:
assert batch.size(0) == 1
assert batch.size(1) == 10
seen += batch.size(1)
assert seen == len(REMOTE_ROWS)
def test_remote_permutation_dataloader_multiprocessing():
with _remote_dataset_table() as table:
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(
permutation,
batch_size=10,
num_workers=2,
multiprocessing_context="spawn",
)
seen = 0
for batch in dataloader:
assert batch["a"].size(0) == 10
seen += batch["a"].size(0)
assert seen == len(REMOTE_ROWS)
def test_permutation_pickle_with_connection_factory(tmp_path):
"""When the user provides a connection_factory, pickling should round-trip
through that factory rather than introspecting the connection URI. Useful
@@ -310,35 +171,6 @@ def _multiworker_dataloader_target(db_uri: str, result_queue):
result_queue.put(count)
def _remote_multiworker_dataloader_target(port: int, result_queue):
import lancedb
from lancedb.permutation import Permutation
db = lancedb.connect(
"db://dev",
api_key="fake",
host_override=f"http://localhost:{port}",
client_config={
"retry_config": {"retries": 0},
"timeout_config": {"connect_timeout": 2, "read_timeout": 2},
},
)
table = db.open_table("test")
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(
permutation,
batch_size=10,
num_workers=2,
multiprocessing_context="fork",
)
count = 0
for batch in dataloader:
assert batch["a"].size(0) == 10
count += 1
result_queue.put(count)
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
@@ -376,46 +208,3 @@ def test_permutation_dataloader_fork_workers(tmp_path):
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no batches"
assert queue.get() == 100
@pytest.mark.skipif(
sys.platform != "linux",
reason=(
"fork() is unavailable on Windows and unsafe on macOS "
"(Apple frameworks/TLS are not fork-safe)"
),
)
def test_remote_permutation_dataloader_fork_workers():
with http.server.ThreadingHTTPServer(
("localhost", 0), _make_mock_http_handler(_remote_dataset_handler)
) as server:
port = server.server_address[1]
handle = threading.Thread(target=server.serve_forever)
handle.start()
try:
ctx = mp.get_context("spawn")
queue = ctx.Queue()
proc = ctx.Process(
target=_remote_multiworker_dataloader_target,
args=(port, queue),
)
proc.start()
proc.join(timeout=30)
if proc.is_alive():
proc.terminate()
proc.join(timeout=5)
if proc.is_alive():
proc.kill()
proc.join()
pytest.fail(
"Remote permutation hung when iterated in a fork-based "
"DataLoader worker"
)
assert proc.exitcode == 0, f"child exited with code {proc.exitcode}"
assert not queue.empty(), "child produced no batches"
assert queue.get() == 10
finally:
server.shutdown()
handle.join()

View File

@@ -149,21 +149,6 @@ def test_value_to_sql_dict():
assert value_to_sql({}) == "named_struct()"
def test_value_to_sql_numpy_scalars():
# numpy scalars (e.g. pulled from an ndarray or a pandas column) must
# convert the same way as their native Python counterparts. np.float64
# already worked by virtue of subclassing float, but the integer / bool
# / float32 scalars previously raised NotImplementedError.
import numpy as np
assert value_to_sql(np.int32(5)) == "5"
assert value_to_sql(np.int64(5)) == "5"
assert value_to_sql(np.float32(1.5)) == "1.5"
assert value_to_sql(np.float64(1.5)) == "1.5"
assert value_to_sql(np.bool_(True)) == "TRUE"
assert value_to_sql(np.bool_(False)) == "FALSE"
def test_append_vector_columns():
registry = EmbeddingFunctionRegistry.get_instance()
registry.register("test")(MockTextEmbeddingFunction)

View File

@@ -9,9 +9,7 @@
use arrow::{datatypes::DataType, pyarrow::PyArrowType};
use datafusion_common::ScalarValue;
use lancedb::expr::{
DfExpr, col as ldb_col, contains, expr_cast, is_in, lit as df_lit, lower, upper,
};
use lancedb::expr::{DfExpr, col as ldb_col, contains, expr_cast, lit as df_lit, lower, upper};
use pyo3::types::PyBytes;
use pyo3::{Bound, PyAny, PyResult, exceptions::PyValueError, prelude::*, pyfunction};
@@ -107,14 +105,6 @@ impl PyExpr {
Self(contains(self.0.clone(), substr.0.clone()))
}
// ── membership ───────────────────────────────────────────────────────────
/// Return true where the value is one of the given expressions (SQL ``IN``).
fn isin(&self, list: Vec<Self>) -> Self {
let items: Vec<DfExpr> = list.into_iter().map(|e| e.0).collect();
Self(is_in(self.0.clone(), items))
}
// ── type cast ────────────────────────────────────────────────────────────
/// Cast the expression to `data_type`.

View File

@@ -1,19 +1,18 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use chrono::{DateTime, Utc};
use lancedb::index::vector::{
IvfFlatIndexBuilder, IvfHnswFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder,
IvfPqIndexBuilder, IvfRqIndexBuilder, IvfSqIndexBuilder,
};
use lancedb::index::{
Index as LanceDbIndex,
scalar::{BTreeIndexBuilder, FmIndexBuilder, FtsIndexBuilder},
scalar::{BTreeIndexBuilder, FtsIndexBuilder},
};
use pyo3::IntoPyObject;
use pyo3::types::PyStringMethods;
use pyo3::{
Bound, FromPyObject, Py, PyAny, PyResult, Python,
Bound, FromPyObject, PyAny, PyResult, Python,
exceptions::{PyKeyError, PyValueError},
intern, pyclass, pymethods,
types::{PyAnyMethods, PyString},
@@ -39,7 +38,6 @@ pub fn extract_index_params(source: &Option<Bound<'_, PyAny>>) -> PyResult<Lance
"BTree" => Ok(LanceDbIndex::BTree(BTreeIndexBuilder::default())),
"Bitmap" => Ok(LanceDbIndex::Bitmap(Default::default())),
"LabelList" => Ok(LanceDbIndex::LabelList(Default::default())),
"Fm" => Ok(LanceDbIndex::Fm(FmIndexBuilder::default())),
"FTS" => {
let params = source.extract::<FtsParams>()?;
let inner_opts = FtsIndexBuilder::default()
@@ -185,7 +183,7 @@ pub fn extract_index_params(source: &Option<Bound<'_, PyAny>>) -> PyResult<Lance
Ok(LanceDbIndex::IvfHnswFlat(hnsw_flat_builder))
}
not_supported => Err(PyValueError::new_err(format!(
"Invalid index type '{}'. Must be one of BTree, Bitmap, LabelList, Fm, FTS, IvfPq, IvfSq, IvfHnswPq, IvfHnswSq, or IvfHnswFlat",
"Invalid index type '{}'. Must be one of BTree, Bitmap, LabelList, FTS, IvfPq, IvfSq, IvfHnswPq, IvfHnswSq, or IvfHnswFlat",
not_supported
))),
}
@@ -295,26 +293,6 @@ pub struct IndexConfig {
pub columns: Vec<String>,
/// Name of the index.
pub name: String,
/// The UUID of the first segment of the index.
pub index_uuid: Option<String>,
/// The protobuf type URL, a precise type identifier for the index.
pub type_url: Option<String>,
/// When the index was created.
pub created_at: Option<DateTime<Utc>>,
/// The number of rows indexed, across all segments.
pub num_indexed_rows: Option<u64>,
/// The number of rows not yet covered by this index.
pub num_unindexed_rows: Option<u64>,
/// The total size in bytes of all index files across all segments.
pub size_bytes: Option<u64>,
/// The number of segments that make up the index.
pub num_segments: Option<u32>,
/// The on-disk index format version.
pub index_version: Option<i32>,
/// Index-type-specific details parsed as a Python object (dict, list, etc.).
///
/// Falls back to a raw string if JSON parsing fails. `None` when unavailable.
pub index_details: Option<Py<PyAny>>,
}
#[pymethods]
@@ -333,49 +311,18 @@ impl IndexConfig {
"index_type" => Ok(self.index_type.clone().into_pyobject(py)?.into_any()),
"columns" => Ok(self.columns.clone().into_pyobject(py)?.into_any()),
"name" | "index_name" => Ok(self.name.clone().into_pyobject(py)?.into_any()),
"index_uuid" => Ok(self.index_uuid.clone().into_pyobject(py)?.into_any()),
"type_url" => Ok(self.type_url.clone().into_pyobject(py)?.into_any()),
"created_at" => Ok(self.created_at.into_pyobject(py)?.into_any()),
"num_indexed_rows" => Ok(self.num_indexed_rows.into_pyobject(py)?.into_any()),
"num_unindexed_rows" => Ok(self.num_unindexed_rows.into_pyobject(py)?.into_any()),
"size_bytes" => Ok(self.size_bytes.into_pyobject(py)?.into_any()),
"num_segments" => Ok(self.num_segments.into_pyobject(py)?.into_any()),
"index_version" => Ok(self.index_version.into_pyobject(py)?.into_any()),
"index_details" => Ok(self
.index_details
.as_ref()
.map(|obj| obj.clone_ref(py))
.into_pyobject(py)?
.into_any()),
_ => Err(PyKeyError::new_err(format!("Invalid key: {}", key))),
}
}
}
fn parse_index_details(py: Python<'_>, s: String) -> Py<PyAny> {
let json = py.import("json").expect("json module is always available");
match json.call_method1("loads", (s.as_str(),)) {
Ok(obj) => obj.into_any().unbind(),
Err(_) => s.into_pyobject(py).unwrap().into_any().unbind(),
}
}
impl IndexConfig {
pub fn from_lancedb(py: Python<'_>, value: lancedb::index::IndexConfig) -> Self {
impl From<lancedb::index::IndexConfig> for IndexConfig {
fn from(value: lancedb::index::IndexConfig) -> Self {
let index_type = format!("{:?}", value.index_type);
Self {
index_type,
columns: value.columns,
name: value.name,
index_uuid: value.index_uuid,
type_url: value.type_url,
created_at: value.created_at,
num_indexed_rows: value.num_indexed_rows,
num_unindexed_rows: value.num_unindexed_rows,
size_bytes: value.size_bytes,
num_segments: value.num_segments,
index_version: value.index_version,
index_details: value.index_details.map(|s| parse_index_details(py, s)),
}
}
}

View File

@@ -16,7 +16,7 @@ use query::{FTSQuery, HybridQuery, Query, VectorQuery};
use session::Session;
use table::{
AddColumnsResult, AddResult, AlterColumnsResult, DeleteResult, DropColumnsResult, LsmWriteSpec,
MergeResult, Table, UpdateFieldMetadataResult, UpdateResult,
MergeResult, Table, UpdateResult,
};
pub mod arrow;
@@ -50,7 +50,6 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_class::<RecordBatchStream>()?;
m.add_class::<AddColumnsResult>()?;
m.add_class::<AlterColumnsResult>()?;
m.add_class::<UpdateFieldMetadataResult>()?;
m.add_class::<AddResult>()?;
m.add_class::<MergeResult>()?;
m.add_class::<LsmWriteSpec>()?;

View File

@@ -6,7 +6,6 @@ use crate::runtime::future_into_py;
use crate::{
connection::Connection,
error::PythonErrorExt,
expr::PyExpr,
index::{IndexConfig, extract_index_params},
query::{Query, TakeQuery},
table::scannable::PyScannable,
@@ -17,24 +16,18 @@ use arrow::{
pyarrow::{FromPyArrow, PyArrowType, ToPyArrow},
};
use lancedb::table::{
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, NewColumnTransform,
OptimizeAction, OptimizeOptions, Ref, Table as LanceDbTable,
AddDataMode, ColumnAlteration, Duration, NewColumnTransform, OptimizeAction, OptimizeOptions,
Table as LanceDbTable,
};
use pyo3::{
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
exceptions::{PyRuntimeError, PyValueError},
exceptions::{PyKeyError, PyRuntimeError, PyValueError},
pyclass, pymethods,
types::{IntoPyDict, PyAnyMethods, PyDict, PyDictMethods},
};
mod scannable;
#[derive(FromPyObject)]
enum PredicateArg {
Expr(PyExpr),
Sql(String),
}
/// Statistics about a compaction operation.
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
@@ -150,20 +143,18 @@ pub struct MergeResult {
pub num_inserted_rows: u64,
pub num_deleted_rows: u64,
pub num_attempts: u32,
pub num_rows: u64,
}
#[pymethods]
impl MergeResult {
pub fn __repr__(&self) -> String {
format!(
"MergeResult(version={}, num_updated_rows={}, num_inserted_rows={}, num_deleted_rows={}, num_attempts={}, num_rows={})",
"MergeResult(version={}, num_updated_rows={}, num_inserted_rows={}, num_deleted_rows={}, num_attempts={})",
self.version,
self.num_updated_rows,
self.num_inserted_rows,
self.num_deleted_rows,
self.num_attempts,
self.num_rows
self.num_attempts
)
}
}
@@ -176,7 +167,6 @@ impl From<lancedb::table::MergeResult> for MergeResult {
num_inserted_rows: result.num_inserted_rows,
num_deleted_rows: result.num_deleted_rows,
num_attempts: result.num_attempts,
num_rows: result.num_rows,
}
}
}
@@ -204,12 +194,6 @@ impl LsmWriteSpec {
}
/// Identity sharding — shard by the raw value of `column`.
///
/// `column` must be a deterministic function of the unenforced primary
/// key: every row with a given primary key must always produce the same
/// `column` value, or upserts of that key can land in different shards
/// and a stale version can win. Typically `column` is the primary key
/// itself or a stable attribute of it.
#[staticmethod]
pub fn identity(column: String) -> Self {
Self {
@@ -364,27 +348,6 @@ impl From<lancedb::table::AlterColumnsResult> for AlterColumnsResult {
}
}
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct UpdateFieldMetadataResult {
pub version: u64,
}
#[pymethods]
impl UpdateFieldMetadataResult {
pub fn __repr__(&self) -> String {
format!("UpdateFieldMetadataResult(version={})", self.version)
}
}
impl From<lancedb::table::UpdateFieldMetadataResult> for UpdateFieldMetadataResult {
fn from(result: lancedb::table::UpdateFieldMetadataResult) -> Self {
Self {
version: result.version,
}
}
}
#[pyclass(get_all, from_py_object)]
#[derive(Clone, Debug)]
pub struct DropColumnsResult {
@@ -568,15 +531,10 @@ impl Table {
})
}
#[allow(private_interfaces)]
pub fn delete(self_: PyRef<'_, Self>, condition: PredicateArg) -> PyResult<Bound<'_, PyAny>> {
pub fn delete(self_: PyRef<'_, Self>, condition: String) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
let result = match &condition {
PredicateArg::Expr(e) => inner.delete(&e.0).await,
PredicateArg::Sql(s) => inner.delete(s.as_str()).await,
}
.infer_error()?;
let result = inner.delete(&condition).await.infer_error()?;
Ok(DeleteResult::from(result))
})
}
@@ -694,13 +652,13 @@ impl Table {
pub fn list_indices(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
let indices = inner.list_indices().await.infer_error()?;
Python::attach(|py| {
Ok(indices
.into_iter()
.map(|idx| IndexConfig::from_lancedb(py, idx))
.collect::<Vec<_>>())
})
Ok(inner
.list_indices()
.await
.infer_error()?
.into_iter()
.map(IndexConfig::from)
.collect::<Vec<_>>())
})
}
@@ -723,6 +681,10 @@ impl Table {
dict.set_item("num_indices", num_indices)?;
}
if let Some(loss) = stats.loss {
dict.set_item("loss", loss)?;
}
Ok(Some(dict.unbind()))
})
} else {
@@ -872,15 +834,6 @@ impl Table {
Ok(Tags::new(self.inner_ref()?.clone()))
}
pub fn current_branch(&self) -> PyResult<Option<String>> {
Ok(self.inner_ref()?.current_branch())
}
#[getter]
pub fn branches(&self) -> PyResult<Branches> {
Ok(Branches::new(self.inner_ref()?.clone()))
}
#[pyo3(signature = (offsets))]
pub fn take_offsets(self_: PyRef<'_, Self>, offsets: Vec<u64>) -> PyResult<TakeQuery> {
Ok(TakeQuery::new(
@@ -971,13 +924,8 @@ impl Table {
builder.when_not_matched_insert_all();
}
if parameters.when_not_matched_by_source_delete {
if let Some(e) = parameters.when_not_matched_by_source_condition_expr {
builder.when_not_matched_by_source_delete_expr(e.0);
} else {
builder.when_not_matched_by_source_delete(
parameters.when_not_matched_by_source_condition,
);
}
builder
.when_not_matched_by_source_delete(parameters.when_not_matched_by_source_condition);
}
if let Some(timeout) = parameters.timeout {
builder.timeout(timeout);
@@ -985,12 +933,6 @@ impl Table {
if let Some(use_index) = parameters.use_index {
builder.use_index(use_index);
}
if let Some(use_lsm_write) = parameters.use_lsm_write {
builder.use_lsm_write(use_lsm_write);
}
if let Some(validate_single_shard) = parameters.validate_single_shard {
builder.validate_single_shard(validate_single_shard);
}
future_into_py(self_.py(), async move {
let res = builder.execute(Box::new(batches)).await.infer_error()?;
@@ -1029,13 +971,6 @@ impl Table {
})
}
pub fn close_lsm_writers(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
inner.close_lsm_writers().await.infer_error()
})
}
pub fn uses_v2_manifest_paths(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
@@ -1145,57 +1080,31 @@ impl Table {
field_name: String,
metadata: &Bound<'_, PyDict>,
) -> PyResult<Bound<'a, PyAny>> {
// Deprecated: forwards to the update_field_metadata path (replace mode).
let mut update = FieldMetadataUpdate::new(field_name).replace();
for (key, value) in metadata.into_iter() {
update = update.set(key.extract::<String>()?, value.extract::<String>()?);
let mut new_metadata = HashMap::<String, String>::new();
for (column_name, value) in metadata.into_iter() {
let key: String = column_name.extract()?;
let value: String = value.extract()?;
new_metadata.insert(key, value);
}
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
inner.update_field_metadata(&[update]).await.infer_error()?;
let native_tbl = inner
.as_native()
.ok_or_else(|| PyValueError::new_err("This cannot be run on a remote table"))?;
let schema = native_tbl.manifest().await.infer_error()?.schema;
let field = schema
.field(&field_name)
.ok_or_else(|| PyKeyError::new_err(format!("Field {} not found", field_name)))?;
native_tbl
.replace_field_metadata(vec![(field.id as u32, new_metadata)])
.await
.infer_error()?;
Ok(())
})
}
pub fn update_field_metadata<'a>(
self_: PyRef<'a, Self>,
updates: Vec<Bound<PyDict>>,
) -> PyResult<Bound<'a, PyAny>> {
let updates = updates
.iter()
.map(|update| {
let path: String = update
.get_item("path")?
.ok_or_else(|| PyValueError::new_err("Missing path"))?
.extract()?;
let mut field_update = FieldMetadataUpdate::new(path);
if let Some(metadata) = update.get_item("metadata")? {
let metadata_dict = metadata.cast::<PyDict>()?;
for (key, value) in metadata_dict.iter() {
let key: String = key.extract()?;
if value.is_none() {
field_update = field_update.remove(key);
} else {
field_update = field_update.set(key, value.extract::<String>()?);
}
}
}
if let Some(replace) = update.get_item("replace")?
&& replace.extract::<bool>()?
{
field_update = field_update.replace();
}
Ok(field_update)
})
.collect::<PyResult<Vec<_>>>()?;
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
let result = inner.update_field_metadata(&updates).await.infer_error()?;
Ok(UpdateFieldMetadataResult::from(result))
})
}
}
#[derive(FromPyObject)]
@@ -1213,11 +1122,8 @@ pub struct MergeInsertParams {
when_not_matched_insert_all: bool,
when_not_matched_by_source_delete: bool,
when_not_matched_by_source_condition: Option<String>,
when_not_matched_by_source_condition_expr: Option<PyExpr>,
timeout: Option<std::time::Duration>,
use_index: Option<bool>,
use_lsm_write: Option<bool>,
validate_single_shard: Option<bool>,
}
#[pyclass]
@@ -1288,71 +1194,3 @@ impl Tags {
})
}
}
#[pyclass]
pub struct Branches {
inner: LanceDbTable,
}
impl Branches {
pub fn new(table: LanceDbTable) -> Self {
Self { inner: table }
}
}
#[pymethods]
impl Branches {
pub fn list(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let res = inner.list_branches().await.infer_error()?;
Python::attach(|py| {
let py_dict = PyDict::new(py);
for (name, contents) in res {
let value = PyDict::new(py);
value.set_item("parent_branch", contents.parent_branch)?;
value.set_item("parent_version", contents.parent_version)?;
value.set_item("manifest_size", contents.manifest_size)?;
py_dict.set_item(name, value)?;
}
Ok(py_dict.unbind())
})
})
}
#[pyo3(signature = (name, from_ref=None, from_version=None))]
pub fn create(
self_: PyRef<'_, Self>,
name: String,
from_ref: Option<String>,
from_version: Option<u64>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let from = Ref::Version(from_ref, from_version);
let table = inner.create_branch(&name, from).await.infer_error()?;
Ok(Table::new(table))
})
}
#[pyo3(signature = (name, version=None))]
pub fn checkout(
self_: PyRef<'_, Self>,
name: String,
version: Option<u64>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
let table = inner.checkout_branch(&name, version).await.infer_error()?;
Ok(Table::new(table))
})
}
pub fn delete(self_: PyRef<'_, Self>, name: String) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner.clone();
future_into_py(self_.py(), async move {
inner.delete_branch(&name).await.infer_error()?;
Ok(())
})
}
}

View File

@@ -450,27 +450,6 @@ def binary_table(tmp_path):
return db.create_table("binary_test", data)
class TestExprIsin:
def test_isin_ints(self):
assert col("id").isin([1, 2, 3]).to_sql() == "id IN (1, 2, 3)"
def test_isin_strs(self):
assert (
col("status").isin(["active", "pending"]).to_sql()
== "status IN ('active', 'pending')"
)
def test_isin_coerces_and_mixes(self):
assert col("id").isin([lit(1), 2]).to_sql() == "id IN (1, 2)"
def test_isin_empty(self):
assert col("id").isin([]).to_sql() == "id IN ()"
def test_isin_filter(self, simple_table):
result = simple_table.search().where(col("id").isin([1, 3, 5])).to_arrow()
assert result.num_rows == 3
class TestExprBytesIntegration:
def test_binary_equality_filter(self, binary_table):
result = (

4228
python/uv.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,2 +1,2 @@
[toolchain]
channel = "1.95.0"
channel = "1.94.0"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.30.1-beta.2"
version = "0.30.0"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -75,7 +75,7 @@ reqwest = { version = "0.12.0", default-features = false, features = [
"stream",
], optional = true }
http = { version = "1", optional = true } # Matching what is in reqwest
uuid = { version = "1.7.0", features = ["v4", "v5"] }
uuid = { version = "1.7.0", features = ["v4"] }
polars-arrow = { version = ">=0.37,<0.40.0", optional = true }
polars = { version = ">=0.37,<0.40.0", optional = true }
hf-hub = { version = "0.4.1", optional = true, default-features = false, features = [

View File

@@ -9,7 +9,6 @@ use std::sync::Arc;
use arrow_array::RecordBatch;
use arrow_schema::SchemaRef;
use lance::dataset::ReadParams;
use lance::dataset::refs::MAIN_BRANCH;
use lance_namespace::models::{
CreateNamespaceRequest, CreateNamespaceResponse, DescribeNamespaceRequest,
DescribeNamespaceResponse, DropNamespaceRequest, DropNamespaceResponse, ListNamespacesRequest,
@@ -120,8 +119,6 @@ pub struct OpenTableBuilder {
parent: Arc<dyn Database>,
request: OpenTableRequest,
embedding_registry: Arc<dyn EmbeddingRegistry>,
branch: Option<String>,
version: Option<u64>,
}
impl OpenTableBuilder {
@@ -142,8 +139,6 @@ impl OpenTableBuilder {
managed_versioning: None,
},
embedding_registry,
branch: None,
version: None,
}
}
@@ -264,48 +259,14 @@ impl OpenTableBuilder {
self
}
/// Open the table scoped to the given branch instead of the default branch.
///
/// Reads and writes on the returned table operate in the branch's context.
pub fn branch(mut self, branch: impl Into<String>) -> Self {
self.branch = Some(branch.into());
self
}
/// Open the table pinned to a specific version, producing a read-only "view".
///
/// Composes with [`Self::branch`]: when a branch is also set, this opens that
/// branch at the given version; otherwise it opens `main` at that version.
/// The returned table is a detached head, so operations that modify the table
/// will fail until [`Table::checkout_latest`] is called.
///
/// ```
/// # use lancedb::Connection;
/// # async fn f(conn: &Connection) -> Result<(), Box<dyn std::error::Error>> {
/// let table = conn.open_table("t").branch("exp").version(3).execute().await?;
/// # Ok(())
/// # }
/// ```
pub fn version(mut self, version: u64) -> Self {
self.version = Some(version);
self
}
/// Open the table
pub async fn execute(self) -> Result<Table> {
let table = self.parent.open_table(self.request).await?;
let table = Table::new_with_embedding_registry(table, self.parent, self.embedding_registry);
// "main" is the default branch, so treat it as no branch.
let branch = self.branch.filter(|b| b.as_str() != MAIN_BRANCH);
match branch {
Some(branch) => table.checkout_branch(&branch, self.version).await,
None => {
if let Some(version) = self.version {
table.checkout(version).await?;
}
Ok(table)
}
}
Ok(Table::new_with_embedding_registry(
table,
self.parent,
self.embedding_registry,
))
}
}

View File

@@ -16,7 +16,7 @@ use lance_namespace::{
CreateNamespaceRequest, CreateNamespaceResponse, DeclareTableRequest,
DescribeNamespaceRequest, DescribeNamespaceResponse, DescribeTableRequest,
DropNamespaceRequest, DropNamespaceResponse, DropTableRequest, ListNamespacesRequest,
ListNamespacesResponse, ListTablesRequest, ListTablesResponse, RenameTableRequest,
ListNamespacesResponse, ListTablesRequest, ListTablesResponse,
},
};
use lance_namespace_impls::ConnectBuilder;
@@ -437,11 +437,8 @@ impl Database for LanceNamespaceDatabase {
// Set up commit handler when managed_versioning is enabled
if managed_versioning == Some(true) {
let external_store = LanceNamespaceExternalManifestStore::for_table_uri(
self.namespace.clone(),
table_id.clone(),
&location,
)?;
let external_store =
LanceNamespaceExternalManifestStore::new(self.namespace.clone(), table_id.clone());
let commit_handler: Arc<dyn CommitHandler> = Arc::new(ExternalManifestCommitHandler {
external_manifest_store: Arc::new(external_store),
});
@@ -491,34 +488,14 @@ impl Database for LanceNamespaceDatabase {
async fn rename_table(
&self,
cur_name: &str,
new_name: &str,
cur_namespace_path: &[String],
new_namespace_path: &[String],
_cur_name: &str,
_new_name: &str,
_cur_namespace_path: &[String],
_new_namespace_path: &[String],
) -> Result<()> {
let mut cur_table_id = cur_namespace_path.to_vec();
cur_table_id.push(cur_name.to_string());
let new_namespace_id = if new_namespace_path.is_empty() {
None
} else {
Some(new_namespace_path.to_vec())
};
let rename_request = RenameTableRequest {
id: Some(cur_table_id),
new_table_name: new_name.to_string(),
new_namespace_id,
..Default::default()
};
self.namespace
.rename_table(rename_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to rename table: {}", e),
})?;
Ok(())
Err(Error::NotSupported {
message: "rename_table is not supported for namespace connections".to_string(),
})
}
async fn drop_table(&self, name: &str, namespace_path: &[String]) -> Result<()> {
@@ -763,64 +740,6 @@ mod tests {
assert!(table_names.contains(&"test_table".to_string()));
}
#[tokio::test]
async fn test_namespace_branch_query_under_pushdown_stays_local() {
// With QueryTable pushdown enabled, a query on the main branch routes to
// the namespace server, but a branch handle must run locally: the
// server-side request carries no branch and would return main's rows.
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.pushdown_operation(NamespaceClientPushdownOperation::QueryTable)
.execute()
.await
.expect("Failed to connect to namespace");
conn.create_namespace(CreateNamespaceRequest {
id: Some(vec!["test_ns".into()]),
..Default::default()
})
.await
.expect("Failed to create namespace");
// main has 5 rows
let table = conn
.create_table("ref_test", create_test_data())
.namespace(vec!["test_ns".into()])
.execute()
.await
.expect("Failed to create table");
let main_version = table.version().await.unwrap();
// fork a branch off main, then add 5 more rows so it differs from main
let branch = table
.create_branch("exp", main_version)
.await
.expect("Failed to create branch");
branch
.add(create_test_data())
.execute()
.await
.expect("Failed to append to branch");
// the branch query must run locally and see the branch's 10 rows --
// not get routed to the server (which carries no branch) and see main's 5
let results = branch
.query()
.execute()
.await
.expect("Failed to query branch")
.try_collect::<Vec<_>>()
.await
.expect("Failed to collect results");
let count: usize = results.iter().map(|b| b.num_rows()).sum();
assert_eq!(count, 10);
}
#[tokio::test]
async fn test_namespace_describe_table() {
// Setup: Create a temporary directory for the namespace

View File

@@ -203,11 +203,11 @@ impl Shuffler {
// Finish writing files
for (file_idx, mut writer) in file_writers.into_iter().enumerate() {
let write_summary = writer.finish().await?;
let num_written = writer.finish().await?;
log::debug!(
"Shuffle job {}: wrote {} rows to file {}",
self.id,
write_summary.num_rows,
num_written,
file_idx
);
}
@@ -464,9 +464,11 @@ mod tests {
let mut iter = ids.into_iter().map(|o| o.unwrap());
while let Some(first) = iter.next() {
let rows_left_in_clump = if first == 4470 { 19 } else { 29 };
for expected_next in (first + 1)..=(first + rows_left_in_clump) {
let mut expected_next = first + 1;
for _ in 0..rows_left_in_clump {
let next = iter.next().unwrap();
assert_eq!(next, expected_next);
expected_next += 1;
}
}
}

View File

@@ -57,10 +57,6 @@ pub fn expr_cast(expr: Expr, data_type: DataType) -> Expr {
cast(expr, data_type)
}
pub fn is_in(expr: Expr, list: Vec<Expr>) -> Expr {
expr.in_list(list, false)
}
lazy_static::lazy_static! {
static ref FUNC_REGISTRY: std::sync::RwLock<std::collections::HashMap<String, Arc<ScalarUDF>>> = {
let mut m = std::collections::HashMap::new();
@@ -198,13 +194,6 @@ mod tests {
assert_eq!(sql, "NOT (data = X'ABCD')");
}
#[test]
fn test_is_in() {
let expr = is_in(col("id"), vec![lit(1i64), lit(2i64), lit(3i64)]);
let sql = expr_to_sql_string(&expr).unwrap();
assert!(sql.contains("IN"), "expected IN in: {}", sql);
}
#[test]
fn test_multiple_binary_literals() {
use datafusion_common::ScalarValue;

View File

@@ -1,7 +1,6 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use chrono::{DateTime, Utc};
use scalar::FtsIndexBuilder;
use serde::Deserialize;
use serde_with::skip_serializing_none;
@@ -13,7 +12,7 @@ use crate::index::vector::IvfRqIndexBuilder;
use crate::{DistanceType, Error, Result, table::BaseTable};
use self::{
scalar::{BTreeIndexBuilder, BitmapIndexBuilder, FmIndexBuilder, LabelListIndexBuilder},
scalar::{BTreeIndexBuilder, BitmapIndexBuilder, LabelListIndexBuilder},
vector::{
IvfHnswFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder,
IvfSqIndexBuilder,
@@ -49,11 +48,6 @@ pub enum Index {
/// using an underlying bitmap index.
LabelList(LabelListIndexBuilder),
/// An `FM` index is a scalar index on string/binary columns that accelerates
/// substring search (`contains(col, 'needle')`). It matches arbitrary
/// substrings of the raw bytes, unlike the tokenized [`Index::FTS`] index.
Fm(FmIndexBuilder),
/// Full text search index using bm25.
FTS(FtsIndexBuilder),
@@ -312,8 +306,6 @@ pub enum IndexType {
Bitmap,
#[serde(alias = "LABEL_LIST")]
LabelList,
#[serde(alias = "FM", alias = "FMINDEX", alias = "FMIndex")]
Fm,
// FTS
#[serde(alias = "INVERTED", alias = "Inverted")]
FTS,
@@ -332,7 +324,6 @@ impl std::fmt::Display for IndexType {
Self::BTree => write!(f, "BTREE"),
Self::Bitmap => write!(f, "BITMAP"),
Self::LabelList => write!(f, "LABEL_LIST"),
Self::Fm => write!(f, "FM"),
Self::FTS => write!(f, "FTS"),
}
}
@@ -346,7 +337,6 @@ impl std::str::FromStr for IndexType {
"BTREE" => Ok(Self::BTree),
"BITMAP" => Ok(Self::Bitmap),
"LABEL_LIST" | "LABELLIST" => Ok(Self::LabelList),
"FM" | "FMINDEX" => Ok(Self::Fm),
"FTS" | "INVERTED" => Ok(Self::FTS),
"IVF_FLAT" => Ok(Self::IvfFlat),
"IVF_SQ" => Ok(Self::IvfSq),
@@ -374,51 +364,6 @@ pub struct IndexConfig {
/// Currently this is always a Vec of size 1. In the future there may
/// be more columns to represent composite indices.
pub columns: Vec<String>,
/// The UUID of the first segment of the index.
///
/// An index may be made up of multiple segments, each with their own UUID.
/// This is the UUID of the first segment. `None` if it could not be
/// determined (e.g. for remote tables, which do not yet surface this).
pub index_uuid: Option<String>,
/// The protobuf type URL, a precise type identifier for the index.
///
/// `None` if unavailable (e.g. for remote tables).
pub type_url: Option<String>,
/// When the index was created, taken as the minimum creation time across
/// all segments.
///
/// `None` if unavailable, such as for indices created before creation
/// timestamps were tracked, or for remote tables.
pub created_at: Option<DateTime<Utc>>,
/// The number of rows indexed, across all segments.
///
/// This is approximate and may include rows that have since been deleted.
/// `None` if unavailable (e.g. for remote tables).
pub num_indexed_rows: Option<u64>,
/// The number of rows in the table that are not yet covered by this index.
///
/// Computed as the table's total row count minus [`Self::num_indexed_rows`].
/// Optimizing the index will fold these rows into it. `None` if unavailable
/// (e.g. for remote tables).
pub num_unindexed_rows: Option<u64>,
/// The total size in bytes of all index files across all segments.
///
/// `None` if size information is unavailable, such as for indices created
/// before file sizes were tracked, or for remote tables.
pub size_bytes: Option<u64>,
/// The number of segments that make up the index.
///
/// `None` if unavailable (e.g. for remote tables).
pub num_segments: Option<u32>,
/// The on-disk index format version, taken from the first segment.
///
/// `None` if unavailable (e.g. for remote tables).
pub index_version: Option<i32>,
/// Index-type-specific details, serialized as JSON.
///
/// The shape of this JSON varies by index type. `None` if the details
/// could not be produced (e.g. no plugin available) or for remote tables.
pub index_details: Option<String>,
}
#[skip_serializing_none]
@@ -427,6 +372,7 @@ pub(crate) struct IndexMetadata {
pub metric_type: Option<DistanceType>,
// Sometimes the index type is provided at this level.
pub index_type: Option<IndexType>,
pub loss: Option<f64>,
}
// This struct is used to deserialize the JSON data returned from the Lance API
@@ -458,4 +404,6 @@ pub struct IndexStatistics {
pub distance_type: Option<DistanceType>,
/// The number of parts this index is split into.
pub num_indices: Option<u32>,
/// The loss value used by the index.
pub loss: Option<f64>,
}

View File

@@ -51,15 +51,6 @@ pub struct BitmapIndexBuilder {}
#[derive(Debug, Clone, Default, serde::Serialize)]
pub struct LabelListIndexBuilder {}
/// Builder for an FM-Index.
///
/// An FM-Index (FerraginaManzini) is a scalar index over string/binary columns
/// that accelerates substring search, i.e. `contains(col, 'needle')`. Unlike an
/// inverted (FTS) index it matches arbitrary substrings of the raw bytes rather
/// than tokenized words.
#[derive(Debug, Clone, Default, serde::Serialize)]
pub struct FmIndexBuilder {}
pub use lance_index::scalar::FullTextSearchQuery;
pub use lance_index::scalar::InvertedIndexParams as FtsIndexBuilder;
pub use lance_index::scalar::InvertedIndexParams;

View File

@@ -908,15 +908,6 @@ mod tests {
use serial_test::serial;
use std::time::Duration;
// Serializes the env-var-mutating tests below: cargo test runs tests in
// parallel, but several of these tests read and write the same process-
// global env vars (`LANCEDB_USER_ID*`), so they would race without this.
static ENV_MUTEX: std::sync::Mutex<()> = std::sync::Mutex::new(());
fn lock_env() -> std::sync::MutexGuard<'static, ()> {
ENV_MUTEX.lock().unwrap_or_else(|e| e.into_inner())
}
#[test]
fn test_timeout_config_default() {
let config = TimeoutConfig::default();
@@ -1175,7 +1166,6 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_none() {
let _guard = lock_env();
let config = ClientConfig::default();
// Clear env vars that might be set from other tests
// SAFETY: This is only called in tests
@@ -1189,7 +1179,6 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_from_env() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::set_var("LANCEDB_USER_ID", "env-user-id");
@@ -1205,7 +1194,6 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_from_env_key() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::remove_var("LANCEDB_USER_ID");
@@ -1227,7 +1215,6 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_direct_takes_precedence() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::set_var("LANCEDB_USER_ID", "env-user-id");
@@ -1246,7 +1233,6 @@ mod tests {
#[test]
#[serial(user_id_env)]
fn test_resolve_user_id_empty_env_ignored() {
let _guard = lock_env();
// SAFETY: This is only called in tests
unsafe {
std::env::set_var("LANCEDB_USER_ID", "");

View File

@@ -983,46 +983,6 @@ mod tests {
assert_eq!(table.name(), "table1");
}
#[tokio::test]
async fn test_open_table_branch_and_version() {
let conn = Connection::new_with_handler(|request| {
let body = if request.url().path() == "/v1/table/t/branches/list/" {
// checkout_branch validates the branch exists via list_branches.
r#"{"branches":{"exp":{"parentVersion":1,"createAt":1,"manifestSize":1}}}"#
} else {
// describe (table open + version/branch validation)
r#"{"table": "t", "version": 2, "schema": {"fields": [
{"name": "a", "type": { "type": "int32" }, "nullable": false}
]}}"#
};
http::Response::builder().status(200).body(body).unwrap()
});
// version-only (and "main" + version) time-travel the main chain
let v2 = conn.open_table("t").version(2).execute().await.unwrap();
assert_eq!(v2.current_branch(), None);
let main_v2 = conn
.open_table("t")
.branch("main")
.version(2)
.execute()
.await
.unwrap();
assert_eq!(main_v2.current_branch(), None);
// a non-main branch opens a handle scoped to that branch
let exp = conn.open_table("t").branch("exp").execute().await.unwrap();
assert_eq!(exp.current_branch(), Some("exp".to_string()));
let exp_v2 = conn
.open_table("t")
.branch("exp")
.version(2)
.execute()
.await
.unwrap();
assert_eq!(exp_v2.current_branch(), Some("exp".to_string()));
}
#[tokio::test]
async fn test_open_table_not_found() {
let conn = Connection::new_with_handler(|_| {

File diff suppressed because it is too large Load Diff

View File

@@ -48,8 +48,6 @@ pub struct RemoteInsertExec<S: HttpSend = Sender> {
metrics: ExecutionPlanMetricsSet,
upload_id: Option<String>,
tracker: Option<Arc<WriteProgressTracker>>,
/// Branch to write to via `?branch=`. `None` targets the main branch.
branch: Option<String>,
}
impl<S: HttpSend + 'static> RemoteInsertExec<S> {
@@ -61,10 +59,9 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
input: Arc<dyn ExecutionPlan>,
overwrite: bool,
tracker: Option<Arc<WriteProgressTracker>>,
branch: Option<String>,
) -> Self {
Self::new_inner(
table_name, identifier, client, input, overwrite, None, tracker, branch,
table_name, identifier, client, input, overwrite, None, tracker,
)
}
@@ -73,7 +70,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
/// Each partition's insert is staged under the given `upload_id` without
/// committing. The caller is responsible for calling the complete (or abort)
/// endpoint after all partitions finish.
#[allow(clippy::too_many_arguments)]
pub fn new_multipart(
table_name: String,
identifier: String,
@@ -82,7 +78,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
overwrite: bool,
upload_id: String,
tracker: Option<Arc<WriteProgressTracker>>,
branch: Option<String>,
) -> Self {
Self::new_inner(
table_name,
@@ -92,11 +87,9 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
overwrite,
Some(upload_id),
tracker,
branch,
)
}
#[allow(clippy::too_many_arguments)]
fn new_inner(
table_name: String,
identifier: String,
@@ -105,7 +98,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
overwrite: bool,
upload_id: Option<String>,
tracker: Option<Arc<WriteProgressTracker>>,
branch: Option<String>,
) -> Self {
let num_partitions = if upload_id.is_some() {
input.output_partitioning().partition_count()
@@ -131,7 +123,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
metrics: ExecutionPlanMetricsSet::new(),
upload_id,
tracker,
branch,
}
}
@@ -282,7 +273,6 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
self.overwrite,
self.upload_id.clone(),
self.tracker.clone(),
self.branch.clone(),
)))
}
@@ -314,7 +304,6 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
let table_name = self.table_name.clone();
let upload_id = self.upload_id.clone();
let tracker = self.tracker.clone();
let branch = self.branch.clone();
let stream = futures::stream::once(async move {
let mut request = client
@@ -327,9 +316,6 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
if let Some(ref uid) = upload_id {
request = request.query(&[("upload_id", uid.as_str())]);
}
if let Some(ref b) = branch {
request = request.query(&[("branch", b.as_str())]);
}
let (error_tx, mut error_rx) = tokio::sync::oneshot::channel();
let body = Self::stream_as_http_body(input_stream, error_tx, tracker)?;

Some files were not shown because too many files have changed in this diff Show More