mirror of
https://github.com/lancedb/lancedb.git
synced 2026-06-07 06:10:38 +00:00
Compare commits
84 Commits
feat/nodej
...
dependabot
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5f6a12ce6b | ||
|
|
59fbfd4158 | ||
|
|
f37e698e2f | ||
|
|
09b1bbc12a | ||
|
|
c484b24e51 | ||
|
|
3868965413 | ||
|
|
c13ebc6796 | ||
|
|
4b287fd9c4 | ||
|
|
64194ea8ad | ||
|
|
e6c5de1a58 | ||
|
|
39a9f3e1e9 | ||
|
|
952055d428 | ||
|
|
927ba2c948 | ||
|
|
415d199c15 | ||
|
|
a16676e05f | ||
|
|
4e44262499 | ||
|
|
632375faf1 | ||
|
|
9969191d0d | ||
|
|
1e7326cd8c | ||
|
|
9483b534af | ||
|
|
ac3411e81e | ||
|
|
6f18eb4cce | ||
|
|
379684391e | ||
|
|
d065be0474 | ||
|
|
7b874905fd | ||
|
|
a327044e2f | ||
|
|
f20ec99dec | ||
|
|
60f961584c | ||
|
|
ac699d7ecf | ||
|
|
968277be79 | ||
|
|
5638907fa5 | ||
|
|
048f52c2aa | ||
|
|
458dcabbd2 | ||
|
|
60ac5c9a7c | ||
|
|
d05fe8ec44 | ||
|
|
ab982d7f65 | ||
|
|
a3339b7bdd | ||
|
|
b20cdc4f93 | ||
|
|
e77a62e35a | ||
|
|
a9f49c8150 | ||
|
|
a7d9f2e99d | ||
|
|
7dba793629 | ||
|
|
87bd6694b6 | ||
|
|
15e75804c4 | ||
|
|
df2b6d3dd4 | ||
|
|
ccec91d957 | ||
|
|
ec82e36317 | ||
|
|
da2a1c4a2c | ||
|
|
8463a10ebe | ||
|
|
7168d64af1 | ||
|
|
403c33dff0 | ||
|
|
a0001043b6 | ||
|
|
1bb7acb74f | ||
|
|
4ce175276c | ||
|
|
4bccb43e56 | ||
|
|
d5dc4c0f06 | ||
|
|
55ae6197c1 | ||
|
|
15bd821825 | ||
|
|
cf162c8a10 | ||
|
|
2eba7ebd02 | ||
|
|
2d5298b6ee | ||
|
|
4cb9147bbf | ||
|
|
54a1982ef1 | ||
|
|
5bfde47a8e | ||
|
|
049b0c8f09 | ||
|
|
20556e23a9 | ||
|
|
01e272c0b0 | ||
|
|
ad1634a0a5 | ||
|
|
5d1c28922a | ||
|
|
53c2164b84 | ||
|
|
6286ee8192 | ||
|
|
af8ca2ad5e | ||
|
|
aac6c62459 | ||
|
|
8df2fff75f | ||
|
|
0d30b31998 | ||
|
|
6a431ff0a0 | ||
|
|
ab2c5adf5e | ||
|
|
f02c4cad90 | ||
|
|
7b74c3dd91 | ||
|
|
13c6dae9a3 | ||
|
|
64aeee84a8 | ||
|
|
5b45e44ce3 | ||
|
|
f893589356 | ||
|
|
df4ad9f851 |
7
.agents/skills/README.md
Normal file
7
.agents/skills/README.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Agent Skills
|
||||
|
||||
This directory contains repo-scoped code agent skills for the LanceDB project.
|
||||
|
||||
Each skill is a folder that contains a required `SKILL.md` and optional bundled resources.
|
||||
|
||||
Codex discovers skills from `.agents/skills` in the current working directory and parent directories.
|
||||
98
.agents/skills/lancedb-update-lance-dependency/SKILL.md
Normal file
98
.agents/skills/lancedb-update-lance-dependency/SKILL.md
Normal file
@@ -0,0 +1,98 @@
|
||||
---
|
||||
name: lancedb-update-lance-dependency
|
||||
description: Update LanceDB to a specific Lance release or tag. Use when bumping Lance dependencies in the lancedb repository, including Rust workspace Lance crates, Java lance-core, validation, branch creation, commit, push, and PR creation when requested.
|
||||
---
|
||||
|
||||
# LanceDB Update Lance Dependency
|
||||
|
||||
## Scope
|
||||
|
||||
Use this skill in the `lancedb/lancedb` repository when updating the Lance dependency to a specific Lance version or tag.
|
||||
|
||||
Inputs can be a version (`7.2.0-beta.1`), a tag (`v7.2.0-beta.1`), a tag ref (`refs/tags/v7.2.0-beta.1`), or `latest`.
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Confirm the worktree status with `git status --short`.
|
||||
2. Resolve the target Lance version:
|
||||
|
||||
- If the input is `latest`, empty, or omitted, run:
|
||||
|
||||
```bash
|
||||
python3 ci/check_lance_release.py
|
||||
```
|
||||
|
||||
Parse the JSON output. If `needs_update` is not `true`, stop without creating a PR. Otherwise use `latest_tag`.
|
||||
|
||||
- If the input is explicit, use it directly.
|
||||
|
||||
3. Compute update metadata without changing files:
|
||||
|
||||
```bash
|
||||
python3 ci/update_lance_dependency.py "$TAG_OR_VERSION" --metadata-only
|
||||
```
|
||||
|
||||
Before making changes, check for an existing open PR with the emitted `pr_title`:
|
||||
|
||||
```bash
|
||||
gh pr list --search "\"$PR_TITLE\" in:title" --state open --limit 1 --json number,url,title
|
||||
```
|
||||
|
||||
If a matching open PR exists, stop and report it instead of creating a duplicate.
|
||||
|
||||
4. Run the deterministic update entrypoint:
|
||||
|
||||
```bash
|
||||
python3 ci/update_lance_dependency.py "$TAG_OR_VERSION"
|
||||
```
|
||||
|
||||
This updates the Rust workspace Lance dependencies through `ci/set_lance_version.py`, updates `java/pom.xml`, refreshes Cargo metadata, and prints JSON metadata containing `branch_name`, `commit_message`, and `pr_title`.
|
||||
|
||||
5. Run validation:
|
||||
|
||||
```bash
|
||||
cargo clippy --quiet --workspace --tests --all-features -- -D warnings
|
||||
cargo fmt --all --quiet
|
||||
```
|
||||
|
||||
Fix real diagnostics and rerun clippy until it succeeds. Do not skip warnings.
|
||||
|
||||
6. Inspect `git status --short` and `git diff` to ensure only the Lance dependency update and required compatibility fixes are present.
|
||||
|
||||
7. If the task only asks to prepare local changes, stop here and report the changed files and validation result.
|
||||
|
||||
8. If the task asks to publish the update, create a branch using the printed `branch_name`, stage all relevant files, and commit using the printed `commit_message`. Do not amend or rewrite existing commits.
|
||||
|
||||
9. Push to `origin`. Before creating the PR, check that the current token has push permission:
|
||||
|
||||
```bash
|
||||
gh api repos/lancedb/lancedb --jq .permissions.push
|
||||
```
|
||||
|
||||
If the remote branch already exists for the same generated branch name, delete the remote ref with `gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/$BRANCH_NAME`, then push. Do not force-push.
|
||||
|
||||
10. Create a PR targeting `main` with the printed `pr_title`. If there is no PR template, keep the body to two or three concise sentences: state the Lance dependency bump, note any required compatibility fixes, and link the triggering Lance tag or release.
|
||||
|
||||
11. Read back the remote PR title after creation. If it is not a Conventional Commit title, fix it immediately.
|
||||
|
||||
12. When running in GitHub Actions after creating the LanceDB PR, trigger the Sophon dependency update:
|
||||
|
||||
```bash
|
||||
gh workflow run codex-bump-lancedb-lance.yml \
|
||||
--repo lancedb/sophon \
|
||||
-f lance_ref="$LANCE_TAG" \
|
||||
-f lancedb_ref="$BRANCH_NAME"
|
||||
gh run list --repo lancedb/sophon --workflow codex-bump-lancedb-lance.yml --limit 1 --json databaseId,url,displayTitle
|
||||
```
|
||||
|
||||
Use the emitted metadata `tag` value as `LANCE_TAG`. Do this only after a new LanceDB PR has been created. If the update was skipped because no update is needed or an open PR already exists, do not trigger Sophon.
|
||||
|
||||
## GitHub Actions
|
||||
|
||||
When this skill is used from GitHub Actions, `TAG`, `GH_TOKEN`, and `GITHUB_TOKEN` may already be set. Resolve `latest` first when `TAG` is empty. Once an explicit tag or version is known, use:
|
||||
|
||||
```bash
|
||||
python3 ci/update_lance_dependency.py "$TAG" --github-output "$GITHUB_OUTPUT"
|
||||
```
|
||||
|
||||
Then use the emitted `branch_name`, `commit_message`, and `pr_title` values for branch, commit, and PR creation.
|
||||
@@ -1,5 +1,5 @@
|
||||
[tool.bumpversion]
|
||||
current_version = "0.28.0-beta.11"
|
||||
current_version = "0.30.1-beta.2"
|
||||
parse = """(?x)
|
||||
(?P<major>0|[1-9]\\d*)\\.
|
||||
(?P<minor>0|[1-9]\\d*)\\.
|
||||
|
||||
16
.github/dependabot.yml
vendored
16
.github/dependabot.yml
vendored
@@ -11,8 +11,24 @@ updates:
|
||||
schedule:
|
||||
interval: weekly
|
||||
open-pull-requests-limit: 10
|
||||
# Only update Cargo.lock, never widen/raise the version requirements in
|
||||
# Cargo.toml. The goal is keeping the lockfile (and the binaries we ship)
|
||||
# current on security fixes, not forcing our library's consumers onto
|
||||
# newer minimum versions.
|
||||
versioning-strategy: lockfile-only
|
||||
groups:
|
||||
rust-minor-patch:
|
||||
update-types:
|
||||
- minor
|
||||
- patch
|
||||
|
||||
- package-ecosystem: pip
|
||||
directory: /python
|
||||
schedule:
|
||||
interval: weekly
|
||||
# Only update uv.lock, never widen version requirements in pyproject.toml.
|
||||
versioning-strategy: lockfile-only
|
||||
groups:
|
||||
python-deps:
|
||||
patterns:
|
||||
- "*"
|
||||
|
||||
@@ -29,7 +29,3 @@ runs:
|
||||
args: ${{ inputs.args }}
|
||||
docker-options: "-e PIP_EXTRA_INDEX_URL='https://pypi.fury.io/lance-format/ https://pypi.fury.io/lancedb/'"
|
||||
working-directory: python
|
||||
- uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: windows-wheels
|
||||
path: python\target\wheels
|
||||
|
||||
@@ -4,14 +4,16 @@ on:
|
||||
workflow_call:
|
||||
inputs:
|
||||
tag:
|
||||
description: "Tag name from Lance"
|
||||
required: true
|
||||
description: "Tag name from Lance. If omitted, the skill will use the latest Lance release that needs an update."
|
||||
required: false
|
||||
default: ""
|
||||
type: string
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
tag:
|
||||
description: "Tag name from Lance"
|
||||
required: true
|
||||
description: "Tag name from Lance. Leave empty to use the latest Lance release that needs an update."
|
||||
required: false
|
||||
default: ""
|
||||
type: string
|
||||
|
||||
permissions:
|
||||
@@ -25,7 +27,7 @@ jobs:
|
||||
steps:
|
||||
- name: Show inputs
|
||||
run: |
|
||||
echo "tag = ${{ inputs.tag }}"
|
||||
echo "tag = ${{ inputs.tag || 'latest' }}"
|
||||
|
||||
- name: Checkout Repo LanceDB
|
||||
uses: actions/checkout@v4
|
||||
@@ -71,65 +73,21 @@ jobs:
|
||||
OPENAI_API_KEY: ${{ secrets.CODEX_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
VERSION="${TAG#refs/tags/}"
|
||||
VERSION="${VERSION#v}"
|
||||
BRANCH_NAME="codex/update-lance-${VERSION//[^a-zA-Z0-9]/-}"
|
||||
|
||||
# Use "chore" for beta/rc versions, "feat" for stable releases
|
||||
if [[ "${VERSION}" == *beta* ]] || [[ "${VERSION}" == *rc* ]]; then
|
||||
COMMIT_TYPE="chore"
|
||||
else
|
||||
COMMIT_TYPE="feat"
|
||||
fi
|
||||
TARGET_TAG="${TAG:-latest}"
|
||||
|
||||
cat <<EOF >/tmp/codex-prompt.txt
|
||||
You are running inside the lancedb repository on a GitHub Actions runner. Update the Lance dependency to version ${VERSION} and prepare a pull request for maintainers to review.
|
||||
You are running inside the lancedb repository on a GitHub Actions runner.
|
||||
|
||||
Follow these steps exactly:
|
||||
1. Use script "ci/set_lance_version.py" to update Lance Rust dependencies. The script already refreshes Cargo metadata, so allow it to finish even if it takes time.
|
||||
2. Update the Java lance-core dependency version in "java/pom.xml": change the "<lance-core.version>...</lance-core.version>" property to "${VERSION}".
|
||||
3. Run "cargo clippy --workspace --tests --all-features -- -D warnings". If diagnostics appear, fix them yourself and rerun clippy until it exits cleanly. Do not skip any warnings.
|
||||
4. After clippy succeeds, run "cargo fmt --all" to format the workspace.
|
||||
5. Ensure the repository is clean except for intentional changes. Inspect "git status --short" and "git diff" to confirm the dependency update and any required fixes.
|
||||
6. Create and switch to a new branch named "${BRANCH_NAME}" (replace any duplicated hyphens if necessary).
|
||||
7. Stage all relevant files with "git add -A". Commit using the message "${COMMIT_TYPE}: update lance dependency to v${VERSION}".
|
||||
8. Push the branch to origin. If the remote branch already exists, delete it first with "gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/${BRANCH_NAME}" then push with "git push origin ${BRANCH_NAME}". Do NOT use "git push --force" or "git push -f".
|
||||
9. env "GH_TOKEN" is available, use "gh" tools for github related operations like creating pull request.
|
||||
10. Create a pull request targeting "main" with title "${COMMIT_TYPE}: update lance dependency to v${VERSION}". First, write the PR body to /tmp/pr-body.md using a heredoc (cat <<'EOF' > /tmp/pr-body.md). The body should summarize the dependency bump, clippy/fmt verification, and link the triggering tag (${TAG}). Then run "gh pr create --body-file /tmp/pr-body.md".
|
||||
11. After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
|
||||
Use \$lancedb-update-lance-dependency with target "${TARGET_TAG}".
|
||||
|
||||
Constraints:
|
||||
- Use bash commands; avoid modifying GitHub workflow files other than through the scripted task above.
|
||||
- Do not merge the PR.
|
||||
- If any command fails, diagnose and fix the issue instead of aborting.
|
||||
- Use env "GH_TOKEN" for GitHub operations.
|
||||
- Do not merge the pull request.
|
||||
- Do not force-push.
|
||||
- Do not create a duplicate pull request if an open PR already exists for the target Lance version.
|
||||
- If any command fails, diagnose and fix the root cause instead of aborting.
|
||||
- After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
|
||||
EOF
|
||||
|
||||
printenv OPENAI_API_KEY | codex login --with-api-key
|
||||
codex --config shell_environment_policy.ignore_default_excludes=true exec --dangerously-bypass-approvals-and-sandbox "$(cat /tmp/codex-prompt.txt)"
|
||||
|
||||
- name: Trigger sophon dependency update
|
||||
env:
|
||||
TAG: ${{ inputs.tag }}
|
||||
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
VERSION="${TAG#refs/tags/}"
|
||||
VERSION="${VERSION#v}"
|
||||
LANCEDB_BRANCH="codex/update-lance-${VERSION//[^a-zA-Z0-9]/-}"
|
||||
|
||||
echo "Triggering sophon workflow with:"
|
||||
echo " lance_ref: ${TAG#refs/tags/}"
|
||||
echo " lancedb_ref: ${LANCEDB_BRANCH}"
|
||||
|
||||
gh workflow run codex-bump-lancedb-lance.yml \
|
||||
--repo lancedb/sophon \
|
||||
-f lance_ref="${TAG#refs/tags/}" \
|
||||
-f lancedb_ref="${LANCEDB_BRANCH}"
|
||||
|
||||
- name: Show latest sophon workflow run
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
echo "Latest sophon workflow run:"
|
||||
gh run list --repo lancedb/sophon --workflow codex-bump-lancedb-lance.yml --limit 1 --json databaseId,url,displayTitle
|
||||
|
||||
62
.github/workflows/lance-release-timer.yml
vendored
62
.github/workflows/lance-release-timer.yml
vendored
@@ -1,62 +0,0 @@
|
||||
name: Lance Release Timer
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: "*/10 * * * *"
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
actions: write
|
||||
|
||||
concurrency:
|
||||
group: lance-release-timer
|
||||
cancel-in-progress: false
|
||||
|
||||
jobs:
|
||||
trigger-update:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Check for new Lance tag
|
||||
id: check
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
|
||||
run: |
|
||||
python3 ci/check_lance_release.py --github-output "$GITHUB_OUTPUT"
|
||||
|
||||
- name: Look for existing PR
|
||||
if: steps.check.outputs.needs_update == 'true'
|
||||
id: pr
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
TITLE="chore: update lance dependency to v${{ steps.check.outputs.latest_version }}"
|
||||
COUNT=$(gh pr list --search "\"$TITLE\" in:title" --state open --limit 1 --json number --jq 'length')
|
||||
if [ "$COUNT" -gt 0 ]; then
|
||||
echo "Open PR already exists for $TITLE"
|
||||
echo "pr_exists=true" >> "$GITHUB_OUTPUT"
|
||||
else
|
||||
echo "No existing PR for $TITLE"
|
||||
echo "pr_exists=false" >> "$GITHUB_OUTPUT"
|
||||
fi
|
||||
|
||||
- name: Trigger codex update workflow
|
||||
if: steps.check.outputs.needs_update == 'true' && steps.pr.outputs.pr_exists != 'true'
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
TAG=${{ steps.check.outputs.latest_tag }}
|
||||
gh workflow run codex-update-lance-dependency.yml -f tag=refs/tags/$TAG
|
||||
|
||||
- name: Show latest codex workflow run
|
||||
if: steps.check.outputs.needs_update == 'true' && steps.pr.outputs.pr_exists != 'true'
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
gh run list --workflow codex-update-lance-dependency.yml --limit 1 --json databaseId,url,displayTitle
|
||||
5
.github/workflows/nodejs.yml
vendored
5
.github/workflows/nodejs.yml
vendored
@@ -157,7 +157,10 @@ jobs:
|
||||
npx jest --testEnvironment jest-environment-node-single-context --verbose
|
||||
macos:
|
||||
timeout-minutes: 30
|
||||
runs-on: "macos-14"
|
||||
# macos-15 ships a newer linker; the older macos-14 linker fails to insert
|
||||
# branch islands when the debug cdylib's __text section exceeds the 128 MB
|
||||
# AArch64 B/BL branch range.
|
||||
runs-on: "macos-15"
|
||||
defaults:
|
||||
run:
|
||||
shell: bash
|
||||
|
||||
110
.github/workflows/pypi-publish.yml
vendored
110
.github/workflows/pypi-publish.yml
vendored
@@ -8,6 +8,9 @@ on:
|
||||
# This should trigger a dry run (we skip the final publish step)
|
||||
paths:
|
||||
- .github/workflows/pypi-publish.yml
|
||||
- .github/workflows/build_linux_wheel/action.yml
|
||||
- .github/workflows/build_mac_wheel/action.yml
|
||||
- .github/workflows/build_windows_wheel/action.yml
|
||||
- Cargo.toml # Change in dependency frequently breaks builds
|
||||
- Cargo.lock
|
||||
|
||||
@@ -21,32 +24,21 @@ jobs:
|
||||
linux:
|
||||
name: Python ${{ matrix.config.platform }} manylinux${{ matrix.config.manylinux }}
|
||||
timeout-minutes: 60
|
||||
permissions:
|
||||
id-token: write
|
||||
contents: read
|
||||
strategy:
|
||||
matrix:
|
||||
config:
|
||||
- platform: x86_64
|
||||
manylinux: "2_17"
|
||||
extra_args: ""
|
||||
runner: ubuntu-22.04
|
||||
- platform: x86_64
|
||||
manylinux: "2_28"
|
||||
extra_args: "--features fp16kernels"
|
||||
runner: ubuntu-22.04
|
||||
- platform: aarch64
|
||||
manylinux: "2_17"
|
||||
extra_args: ""
|
||||
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
|
||||
runner: ubuntu-2404-8x-arm64
|
||||
# For successful fat LTO builds, we need a large runner to avoid OOM errors.
|
||||
- platform: aarch64
|
||||
manylinux: "2_28"
|
||||
extra_args: "--features fp16kernels"
|
||||
runner: ubuntu-2404-8x-arm64
|
||||
runs-on: ${{ matrix.config.runner }}
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v6
|
||||
with:
|
||||
fetch-depth: 0
|
||||
lfs: true
|
||||
@@ -60,15 +52,14 @@ jobs:
|
||||
args: "--release --strip ${{ matrix.config.extra_args }}"
|
||||
arm-build: ${{ matrix.config.platform == 'aarch64' }}
|
||||
manylinux: ${{ matrix.config.manylinux }}
|
||||
- uses: ./.github/workflows/upload_wheel
|
||||
- uses: actions/upload-artifact@v7
|
||||
if: startsWith(github.ref, 'refs/tags/python-v')
|
||||
with:
|
||||
fury_token: ${{ secrets.FURY_TOKEN }}
|
||||
name: wheels-linux-${{ matrix.config.platform }}-${{ matrix.config.manylinux }}
|
||||
path: target/wheels/lancedb-*.whl
|
||||
if-no-files-found: error
|
||||
mac:
|
||||
timeout-minutes: 90
|
||||
permissions:
|
||||
id-token: write
|
||||
contents: read
|
||||
runs-on: ${{ matrix.config.runner }}
|
||||
strategy:
|
||||
matrix:
|
||||
@@ -78,7 +69,7 @@ jobs:
|
||||
env:
|
||||
MACOSX_DEPLOYMENT_TARGET: 10.15
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v6
|
||||
with:
|
||||
fetch-depth: 0
|
||||
lfs: true
|
||||
@@ -90,18 +81,21 @@ jobs:
|
||||
with:
|
||||
python-minor-version: 10
|
||||
args: "--release --strip --target ${{ matrix.config.target }} --features fp16kernels"
|
||||
- uses: ./.github/workflows/upload_wheel
|
||||
- uses: actions/upload-artifact@v7
|
||||
if: startsWith(github.ref, 'refs/tags/python-v')
|
||||
with:
|
||||
fury_token: ${{ secrets.FURY_TOKEN }}
|
||||
name: wheels-mac-${{ matrix.config.target }}
|
||||
path: target/wheels/lancedb-*.whl
|
||||
if-no-files-found: error
|
||||
windows:
|
||||
timeout-minutes: 60
|
||||
permissions:
|
||||
id-token: write
|
||||
contents: read
|
||||
timeout-minutes: 90
|
||||
runs-on: windows-latest
|
||||
env:
|
||||
# link.exe is single-threaded and the long pole on Windows builds. Use
|
||||
# rustc's bundled lld-link instead.
|
||||
CARGO_TARGET_X86_64_PC_WINDOWS_MSVC_LINKER: rust-lld
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v6
|
||||
with:
|
||||
fetch-depth: 0
|
||||
lfs: true
|
||||
@@ -113,18 +107,70 @@ jobs:
|
||||
with:
|
||||
python-minor-version: 10
|
||||
args: "--release --strip"
|
||||
vcpkg_token: ${{ secrets.VCPKG_GITHUB_PACKAGES }}
|
||||
- uses: ./.github/workflows/upload_wheel
|
||||
- uses: actions/upload-artifact@v7
|
||||
if: startsWith(github.ref, 'refs/tags/python-v')
|
||||
with:
|
||||
fury_token: ${{ secrets.FURY_TOKEN }}
|
||||
name: wheels-windows
|
||||
path: target/wheels/lancedb-*.whl
|
||||
if-no-files-found: error
|
||||
publish:
|
||||
name: Publish wheels
|
||||
if: startsWith(github.ref, 'refs/tags/python-v')
|
||||
needs: [linux, mac, windows]
|
||||
runs-on: ubuntu-latest
|
||||
permissions:
|
||||
id-token: write
|
||||
contents: read
|
||||
steps:
|
||||
- uses: actions/checkout@v6
|
||||
- name: Download wheel artifacts
|
||||
uses: actions/download-artifact@v8
|
||||
with:
|
||||
pattern: wheels-*
|
||||
path: target/wheels
|
||||
merge-multiple: true
|
||||
- name: List wheels
|
||||
run: ls -la target/wheels
|
||||
- name: Choose repo
|
||||
id: choose_repo
|
||||
run: |
|
||||
if [[ ${{ github.ref }} == *beta* ]]; then
|
||||
echo "repo=fury" >> $GITHUB_OUTPUT
|
||||
else
|
||||
echo "repo=pypi" >> $GITHUB_OUTPUT
|
||||
fi
|
||||
- name: Publish to Fury
|
||||
if: steps.choose_repo.outputs.repo == 'fury'
|
||||
env:
|
||||
FURY_TOKEN: ${{ secrets.FURY_TOKEN }}
|
||||
run: |
|
||||
shopt -s nullglob
|
||||
WHEELS=(target/wheels/lancedb-*.whl)
|
||||
if [[ ${#WHEELS[@]} -eq 0 ]]; then
|
||||
echo "No wheels found in target/wheels/" >&2
|
||||
exit 1
|
||||
fi
|
||||
for WHEEL in "${WHEELS[@]}"; do
|
||||
echo "Uploading $WHEEL to Fury"
|
||||
curl -f -F package=@"$WHEEL" "https://$FURY_TOKEN@push.fury.io/lancedb/"
|
||||
done
|
||||
# NOTE: pypa/gh-action-pypi-publish must be invoked directly from a
|
||||
# workflow file, not from inside a composite action. When called from a
|
||||
# composite, `github.action_repository` is empty (actions/runner#2473)
|
||||
# and the action falls back to `github.repository`, producing a bogus
|
||||
# `docker://ghcr.io/<repo>:<ref>` image reference that GHA tries to pull.
|
||||
- name: Publish to PyPI
|
||||
if: steps.choose_repo.outputs.repo == 'pypi'
|
||||
uses: pypa/gh-action-pypi-publish@release/v1
|
||||
with:
|
||||
packages-dir: target/wheels/
|
||||
gh-release:
|
||||
if: startsWith(github.ref, 'refs/tags/python-v')
|
||||
runs-on: ubuntu-latest
|
||||
permissions:
|
||||
contents: write
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v6
|
||||
with:
|
||||
fetch-depth: 0
|
||||
lfs: true
|
||||
@@ -187,13 +233,13 @@ jobs:
|
||||
report-failure:
|
||||
name: Report Workflow Failure
|
||||
runs-on: ubuntu-latest
|
||||
needs: [linux, mac, windows]
|
||||
needs: [linux, mac, windows, publish]
|
||||
permissions:
|
||||
contents: read
|
||||
issues: write
|
||||
if: always() && failure() && startsWith(github.ref, 'refs/tags/python-v')
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/checkout@v6
|
||||
- uses: ./.github/actions/create-failure-issue
|
||||
with:
|
||||
job-results: ${{ toJSON(needs) }}
|
||||
|
||||
2
.github/workflows/python.yml
vendored
2
.github/workflows/python.yml
vendored
@@ -205,7 +205,7 @@ jobs:
|
||||
- name: Delete wheels
|
||||
run: rm -rf target/wheels
|
||||
pydantic1x:
|
||||
timeout-minutes: 30
|
||||
timeout-minutes: 60
|
||||
runs-on: "ubuntu-24.04"
|
||||
defaults:
|
||||
run:
|
||||
|
||||
20
.github/workflows/rust.yml
vendored
20
.github/workflows/rust.yml
vendored
@@ -233,6 +233,26 @@ jobs:
|
||||
cargo update -p aws-sdk-sso --precise 1.62.0
|
||||
cargo update -p aws-sdk-ssooidc --precise 1.63.0
|
||||
cargo update -p aws-sdk-sts --precise 1.63.0
|
||||
# aws-runtime/sigv4/credential-types/types and the aws-smithy-*
|
||||
# crates bumped their MSRV to 1.91.1 in late 2026; pin to the last
|
||||
# 1.91.0-compatible versions. The order matters — each downgrade
|
||||
# only succeeds once everything that still pins it at a higher
|
||||
# version has itself been downgraded.
|
||||
cargo update -p aws-runtime --precise 1.5.12
|
||||
cargo update -p aws-types --precise 1.3.9
|
||||
cargo update -p aws-sigv4 --precise 1.3.5
|
||||
cargo update -p aws-credential-types --precise 1.2.8
|
||||
cargo update -p aws-smithy-checksums --precise 0.63.9
|
||||
cargo update -p aws-smithy-runtime --precise 1.9.3
|
||||
cargo update -p aws-smithy-http --precise 0.62.4
|
||||
cargo update -p aws-smithy-eventstream --precise 0.60.12
|
||||
cargo update -p aws-smithy-http-client --precise 1.1.3
|
||||
cargo update -p aws-smithy-observability --precise 0.1.4
|
||||
cargo update -p aws-smithy-query --precise 0.60.8
|
||||
cargo update -p aws-smithy-runtime-api --precise 1.9.1
|
||||
cargo update -p aws-smithy-async --precise 1.2.6
|
||||
cargo update -p aws-smithy-types --precise 1.3.5
|
||||
cargo update -p aws-smithy-xml --precise 0.60.11
|
||||
cargo update -p home --precise 0.5.9
|
||||
- name: cargo +${{ matrix.msrv }} check
|
||||
env:
|
||||
|
||||
34
.github/workflows/upload_wheel/action.yml
vendored
34
.github/workflows/upload_wheel/action.yml
vendored
@@ -1,34 +0,0 @@
|
||||
name: upload-wheel
|
||||
|
||||
description: "Upload wheels to Pypi"
|
||||
inputs:
|
||||
fury_token:
|
||||
required: true
|
||||
description: "release token for the fury repo"
|
||||
|
||||
runs:
|
||||
using: "composite"
|
||||
steps:
|
||||
- name: Choose repo
|
||||
shell: bash
|
||||
id: choose_repo
|
||||
run: |
|
||||
if [[ ${{ github.ref }} == *beta* ]]; then
|
||||
echo "repo=fury" >> $GITHUB_OUTPUT
|
||||
else
|
||||
echo "repo=pypi" >> $GITHUB_OUTPUT
|
||||
fi
|
||||
- name: Publish to Fury
|
||||
if: steps.choose_repo.outputs.repo == 'fury'
|
||||
shell: bash
|
||||
env:
|
||||
FURY_TOKEN: ${{ inputs.fury_token }}
|
||||
run: |
|
||||
WHEEL=$(ls target/wheels/lancedb-*.whl 2> /dev/null | head -n 1)
|
||||
echo "Uploading $WHEEL to Fury"
|
||||
curl -f -F package=@$WHEEL https://$FURY_TOKEN@push.fury.io/lancedb/
|
||||
- name: Publish to PyPI
|
||||
if: steps.choose_repo.outputs.repo == 'pypi'
|
||||
uses: pypa/gh-action-pypi-publish@release/v1
|
||||
with:
|
||||
packages-dir: target/wheels/
|
||||
28
AGENTS.md
28
AGENTS.md
@@ -17,9 +17,33 @@ Common commands:
|
||||
* Run tests: `cargo test --quiet --features remote --tests`
|
||||
* Run specific test: `cargo test --quiet --features remote -p <package_name> --test <test_name>`
|
||||
* Lint: `cargo clippy --quiet --features remote --tests --examples`
|
||||
* Format: `cargo fmt --all`
|
||||
* Format Rust: `cargo fmt --all`
|
||||
* Format Python: `ruff format .`
|
||||
* Lint Python: `ruff check .`
|
||||
* Bootstrap Python dev env: `cd python && uv run --extra tests --extra dev maturin develop --extras tests,dev`
|
||||
* Run Python tests: `cd python && uv run --extra tests pytest python/tests -vv --durations=10 -m "not slow and not s3_test"`
|
||||
* Run specific Python test: `cd python && uv run --extra tests pytest python/tests/<test_file>.py::<test_name> -q`
|
||||
|
||||
Before committing changes, run formatting.
|
||||
For Python validation, prefer the uv-managed environment declared by `python/uv.lock`.
|
||||
Do not treat system `python`, global `pytest`, or missing editable-install errors as
|
||||
final blockers; bootstrap or enter the uv environment instead. If `lancedb._lancedb`
|
||||
is missing or stale, or if Rust/PyO3 binding code changed, rebuild the Python
|
||||
extension with the bootstrap command above before running tests.
|
||||
|
||||
Before committing changes, run formatting for every language you touched. At minimum:
|
||||
|
||||
* Rust changes: run `cargo fmt --all`.
|
||||
* Python changes: run `ruff format .` and `ruff check .` from the repository root,
|
||||
and run targeted tests through `cd python && uv run ...`.
|
||||
* TypeScript changes: run the relevant `npm`/`pnpm` lint, format, build, and docs commands in `nodejs`.
|
||||
|
||||
Before creating a PR, the exact value passed to `gh pr create --title` must follow
|
||||
Conventional Commits, such as `fix: support nested field paths in native index creation`
|
||||
or `feat(python): add dataset multiprocessing support`. Do not use a plain natural
|
||||
language summary like `Support nested field paths in native index creation` as the PR
|
||||
title. The semantic-release check uses the PR title and body as the merge commit message,
|
||||
so a non-conventional PR title will fail CI. After creating a PR, read the remote PR title
|
||||
back and fix it immediately if it is not conventional.
|
||||
|
||||
## Coding tips
|
||||
|
||||
|
||||
2303
Cargo.lock
generated
2303
Cargo.lock
generated
File diff suppressed because it is too large
Load Diff
28
Cargo.toml
28
Cargo.toml
@@ -13,20 +13,20 @@ categories = ["database-implementations"]
|
||||
rust-version = "1.91.0"
|
||||
|
||||
[workspace.dependencies]
|
||||
lance = { "version" = "=7.0.0-beta.7", default-features = false, "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-core = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datagen = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-file = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-io = { "version" = "=7.0.0-beta.7", default-features = false, "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-index = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-linalg = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=7.0.0-beta.7", default-features = false, "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-table = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-testing = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datafusion = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-encoding = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-arrow = { "version" = "=7.0.0-beta.7", "tag" = "v7.0.0-beta.7", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance = { "version" = "=8.0.0-beta.6", default-features = false, "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-core = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datagen = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-file = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-io = { "version" = "=8.0.0-beta.6", default-features = false, "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-index = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-linalg = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=8.0.0-beta.6", default-features = false, "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-table = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-testing = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datafusion = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-encoding = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-arrow = { "version" = "=8.0.0-beta.6", "tag" = "v8.0.0-beta.6", "git" = "https://github.com/lance-format/lance.git" }
|
||||
ahash = "0.8"
|
||||
# Note that this one does not include pyarrow
|
||||
arrow = { version = "58.0.0", optional = false }
|
||||
|
||||
26
REVIEW.md
Normal file
26
REVIEW.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Code review guidelines
|
||||
|
||||
Repo-specific guidance for automated PR reviews.
|
||||
|
||||
## Cross-SDK parity
|
||||
|
||||
LanceDB exposes the same core (`rust/lancedb`) through Python, TypeScript (`nodejs`),
|
||||
and Java bindings. Behavioral drift between SDKs is a recurring problem, so watch for
|
||||
parity gaps when reviewing — but only flag real ones:
|
||||
|
||||
* If the change adds or modifies user-facing API or behavior in the shared core
|
||||
(`rust/lancedb`), check whether each binding that should expose it (`python`,
|
||||
`nodejs`) does. A core change with no corresponding binding update is worth a note.
|
||||
* If the change adds or modifies a public API in one SDK but not the other, open the
|
||||
sibling SDK's corresponding module and state whether an equivalent exists. If not,
|
||||
note it as a possible parity gap and suggest a follow-up issue.
|
||||
* For bug fixes, first read the sibling SDK's analogous code path to check whether the
|
||||
same bug exists there. Only raise parity if it actually does. Do not ask to "port" a
|
||||
fix for a bug that only ever existed in one binding.
|
||||
* Stay silent on internal-only refactors, tests, docs, and changes with no cross-SDK
|
||||
surface.
|
||||
* Parity expectations apply to the Python and TypeScript (`nodejs`) SDKs. Java currently
|
||||
implements only the remote table, not the local/embedded backend, so it is expected to
|
||||
be partial — do not flag Java for missing local-only functionality.
|
||||
* Keep parity feedback to a short, clearly-labeled note (e.g. "Possible SDK parity
|
||||
gap: …"). It is advisory, not a merge blocker.
|
||||
@@ -112,25 +112,25 @@ def fetch_remote_tags() -> List[TagInfo]:
|
||||
"api",
|
||||
"-X",
|
||||
"GET",
|
||||
f"repos/{LANCE_REPO}/git/refs/tags",
|
||||
"--paginate",
|
||||
f"repos/{LANCE_REPO}/releases",
|
||||
"--jq",
|
||||
".[].ref",
|
||||
".[].tag_name",
|
||||
"-F",
|
||||
"per_page=20",
|
||||
]
|
||||
)
|
||||
tags: List[TagInfo] = []
|
||||
for line in output.splitlines():
|
||||
ref = line.strip()
|
||||
if not ref.startswith("refs/tags/v"):
|
||||
tag = line.strip()
|
||||
if not tag.startswith("v"):
|
||||
continue
|
||||
tag = ref.split("refs/tags/")[-1]
|
||||
version = tag.lstrip("v")
|
||||
try:
|
||||
tags.append(TagInfo(tag=tag, version=version, semver=parse_semver(version)))
|
||||
except ValueError:
|
||||
continue
|
||||
if not tags:
|
||||
raise RuntimeError("No Lance tags could be parsed from GitHub API output")
|
||||
raise RuntimeError("No Lance releases could be parsed from GitHub API output")
|
||||
return tags
|
||||
|
||||
|
||||
|
||||
126
ci/update_lance_dependency.py
Normal file
126
ci/update_lance_dependency.py
Normal file
@@ -0,0 +1,126 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Prepare a Lance dependency update for LanceDB."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Sequence
|
||||
|
||||
try:
|
||||
from check_lance_release import parse_semver
|
||||
except ModuleNotFoundError:
|
||||
# Supports importing as ci.update_lance_dependency from tests or ad hoc checks.
|
||||
from ci.check_lance_release import parse_semver # type: ignore
|
||||
|
||||
|
||||
def normalize_version(raw: str) -> str:
|
||||
value = raw.strip()
|
||||
value = value.removeprefix("refs/tags/")
|
||||
value = value.removeprefix("v")
|
||||
try:
|
||||
parse_semver(value)
|
||||
except ValueError:
|
||||
raise ValueError(f"Unsupported Lance version or tag: {raw}")
|
||||
return value
|
||||
|
||||
|
||||
def normalized_tag(version: str) -> str:
|
||||
return f"v{version}"
|
||||
|
||||
|
||||
def branch_name(version: str) -> str:
|
||||
suffix = re.sub(r"[^a-zA-Z0-9]+", "-", version).strip("-")
|
||||
suffix = re.sub(r"-+", "-", suffix)
|
||||
return f"codex/update-lance-{suffix}"
|
||||
|
||||
|
||||
def commit_type(version: str) -> str:
|
||||
prerelease = version.split("-", maxsplit=1)[1] if "-" in version else ""
|
||||
return "chore" if "beta" in prerelease or "rc" in prerelease else "feat"
|
||||
|
||||
|
||||
def metadata_for(version: str) -> dict[str, str]:
|
||||
kind = commit_type(version)
|
||||
message = f"{kind}: update lance dependency to v{version}"
|
||||
return {
|
||||
"version": version,
|
||||
"tag": normalized_tag(version),
|
||||
"branch_name": branch_name(version),
|
||||
"commit_type": kind,
|
||||
"commit_message": message,
|
||||
"pr_title": message,
|
||||
}
|
||||
|
||||
|
||||
def run_command(cmd: Sequence[str], *, cwd: Path) -> None:
|
||||
subprocess.run(cmd, cwd=cwd, check=True)
|
||||
|
||||
|
||||
def update_java_lance_core_version(repo_root: Path, version: str) -> None:
|
||||
pom_path = repo_root / "java" / "pom.xml"
|
||||
contents = pom_path.read_text(encoding="utf-8")
|
||||
updated, count = re.subn(
|
||||
r"(<lance-core\.version>)[^<]+(</lance-core\.version>)",
|
||||
rf"\g<1>{version}\g<2>",
|
||||
contents,
|
||||
count=1,
|
||||
)
|
||||
if count != 1:
|
||||
raise RuntimeError(
|
||||
"Expected exactly one <lance-core.version> entry in java/pom.xml"
|
||||
)
|
||||
pom_path.write_text(updated, encoding="utf-8")
|
||||
|
||||
|
||||
def write_github_outputs(path: str | None, payload: dict[str, str]) -> None:
|
||||
if not path:
|
||||
return
|
||||
with open(path, "a", encoding="utf-8") as output:
|
||||
for key, value in payload.items():
|
||||
output.write(f"{key}={value}\n")
|
||||
|
||||
|
||||
def main(argv: Sequence[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description=__doc__)
|
||||
parser.add_argument(
|
||||
"tag_or_version",
|
||||
help="Lance tag or version, for example refs/tags/v7.2.0-beta.1 or 7.2.0",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--repo-root",
|
||||
type=Path,
|
||||
default=Path(__file__).resolve().parents[1],
|
||||
help="Path to the lancedb repository root",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--github-output",
|
||||
default=None,
|
||||
help="Optional GitHub Actions output file to receive metadata fields",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--metadata-only",
|
||||
action="store_true",
|
||||
help="Only print derived metadata; do not modify dependency files",
|
||||
)
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
repo_root = args.repo_root.resolve()
|
||||
version = normalize_version(args.tag_or_version)
|
||||
payload = metadata_for(version)
|
||||
|
||||
if not args.metadata_only:
|
||||
run_command([sys.executable, "ci/set_lance_version.py", version], cwd=repo_root)
|
||||
update_java_lance_core_version(repo_root, version)
|
||||
|
||||
write_github_outputs(args.github_output, payload)
|
||||
print(json.dumps(payload, sort_keys=True))
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -147,6 +147,14 @@ allow = [
|
||||
"CDLA-Permissive-2.0",
|
||||
]
|
||||
confidence-threshold = 0.8
|
||||
# Per-crate license exceptions: allow a license for a specific crate only,
|
||||
# rather than globally via the `allow` list above.
|
||||
exceptions = [
|
||||
# CDDL-1.0 (copyleft) is pulled in only as a dev/profiling dependency via
|
||||
# `inferno` -> `pprof` -> `lance-testing`; it is a test dependency that we
|
||||
# do not distribute, so scope the allowance to `inferno` alone.
|
||||
{ allow = ["CDDL-1.0"], crate = "inferno" },
|
||||
]
|
||||
# Crates whose license cannot be determined from Cargo metadata but whose
|
||||
# license we've manually confirmed from upstream. Keep this list minimal.
|
||||
[[licenses.clarify]]
|
||||
|
||||
@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
|
||||
<dependency>
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-core</artifactId>
|
||||
<version>0.28.0-beta.11</version>
|
||||
<version>0.30.1-beta.2</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
|
||||
@@ -437,6 +437,39 @@ Open a table in the database.
|
||||
|
||||
***
|
||||
|
||||
### renameTable()
|
||||
|
||||
```ts
|
||||
abstract renameTable(
|
||||
currentName,
|
||||
newName,
|
||||
options?): Promise<void>
|
||||
```
|
||||
|
||||
Rename a table.
|
||||
|
||||
Currently only supported by LanceDB Cloud. Local OSS connections and
|
||||
namespace-backed connections (via [connectNamespace](../functions/connectNamespace.md)) reject with
|
||||
a "not supported" error.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **currentName**: `string`
|
||||
The current name of the table.
|
||||
|
||||
* **newName**: `string`
|
||||
The new name for the table.
|
||||
|
||||
* **options?**: [`RenameTableOptions`](../interfaces/RenameTableOptions.md)
|
||||
Optional namespace paths. When
|
||||
`newNamespacePath` is omitted the table stays in `namespacePath`.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`void`>
|
||||
|
||||
***
|
||||
|
||||
### tableNames()
|
||||
|
||||
#### tableNames(options)
|
||||
|
||||
@@ -76,6 +76,57 @@ the query optimizer chooses a suboptimal path.
|
||||
|
||||
***
|
||||
|
||||
### useLsmWrite()
|
||||
|
||||
```ts
|
||||
useLsmWrite(useLsmWrite): MergeInsertBuilder
|
||||
```
|
||||
|
||||
Controls whether the merge uses the MemWAL LSM write path.
|
||||
|
||||
By default (unset), a `mergeInsert` on a table with an LSM write spec is
|
||||
routed through Lance's MemWAL shard writer, and a table without one uses
|
||||
the standard path. Pass `false` to force the standard path even when a
|
||||
spec is set. Pass `true` to require a spec — `mergeInsert` rejects if none
|
||||
is installed.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **useLsmWrite**: `boolean`
|
||||
Whether to use the LSM write path.
|
||||
|
||||
#### Returns
|
||||
|
||||
[`MergeInsertBuilder`](MergeInsertBuilder.md)
|
||||
|
||||
***
|
||||
|
||||
### validateSingleShard()
|
||||
|
||||
```ts
|
||||
validateSingleShard(validateSingleShard): MergeInsertBuilder
|
||||
```
|
||||
|
||||
Controls how an LSM merge checks that its input targets a single shard.
|
||||
|
||||
When a table has an LSM write spec, every row in a `mergeInsert` call must
|
||||
route to the same shard. When `true` (the default), every row is inspected
|
||||
to verify this. When `false`, only the first row is inspected and the
|
||||
shard it routes to is used for the whole input — a faster path for callers
|
||||
that have already pre-sharded their input. Has no effect on tables without
|
||||
an LSM write spec.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **validateSingleShard**: `boolean`
|
||||
Whether to check every row routes to one shard. Defaults to `true`.
|
||||
|
||||
#### Returns
|
||||
|
||||
[`MergeInsertBuilder`](MergeInsertBuilder.md)
|
||||
|
||||
***
|
||||
|
||||
### whenMatchedUpdateAll()
|
||||
|
||||
```ts
|
||||
|
||||
@@ -343,6 +343,30 @@ This is useful for pagination.
|
||||
|
||||
***
|
||||
|
||||
### orderBy()
|
||||
|
||||
```ts
|
||||
orderBy(ordering): this
|
||||
```
|
||||
|
||||
Sort the results by the specified column(s).
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **ordering**: [`ColumnOrdering`](../interfaces/ColumnOrdering.md) \| [`ColumnOrdering`](../interfaces/ColumnOrdering.md)[]
|
||||
|
||||
#### Returns
|
||||
|
||||
`this`
|
||||
|
||||
This query builder.
|
||||
|
||||
#### Inherited from
|
||||
|
||||
`StandardQueryBase.orderBy`
|
||||
|
||||
***
|
||||
|
||||
### outputSchema()
|
||||
|
||||
```ts
|
||||
|
||||
173
docs/src/js/classes/Scannable.md
Normal file
173
docs/src/js/classes/Scannable.md
Normal file
@@ -0,0 +1,173 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / Scannable
|
||||
|
||||
# Class: Scannable
|
||||
|
||||
A data source that can be scanned as a stream of Arrow `RecordBatch`es.
|
||||
|
||||
`Scannable` wraps the schema + optional row count + rescannable flag and
|
||||
a callback that yields batches one at a time. It is passed to consumers
|
||||
(e.g. `Table.add`, `createTable`, `mergeInsert` — follow-up work) that
|
||||
need to pull data without materializing the full dataset in JS memory.
|
||||
|
||||
Batches cross the JS↔Rust boundary as Arrow IPC Stream messages; a fresh
|
||||
writer serializes each batch, and the Rust side decodes it with
|
||||
`arrow_ipc::reader::StreamReader`. One batch is in flight at a time.
|
||||
|
||||
## Properties
|
||||
|
||||
### numRows
|
||||
|
||||
```ts
|
||||
readonly numRows: null | number;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### rescannable
|
||||
|
||||
```ts
|
||||
readonly rescannable: boolean;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### schema
|
||||
|
||||
```ts
|
||||
readonly schema: Schema<any>;
|
||||
```
|
||||
|
||||
## Methods
|
||||
|
||||
### fromFactory()
|
||||
|
||||
```ts
|
||||
static fromFactory(
|
||||
schema,
|
||||
factory,
|
||||
opts): Promise<Scannable>
|
||||
```
|
||||
|
||||
Build a Scannable from an explicit schema and a factory that returns a
|
||||
fresh batch iterator on each call.
|
||||
|
||||
The factory is invoked once per scan. Each iterator yields
|
||||
`RecordBatch`es matching the declared schema. Use this when you need
|
||||
direct control over the pull loop — for example, to wrap a streaming
|
||||
source whose batches are produced lazily.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **schema**: `Schema`<`any`>
|
||||
The Arrow schema of the produced batches.
|
||||
|
||||
* **factory**
|
||||
Called at the start of each scan to produce a batch
|
||||
iterator. Must be idempotent when `rescannable` is true.
|
||||
|
||||
* **opts**: [`ScannableOptions`](../interfaces/ScannableOptions.md) = `{}`
|
||||
Optional hints. `rescannable` defaults to `true`; set to
|
||||
`false` if calling `factory()` twice would not reproduce the same data.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Scannable`](Scannable.md)>
|
||||
|
||||
***
|
||||
|
||||
### fromIterable()
|
||||
|
||||
```ts
|
||||
static fromIterable(
|
||||
schema,
|
||||
iter,
|
||||
opts): Promise<Scannable>
|
||||
```
|
||||
|
||||
Build a Scannable from an iterable of `RecordBatch`es. `rescannable`
|
||||
defaults to `false`. Pass an explicit schema so the consumer can
|
||||
validate before any batch is pulled.
|
||||
|
||||
`opts.rescannable: true` is honest for replayable iterables (Arrays,
|
||||
Sets, or custom iterables whose `[Symbol.iterator]()` returns a fresh
|
||||
iterator each call). It is rejected for one-shot iterables (generators,
|
||||
async generators, or already-an-iterator inputs) because their
|
||||
`[Symbol.iterator]()` returns the same exhausted object on the second
|
||||
scan. For replayable sources outside this shape, use
|
||||
`fromFactory(schema, () => createIter(), { rescannable: true })`.
|
||||
|
||||
Note: when `opts.rescannable` is `true`, the constructor calls
|
||||
`[Symbol.iterator]()` once on the input to perform the structural check.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **schema**: `Schema`<`any`>
|
||||
|
||||
* **iter**: `Iterable`<`RecordBatch`<`any`>> \| `AsyncIterable`<`RecordBatch`<`any`>>
|
||||
|
||||
* **opts**: [`ScannableOptions`](../interfaces/ScannableOptions.md) = `{}`
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Scannable`](Scannable.md)>
|
||||
|
||||
***
|
||||
|
||||
### fromRecordBatchReader()
|
||||
|
||||
```ts
|
||||
static fromRecordBatchReader(reader, opts): Promise<Scannable>
|
||||
```
|
||||
|
||||
Build a Scannable from an Arrow `RecordBatchReader`. A reader can only
|
||||
be consumed once; `rescannable` defaults to `false`.
|
||||
|
||||
The reader must already be opened (via `.open()`) so its `.schema` is
|
||||
populated. `RecordBatchReader.from(...)` returns an unopened reader.
|
||||
|
||||
`opts.rescannable: true` is rejected because `RecordBatchReader` is a
|
||||
self-iterator (its `[Symbol.iterator]()` returns itself), and this
|
||||
constructor does not call `reader.reset()` between scans, so a second
|
||||
scan would always see an exhausted reader. For genuinely replayable
|
||||
sources, use
|
||||
`fromFactory(schema, () => openReader(), { rescannable: true })`,
|
||||
which mints a fresh reader on each scan.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **reader**: `RecordBatchReader`<`any`>
|
||||
|
||||
* **opts**: [`ScannableOptions`](../interfaces/ScannableOptions.md) = `{}`
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Scannable`](Scannable.md)>
|
||||
|
||||
***
|
||||
|
||||
### fromTable()
|
||||
|
||||
```ts
|
||||
static fromTable(table, opts): Promise<Scannable>
|
||||
```
|
||||
|
||||
Build a Scannable from an in-memory Arrow `Table`. Always rescannable;
|
||||
the table's batches are replayed on each scan.
|
||||
|
||||
The table's row count is authoritative: `opts.numRows` must either be
|
||||
omitted or equal to `table.numRows`. `opts.rescannable` of `false` is
|
||||
rejected because in-memory Tables are always rescannable.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **table**: `Table`<`any`>
|
||||
|
||||
* **opts**: [`ScannableOptions`](../interfaces/ScannableOptions.md) = `{}`
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Scannable`](Scannable.md)>
|
||||
@@ -187,6 +187,25 @@ Any attempt to use the table after it is closed will result in an error.
|
||||
|
||||
***
|
||||
|
||||
### closeLsmWriters()
|
||||
|
||||
```ts
|
||||
abstract closeLsmWriters(): Promise<void>
|
||||
```
|
||||
|
||||
Drain and close any cached MemWAL shard writers held for this table.
|
||||
|
||||
When an [LsmWriteSpec](../interfaces/LsmWriteSpec.md) is installed, `mergeInsert` opens MemWAL
|
||||
shard writers and caches them for reuse across calls. This closes them,
|
||||
flushing pending data; writers reopen lazily on the next `mergeInsert`.
|
||||
It is a no-op when no writers are cached.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`void`>
|
||||
|
||||
***
|
||||
|
||||
### countRows()
|
||||
|
||||
```ts
|
||||
@@ -690,6 +709,74 @@ of the given query
|
||||
|
||||
***
|
||||
|
||||
### setLsmWriteSpec()
|
||||
|
||||
```ts
|
||||
abstract setLsmWriteSpec(spec): Promise<void>
|
||||
```
|
||||
|
||||
Install an [LsmWriteSpec](../interfaces/LsmWriteSpec.md) on this table, selecting Lance's MemWAL
|
||||
LSM-style write path for future `mergeInsert` calls.
|
||||
|
||||
`LsmWriteSpec` chooses one of three sharding strategies via `specType`:
|
||||
|
||||
- `"bucket"` — hash-bucket writes by the single-column unenforced primary
|
||||
key (`column` and `numBuckets` required).
|
||||
- `"identity"` — shard by the raw value of a scalar `column`.
|
||||
- `"unsharded"` — route every write to a single shard.
|
||||
|
||||
All variants require the table to have an unenforced primary key
|
||||
([Table#setUnenforcedPrimaryKey](Table.md#setunenforcedprimarykey)); bucket sharding additionally
|
||||
requires it to be the single column being bucketed.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **spec**: [`LsmWriteSpec`](../interfaces/LsmWriteSpec.md)
|
||||
The sharding spec to install.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`void`>
|
||||
|
||||
#### Example
|
||||
|
||||
```ts
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
await table.setLsmWriteSpec({
|
||||
specType: "bucket",
|
||||
column: "id",
|
||||
numBuckets: 16,
|
||||
maintainedIndexes: ["id_idx"],
|
||||
});
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### setUnenforcedPrimaryKey()
|
||||
|
||||
```ts
|
||||
abstract setUnenforcedPrimaryKey(columns): Promise<void>
|
||||
```
|
||||
|
||||
Set the unenforced primary key for this table to a single column.
|
||||
|
||||
"Unenforced" means LanceDB does not check uniqueness on writes; the
|
||||
column is recorded in the schema as the primary key for use by features
|
||||
such as `merge_insert`. Only single-column primary keys are supported,
|
||||
and the key cannot be changed once set.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **columns**: `string` \| `string`[]
|
||||
The primary key column. A one-element
|
||||
array is also accepted; passing more than one column is rejected.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`void`>
|
||||
|
||||
***
|
||||
|
||||
### stats()
|
||||
|
||||
```ts
|
||||
@@ -793,6 +880,23 @@ Return the table as an arrow table
|
||||
|
||||
***
|
||||
|
||||
### unsetLsmWriteSpec()
|
||||
|
||||
```ts
|
||||
abstract unsetLsmWriteSpec(): Promise<void>
|
||||
```
|
||||
|
||||
Remove the [LsmWriteSpec](../interfaces/LsmWriteSpec.md) from this table, reverting to the standard
|
||||
`mergeInsert` write path.
|
||||
|
||||
Errors if no spec is currently set.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`void`>
|
||||
|
||||
***
|
||||
|
||||
### update()
|
||||
|
||||
#### update(opts)
|
||||
@@ -890,6 +994,29 @@ based on the row being updated (e.g. "my_col + 1")
|
||||
|
||||
***
|
||||
|
||||
### updateFieldMetadata()
|
||||
|
||||
```ts
|
||||
abstract updateFieldMetadata(updates): Promise<UpdateFieldMetadataResult>
|
||||
```
|
||||
|
||||
Update per-field (column) metadata.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **updates**: [`FieldMetadataUpdate`](../interfaces/FieldMetadataUpdate.md)[]
|
||||
One or more per-field updates. Each
|
||||
update's metadata is merged into the field's existing metadata by default;
|
||||
a value of `null` deletes that key, and `replace: true` swaps the whole map.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`UpdateFieldMetadataResult`](../interfaces/UpdateFieldMetadataResult.md)>
|
||||
|
||||
resolves to the new table version.
|
||||
|
||||
***
|
||||
|
||||
### vectorSearch()
|
||||
|
||||
```ts
|
||||
|
||||
@@ -498,6 +498,30 @@ This is useful for pagination.
|
||||
|
||||
***
|
||||
|
||||
### orderBy()
|
||||
|
||||
```ts
|
||||
orderBy(ordering): this
|
||||
```
|
||||
|
||||
Sort the results by the specified column(s).
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **ordering**: [`ColumnOrdering`](../interfaces/ColumnOrdering.md) \| [`ColumnOrdering`](../interfaces/ColumnOrdering.md)[]
|
||||
|
||||
#### Returns
|
||||
|
||||
`this`
|
||||
|
||||
This query builder.
|
||||
|
||||
#### Inherited from
|
||||
|
||||
`StandardQueryBase.orderBy`
|
||||
|
||||
***
|
||||
|
||||
### outputSchema()
|
||||
|
||||
```ts
|
||||
|
||||
@@ -32,6 +32,7 @@
|
||||
- [PhraseQuery](classes/PhraseQuery.md)
|
||||
- [Query](classes/Query.md)
|
||||
- [QueryBase](classes/QueryBase.md)
|
||||
- [Scannable](classes/Scannable.md)
|
||||
- [Session](classes/Session.md)
|
||||
- [StaticHeaderProvider](classes/StaticHeaderProvider.md)
|
||||
- [Table](classes/Table.md)
|
||||
@@ -50,6 +51,7 @@
|
||||
- [AlterColumnsResult](interfaces/AlterColumnsResult.md)
|
||||
- [ClientConfig](interfaces/ClientConfig.md)
|
||||
- [ColumnAlteration](interfaces/ColumnAlteration.md)
|
||||
- [ColumnOrdering](interfaces/ColumnOrdering.md)
|
||||
- [CompactionStats](interfaces/CompactionStats.md)
|
||||
- [ConnectNamespaceOptions](interfaces/ConnectNamespaceOptions.md)
|
||||
- [ConnectionOptions](interfaces/ConnectionOptions.md)
|
||||
@@ -63,6 +65,7 @@
|
||||
- [DropNamespaceOptions](interfaces/DropNamespaceOptions.md)
|
||||
- [DropNamespaceResponse](interfaces/DropNamespaceResponse.md)
|
||||
- [ExecutableQuery](interfaces/ExecutableQuery.md)
|
||||
- [FieldMetadataUpdate](interfaces/FieldMetadataUpdate.md)
|
||||
- [FragmentStatistics](interfaces/FragmentStatistics.md)
|
||||
- [FragmentSummaryStats](interfaces/FragmentSummaryStats.md)
|
||||
- [FtsOptions](interfaces/FtsOptions.md)
|
||||
@@ -78,14 +81,17 @@
|
||||
- [IvfRqOptions](interfaces/IvfRqOptions.md)
|
||||
- [ListNamespacesOptions](interfaces/ListNamespacesOptions.md)
|
||||
- [ListNamespacesResponse](interfaces/ListNamespacesResponse.md)
|
||||
- [LsmWriteSpec](interfaces/LsmWriteSpec.md)
|
||||
- [MergeResult](interfaces/MergeResult.md)
|
||||
- [OpenTableOptions](interfaces/OpenTableOptions.md)
|
||||
- [OptimizeOptions](interfaces/OptimizeOptions.md)
|
||||
- [OptimizeStats](interfaces/OptimizeStats.md)
|
||||
- [QueryExecutionOptions](interfaces/QueryExecutionOptions.md)
|
||||
- [RemovalStats](interfaces/RemovalStats.md)
|
||||
- [RenameTableOptions](interfaces/RenameTableOptions.md)
|
||||
- [RestNamespaceConfig](interfaces/RestNamespaceConfig.md)
|
||||
- [RetryConfig](interfaces/RetryConfig.md)
|
||||
- [ScannableOptions](interfaces/ScannableOptions.md)
|
||||
- [ShuffleOptions](interfaces/ShuffleOptions.md)
|
||||
- [SplitCalculatedOptions](interfaces/SplitCalculatedOptions.md)
|
||||
- [SplitHashOptions](interfaces/SplitHashOptions.md)
|
||||
@@ -96,10 +102,12 @@
|
||||
- [TimeoutConfig](interfaces/TimeoutConfig.md)
|
||||
- [TlsConfig](interfaces/TlsConfig.md)
|
||||
- [TokenResponse](interfaces/TokenResponse.md)
|
||||
- [UpdateFieldMetadataResult](interfaces/UpdateFieldMetadataResult.md)
|
||||
- [UpdateOptions](interfaces/UpdateOptions.md)
|
||||
- [UpdateResult](interfaces/UpdateResult.md)
|
||||
- [Version](interfaces/Version.md)
|
||||
- [WriteExecutionOptions](interfaces/WriteExecutionOptions.md)
|
||||
- [WriteProgress](interfaces/WriteProgress.md)
|
||||
|
||||
## Type Aliases
|
||||
|
||||
|
||||
@@ -19,3 +19,39 @@ mode: "append" | "overwrite";
|
||||
If "append" (the default) then the new data will be added to the table
|
||||
|
||||
If "overwrite" then the new data will replace the existing data in the table.
|
||||
|
||||
***
|
||||
|
||||
### progress()
|
||||
|
||||
```ts
|
||||
progress: (progress) => void;
|
||||
```
|
||||
|
||||
Optional callback invoked periodically with write progress.
|
||||
|
||||
The callback is fired once per batch written and once more with
|
||||
`done: true` when the write completes. Calls are dispatched
|
||||
asynchronously to the JS event loop and never block the write — a slow
|
||||
callback will queue events rather than back-pressure the writer.
|
||||
|
||||
Errors thrown from the callback are logged with `console.warn` and
|
||||
swallowed — they do not abort the write.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **progress**: [`WriteProgress`](WriteProgress.md)
|
||||
|
||||
#### Returns
|
||||
|
||||
`void`
|
||||
|
||||
#### Example
|
||||
|
||||
```ts
|
||||
await table.add(data, {
|
||||
progress: (p) => {
|
||||
console.log(`${p.outputRows}/${p.totalRows ?? "?"} rows`);
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
31
docs/src/js/interfaces/ColumnOrdering.md
Normal file
31
docs/src/js/interfaces/ColumnOrdering.md
Normal file
@@ -0,0 +1,31 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / ColumnOrdering
|
||||
|
||||
# Interface: ColumnOrdering
|
||||
|
||||
## Properties
|
||||
|
||||
### ascending?
|
||||
|
||||
```ts
|
||||
optional ascending: boolean;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### columnName
|
||||
|
||||
```ts
|
||||
columnName: string;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### nullsFirst?
|
||||
|
||||
```ts
|
||||
optional nullsFirst: boolean;
|
||||
```
|
||||
@@ -70,16 +70,20 @@ client used by manifest-enabled native connections.
|
||||
optional readConsistencyInterval: number;
|
||||
```
|
||||
|
||||
(For LanceDB OSS only): The interval, in seconds, at which to check for
|
||||
updates to the table from other processes. If None, then consistency is not
|
||||
checked. For performance reasons, this is the default. For strong
|
||||
consistency, set this to zero seconds. Then every read will check for
|
||||
updates from other processes. As a compromise, you can set this to a
|
||||
non-zero value for eventual consistency. If more than that interval
|
||||
has passed since the last check, then the table will be checked for updates.
|
||||
Note: this consistency only applies to read operations. Write operations are
|
||||
The interval, in seconds, at which to check for updates to the table
|
||||
from other processes. If None, then consistency is not checked. For
|
||||
performance reasons, this is the default. For strong consistency, set
|
||||
this to zero seconds. Then every read will check for updates from other
|
||||
processes. As a compromise, you can set this to a non-zero value for
|
||||
eventual consistency. If more than that interval has passed since the
|
||||
last check, then the table will be checked for updates. Note: this
|
||||
consistency only applies to read operations. Write operations are
|
||||
always consistent.
|
||||
|
||||
Stronger consistency is not free. The smaller the interval, the more
|
||||
often each read pays the cost of checking for updates against object
|
||||
storage, raising per-read latency and cost.
|
||||
|
||||
***
|
||||
|
||||
### region?
|
||||
|
||||
41
docs/src/js/interfaces/FieldMetadataUpdate.md
Normal file
41
docs/src/js/interfaces/FieldMetadataUpdate.md
Normal file
@@ -0,0 +1,41 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / FieldMetadataUpdate
|
||||
|
||||
# Interface: FieldMetadataUpdate
|
||||
|
||||
A per-field metadata update, addressed by dot-path.
|
||||
|
||||
## Properties
|
||||
|
||||
### metadata
|
||||
|
||||
```ts
|
||||
metadata: Record<string, null | string>;
|
||||
```
|
||||
|
||||
Metadata key/value pairs. Merged into the field's existing metadata by
|
||||
default; a value of `null` deletes that key.
|
||||
|
||||
***
|
||||
|
||||
### path
|
||||
|
||||
```ts
|
||||
path: string;
|
||||
```
|
||||
|
||||
Dot-separated path to the field. For a top-level column this is just its
|
||||
name; for a nested field it's the path, e.g. "a.b.c".
|
||||
|
||||
***
|
||||
|
||||
### replace?
|
||||
|
||||
```ts
|
||||
optional replace: boolean;
|
||||
```
|
||||
|
||||
If true, replace the field's entire metadata map instead of merging.
|
||||
@@ -30,17 +30,6 @@ The type of the index
|
||||
|
||||
***
|
||||
|
||||
### loss?
|
||||
|
||||
```ts
|
||||
optional loss: number;
|
||||
```
|
||||
|
||||
The KMeans loss value of the index,
|
||||
it is only present for vector indices.
|
||||
|
||||
***
|
||||
|
||||
### numIndexedRows
|
||||
|
||||
```ts
|
||||
|
||||
67
docs/src/js/interfaces/LsmWriteSpec.md
Normal file
67
docs/src/js/interfaces/LsmWriteSpec.md
Normal file
@@ -0,0 +1,67 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / LsmWriteSpec
|
||||
|
||||
# Interface: LsmWriteSpec
|
||||
|
||||
Specification selecting Lance's MemWAL LSM-style write path for
|
||||
`mergeInsert`.
|
||||
|
||||
`specType` is `"bucket"`, `"identity"`, or `"unsharded"`. For `"bucket"`,
|
||||
`column` and `numBuckets` are required; for `"identity"`, `column` is
|
||||
required and must be a deterministic function of the unenforced primary
|
||||
key (every row with a given primary key must always produce the same
|
||||
`column` value, or upserts of that key can land in different shards and a
|
||||
stale version can win).
|
||||
|
||||
## Properties
|
||||
|
||||
### column?
|
||||
|
||||
```ts
|
||||
optional column: string;
|
||||
```
|
||||
|
||||
Bucket and identity variants: the sharding column.
|
||||
|
||||
***
|
||||
|
||||
### maintainedIndexes?
|
||||
|
||||
```ts
|
||||
optional maintainedIndexes: string[];
|
||||
```
|
||||
|
||||
Names of indexes the MemWAL should keep up to date during writes.
|
||||
|
||||
***
|
||||
|
||||
### numBuckets?
|
||||
|
||||
```ts
|
||||
optional numBuckets: number;
|
||||
```
|
||||
|
||||
Bucket variant: the number of buckets, in `[1, 1024]`.
|
||||
|
||||
***
|
||||
|
||||
### specType
|
||||
|
||||
```ts
|
||||
specType: "bucket" | "identity" | "unsharded";
|
||||
```
|
||||
|
||||
One of `"bucket"`, `"identity"`, or `"unsharded"`.
|
||||
|
||||
***
|
||||
|
||||
### writerConfigDefaults?
|
||||
|
||||
```ts
|
||||
optional writerConfigDefaults: Record<string, string>;
|
||||
```
|
||||
|
||||
Default `ShardWriter` configuration recorded in the MemWAL index.
|
||||
@@ -32,6 +32,14 @@ numInsertedRows: number;
|
||||
|
||||
***
|
||||
|
||||
### numRows
|
||||
|
||||
```ts
|
||||
numRows: number;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### numUpdatedRows
|
||||
|
||||
```ts
|
||||
|
||||
29
docs/src/js/interfaces/RenameTableOptions.md
Normal file
29
docs/src/js/interfaces/RenameTableOptions.md
Normal file
@@ -0,0 +1,29 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / RenameTableOptions
|
||||
|
||||
# Interface: RenameTableOptions
|
||||
|
||||
## Properties
|
||||
|
||||
### namespacePath?
|
||||
|
||||
```ts
|
||||
optional namespacePath: string[];
|
||||
```
|
||||
|
||||
The namespace path of the table being renamed. Defaults to the root
|
||||
namespace (`[]`) when omitted.
|
||||
|
||||
***
|
||||
|
||||
### newNamespacePath?
|
||||
|
||||
```ts
|
||||
optional newNamespacePath: string[];
|
||||
```
|
||||
|
||||
The namespace path to move the table to as part of the rename. When
|
||||
omitted the table stays in `namespacePath`.
|
||||
29
docs/src/js/interfaces/ScannableOptions.md
Normal file
29
docs/src/js/interfaces/ScannableOptions.md
Normal file
@@ -0,0 +1,29 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / ScannableOptions
|
||||
|
||||
# Interface: ScannableOptions
|
||||
|
||||
## Properties
|
||||
|
||||
### numRows?
|
||||
|
||||
```ts
|
||||
optional numRows: number;
|
||||
```
|
||||
|
||||
Hint about the number of rows. Not validated against the stream.
|
||||
|
||||
***
|
||||
|
||||
### rescannable?
|
||||
|
||||
```ts
|
||||
optional rescannable: boolean;
|
||||
```
|
||||
|
||||
Whether the source can be scanned more than once. Defaults to `true` for
|
||||
`fromTable` / `fromFactory` and `false` for `fromIterable` /
|
||||
`fromRecordBatchReader`.
|
||||
15
docs/src/js/interfaces/UpdateFieldMetadataResult.md
Normal file
15
docs/src/js/interfaces/UpdateFieldMetadataResult.md
Normal file
@@ -0,0 +1,15 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / UpdateFieldMetadataResult
|
||||
|
||||
# Interface: UpdateFieldMetadataResult
|
||||
|
||||
## Properties
|
||||
|
||||
### version
|
||||
|
||||
```ts
|
||||
version: number;
|
||||
```
|
||||
84
docs/src/js/interfaces/WriteProgress.md
Normal file
84
docs/src/js/interfaces/WriteProgress.md
Normal file
@@ -0,0 +1,84 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / WriteProgress
|
||||
|
||||
# Interface: WriteProgress
|
||||
|
||||
Progress snapshot for a write operation, delivered to the `progress`
|
||||
callback passed to [Table.add](../classes/Table.md#add).
|
||||
|
||||
## Properties
|
||||
|
||||
### activeTasks
|
||||
|
||||
```ts
|
||||
activeTasks: number;
|
||||
```
|
||||
|
||||
Number of parallel write tasks currently in flight.
|
||||
|
||||
***
|
||||
|
||||
### done
|
||||
|
||||
```ts
|
||||
done: boolean;
|
||||
```
|
||||
|
||||
`true` for the final callback; `false` otherwise.
|
||||
|
||||
***
|
||||
|
||||
### elapsedSeconds
|
||||
|
||||
```ts
|
||||
elapsedSeconds: number;
|
||||
```
|
||||
|
||||
Wall-clock seconds since the write started.
|
||||
|
||||
***
|
||||
|
||||
### outputBytes
|
||||
|
||||
```ts
|
||||
outputBytes: number;
|
||||
```
|
||||
|
||||
Number of bytes written so far.
|
||||
|
||||
***
|
||||
|
||||
### outputRows
|
||||
|
||||
```ts
|
||||
outputRows: number;
|
||||
```
|
||||
|
||||
Number of rows written so far.
|
||||
|
||||
***
|
||||
|
||||
### totalRows?
|
||||
|
||||
```ts
|
||||
optional totalRows: number;
|
||||
```
|
||||
|
||||
Total rows expected, when the input source reports it.
|
||||
|
||||
Always set on the final callback (the one with `done: true`), falling
|
||||
back to the actual number of rows written when the source could not
|
||||
report a row count up front.
|
||||
|
||||
***
|
||||
|
||||
### totalTasks
|
||||
|
||||
```ts
|
||||
totalTasks: number;
|
||||
```
|
||||
|
||||
Total number of parallel write tasks (the write parallelism).
|
||||
@@ -166,6 +166,12 @@ lists the indices that LanceDb supports.
|
||||
|
||||
::: lancedb.index.IvfFlat
|
||||
|
||||
::: lancedb.index.IvfSq
|
||||
|
||||
::: lancedb.index.IvfRq
|
||||
|
||||
::: lancedb.index.HnswFlat
|
||||
|
||||
::: lancedb.table.IndexStatistics
|
||||
|
||||
## Querying (Asynchronous)
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
<parent>
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-parent</artifactId>
|
||||
<version>0.28.0-beta.11</version>
|
||||
<version>0.30.1-beta.2</version>
|
||||
<relativePath>../pom.xml</relativePath>
|
||||
</parent>
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-parent</artifactId>
|
||||
<version>0.28.0-beta.11</version>
|
||||
<version>0.30.1-beta.2</version>
|
||||
<packaging>pom</packaging>
|
||||
<name>${project.artifactId}</name>
|
||||
<description>LanceDB Java SDK Parent POM</description>
|
||||
@@ -28,7 +28,7 @@
|
||||
<properties>
|
||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||
<arrow.version>15.0.0</arrow.version>
|
||||
<lance-core.version>7.0.0-beta.7</lance-core.version>
|
||||
<lance-core.version>8.0.0-beta.6</lance-core.version>
|
||||
<spotless.skip>false</spotless.skip>
|
||||
<spotless.version>2.30.0</spotless.version>
|
||||
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
[package]
|
||||
name = "lancedb-nodejs"
|
||||
edition.workspace = true
|
||||
version = "0.28.0-beta.11"
|
||||
version = "0.30.1-beta.2"
|
||||
publish = false
|
||||
license.workspace = true
|
||||
description.workspace = true
|
||||
|
||||
@@ -47,6 +47,14 @@ describe("given a connection", () => {
|
||||
await db.close();
|
||||
expect(db.isOpen()).toBe(false);
|
||||
await expect(db.tableNames()).rejects.toThrow("Connection is closed");
|
||||
await expect(db.renameTable("a", "b")).rejects.toThrow(
|
||||
"Connection is closed",
|
||||
);
|
||||
});
|
||||
|
||||
it("should report renameTable as unsupported on an OSS connection", async () => {
|
||||
await db.createTable("a", [{ id: 1 }]);
|
||||
await expect(db.renameTable("a", "b")).rejects.toThrow(/not supported/);
|
||||
});
|
||||
it("should be able to create a table from an object arg `createTable(options)`, or args `createTable(name, data, options)`", async () => {
|
||||
let tbl = await db.createTable("test", [{ id: 1 }, { id: 2 }]);
|
||||
@@ -163,18 +171,22 @@ describe("given a connection", () => {
|
||||
|
||||
let manifestDir =
|
||||
tmpDir.name + "/test_manifest_paths_v2_empty.lance/_versions";
|
||||
readdirSync(manifestDir).forEach((file) => {
|
||||
expect(file).toMatch(/^\d{20}\.manifest$/);
|
||||
});
|
||||
readdirSync(manifestDir)
|
||||
.filter((f) => f.endsWith(".manifest"))
|
||||
.forEach((file) => {
|
||||
expect(file).toMatch(/^\d{20}\.manifest$/);
|
||||
});
|
||||
|
||||
table = (await db.createTable("test_manifest_paths_v2", [{ id: 1 }], {
|
||||
enableV2ManifestPaths: true,
|
||||
})) as LocalTable;
|
||||
expect(await table.usesV2ManifestPaths()).toBe(true);
|
||||
manifestDir = tmpDir.name + "/test_manifest_paths_v2.lance/_versions";
|
||||
readdirSync(manifestDir).forEach((file) => {
|
||||
expect(file).toMatch(/^\d{20}\.manifest$/);
|
||||
});
|
||||
readdirSync(manifestDir)
|
||||
.filter((f) => f.endsWith(".manifest"))
|
||||
.forEach((file) => {
|
||||
expect(file).toMatch(/^\d{20}\.manifest$/);
|
||||
});
|
||||
});
|
||||
|
||||
it("should be able to migrate tables to the V2 manifest paths", async () => {
|
||||
@@ -191,16 +203,20 @@ describe("given a connection", () => {
|
||||
|
||||
const manifestDir =
|
||||
tmpDir.name + "/test_manifest_path_migration.lance/_versions";
|
||||
readdirSync(manifestDir).forEach((file) => {
|
||||
expect(file).toMatch(/^\d\.manifest$/);
|
||||
});
|
||||
readdirSync(manifestDir)
|
||||
.filter((f) => f.endsWith(".manifest"))
|
||||
.forEach((file) => {
|
||||
expect(file).toMatch(/^\d\.manifest$/);
|
||||
});
|
||||
|
||||
await table.migrateManifestPathsV2();
|
||||
expect(await table.usesV2ManifestPaths()).toBe(true);
|
||||
|
||||
readdirSync(manifestDir).forEach((file) => {
|
||||
expect(file).toMatch(/^\d{20}\.manifest$/);
|
||||
});
|
||||
readdirSync(manifestDir)
|
||||
.filter((f) => f.endsWith(".manifest"))
|
||||
.forEach((file) => {
|
||||
expect(file).toMatch(/^\d{20}\.manifest$/);
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
|
||||
@@ -109,3 +109,209 @@ describe("Query outputSchema", () => {
|
||||
expect(schema.fields.length).toBe(3);
|
||||
});
|
||||
});
|
||||
|
||||
describe("Query orderBy", () => {
|
||||
let tmpDir: tmp.DirResult;
|
||||
let table: Table;
|
||||
|
||||
beforeEach(async () => {
|
||||
tmpDir = tmp.dirSync({ unsafeCleanup: true });
|
||||
const db = await connect(tmpDir.name);
|
||||
|
||||
// Create table with numeric data for sorting
|
||||
const schema = new Schema([
|
||||
new Field("id", new Int64(), true),
|
||||
new Field("score", new Float32(), true),
|
||||
new Field("name", new Utf8(), true),
|
||||
]);
|
||||
|
||||
const data = makeArrowTable(
|
||||
[
|
||||
{ id: 1n, score: 3.5, name: "charlie" },
|
||||
{ id: 2n, score: 1.2, name: "alice" },
|
||||
{ id: 3n, score: 2.8, name: "bob" },
|
||||
{ id: 4n, score: 0.5, name: "david" },
|
||||
{ id: 5n, score: 4.1, name: "eve" },
|
||||
],
|
||||
{ schema },
|
||||
);
|
||||
table = await db.createTable("test", data);
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
tmpDir.removeCallback();
|
||||
});
|
||||
|
||||
it("should sort by single column ascending", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "score", ascending: true, nullsFirst: false })
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(5);
|
||||
// Verify ascending order
|
||||
expect(results[0].score).toBeCloseTo(0.5, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(1.2, 0.001);
|
||||
expect(results[2].score).toBeCloseTo(2.8, 0.001);
|
||||
expect(results[3].score).toBeCloseTo(3.5, 0.001);
|
||||
expect(results[4].score).toBeCloseTo(4.1, 0.001);
|
||||
});
|
||||
|
||||
it("should sort by single column descending", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "score", ascending: false, nullsFirst: false })
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(5);
|
||||
// Verify descending order
|
||||
expect(results[0].score).toBeCloseTo(4.1, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(3.5, 0.001);
|
||||
expect(results[2].score).toBeCloseTo(2.8, 0.001);
|
||||
expect(results[3].score).toBeCloseTo(1.2, 0.001);
|
||||
expect(results[4].score).toBeCloseTo(0.5, 0.001);
|
||||
});
|
||||
|
||||
it("should use ascending as default direction", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "score" })
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(5);
|
||||
// Verify ascending order (default)
|
||||
expect(results[0].score).toBeCloseTo(0.5, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(1.2, 0.001);
|
||||
expect(results[2].score).toBeCloseTo(2.8, 0.001);
|
||||
expect(results[3].score).toBeCloseTo(3.5, 0.001);
|
||||
expect(results[4].score).toBeCloseTo(4.1, 0.001);
|
||||
});
|
||||
|
||||
it("should sort by string column", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "name" })
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(5);
|
||||
// Verify alphabetical order
|
||||
expect(results[0].name).toBe("alice");
|
||||
expect(results[1].name).toBe("bob");
|
||||
expect(results[2].name).toBe("charlie");
|
||||
expect(results[3].name).toBe("david");
|
||||
expect(results[4].name).toBe("eve");
|
||||
});
|
||||
|
||||
it("should support method chaining with where", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.where("score > 2.0")
|
||||
.orderBy({ columnName: "score" })
|
||||
.toArray();
|
||||
expect(results.length).toBe(3);
|
||||
// Verify filtered and sorted
|
||||
expect(results[0].score).toBeCloseTo(2.8, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(3.5, 0.001);
|
||||
expect(results[2].score).toBeCloseTo(4.1, 0.001);
|
||||
});
|
||||
|
||||
it("should support method chaining with limit", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "score", ascending: false })
|
||||
.limit(3)
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(3);
|
||||
// Verify top 3 in descending order
|
||||
expect(results[0].score).toBeCloseTo(4.1, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(3.5, 0.001);
|
||||
expect(results[2].score).toBeCloseTo(2.8, 0.001);
|
||||
});
|
||||
|
||||
it("should support method chaining with offset", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "score" })
|
||||
.offset(2)
|
||||
.limit(2)
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(2);
|
||||
// Verify results skip first 2 and take next 2
|
||||
expect(results[0].score).toBeCloseTo(2.8, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(3.5, 0.001);
|
||||
});
|
||||
|
||||
it("should support method chaining with select", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.orderBy({ columnName: "name" })
|
||||
.select(["name", "score"])
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(5);
|
||||
// Verify only selected columns are present
|
||||
expect(Object.keys(results[0])).toEqual(["name", "score"]);
|
||||
expect(Object.keys(results[4])).toEqual(["name", "score"]);
|
||||
// Verify sorted by name
|
||||
expect(results[0].name).toBe("alice");
|
||||
expect(results[4].name).toBe("eve");
|
||||
});
|
||||
|
||||
it("should support complex method chaining", async () => {
|
||||
const results = await table
|
||||
.query()
|
||||
.where("score > 1.0")
|
||||
.orderBy({ columnName: "score", ascending: false })
|
||||
.limit(3)
|
||||
.select(["id", "score", "name"])
|
||||
.toArray();
|
||||
|
||||
expect(results.length).toBe(3);
|
||||
// Verify filtered, sorted, limited, and projected
|
||||
expect(results[0].score).toBeCloseTo(4.1, 0.001);
|
||||
expect(results[1].score).toBeCloseTo(3.5, 0.001);
|
||||
expect(results[2].score).toBeCloseTo(2.8, 0.001);
|
||||
expect(Object.keys(results[0])).toEqual(["id", "score", "name"]);
|
||||
});
|
||||
|
||||
it("should support multi-column ordering and null placement", async () => {
|
||||
const schema = new Schema([
|
||||
new Field("group", new Int64(), true),
|
||||
new Field("score", new Float32(), true),
|
||||
new Field("name", new Utf8(), true),
|
||||
]);
|
||||
|
||||
const data = makeArrowTable(
|
||||
[
|
||||
{ group: 1n, score: null, name: "z" },
|
||||
{ group: 1n, score: 1.0, name: "b" },
|
||||
{ group: 1n, score: 1.0, name: "a" },
|
||||
{ group: 2n, score: 0.5, name: "c" },
|
||||
],
|
||||
{ schema },
|
||||
);
|
||||
const nullTable = await (await connect(tmpDir.name)).createTable(
|
||||
"test_multi_order",
|
||||
data,
|
||||
{ mode: "overwrite" },
|
||||
);
|
||||
|
||||
const results = await nullTable
|
||||
.query()
|
||||
.orderBy([
|
||||
{ columnName: "group", ascending: true, nullsFirst: false },
|
||||
{ columnName: "score", ascending: true, nullsFirst: true },
|
||||
{ columnName: "name", ascending: true, nullsFirst: false },
|
||||
])
|
||||
.toArray();
|
||||
|
||||
expect(results.map((r) => [r.group, r.score, r.name])).toEqual([
|
||||
[1n, null, "z"],
|
||||
[1n, 1.0, "a"],
|
||||
[1n, 1.0, "b"],
|
||||
[2n, 0.5, "c"],
|
||||
]);
|
||||
});
|
||||
});
|
||||
|
||||
@@ -617,4 +617,68 @@ describe("remote connection", () => {
|
||||
);
|
||||
});
|
||||
});
|
||||
|
||||
describe("renameTable", () => {
|
||||
async function captureRenameRequest(
|
||||
call: (db: Connection) => Promise<void>,
|
||||
): Promise<{ url: string; body: Record<string, unknown> }> {
|
||||
let captured: { url: string; body: Record<string, unknown> } | undefined;
|
||||
await withMockDatabase((req, res) => {
|
||||
let raw = "";
|
||||
req.on("data", (chunk) => {
|
||||
raw += chunk;
|
||||
});
|
||||
req.on("end", () => {
|
||||
captured = {
|
||||
url: req.url ?? "",
|
||||
body: raw ? JSON.parse(raw) : {},
|
||||
};
|
||||
res.writeHead(200, { "Content-Type": "application/json" }).end("");
|
||||
});
|
||||
}, call);
|
||||
if (!captured) {
|
||||
throw new Error("mock server never saw a request");
|
||||
}
|
||||
return captured;
|
||||
}
|
||||
|
||||
it("sends rename request for a table in the root namespace", async () => {
|
||||
const { url, body } = await captureRenameRequest(async (db) => {
|
||||
await db.renameTable("table1", "table2");
|
||||
});
|
||||
expect(url).toBe("/v1/table/table1/rename/");
|
||||
// biome-ignore lint/style/useNamingConvention: snake_case mandated by the server wire format
|
||||
expect(body).toEqual({ new_table_name: "table2" });
|
||||
});
|
||||
|
||||
it("omits new_namespace when only the current namespace is supplied", async () => {
|
||||
// Safe-default check: passing namespacePath alone must not send
|
||||
// `new_namespace`, so the server keeps the table in its current
|
||||
// namespace instead of silently moving it to root.
|
||||
const { url, body } = await captureRenameRequest(async (db) => {
|
||||
await db.renameTable("table1", "table2", {
|
||||
namespacePath: ["ns1"],
|
||||
});
|
||||
});
|
||||
expect(url).toBe("/v1/table/ns1$table1/rename/");
|
||||
// biome-ignore lint/style/useNamingConvention: snake_case mandated by the server wire format
|
||||
expect(body).toEqual({ new_table_name: "table2" });
|
||||
});
|
||||
|
||||
it("includes new_namespace in the body for a cross-namespace rename", async () => {
|
||||
const { url, body } = await captureRenameRequest(async (db) => {
|
||||
await db.renameTable("table1", "table2", {
|
||||
namespacePath: ["ns1"],
|
||||
newNamespacePath: ["ns2"],
|
||||
});
|
||||
});
|
||||
expect(url).toBe("/v1/table/ns1$table1/rename/");
|
||||
expect(body).toEqual({
|
||||
// biome-ignore lint/style/useNamingConvention: snake_case mandated by the server wire format
|
||||
new_table_name: "table2",
|
||||
// biome-ignore lint/style/useNamingConvention: snake_case mandated by the server wire format
|
||||
new_namespace: ["ns2"],
|
||||
});
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
438
nodejs/__test__/scannable.test.ts
Normal file
438
nodejs/__test__/scannable.test.ts
Normal file
@@ -0,0 +1,438 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import {
|
||||
Field,
|
||||
Float16,
|
||||
Int32,
|
||||
type RecordBatch,
|
||||
RecordBatchReader,
|
||||
Schema,
|
||||
tableToIPC,
|
||||
} from "apache-arrow";
|
||||
import { makeArrowTable, makeEmptyTable } from "../lancedb/arrow";
|
||||
import { Scannable } from "../lancedb/scannable";
|
||||
|
||||
function makeTable() {
|
||||
return makeArrowTable(
|
||||
[
|
||||
{ id: 1, name: "a" },
|
||||
{ id: 2, name: "b" },
|
||||
{ id: 3, name: "c" },
|
||||
],
|
||||
{ vectorColumns: {} },
|
||||
);
|
||||
}
|
||||
|
||||
async function makeReader(): Promise<RecordBatchReader> {
|
||||
// `RecordBatchReader.from()` returns an unopened reader; `.schema` is only
|
||||
// populated after `.open()`. Opening sync readers is synchronous.
|
||||
const reader = RecordBatchReader.from(tableToIPC(makeTable()));
|
||||
return reader.open() as RecordBatchReader;
|
||||
}
|
||||
|
||||
describe("Scannable", () => {
|
||||
describe("fromTable", () => {
|
||||
test("reflects schema, numRows, and defaults rescannable=true", async () => {
|
||||
const table = makeTable();
|
||||
const scannable = await Scannable.fromTable(table);
|
||||
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
expect(scannable.numRows).toBe(table.numRows);
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
});
|
||||
|
||||
test("throws when opts.numRows does not match table.numRows", async () => {
|
||||
await expect(
|
||||
Scannable.fromTable(makeTable(), { numRows: 42 }),
|
||||
).rejects.toThrow(/does not match table\.numRows/);
|
||||
});
|
||||
|
||||
test("throws when opts.rescannable is false", async () => {
|
||||
await expect(
|
||||
Scannable.fromTable(makeTable(), { rescannable: false }),
|
||||
).rejects.toThrow(/always rescannable/);
|
||||
});
|
||||
});
|
||||
|
||||
describe("fromRecordBatchReader", () => {
|
||||
test("reflects schema and defaults numRows=null, rescannable=false", async () => {
|
||||
const reader = await makeReader();
|
||||
const scannable = await Scannable.fromRecordBatchReader(reader);
|
||||
|
||||
expect(scannable.schema).toBe(reader.schema);
|
||||
expect(scannable.numRows).toBeNull();
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
|
||||
test("honors numRows override", async () => {
|
||||
const scannable = await Scannable.fromRecordBatchReader(
|
||||
await makeReader(),
|
||||
{ numRows: 3 },
|
||||
);
|
||||
|
||||
expect(scannable.numRows).toBe(3);
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
|
||||
test("rescannable: false explicit does not throw", async () => {
|
||||
const reader = await makeReader();
|
||||
const scannable = await Scannable.fromRecordBatchReader(reader, {
|
||||
rescannable: false,
|
||||
});
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
|
||||
test("throws when opts.rescannable is true", async () => {
|
||||
const reader = await makeReader();
|
||||
await expect(
|
||||
Scannable.fromRecordBatchReader(reader, { rescannable: true }),
|
||||
).rejects.toThrow(/does not accept rescannable/);
|
||||
});
|
||||
|
||||
test("throws when opts.rescannable is true even alongside numRows", async () => {
|
||||
const reader = await makeReader();
|
||||
await expect(
|
||||
Scannable.fromRecordBatchReader(reader, {
|
||||
numRows: 3,
|
||||
rescannable: true,
|
||||
}),
|
||||
).rejects.toThrow(/does not accept rescannable/);
|
||||
});
|
||||
});
|
||||
|
||||
describe("fromIterable", () => {
|
||||
test("accepts a sync iterable of batches", async () => {
|
||||
const table = makeTable();
|
||||
const scannable = await Scannable.fromIterable(
|
||||
table.schema,
|
||||
table.batches,
|
||||
);
|
||||
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
expect(scannable.numRows).toBeNull();
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
|
||||
test("accepts an async iterable of batches", async () => {
|
||||
const table = makeTable();
|
||||
async function* generator(): AsyncGenerator<RecordBatch> {
|
||||
for (const batch of table.batches) {
|
||||
yield batch;
|
||||
}
|
||||
}
|
||||
|
||||
const scannable = await Scannable.fromIterable(table.schema, generator());
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
|
||||
describe("rescannable: true detection", () => {
|
||||
// Replayable inputs: [Symbol.iterator]() / [Symbol.asyncIterator]()
|
||||
// returns a fresh iterator each call. Must NOT throw.
|
||||
|
||||
test("Array passes (fresh ArrayIterator each call)", async () => {
|
||||
const table = makeTable();
|
||||
const scannable = await Scannable.fromIterable(
|
||||
table.schema,
|
||||
table.batches,
|
||||
{ rescannable: true },
|
||||
);
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
});
|
||||
|
||||
test("Set passes (fresh SetIterator each call)", async () => {
|
||||
const table = makeTable();
|
||||
const set = new Set<RecordBatch>(table.batches);
|
||||
const scannable = await Scannable.fromIterable(table.schema, set, {
|
||||
rescannable: true,
|
||||
});
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
});
|
||||
|
||||
test("custom Iterable returning a fresh iterator passes", async () => {
|
||||
const table = makeTable();
|
||||
const replayable: Iterable<RecordBatch> = {
|
||||
[Symbol.iterator]() {
|
||||
return table.batches[Symbol.iterator]();
|
||||
},
|
||||
};
|
||||
const scannable = await Scannable.fromIterable(
|
||||
table.schema,
|
||||
replayable,
|
||||
{ rescannable: true },
|
||||
);
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
});
|
||||
|
||||
test("object with generator method passes (fresh generator each call)", async () => {
|
||||
const table = makeTable();
|
||||
const replayable: Iterable<RecordBatch> = {
|
||||
*[Symbol.iterator]() {
|
||||
for (const batch of table.batches) yield batch;
|
||||
},
|
||||
};
|
||||
const scannable = await Scannable.fromIterable(
|
||||
table.schema,
|
||||
replayable,
|
||||
{ rescannable: true },
|
||||
);
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
});
|
||||
|
||||
test("empty Array passes (replayable degenerate case)", async () => {
|
||||
const schema = makeTable().schema;
|
||||
const scannable = await Scannable.fromIterable(
|
||||
schema,
|
||||
[] as RecordBatch[],
|
||||
{ rescannable: true },
|
||||
);
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
});
|
||||
|
||||
// One-shot inputs: [Symbol.iterator]() / [Symbol.asyncIterator]()
|
||||
// returns the same object, or the input is already-an-iterator.
|
||||
// Must throw with a /one-shot/ message.
|
||||
|
||||
test("sync generator throws", async () => {
|
||||
const table = makeTable();
|
||||
function* generator(): Generator<RecordBatch> {
|
||||
for (const batch of table.batches) yield batch;
|
||||
}
|
||||
await expect(
|
||||
Scannable.fromIterable(table.schema, generator(), {
|
||||
rescannable: true,
|
||||
}),
|
||||
).rejects.toThrow(/one-shot/);
|
||||
});
|
||||
|
||||
test("async generator throws", async () => {
|
||||
const table = makeTable();
|
||||
async function* generator(): AsyncGenerator<RecordBatch> {
|
||||
for (const batch of table.batches) yield batch;
|
||||
}
|
||||
await expect(
|
||||
Scannable.fromIterable(table.schema, generator(), {
|
||||
rescannable: true,
|
||||
}),
|
||||
).rejects.toThrow(/one-shot/);
|
||||
});
|
||||
|
||||
test("empty generator throws (one-shot degenerate case)", async () => {
|
||||
const schema = makeTable().schema;
|
||||
function* generator(): Generator<RecordBatch> {
|
||||
// intentionally empty; yields nothing.
|
||||
}
|
||||
await expect(
|
||||
Scannable.fromIterable(schema, generator(), { rescannable: true }),
|
||||
).rejects.toThrow(/one-shot/);
|
||||
});
|
||||
|
||||
test("custom self-iterator throws", async () => {
|
||||
const table = makeTable();
|
||||
const batches = table.batches;
|
||||
let i = 0;
|
||||
const oneShot: Iterable<RecordBatch> & Iterator<RecordBatch> = {
|
||||
[Symbol.iterator]() {
|
||||
return this;
|
||||
},
|
||||
next() {
|
||||
if (i >= batches.length) {
|
||||
return { done: true, value: undefined };
|
||||
}
|
||||
return { done: false, value: batches[i++] };
|
||||
},
|
||||
};
|
||||
await expect(
|
||||
Scannable.fromIterable(table.schema, oneShot, { rescannable: true }),
|
||||
).rejects.toThrow(/one-shot/);
|
||||
});
|
||||
|
||||
test("Array.values() (IterableIterator) throws", async () => {
|
||||
const table = makeTable();
|
||||
const iter = table.batches.values();
|
||||
await expect(
|
||||
Scannable.fromIterable(table.schema, iter, { rescannable: true }),
|
||||
).rejects.toThrow(/one-shot/);
|
||||
});
|
||||
|
||||
test("raw iterator (only `.next`) throws", async () => {
|
||||
const table = makeTable();
|
||||
const batches = table.batches;
|
||||
let i = 0;
|
||||
const rawIter = {
|
||||
next(): IteratorResult<RecordBatch> {
|
||||
if (i >= batches.length) {
|
||||
return { done: true, value: undefined };
|
||||
}
|
||||
return { done: false, value: batches[i++] };
|
||||
},
|
||||
};
|
||||
await expect(
|
||||
Scannable.fromIterable(
|
||||
table.schema,
|
||||
rawIter as unknown as Iterable<RecordBatch>,
|
||||
{ rescannable: true },
|
||||
),
|
||||
).rejects.toThrow(/one-shot/);
|
||||
});
|
||||
|
||||
// Edge: null/undefined must not crash the detection helper. The
|
||||
// null check belongs to `normalizeIterator` and only fires when a
|
||||
// scan starts.
|
||||
|
||||
test("null input does not crash detection at construction", async () => {
|
||||
const schema = makeTable().schema;
|
||||
await expect(
|
||||
Scannable.fromIterable(
|
||||
schema,
|
||||
null as unknown as Iterable<RecordBatch>,
|
||||
{
|
||||
rescannable: true,
|
||||
},
|
||||
),
|
||||
).resolves.toBeDefined();
|
||||
});
|
||||
|
||||
test("undefined input does not crash detection at construction", async () => {
|
||||
const schema = makeTable().schema;
|
||||
await expect(
|
||||
Scannable.fromIterable(
|
||||
schema,
|
||||
undefined as unknown as Iterable<RecordBatch>,
|
||||
{ rescannable: true },
|
||||
),
|
||||
).resolves.toBeDefined();
|
||||
});
|
||||
|
||||
// Default (rescannable omitted) skips the check entirely, so even
|
||||
// pathological inputs construct without throwing here.
|
||||
|
||||
test("rescannable omitted skips detection entirely (generator passes)", async () => {
|
||||
const table = makeTable();
|
||||
function* generator(): Generator<RecordBatch> {
|
||||
for (const batch of table.batches) yield batch;
|
||||
}
|
||||
const scannable = await Scannable.fromIterable(
|
||||
table.schema,
|
||||
generator(),
|
||||
);
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
|
||||
test("rescannable: false explicit skips detection entirely (generator passes)", async () => {
|
||||
const table = makeTable();
|
||||
function* generator(): Generator<RecordBatch> {
|
||||
for (const batch of table.batches) yield batch;
|
||||
}
|
||||
const scannable = await Scannable.fromIterable(
|
||||
table.schema,
|
||||
generator(),
|
||||
{ rescannable: false },
|
||||
);
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
});
|
||||
});
|
||||
|
||||
describe("fromFactory", () => {
|
||||
test("defaults rescannable=true and does not invoke the factory eagerly", async () => {
|
||||
const table = makeTable();
|
||||
const factory = jest.fn(() => table.batches);
|
||||
|
||||
const scannable = await Scannable.fromFactory(table.schema, factory);
|
||||
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
expect(scannable.rescannable).toBe(true);
|
||||
expect(factory).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
test("honors rescannable and numRows overrides", async () => {
|
||||
const table = makeTable();
|
||||
const scannable = await Scannable.fromFactory(
|
||||
table.schema,
|
||||
() => table.batches,
|
||||
{ numRows: 7, rescannable: false },
|
||||
);
|
||||
|
||||
expect(scannable.numRows).toBe(7);
|
||||
expect(scannable.rescannable).toBe(false);
|
||||
});
|
||||
});
|
||||
|
||||
describe("validation", () => {
|
||||
test("throws when numRows is negative", async () => {
|
||||
await expect(
|
||||
Scannable.fromFactory(makeTable().schema, () => [], { numRows: -1 }),
|
||||
).rejects.toThrow(/non-negative/);
|
||||
});
|
||||
|
||||
test("throws when numRows is not an integer", async () => {
|
||||
await expect(
|
||||
Scannable.fromFactory(makeTable().schema, () => [], { numRows: 3.5 }),
|
||||
).rejects.toThrow(/integer/);
|
||||
});
|
||||
});
|
||||
|
||||
describe("native handle", () => {
|
||||
test("exposes a native handle via inner", async () => {
|
||||
const scannable = await Scannable.fromTable(makeTable());
|
||||
expect(scannable.inner).toBeDefined();
|
||||
expect(typeof scannable.inner).toBe("object");
|
||||
expect(scannable.inner).not.toBeNull();
|
||||
});
|
||||
});
|
||||
|
||||
// Schema-variety construction tests. Each asserts that construction
|
||||
// succeeds against a richer Arrow schema, which transitively exercises
|
||||
// schema serialization and the Rust-side `ipc_file_to_schema` for types
|
||||
// beyond flat primitives.
|
||||
describe("schema variety", () => {
|
||||
test("accepts an empty table", async () => {
|
||||
const schema = new Schema([new Field("id", new Int32(), true)]);
|
||||
const table = makeEmptyTable(schema);
|
||||
const scannable = await Scannable.fromTable(table);
|
||||
|
||||
expect(scannable.numRows).toBe(0);
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
});
|
||||
|
||||
test("accepts nested struct and list columns", async () => {
|
||||
const table = makeArrowTable(
|
||||
[
|
||||
{ id: 1, point: { x: 0, y: 0 }, tags: ["a", "b"] },
|
||||
{ id: 2, point: { x: 1, y: 2 }, tags: ["c"] },
|
||||
],
|
||||
{ vectorColumns: {} },
|
||||
);
|
||||
const scannable = await Scannable.fromTable(table);
|
||||
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
expect(scannable.numRows).toBe(2);
|
||||
});
|
||||
|
||||
test("accepts a FixedSizeList (vector) column", async () => {
|
||||
const table = makeArrowTable(
|
||||
[
|
||||
{ id: 1, vec: [1, 2, 3] },
|
||||
{ id: 2, vec: [4, 5, 6] },
|
||||
],
|
||||
{ vectorColumns: { vec: { type: new Float16() } } },
|
||||
);
|
||||
const scannable = await Scannable.fromTable(table);
|
||||
|
||||
expect(scannable.schema).toBe(table.schema);
|
||||
expect(scannable.numRows).toBe(2);
|
||||
});
|
||||
|
||||
test("accepts a table with many columns", async () => {
|
||||
const row: Record<string, number> = {};
|
||||
for (let i = 0; i < 50; i++) row[`c${i}`] = i;
|
||||
const table = makeArrowTable([row, row], { vectorColumns: {} });
|
||||
const scannable = await Scannable.fromTable(table);
|
||||
|
||||
expect(scannable.schema.fields.length).toBe(50);
|
||||
expect(scannable.numRows).toBe(2);
|
||||
});
|
||||
});
|
||||
});
|
||||
@@ -28,6 +28,7 @@ import {
|
||||
List,
|
||||
Schema,
|
||||
SchemaLike,
|
||||
Struct,
|
||||
Type,
|
||||
Uint8,
|
||||
Utf8,
|
||||
@@ -115,6 +116,48 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
|
||||
await expect(table.countRows()).resolves.toBe(1);
|
||||
});
|
||||
|
||||
it("should invoke the progress callback", async () => {
|
||||
const events: import("../lancedb").WriteProgress[] = [];
|
||||
await table.add([{ id: 1 }, { id: 2 }, { id: 3 }], {
|
||||
progress: (p) => events.push(p),
|
||||
});
|
||||
|
||||
expect(events.length).toBeGreaterThan(0);
|
||||
const last = events[events.length - 1];
|
||||
expect(last.done).toBe(true);
|
||||
// Earlier callbacks must have done=false.
|
||||
for (const ev of events.slice(0, -1)) {
|
||||
expect(ev.done).toBe(false);
|
||||
}
|
||||
// outputRows reflects the rows added in this call, not table size.
|
||||
expect(last.outputRows).toBe(3);
|
||||
// The input source (an array) reports a row count, so totalRows is set.
|
||||
expect(last.totalRows).toBe(3);
|
||||
// outputRows is monotonic.
|
||||
for (let i = 1; i < events.length; i++) {
|
||||
expect(events[i].outputRows).toBeGreaterThanOrEqual(
|
||||
events[i - 1].outputRows,
|
||||
);
|
||||
}
|
||||
});
|
||||
|
||||
it("should swallow errors thrown from the progress callback", async () => {
|
||||
const warn = jest
|
||||
.spyOn(console, "warn")
|
||||
.mockImplementation(() => undefined);
|
||||
try {
|
||||
const res = await table.add([{ id: 1 }, { id: 2 }], {
|
||||
progress: () => {
|
||||
throw new Error("callback bomb");
|
||||
},
|
||||
});
|
||||
expect(res.version).toBeGreaterThan(0);
|
||||
expect(warn).toHaveBeenCalled();
|
||||
} finally {
|
||||
warn.mockRestore();
|
||||
}
|
||||
});
|
||||
|
||||
it("should let me close the table", async () => {
|
||||
expect(table.isOpen()).toBe(true);
|
||||
table.close();
|
||||
@@ -678,7 +721,7 @@ describe("When creating an index", () => {
|
||||
columns: ["vec"],
|
||||
});
|
||||
const stats = await tbl.indexStats("vec_idx");
|
||||
expect(stats?.loss).toBeDefined();
|
||||
expect(stats).toBeDefined();
|
||||
|
||||
// Search without specifying the column
|
||||
let rst = await tbl
|
||||
@@ -738,6 +781,113 @@ describe("When creating an index", () => {
|
||||
expect(indices2.length).toBe(0);
|
||||
});
|
||||
|
||||
it("should create and search a nested vector index", async () => {
|
||||
const db = await connect(tmpDir.name);
|
||||
const nestedSchema = new Schema([
|
||||
new Field("id", new Int32(), true),
|
||||
new Field(
|
||||
"image",
|
||||
new Struct([
|
||||
new Field(
|
||||
"embedding",
|
||||
new FixedSizeList(2, new Field("item", new Float32(), true)),
|
||||
true,
|
||||
),
|
||||
]),
|
||||
true,
|
||||
),
|
||||
]);
|
||||
const nestedTable = await db.createTable(
|
||||
"nested_vector",
|
||||
makeArrowTable(
|
||||
Array.from({ length: 300 }, (_, id) => ({
|
||||
id,
|
||||
image: { embedding: [id, id + 1] },
|
||||
})),
|
||||
{ schema: nestedSchema },
|
||||
),
|
||||
);
|
||||
|
||||
await nestedTable.createIndex("image.embedding", {
|
||||
name: "image_embedding_idx",
|
||||
});
|
||||
const indices = await nestedTable.listIndices();
|
||||
expect(indices).toContainEqual({
|
||||
name: "image_embedding_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["image.embedding"],
|
||||
});
|
||||
|
||||
const explicit = await nestedTable
|
||||
.query()
|
||||
.nearestTo([0.0, 1.0])
|
||||
.column("image.embedding")
|
||||
.limit(1)
|
||||
.toArray();
|
||||
const inferred = await nestedTable
|
||||
.query()
|
||||
.nearestTo([0.0, 1.0])
|
||||
.limit(1)
|
||||
.toArray();
|
||||
expect(inferred[0].id).toEqual(explicit[0].id);
|
||||
});
|
||||
|
||||
it("should report multiple nested vector candidates", async () => {
|
||||
const db = await connect(tmpDir.name);
|
||||
const nestedSchema = new Schema([
|
||||
new Field(
|
||||
"image",
|
||||
new Struct([
|
||||
new Field(
|
||||
"embedding",
|
||||
new FixedSizeList(2, new Field("item", new Float32(), true)),
|
||||
true,
|
||||
),
|
||||
]),
|
||||
true,
|
||||
),
|
||||
new Field(
|
||||
"text",
|
||||
new Struct([
|
||||
new Field(
|
||||
"embedding",
|
||||
new FixedSizeList(2, new Field("item", new Float32(), true)),
|
||||
true,
|
||||
),
|
||||
]),
|
||||
true,
|
||||
),
|
||||
]);
|
||||
const nestedTable = await db.createTable(
|
||||
"multiple_nested_vectors",
|
||||
makeArrowTable(
|
||||
[
|
||||
{
|
||||
image: { embedding: [0.0, 1.0] },
|
||||
text: { embedding: [2.0, 3.0] },
|
||||
},
|
||||
],
|
||||
{ schema: nestedSchema },
|
||||
),
|
||||
);
|
||||
|
||||
await expect(
|
||||
nestedTable.query().nearestTo([0.0, 1.0]).limit(1).toArray(),
|
||||
).rejects.toThrow(/image\.embedding.*text\.embedding/);
|
||||
});
|
||||
|
||||
it("should report when no default vector column exists", async () => {
|
||||
const db = await connect(tmpDir.name);
|
||||
const noVectorTable = await db.createTable(
|
||||
"no_vector",
|
||||
makeArrowTable([{ id: 0, label: "cat" }]),
|
||||
);
|
||||
|
||||
await expect(
|
||||
noVectorTable.query().nearestTo([0.0, 1.0]).limit(1).toArray(),
|
||||
).rejects.toThrow(/No vector column/);
|
||||
});
|
||||
|
||||
it("should wait for index readiness", async () => {
|
||||
// Create an index and then wait for it to be ready
|
||||
await tbl.createIndex("vec");
|
||||
@@ -1000,7 +1150,6 @@ describe("When creating an index", () => {
|
||||
expect(stats?.distanceType).toBeUndefined();
|
||||
expect(stats?.indexType).toEqual("BTREE");
|
||||
expect(stats?.numIndices).toEqual(1);
|
||||
expect(stats?.loss).toBeUndefined();
|
||||
});
|
||||
|
||||
test("when getting stats on non-existent index", async () => {
|
||||
@@ -1421,6 +1570,33 @@ describe("schema evolution", function () {
|
||||
expect(await table.schema()).toEqual(expectedSchema3);
|
||||
});
|
||||
|
||||
it("can update field metadata", async function () {
|
||||
const con = await connect(tmpDir.name);
|
||||
const table = await con.createTable("fm", [
|
||||
{ id: 1, category: "a" },
|
||||
{ id: 2, category: "b" },
|
||||
]);
|
||||
|
||||
const res = await table.updateFieldMetadata([
|
||||
{ path: "category", metadata: { unit: "label", pii: "false" } },
|
||||
]);
|
||||
expect(res).toHaveProperty("version");
|
||||
expect(res.version).toBe(2);
|
||||
|
||||
let cat = (await table.schema()).fields.find((f) => f.name === "category");
|
||||
expect(cat?.metadata.get("unit")).toBe("label");
|
||||
expect(cat?.metadata.get("pii")).toBe("false");
|
||||
|
||||
// merge: add a key, delete one via null, keep the rest
|
||||
await table.updateFieldMetadata([
|
||||
{ path: "category", metadata: { source: "import", pii: null } },
|
||||
]);
|
||||
cat = (await table.schema()).fields.find((f) => f.name === "category");
|
||||
expect(cat?.metadata.get("unit")).toBe("label"); // preserved
|
||||
expect(cat?.metadata.get("source")).toBe("import"); // added
|
||||
expect(cat?.metadata.has("pii")).toBe(false); // deleted
|
||||
});
|
||||
|
||||
it("can cast to various types", async function () {
|
||||
const con = await connect(tmpDir.name);
|
||||
|
||||
@@ -2348,3 +2524,224 @@ describe("when creating a table with Float32Array vectors", () => {
|
||||
expect((fsl.children[0].type as Float32).precision).toBe(1);
|
||||
});
|
||||
});
|
||||
|
||||
describe("setUnenforcedPrimaryKey", () => {
|
||||
let tmpDir: tmp.DirResult;
|
||||
|
||||
beforeEach(() => {
|
||||
tmpDir = tmp.dirSync({ unsafeCleanup: true });
|
||||
});
|
||||
afterEach(() => tmpDir.removeCallback());
|
||||
|
||||
it("sets a single-column primary key (string or one-element array)", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const schema = new arrow.Schema([
|
||||
new arrow.Field("id", new arrow.Int64(), false),
|
||||
]);
|
||||
const t1 = await conn.createEmptyTable("t1", schema);
|
||||
await t1.setUnenforcedPrimaryKey("id");
|
||||
|
||||
const t2 = await conn.createEmptyTable("t2", schema);
|
||||
await t2.setUnenforcedPrimaryKey(["id"]);
|
||||
});
|
||||
|
||||
it("rejects a compound primary key", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await conn.createEmptyTable(
|
||||
"t",
|
||||
new arrow.Schema([
|
||||
new arrow.Field("id", new arrow.Int64(), false),
|
||||
new arrow.Field("name", new arrow.Utf8(), false),
|
||||
]),
|
||||
);
|
||||
await expect(
|
||||
table.setUnenforcedPrimaryKey(["id", "name"]),
|
||||
).rejects.toThrow();
|
||||
});
|
||||
|
||||
it("rejects changing the primary key once set", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await conn.createEmptyTable(
|
||||
"t",
|
||||
new arrow.Schema([
|
||||
new arrow.Field("id", new arrow.Int64(), false),
|
||||
new arrow.Field("name", new arrow.Utf8(), false),
|
||||
]),
|
||||
);
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
await expect(table.setUnenforcedPrimaryKey("name")).rejects.toThrow();
|
||||
await expect(table.setUnenforcedPrimaryKey("id")).rejects.toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe("setLsmWriteSpec / unsetLsmWriteSpec", () => {
|
||||
let tmpDir: tmp.DirResult;
|
||||
|
||||
beforeEach(() => {
|
||||
tmpDir = tmp.dirSync({ unsafeCleanup: true });
|
||||
});
|
||||
afterEach(() => tmpDir.removeCallback());
|
||||
|
||||
async function makeTable(conn: Connection): Promise<Table> {
|
||||
return await conn.createEmptyTable(
|
||||
"t",
|
||||
new arrow.Schema([new arrow.Field("id", new arrow.Int64(), false)]),
|
||||
);
|
||||
}
|
||||
|
||||
it("installs and removes a bucket spec", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await makeTable(conn);
|
||||
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
await table.setLsmWriteSpec({
|
||||
specType: "bucket",
|
||||
column: "id",
|
||||
numBuckets: 4,
|
||||
});
|
||||
await table.unsetLsmWriteSpec();
|
||||
// A second unset errors — there is no spec left to remove.
|
||||
await expect(table.unsetLsmWriteSpec()).rejects.toThrow();
|
||||
// A fresh spec can be installed after unset.
|
||||
await table.setLsmWriteSpec({
|
||||
specType: "bucket",
|
||||
column: "id",
|
||||
numBuckets: 8,
|
||||
});
|
||||
});
|
||||
|
||||
it("installs an unsharded spec", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await makeTable(conn);
|
||||
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
await table.setLsmWriteSpec({ specType: "unsharded" });
|
||||
await table.unsetLsmWriteSpec();
|
||||
});
|
||||
|
||||
it("installs an identity spec", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await makeTable(conn);
|
||||
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
await table.setLsmWriteSpec({ specType: "identity", column: "id" });
|
||||
await table.unsetLsmWriteSpec();
|
||||
});
|
||||
|
||||
it("rejects an invalid spec", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await makeTable(conn);
|
||||
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
// num_buckets out of range.
|
||||
await expect(
|
||||
table.setLsmWriteSpec({
|
||||
specType: "bucket",
|
||||
column: "id",
|
||||
numBuckets: 0,
|
||||
}),
|
||||
).rejects.toThrow();
|
||||
// Column mismatch.
|
||||
await expect(
|
||||
table.setLsmWriteSpec({
|
||||
specType: "bucket",
|
||||
column: "missing",
|
||||
numBuckets: 4,
|
||||
}),
|
||||
).rejects.toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe("LSM merge insert", () => {
|
||||
let tmpDir: tmp.DirResult;
|
||||
|
||||
beforeEach(() => {
|
||||
tmpDir = tmp.dirSync({ unsafeCleanup: true });
|
||||
});
|
||||
afterEach(() => tmpDir.removeCallback());
|
||||
|
||||
async function bucketTable(conn: Connection): Promise<Table> {
|
||||
// The primary key column must be non-nullable.
|
||||
const table = await conn.createEmptyTable(
|
||||
"t",
|
||||
new arrow.Schema([
|
||||
new arrow.Field("id", new arrow.Utf8(), false),
|
||||
new arrow.Field("value", new arrow.Float64(), true),
|
||||
]),
|
||||
);
|
||||
await table.add([
|
||||
{ id: "a", value: 1 },
|
||||
{ id: "b", value: 2 },
|
||||
]);
|
||||
await table.setUnenforcedPrimaryKey("id");
|
||||
// numBuckets = 1: every row routes to the single bucket.
|
||||
await table.setLsmWriteSpec({
|
||||
specType: "bucket",
|
||||
column: "id",
|
||||
numBuckets: 1,
|
||||
});
|
||||
return table;
|
||||
}
|
||||
|
||||
it("routes merge_insert through the shard writer", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await bucketTable(conn);
|
||||
|
||||
const res = await table
|
||||
.mergeInsert("id")
|
||||
.whenMatchedUpdateAll()
|
||||
.whenNotMatchedInsertAll()
|
||||
.execute([
|
||||
{ id: "c", value: 3 },
|
||||
{ id: "d", value: 4 },
|
||||
]);
|
||||
// LSM path: rows go to the MemWAL, so only numRows is populated.
|
||||
expect(res.numRows).toBe(2);
|
||||
expect(res.version).toBe(0);
|
||||
expect(res.numInsertedRows).toBe(0);
|
||||
|
||||
await table.closeLsmWriters();
|
||||
});
|
||||
|
||||
it("falls back to the standard path with useLsmWrite(false)", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await bucketTable(conn);
|
||||
|
||||
const res = await table
|
||||
.mergeInsert("id")
|
||||
.whenNotMatchedInsertAll()
|
||||
.useLsmWrite(false)
|
||||
.execute([
|
||||
{ id: "b", value: 9 },
|
||||
{ id: "e", value: 5 },
|
||||
]);
|
||||
// Standard path commits: id="e" inserted ("b" already exists).
|
||||
expect(res.numInsertedRows).toBe(1);
|
||||
expect(await table.countRows()).toBe(3);
|
||||
});
|
||||
|
||||
it("supports validateSingleShard(false)", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await bucketTable(conn);
|
||||
|
||||
const res = await table
|
||||
.mergeInsert("id")
|
||||
.whenMatchedUpdateAll()
|
||||
.whenNotMatchedInsertAll()
|
||||
.validateSingleShard(false)
|
||||
.execute([{ id: "f", value: 6 }]);
|
||||
expect(res.numRows).toBe(1);
|
||||
});
|
||||
|
||||
it("rejects a non-upsert merge under an LSM spec", async () => {
|
||||
const conn = await connect(tmpDir.name);
|
||||
const table = await bucketTable(conn);
|
||||
|
||||
await expect(
|
||||
table
|
||||
.mergeInsert("id")
|
||||
.whenNotMatchedInsertAll()
|
||||
.execute([{ id: "g", value: 7 }]),
|
||||
).rejects.toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
@@ -38,5 +38,14 @@ test("filtering examples", async () => {
|
||||
// --8<-- [start:sql_search]
|
||||
await tbl.query().where("id = 10").limit(10).toArray();
|
||||
// --8<-- [end:sql_search]
|
||||
|
||||
// --8<-- [start:orderby_search]
|
||||
await tbl
|
||||
.query()
|
||||
.where("id > 10")
|
||||
.orderBy({ columnName: "id", ascending: false })
|
||||
.limit(5)
|
||||
.toArray();
|
||||
// --8<-- [end:orderby_search]
|
||||
});
|
||||
});
|
||||
|
||||
@@ -1291,6 +1291,18 @@ export async function fromRecordBatchToBuffer(
|
||||
return Buffer.from(await writer.toUint8Array());
|
||||
}
|
||||
|
||||
/**
|
||||
* Create a buffer containing a single record batch using the Arrow IPC Stream
|
||||
* serialization. Each call produces a self-contained Stream message (schema +
|
||||
* batch + EOS) suitable for incremental decode by `arrow_ipc::reader::StreamReader`.
|
||||
*/
|
||||
export async function fromRecordBatchToStreamBuffer(
|
||||
batch: RecordBatch,
|
||||
): Promise<Buffer> {
|
||||
const writer = RecordBatchStreamWriter.writeAll([batch]);
|
||||
return Buffer.from(await writer.toUint8Array());
|
||||
}
|
||||
|
||||
/**
|
||||
* Serialize an Arrow Table into a buffer using the Arrow IPC Stream serialization
|
||||
*
|
||||
|
||||
@@ -144,6 +144,19 @@ export interface DropNamespaceOptions {
|
||||
behavior?: "restrict" | "cascade";
|
||||
}
|
||||
|
||||
export interface RenameTableOptions {
|
||||
/**
|
||||
* The namespace path of the table being renamed. Defaults to the root
|
||||
* namespace (`[]`) when omitted.
|
||||
*/
|
||||
namespacePath?: string[];
|
||||
/**
|
||||
* The namespace path to move the table to as part of the rename. When
|
||||
* omitted the table stays in `namespacePath`.
|
||||
*/
|
||||
newNamespacePath?: string[];
|
||||
}
|
||||
|
||||
/**
|
||||
* A LanceDB Connection that allows you to open tables and create new ones.
|
||||
*
|
||||
@@ -391,6 +404,24 @@ export abstract class Connection {
|
||||
isShallow?: boolean;
|
||||
},
|
||||
): Promise<Table>;
|
||||
|
||||
/**
|
||||
* Rename a table.
|
||||
*
|
||||
* Currently only supported by LanceDB Cloud. Local OSS connections and
|
||||
* namespace-backed connections (via {@link connectNamespace}) reject with
|
||||
* a "not supported" error.
|
||||
*
|
||||
* @param {string} currentName - The current name of the table.
|
||||
* @param {string} newName - The new name for the table.
|
||||
* @param {RenameTableOptions} options - Optional namespace paths. When
|
||||
* `newNamespacePath` is omitted the table stays in `namespacePath`.
|
||||
*/
|
||||
abstract renameTable(
|
||||
currentName: string,
|
||||
newName: string,
|
||||
options?: RenameTableOptions,
|
||||
): Promise<void>;
|
||||
}
|
||||
|
||||
/** @hideconstructor */
|
||||
@@ -651,6 +682,19 @@ export class LocalConnection extends Connection {
|
||||
options?.behavior,
|
||||
);
|
||||
}
|
||||
|
||||
async renameTable(
|
||||
currentName: string,
|
||||
newName: string,
|
||||
options?: RenameTableOptions,
|
||||
): Promise<void> {
|
||||
return this.inner.renameTable(
|
||||
currentName,
|
||||
newName,
|
||||
options?.namespacePath ?? [],
|
||||
options?.newNamespacePath,
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
|
||||
@@ -42,6 +42,7 @@ export {
|
||||
AddResult,
|
||||
AddColumnsResult,
|
||||
AlterColumnsResult,
|
||||
UpdateFieldMetadataResult,
|
||||
DeleteResult,
|
||||
DropColumnsResult,
|
||||
UpdateResult,
|
||||
@@ -71,6 +72,7 @@ export {
|
||||
CreateNamespaceResponse,
|
||||
DropNamespaceResponse,
|
||||
DescribeNamespaceResponse,
|
||||
RenameTableOptions,
|
||||
} from "./connection";
|
||||
|
||||
export { Session } from "./native.js";
|
||||
@@ -82,6 +84,7 @@ export {
|
||||
VectorQuery,
|
||||
TakeQuery,
|
||||
QueryExecutionOptions,
|
||||
ColumnOrdering,
|
||||
FullTextSearchOptions,
|
||||
RecordBatchIterator,
|
||||
FullTextQuery,
|
||||
@@ -112,7 +115,10 @@ export {
|
||||
UpdateOptions,
|
||||
OptimizeOptions,
|
||||
Version,
|
||||
WriteProgress,
|
||||
LsmWriteSpec,
|
||||
ColumnAlteration,
|
||||
FieldMetadataUpdate,
|
||||
} from "./table";
|
||||
|
||||
export {
|
||||
@@ -126,6 +132,7 @@ export { MergeInsertBuilder, WriteExecutionOptions } from "./merge";
|
||||
|
||||
export * as embedding from "./embedding";
|
||||
export { permutationBuilder, PermutationBuilder } from "./permutation";
|
||||
export { Scannable, ScannableOptions } from "./scannable";
|
||||
export * as rerankers from "./rerankers";
|
||||
export {
|
||||
SchemaLike,
|
||||
|
||||
@@ -87,6 +87,41 @@ export class MergeInsertBuilder {
|
||||
this.#schema,
|
||||
);
|
||||
}
|
||||
/**
|
||||
* Controls whether the merge uses the MemWAL LSM write path.
|
||||
*
|
||||
* By default (unset), a `mergeInsert` on a table with an LSM write spec is
|
||||
* routed through Lance's MemWAL shard writer, and a table without one uses
|
||||
* the standard path. Pass `false` to force the standard path even when a
|
||||
* spec is set. Pass `true` to require a spec — `mergeInsert` rejects if none
|
||||
* is installed.
|
||||
*
|
||||
* @param useLsmWrite - Whether to use the LSM write path.
|
||||
*/
|
||||
useLsmWrite(useLsmWrite: boolean): MergeInsertBuilder {
|
||||
return new MergeInsertBuilder(
|
||||
this.#native.useLsmWrite(useLsmWrite),
|
||||
this.#schema,
|
||||
);
|
||||
}
|
||||
/**
|
||||
* Controls how an LSM merge checks that its input targets a single shard.
|
||||
*
|
||||
* When a table has an LSM write spec, every row in a `mergeInsert` call must
|
||||
* route to the same shard. When `true` (the default), every row is inspected
|
||||
* to verify this. When `false`, only the first row is inspected and the
|
||||
* shard it routes to is used for the whole input — a faster path for callers
|
||||
* that have already pre-sharded their input. Has no effect on tables without
|
||||
* an LSM write spec.
|
||||
*
|
||||
* @param validateSingleShard - Whether to check every row routes to one shard. Defaults to `true`.
|
||||
*/
|
||||
validateSingleShard(validateSingleShard: boolean): MergeInsertBuilder {
|
||||
return new MergeInsertBuilder(
|
||||
this.#native.validateSingleShard(validateSingleShard),
|
||||
this.#schema,
|
||||
);
|
||||
}
|
||||
/**
|
||||
* Executes the merge insert operation
|
||||
*
|
||||
|
||||
@@ -79,6 +79,12 @@ export interface QueryExecutionOptions {
|
||||
timeoutMs?: number;
|
||||
}
|
||||
|
||||
export interface ColumnOrdering {
|
||||
columnName: string;
|
||||
ascending?: boolean;
|
||||
nullsFirst?: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* Options that control the behavior of a full text search
|
||||
*/
|
||||
@@ -417,6 +423,21 @@ export class StandardQueryBase<
|
||||
return this;
|
||||
}
|
||||
|
||||
/**
|
||||
* Sort the results by the specified column(s).
|
||||
* @returns This query builder.
|
||||
*/
|
||||
orderBy(ordering: ColumnOrdering | ColumnOrdering[]): this {
|
||||
const orderings = Array.isArray(ordering) ? ordering : [ordering];
|
||||
const normalized = orderings.map((o) => ({
|
||||
columnName: o.columnName,
|
||||
ascending: o.ascending ?? true,
|
||||
nullsFirst: o.nullsFirst ?? false,
|
||||
}));
|
||||
this.doCall((inner) => inner.orderBy(normalized));
|
||||
return this;
|
||||
}
|
||||
|
||||
/**
|
||||
* Skip searching un-indexed data. This can make search faster, but will miss
|
||||
* any data that is not yet indexed.
|
||||
|
||||
274
nodejs/lancedb/scannable.ts
Normal file
274
nodejs/lancedb/scannable.ts
Normal file
@@ -0,0 +1,274 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import {
|
||||
Table as ArrowTable,
|
||||
RecordBatch,
|
||||
RecordBatchReader,
|
||||
Schema,
|
||||
} from "apache-arrow";
|
||||
import {
|
||||
fromRecordBatchToStreamBuffer,
|
||||
fromTableToBuffer,
|
||||
makeEmptyTable,
|
||||
} from "./arrow";
|
||||
import { NapiScannable } from "./native.js";
|
||||
|
||||
export interface ScannableOptions {
|
||||
/** Hint about the number of rows. Not validated against the stream. */
|
||||
numRows?: number;
|
||||
/**
|
||||
* Whether the source can be scanned more than once. Defaults to `true` for
|
||||
* `fromTable` / `fromFactory` and `false` for `fromIterable` /
|
||||
* `fromRecordBatchReader`.
|
||||
*/
|
||||
rescannable?: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* A data source that can be scanned as a stream of Arrow `RecordBatch`es.
|
||||
*
|
||||
* `Scannable` wraps the schema + optional row count + rescannable flag and
|
||||
* a callback that yields batches one at a time. It is passed to consumers
|
||||
* (e.g. `Table.add`, `createTable`, `mergeInsert` — follow-up work) that
|
||||
* need to pull data without materializing the full dataset in JS memory.
|
||||
*
|
||||
* Batches cross the JS↔Rust boundary as Arrow IPC Stream messages; a fresh
|
||||
* writer serializes each batch, and the Rust side decodes it with
|
||||
* `arrow_ipc::reader::StreamReader`. One batch is in flight at a time.
|
||||
*/
|
||||
export class Scannable {
|
||||
readonly schema: Schema;
|
||||
readonly numRows: number | null;
|
||||
readonly rescannable: boolean;
|
||||
|
||||
/** @hidden */
|
||||
private readonly native: NapiScannable;
|
||||
|
||||
private constructor(
|
||||
native: NapiScannable,
|
||||
schema: Schema,
|
||||
numRows: number | null,
|
||||
rescannable: boolean,
|
||||
) {
|
||||
this.native = native;
|
||||
this.schema = schema;
|
||||
this.numRows = numRows;
|
||||
this.rescannable = rescannable;
|
||||
}
|
||||
|
||||
/** @hidden Access the native handle for passing through to Rust consumers. */
|
||||
get inner(): NapiScannable {
|
||||
return this.native;
|
||||
}
|
||||
|
||||
/**
|
||||
* Build a Scannable from an explicit schema and a factory that returns a
|
||||
* fresh batch iterator on each call.
|
||||
*
|
||||
* The factory is invoked once per scan. Each iterator yields
|
||||
* `RecordBatch`es matching the declared schema. Use this when you need
|
||||
* direct control over the pull loop — for example, to wrap a streaming
|
||||
* source whose batches are produced lazily.
|
||||
*
|
||||
* @param schema - The Arrow schema of the produced batches.
|
||||
* @param factory - Called at the start of each scan to produce a batch
|
||||
* iterator. Must be idempotent when `rescannable` is true.
|
||||
* @param opts - Optional hints. `rescannable` defaults to `true`; set to
|
||||
* `false` if calling `factory()` twice would not reproduce the same data.
|
||||
*/
|
||||
static async fromFactory(
|
||||
schema: Schema,
|
||||
factory: () =>
|
||||
| AsyncIterable<RecordBatch>
|
||||
| Iterable<RecordBatch>
|
||||
| AsyncIterator<RecordBatch>
|
||||
| Iterator<RecordBatch>,
|
||||
opts: ScannableOptions = {},
|
||||
): Promise<Scannable> {
|
||||
const numRows = opts.numRows ?? null;
|
||||
if (numRows != null && !Number.isInteger(numRows)) {
|
||||
throw new TypeError("numRows must be an integer");
|
||||
}
|
||||
const rescannable = opts.rescannable ?? true;
|
||||
|
||||
let iter: AsyncIterator<RecordBatch> | Iterator<RecordBatch> | null = null;
|
||||
const getNextBatch = async (isStart: boolean): Promise<Buffer | null> => {
|
||||
// `isStart` is true on the first pull of every new scan_as_stream.
|
||||
// Drop any cached iterator so factory() is re-invoked for the next scan
|
||||
if (isStart) {
|
||||
iter = null;
|
||||
}
|
||||
if (iter === null) {
|
||||
iter = normalizeIterator(factory());
|
||||
}
|
||||
const result = await iter.next();
|
||||
if (result.done) {
|
||||
iter = null;
|
||||
return null;
|
||||
}
|
||||
return fromRecordBatchToStreamBuffer(result.value);
|
||||
};
|
||||
|
||||
const schemaBuf = await fromTableToBuffer(makeEmptyTable(schema));
|
||||
const native = new NapiScannable(
|
||||
schemaBuf,
|
||||
numRows,
|
||||
rescannable,
|
||||
getNextBatch,
|
||||
);
|
||||
return new Scannable(native, schema, numRows, rescannable);
|
||||
}
|
||||
|
||||
/**
|
||||
* Build a Scannable from an in-memory Arrow `Table`. Always rescannable;
|
||||
* the table's batches are replayed on each scan.
|
||||
*
|
||||
* The table's row count is authoritative: `opts.numRows` must either be
|
||||
* omitted or equal to `table.numRows`. `opts.rescannable` of `false` is
|
||||
* rejected because in-memory Tables are always rescannable.
|
||||
*/
|
||||
static async fromTable(
|
||||
table: ArrowTable,
|
||||
opts: ScannableOptions = {},
|
||||
): Promise<Scannable> {
|
||||
if (opts.numRows != null && opts.numRows !== table.numRows) {
|
||||
throw new TypeError(
|
||||
`opts.numRows (${opts.numRows}) does not match table.numRows (${table.numRows}). ` +
|
||||
`The table's row count is authoritative; omit numRows or pass the matching value.`,
|
||||
);
|
||||
}
|
||||
if (opts.rescannable === false) {
|
||||
throw new TypeError(
|
||||
`fromTable does not accept rescannable: false. ` +
|
||||
`In-memory Arrow Tables are always rescannable; omit the option or pass true.`,
|
||||
);
|
||||
}
|
||||
return Scannable.fromFactory(table.schema, () => table.batches, {
|
||||
numRows: table.numRows,
|
||||
rescannable: true,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Build a Scannable from an iterable of `RecordBatch`es. `rescannable`
|
||||
* defaults to `false`. Pass an explicit schema so the consumer can
|
||||
* validate before any batch is pulled.
|
||||
*
|
||||
* `opts.rescannable: true` is honest for replayable iterables (Arrays,
|
||||
* Sets, or custom iterables whose `[Symbol.iterator]()` returns a fresh
|
||||
* iterator each call). It is rejected for one-shot iterables (generators,
|
||||
* async generators, or already-an-iterator inputs) because their
|
||||
* `[Symbol.iterator]()` returns the same exhausted object on the second
|
||||
* scan. For replayable sources outside this shape, use
|
||||
* `fromFactory(schema, () => createIter(), { rescannable: true })`.
|
||||
*
|
||||
* Note: when `opts.rescannable` is `true`, the constructor calls
|
||||
* `[Symbol.iterator]()` once on the input to perform the structural check.
|
||||
*/
|
||||
static async fromIterable(
|
||||
schema: Schema,
|
||||
iter: AsyncIterable<RecordBatch> | Iterable<RecordBatch>,
|
||||
opts: ScannableOptions = {},
|
||||
): Promise<Scannable> {
|
||||
if (opts.rescannable === true && isOneShotIterable(iter)) {
|
||||
throw new TypeError(
|
||||
`fromIterable: rescannable: true is not honest for one-shot iterables ` +
|
||||
`(generators, async generators, or iterators where [Symbol.iterator]() ` +
|
||||
`returns the same object). The source would be exhausted after the first scan. ` +
|
||||
`Use fromFactory(schema, () => createIter(), { rescannable: true }) for sources ` +
|
||||
`where each call mints a fresh iterator.`,
|
||||
);
|
||||
}
|
||||
return Scannable.fromFactory(schema, () => iter, {
|
||||
numRows: opts.numRows,
|
||||
rescannable: opts.rescannable ?? false,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Build a Scannable from an Arrow `RecordBatchReader`. A reader can only
|
||||
* be consumed once; `rescannable` defaults to `false`.
|
||||
*
|
||||
* The reader must already be opened (via `.open()`) so its `.schema` is
|
||||
* populated. `RecordBatchReader.from(...)` returns an unopened reader.
|
||||
*
|
||||
* `opts.rescannable: true` is rejected because `RecordBatchReader` is a
|
||||
* self-iterator (its `[Symbol.iterator]()` returns itself), and this
|
||||
* constructor does not call `reader.reset()` between scans, so a second
|
||||
* scan would always see an exhausted reader. For genuinely replayable
|
||||
* sources, use
|
||||
* `fromFactory(schema, () => openReader(), { rescannable: true })`,
|
||||
* which mints a fresh reader on each scan.
|
||||
*/
|
||||
static async fromRecordBatchReader(
|
||||
reader: RecordBatchReader,
|
||||
opts: ScannableOptions = {},
|
||||
): Promise<Scannable> {
|
||||
if (opts.rescannable === true) {
|
||||
throw new TypeError(
|
||||
`fromRecordBatchReader does not accept rescannable: true. ` +
|
||||
`RecordBatchReader is a self-iterator (its [Symbol.iterator]() ` +
|
||||
`returns itself) and would be exhausted after the first scan. ` +
|
||||
`Use fromFactory(schema, () => openReader(), { rescannable: true }) ` +
|
||||
`for sources where each call mints a fresh reader.`,
|
||||
);
|
||||
}
|
||||
return Scannable.fromFactory(reader.schema, () => reader, {
|
||||
numRows: opts.numRows,
|
||||
rescannable: false,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
function normalizeIterator<T>(
|
||||
source: AsyncIterable<T> | Iterable<T> | AsyncIterator<T> | Iterator<T>,
|
||||
): AsyncIterator<T> | Iterator<T> {
|
||||
if (source == null) {
|
||||
throw new TypeError("Scannable factory returned null/undefined");
|
||||
}
|
||||
if (
|
||||
typeof (source as AsyncIterable<T>)[Symbol.asyncIterator] === "function"
|
||||
) {
|
||||
return (source as AsyncIterable<T>)[Symbol.asyncIterator]();
|
||||
}
|
||||
if (typeof (source as Iterable<T>)[Symbol.iterator] === "function") {
|
||||
return (source as Iterable<T>)[Symbol.iterator]();
|
||||
}
|
||||
// Already an iterator (has `.next`).
|
||||
if (typeof (source as Iterator<T>).next === "function") {
|
||||
return source as Iterator<T>;
|
||||
}
|
||||
throw new TypeError("Scannable factory returned a non-iterable value");
|
||||
}
|
||||
|
||||
// A "self-iterator" returns the same object from `[Symbol.iterator]()` /
|
||||
// `[Symbol.asyncIterator]()`. Generators behave this way, so they exhaust
|
||||
// after one pass. Replayable iterables (Array, Set, custom) return a fresh
|
||||
// iterator each call. Detection mirrors `normalizeIterator`'s ordering so
|
||||
// classification matches scan-time behavior.
|
||||
function isOneShotIterable(
|
||||
source: AsyncIterable<unknown> | Iterable<unknown>,
|
||||
): boolean {
|
||||
// null/undefined are not one-shot in any meaningful sense; let
|
||||
// `normalizeIterator` raise the actual error at scan time.
|
||||
if (source == null) return false;
|
||||
const ref = source as unknown;
|
||||
if (
|
||||
typeof (source as AsyncIterable<unknown>)[Symbol.asyncIterator] ===
|
||||
"function"
|
||||
) {
|
||||
const it = (source as AsyncIterable<unknown>)[
|
||||
Symbol.asyncIterator
|
||||
]() as unknown;
|
||||
return it === ref;
|
||||
}
|
||||
if (typeof (source as Iterable<unknown>)[Symbol.iterator] === "function") {
|
||||
const it = (source as Iterable<unknown>)[Symbol.iterator]() as unknown;
|
||||
return it === ref;
|
||||
}
|
||||
// Already-an-iterator (has `.next` but no `Symbol.iterator`) is by
|
||||
// definition one-shot.
|
||||
if (typeof (source as { next?: unknown }).next === "function") return true;
|
||||
return false;
|
||||
}
|
||||
@@ -32,6 +32,7 @@ import {
|
||||
OptimizeStats,
|
||||
TableStatistics,
|
||||
Tags,
|
||||
UpdateFieldMetadataResult,
|
||||
UpdateResult,
|
||||
Table as _NativeTable,
|
||||
} from "./native";
|
||||
@@ -46,6 +47,33 @@ import { sanitizeType } from "./sanitize";
|
||||
import { IntoSql, toSQL } from "./util";
|
||||
export { IndexConfig } from "./native";
|
||||
|
||||
/**
|
||||
* Progress snapshot for a write operation, delivered to the `progress`
|
||||
* callback passed to {@link Table.add}.
|
||||
*/
|
||||
export interface WriteProgress {
|
||||
/** Number of rows written so far. */
|
||||
outputRows: number;
|
||||
/** Number of bytes written so far. */
|
||||
outputBytes: number;
|
||||
/**
|
||||
* Total rows expected, when the input source reports it.
|
||||
*
|
||||
* Always set on the final callback (the one with `done: true`), falling
|
||||
* back to the actual number of rows written when the source could not
|
||||
* report a row count up front.
|
||||
*/
|
||||
totalRows?: number;
|
||||
/** Wall-clock seconds since the write started. */
|
||||
elapsedSeconds: number;
|
||||
/** Number of parallel write tasks currently in flight. */
|
||||
activeTasks: number;
|
||||
/** Total number of parallel write tasks (the write parallelism). */
|
||||
totalTasks: number;
|
||||
/** `true` for the final callback; `false` otherwise. */
|
||||
done: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* Options for adding data to a table.
|
||||
*/
|
||||
@@ -56,6 +84,28 @@ export interface AddDataOptions {
|
||||
* If "overwrite" then the new data will replace the existing data in the table.
|
||||
*/
|
||||
mode: "append" | "overwrite";
|
||||
|
||||
/**
|
||||
* Optional callback invoked periodically with write progress.
|
||||
*
|
||||
* The callback is fired once per batch written and once more with
|
||||
* `done: true` when the write completes. Calls are dispatched
|
||||
* asynchronously to the JS event loop and never block the write — a slow
|
||||
* callback will queue events rather than back-pressure the writer.
|
||||
*
|
||||
* Errors thrown from the callback are logged with `console.warn` and
|
||||
* swallowed — they do not abort the write.
|
||||
*
|
||||
* @example
|
||||
* ```ts
|
||||
* await table.add(data, {
|
||||
* progress: (p) => {
|
||||
* console.log(`${p.outputRows}/${p.totalRows ?? "?"} rows`);
|
||||
* },
|
||||
* });
|
||||
* ```
|
||||
*/
|
||||
progress: (progress: WriteProgress) => void;
|
||||
}
|
||||
|
||||
export interface UpdateOptions {
|
||||
@@ -106,6 +156,30 @@ export interface Version {
|
||||
metadata: Record<string, string>;
|
||||
}
|
||||
|
||||
/**
|
||||
* Specification selecting Lance's MemWAL LSM-style write path for
|
||||
* `mergeInsert`.
|
||||
*
|
||||
* `specType` is `"bucket"`, `"identity"`, or `"unsharded"`. For `"bucket"`,
|
||||
* `column` and `numBuckets` are required; for `"identity"`, `column` is
|
||||
* required and must be a deterministic function of the unenforced primary
|
||||
* key (every row with a given primary key must always produce the same
|
||||
* `column` value, or upserts of that key can land in different shards and a
|
||||
* stale version can win).
|
||||
*/
|
||||
export interface LsmWriteSpec {
|
||||
/** One of `"bucket"`, `"identity"`, or `"unsharded"`. */
|
||||
specType: "bucket" | "identity" | "unsharded";
|
||||
/** Bucket and identity variants: the sharding column. */
|
||||
column?: string;
|
||||
/** Bucket variant: the number of buckets, in `[1, 1024]`. */
|
||||
numBuckets?: number;
|
||||
/** Names of indexes the MemWAL should keep up to date during writes. */
|
||||
maintainedIndexes?: string[];
|
||||
/** Default `ShardWriter` configuration recorded in the MemWAL index. */
|
||||
writerConfigDefaults?: Record<string, string>;
|
||||
}
|
||||
|
||||
/**
|
||||
* A Table is a collection of Records in a LanceDB Database.
|
||||
*
|
||||
@@ -435,6 +509,18 @@ export abstract class Table {
|
||||
abstract alterColumns(
|
||||
columnAlterations: ColumnAlteration[],
|
||||
): Promise<AlterColumnsResult>;
|
||||
|
||||
/**
|
||||
* Update per-field (column) metadata.
|
||||
* @param {FieldMetadataUpdate[]} updates One or more per-field updates. Each
|
||||
* update's metadata is merged into the field's existing metadata by default;
|
||||
* a value of `null` deletes that key, and `replace: true` swaps the whole map.
|
||||
* @returns {Promise<UpdateFieldMetadataResult>} resolves to the new table version.
|
||||
*/
|
||||
abstract updateFieldMetadata(
|
||||
updates: FieldMetadataUpdate[],
|
||||
): Promise<UpdateFieldMetadataResult>;
|
||||
|
||||
/**
|
||||
* Drop one or more columns from the dataset
|
||||
*
|
||||
@@ -449,6 +535,64 @@ export abstract class Table {
|
||||
* containing the new version number of the table after dropping the columns.
|
||||
*/
|
||||
abstract dropColumns(columnNames: string[]): Promise<DropColumnsResult>;
|
||||
/**
|
||||
* Set the unenforced primary key for this table to a single column.
|
||||
*
|
||||
* "Unenforced" means LanceDB does not check uniqueness on writes; the
|
||||
* column is recorded in the schema as the primary key for use by features
|
||||
* such as `merge_insert`. Only single-column primary keys are supported,
|
||||
* and the key cannot be changed once set.
|
||||
* @param {string | string[]} columns The primary key column. A one-element
|
||||
* array is also accepted; passing more than one column is rejected.
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
abstract setUnenforcedPrimaryKey(columns: string | string[]): Promise<void>;
|
||||
/**
|
||||
* Install an {@link LsmWriteSpec} on this table, selecting Lance's MemWAL
|
||||
* LSM-style write path for future `mergeInsert` calls.
|
||||
*
|
||||
* `LsmWriteSpec` chooses one of three sharding strategies via `specType`:
|
||||
*
|
||||
* - `"bucket"` — hash-bucket writes by the single-column unenforced primary
|
||||
* key (`column` and `numBuckets` required).
|
||||
* - `"identity"` — shard by the raw value of a scalar `column`.
|
||||
* - `"unsharded"` — route every write to a single shard.
|
||||
*
|
||||
* All variants require the table to have an unenforced primary key
|
||||
* ({@link Table#setUnenforcedPrimaryKey}); bucket sharding additionally
|
||||
* requires it to be the single column being bucketed.
|
||||
* @param {LsmWriteSpec} spec The sharding spec to install.
|
||||
* @returns {Promise<void>}
|
||||
* @example
|
||||
* ```ts
|
||||
* await table.setUnenforcedPrimaryKey("id");
|
||||
* await table.setLsmWriteSpec({
|
||||
* specType: "bucket",
|
||||
* column: "id",
|
||||
* numBuckets: 16,
|
||||
* maintainedIndexes: ["id_idx"],
|
||||
* });
|
||||
* ```
|
||||
*/
|
||||
abstract setLsmWriteSpec(spec: LsmWriteSpec): Promise<void>;
|
||||
/**
|
||||
* Remove the {@link LsmWriteSpec} from this table, reverting to the standard
|
||||
* `mergeInsert` write path.
|
||||
*
|
||||
* Errors if no spec is currently set.
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
abstract unsetLsmWriteSpec(): Promise<void>;
|
||||
/**
|
||||
* Drain and close any cached MemWAL shard writers held for this table.
|
||||
*
|
||||
* When an {@link LsmWriteSpec} is installed, `mergeInsert` opens MemWAL
|
||||
* shard writers and caches them for reuse across calls. This closes them,
|
||||
* flushing pending data; writers reopen lazily on the next `mergeInsert`.
|
||||
* It is a no-op when no writers are cached.
|
||||
* @returns {Promise<void>}
|
||||
*/
|
||||
abstract closeLsmWriters(): Promise<void>;
|
||||
/** Retrieve the version of the table */
|
||||
|
||||
abstract version(): Promise<number>;
|
||||
@@ -636,7 +780,20 @@ export class LocalTable extends Table {
|
||||
const schema = await this.schema();
|
||||
|
||||
const buffer = await fromDataToBuffer(data, undefined, schema);
|
||||
return await this.inner.add(buffer, mode);
|
||||
// Wrap the user callback so a thrown error doesn't surface as an
|
||||
// unhandled exception (the callback fires from a napi threadsafe
|
||||
// function — exceptions there crash the process).
|
||||
const userProgress = options?.progress;
|
||||
const progress = userProgress
|
||||
? (p: WriteProgress) => {
|
||||
try {
|
||||
userProgress(p);
|
||||
} catch (e) {
|
||||
console.warn("Table.add progress callback threw:", e);
|
||||
}
|
||||
}
|
||||
: undefined;
|
||||
return await this.inner.add(buffer, mode, progress);
|
||||
}
|
||||
|
||||
async update(
|
||||
@@ -893,10 +1050,33 @@ export class LocalTable extends Table {
|
||||
return await this.inner.alterColumns(processedAlterations);
|
||||
}
|
||||
|
||||
async updateFieldMetadata(
|
||||
updates: FieldMetadataUpdate[],
|
||||
): Promise<UpdateFieldMetadataResult> {
|
||||
return await this.inner.updateFieldMetadata(updates);
|
||||
}
|
||||
|
||||
async dropColumns(columnNames: string[]): Promise<DropColumnsResult> {
|
||||
return await this.inner.dropColumns(columnNames);
|
||||
}
|
||||
|
||||
async setUnenforcedPrimaryKey(columns: string | string[]): Promise<void> {
|
||||
const cols = typeof columns === "string" ? [columns] : columns;
|
||||
return await this.inner.setUnenforcedPrimaryKey(cols);
|
||||
}
|
||||
|
||||
async setLsmWriteSpec(spec: LsmWriteSpec): Promise<void> {
|
||||
return await this.inner.setLsmWriteSpec(spec);
|
||||
}
|
||||
|
||||
async unsetLsmWriteSpec(): Promise<void> {
|
||||
return await this.inner.unsetLsmWriteSpec();
|
||||
}
|
||||
|
||||
async closeLsmWriters(): Promise<void> {
|
||||
return await this.inner.closeLsmWriters();
|
||||
}
|
||||
|
||||
async version(): Promise<number> {
|
||||
return await this.inner.version();
|
||||
}
|
||||
@@ -1042,3 +1222,19 @@ export interface ColumnAlteration {
|
||||
/** Set the new nullability. Note that a nullable column cannot be made non-nullable. */
|
||||
nullable?: boolean;
|
||||
}
|
||||
|
||||
/** A per-field metadata update, addressed by dot-path. */
|
||||
export interface FieldMetadataUpdate {
|
||||
/**
|
||||
* Dot-separated path to the field. For a top-level column this is just its
|
||||
* name; for a nested field it's the path, e.g. "a.b.c".
|
||||
*/
|
||||
path: string;
|
||||
/**
|
||||
* Metadata key/value pairs. Merged into the field's existing metadata by
|
||||
* default; a value of `null` deletes that key.
|
||||
*/
|
||||
metadata: Record<string, string | null>;
|
||||
/** If true, replace the field's entire metadata map instead of merging. */
|
||||
replace?: boolean;
|
||||
}
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-darwin-arm64",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["darwin"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.darwin-arm64.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-arm64-gnu",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.linux-arm64-gnu.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-arm64-musl",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.linux-arm64-musl.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-x64-gnu",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.linux-x64-gnu.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-x64-musl",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.linux-x64-musl.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-win32-arm64-msvc",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": [
|
||||
"win32"
|
||||
],
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-win32-x64-msvc",
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["win32"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.win32-x64-msvc.node",
|
||||
|
||||
11029
nodejs/package-lock.json
generated
Normal file
11029
nodejs/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
@@ -11,7 +11,7 @@
|
||||
"ann"
|
||||
],
|
||||
"private": false,
|
||||
"version": "0.28.0-beta.11",
|
||||
"version": "0.30.1-beta.2",
|
||||
"main": "dist/index.js",
|
||||
"exports": {
|
||||
".": "./dist/index.js",
|
||||
|
||||
@@ -459,4 +459,23 @@ impl Connection {
|
||||
transaction_id: resp.transaction_id,
|
||||
})
|
||||
}
|
||||
|
||||
/// Rename a table. `current_namespace_path` and `new_namespace_path` default to
|
||||
/// the root namespace when omitted; the caller is expected to either pass both
|
||||
/// or pass neither.
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn rename_table(
|
||||
&self,
|
||||
current_name: String,
|
||||
new_name: String,
|
||||
current_namespace_path: Option<Vec<String>>,
|
||||
new_namespace_path: Option<Vec<String>>,
|
||||
) -> napi::Result<()> {
|
||||
let cur_ns = current_namespace_path.unwrap_or_default();
|
||||
let new_ns = new_namespace_path.unwrap_or_default();
|
||||
self.get_inner()?
|
||||
.rename_table(¤t_name, &new_name, &cur_ns, &new_ns)
|
||||
.await
|
||||
.default_error()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -16,6 +16,7 @@ pub mod permutation;
|
||||
mod query;
|
||||
pub mod remote;
|
||||
mod rerankers;
|
||||
mod scannable;
|
||||
mod session;
|
||||
mod table;
|
||||
mod util;
|
||||
@@ -23,15 +24,19 @@ mod util;
|
||||
#[napi(object)]
|
||||
#[derive(Debug)]
|
||||
pub struct ConnectionOptions {
|
||||
/// (For LanceDB OSS only): The interval, in seconds, at which to check for
|
||||
/// updates to the table from other processes. If None, then consistency is not
|
||||
/// checked. For performance reasons, this is the default. For strong
|
||||
/// consistency, set this to zero seconds. Then every read will check for
|
||||
/// updates from other processes. As a compromise, you can set this to a
|
||||
/// non-zero value for eventual consistency. If more than that interval
|
||||
/// has passed since the last check, then the table will be checked for updates.
|
||||
/// Note: this consistency only applies to read operations. Write operations are
|
||||
/// The interval, in seconds, at which to check for updates to the table
|
||||
/// from other processes. If None, then consistency is not checked. For
|
||||
/// performance reasons, this is the default. For strong consistency, set
|
||||
/// this to zero seconds. Then every read will check for updates from other
|
||||
/// processes. As a compromise, you can set this to a non-zero value for
|
||||
/// eventual consistency. If more than that interval has passed since the
|
||||
/// last check, then the table will be checked for updates. Note: this
|
||||
/// consistency only applies to read operations. Write operations are
|
||||
/// always consistent.
|
||||
///
|
||||
/// Stronger consistency is not free. The smaller the interval, the more
|
||||
/// often each read pays the cost of checking for updates against object
|
||||
/// storage, raising per-read latency and cost.
|
||||
pub read_consistency_interval: Option<f64>,
|
||||
/// (For LanceDB OSS only): configuration for object storage.
|
||||
///
|
||||
|
||||
@@ -50,6 +50,20 @@ impl NativeMergeInsertBuilder {
|
||||
this
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub fn use_lsm_write(&self, use_lsm_write: bool) -> Self {
|
||||
let mut this = self.clone();
|
||||
this.inner.use_lsm_write(use_lsm_write);
|
||||
this
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub fn validate_single_shard(&self, validate_single_shard: bool) -> Self {
|
||||
let mut this = self.clone();
|
||||
this.inner.validate_single_shard(validate_single_shard);
|
||||
this
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn execute(&self, buf: Buffer) -> napi::Result<MergeResult> {
|
||||
let data = ipc_file_to_batches(buf.to_vec())
|
||||
|
||||
@@ -3,6 +3,12 @@
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use crate::error::NapiErrorExt;
|
||||
use crate::error::convert_error;
|
||||
use crate::iterator::RecordBatchIterator;
|
||||
use crate::rerankers::RerankHybridCallbackArgs;
|
||||
use crate::rerankers::Reranker;
|
||||
use crate::util::{parse_distance_type, schema_to_buffer};
|
||||
use arrow_array::{
|
||||
Array, Float16Array as ArrowFloat16Array, Float32Array as ArrowFloat32Array,
|
||||
Float64Array as ArrowFloat64Array, UInt8Array as ArrowUInt8Array,
|
||||
@@ -19,16 +25,27 @@ use lancedb::query::QueryBase;
|
||||
use lancedb::query::QueryExecutionOptions;
|
||||
use lancedb::query::Select;
|
||||
use lancedb::query::TakeQuery as LanceDbTakeQuery;
|
||||
use lancedb::query::VectorQuery as LanceDbVectorQuery;
|
||||
use lancedb::query::{ColumnOrdering as LanceDbColumnOrdering, VectorQuery as LanceDbVectorQuery};
|
||||
use napi::bindgen_prelude::*;
|
||||
use napi_derive::napi;
|
||||
|
||||
use crate::error::NapiErrorExt;
|
||||
use crate::error::convert_error;
|
||||
use crate::iterator::RecordBatchIterator;
|
||||
use crate::rerankers::RerankHybridCallbackArgs;
|
||||
use crate::rerankers::Reranker;
|
||||
use crate::util::{parse_distance_type, schema_to_buffer};
|
||||
#[napi(object)]
|
||||
pub struct ColumnOrdering {
|
||||
pub ascending: bool,
|
||||
pub nulls_first: bool,
|
||||
pub column_name: String,
|
||||
}
|
||||
|
||||
impl From<ColumnOrdering> for LanceDbColumnOrdering {
|
||||
fn from(value: ColumnOrdering) -> Self {
|
||||
match (value.ascending, value.nulls_first) {
|
||||
(true, true) => Self::asc_nulls_first(value.column_name),
|
||||
(true, false) => Self::asc_nulls_last(value.column_name),
|
||||
(false, true) => Self::desc_nulls_first(value.column_name),
|
||||
(false, false) => Self::desc_nulls_last(value.column_name),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn bytes_to_arrow_array(data: Uint8Array, dtype: String) -> napi::Result<Arc<dyn Array>> {
|
||||
let buf = arrow_buffer::Buffer::from(data.to_vec());
|
||||
@@ -128,6 +145,18 @@ impl Query {
|
||||
self.inner = self.inner.clone().with_row_id();
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub fn order_by(&mut self, ordering: Option<Vec<ColumnOrdering>>) -> napi::Result<()> {
|
||||
let ordering = ordering.map(|ordering| {
|
||||
ordering
|
||||
.into_iter()
|
||||
.map(LanceDbColumnOrdering::from)
|
||||
.collect()
|
||||
});
|
||||
self.inner = self.inner.clone().order_by(ordering);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn output_schema(&self) -> napi::Result<Buffer> {
|
||||
let schema = self.inner.output_schema().await.default_error()?;
|
||||
@@ -328,6 +357,18 @@ impl VectorQuery {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub fn order_by(&mut self, ordering: Option<Vec<ColumnOrdering>>) -> napi::Result<()> {
|
||||
let ordering = ordering.map(|ordering| {
|
||||
ordering
|
||||
.into_iter()
|
||||
.map(LanceDbColumnOrdering::from)
|
||||
.collect()
|
||||
});
|
||||
self.inner = self.inner.clone().order_by(ordering);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn output_schema(&self) -> napi::Result<Buffer> {
|
||||
let schema = self.inner.output_schema().await.default_error()?;
|
||||
|
||||
253
nodejs/src/scannable.rs
Normal file
253
nodejs/src/scannable.rs
Normal file
@@ -0,0 +1,253 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
//! NodeJS binding for the [`lancedb::data::scannable::Scannable`] trait.
|
||||
//!
|
||||
//! The JS side supplies a `getNextBatch(isStart)` callback that returns the
|
||||
//! next Arrow `RecordBatch` encoded as a self-contained Arrow IPC Stream
|
||||
//! message (schema message + record batch message + EOS marker) wrapped in a
|
||||
//! `Buffer`, or `null` when the stream is exhausted. The Rust side parses
|
||||
//! each buffer with `arrow_ipc::reader::StreamReader`, validates every
|
||||
//! standalone batch stream against the declared schema, and yields decoded
|
||||
//! `RecordBatch`es as a [`SendableRecordBatchStream`].
|
||||
//!
|
||||
//! `isStart` is `true` on the first `getNextBatch` call of each new
|
||||
//! `scan_as_stream` and `false` thereafter. JS uses it to drop any cached
|
||||
//! iterator and re-invoke its factory at scan boundaries, so retries
|
||||
//! triggered by mid-stream failures restart at batch 0.
|
||||
|
||||
use std::io::Cursor;
|
||||
use std::sync::Arc;
|
||||
|
||||
use arrow_array::RecordBatch;
|
||||
use arrow_ipc::reader::StreamReader;
|
||||
use arrow_schema::SchemaRef;
|
||||
use futures::stream::once;
|
||||
use lancedb::arrow::{SendableRecordBatchStream, SimpleRecordBatchStream};
|
||||
use lancedb::data::scannable::Scannable as LanceScannable;
|
||||
use lancedb::ipc::ipc_file_to_schema;
|
||||
use lancedb::{Error, Result as LanceResult};
|
||||
use napi::bindgen_prelude::*;
|
||||
use napi::threadsafe_function::ThreadsafeFunction;
|
||||
use napi_derive::napi;
|
||||
|
||||
/// Threadsafe handle to the JS `getNextBatch` callback. The callback takes a
|
||||
/// single boolean `isStart` (`true` on the first call of each new scan) and
|
||||
/// returns a Promise that resolves to a `Buffer` containing one IPC Stream
|
||||
/// message, or `null` at end-of-stream.
|
||||
type GetNextBatchFn = ThreadsafeFunction<bool, Promise<Option<Buffer>>, bool, Status, false>;
|
||||
|
||||
/// A Rust-side view of a JS-constructed `Scannable`.
|
||||
///
|
||||
/// Held in JS as the return value of the `Scannable` class constructor. When
|
||||
/// passed to a consumer that accepts `impl lancedb::data::scannable::Scannable`,
|
||||
/// the consumer invokes `scan_as_stream()` to pull batches through the JS
|
||||
/// callback.
|
||||
#[napi]
|
||||
pub struct NapiScannable {
|
||||
schema: SchemaRef,
|
||||
num_rows: Option<usize>,
|
||||
rescannable: bool,
|
||||
// `ThreadsafeFunction` is not `Clone`; wrap in `Arc` so the stream
|
||||
// returned by `scan_as_stream` can own a handle independent of `self`.
|
||||
get_next_batch: Arc<GetNextBatchFn>,
|
||||
// Tracks whether a scan has already started; used to enforce one-shot
|
||||
// semantics on non-rescannable sources.
|
||||
scanned: bool,
|
||||
}
|
||||
|
||||
#[napi]
|
||||
impl NapiScannable {
|
||||
/// Construct a new `NapiScannable`.
|
||||
///
|
||||
/// - `schema_buf` — Arrow IPC File buffer carrying only the schema (no batches).
|
||||
/// - `num_rows` — optional row count hint; not validated against the stream.
|
||||
/// - `rescannable` — whether `get_next_batch` may be re-driven after the
|
||||
/// scan completes.
|
||||
/// - `get_next_batch` -- JS callback that yields the next batch as an Arrow
|
||||
/// IPC Stream message wrapped in a `Buffer`, or `null` at EOF. The
|
||||
/// `isStart` argument is `true` on the first call of each new scan;
|
||||
/// JS uses it to discard any cached iterator before pulling.
|
||||
#[napi(constructor)]
|
||||
pub fn new(
|
||||
schema_buf: Buffer,
|
||||
num_rows: Option<i64>,
|
||||
rescannable: bool,
|
||||
get_next_batch: Function<bool, Promise<Option<Buffer>>>,
|
||||
) -> napi::Result<Self> {
|
||||
let schema = ipc_file_to_schema(schema_buf.to_vec())
|
||||
.map_err(|e| napi::Error::from_reason(format!("Invalid schema buffer: {}", e)))?;
|
||||
let num_rows = num_rows
|
||||
.map(|n| {
|
||||
usize::try_from(n)
|
||||
.map_err(|_| napi::Error::from_reason("num_rows must be non-negative"))
|
||||
})
|
||||
.transpose()?;
|
||||
let get_next_batch = Arc::new(get_next_batch.build_threadsafe_function().build()?);
|
||||
Ok(Self {
|
||||
schema,
|
||||
num_rows,
|
||||
rescannable,
|
||||
get_next_batch,
|
||||
scanned: false,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl std::fmt::Debug for NapiScannable {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
f.debug_struct("NapiScannable")
|
||||
.field("schema", &self.schema)
|
||||
.field("num_rows", &self.num_rows)
|
||||
.field("rescannable", &self.rescannable)
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
|
||||
impl LanceScannable for NapiScannable {
|
||||
fn schema(&self) -> SchemaRef {
|
||||
self.schema.clone()
|
||||
}
|
||||
|
||||
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
|
||||
let schema = self.schema.clone();
|
||||
|
||||
// One-shot enforcement for non-rescannable sources: return a stream
|
||||
// whose first item is an error.
|
||||
if self.scanned && !self.rescannable {
|
||||
let err_stream = once(async {
|
||||
Err(Error::InvalidInput {
|
||||
message: "Scannable has already been consumed (non-rescannable source)"
|
||||
.to_string(),
|
||||
})
|
||||
});
|
||||
return Box::pin(SimpleRecordBatchStream::new(err_stream, schema));
|
||||
}
|
||||
self.scanned = true;
|
||||
|
||||
let tsfn = Arc::clone(&self.get_next_batch);
|
||||
let declared_schema = schema.clone();
|
||||
|
||||
// State threaded through the unfold. `is_first_pull` starts true so
|
||||
// the first call into JS signals a new-scan boundary; JS uses it to
|
||||
// reset any cached iterator before factory()-ing a fresh one.
|
||||
let initial = State {
|
||||
tsfn,
|
||||
batch_index: 0,
|
||||
declared_schema,
|
||||
errored: false,
|
||||
is_first_pull: true,
|
||||
};
|
||||
|
||||
let stream = futures::stream::unfold(initial, |mut state| async move {
|
||||
if state.errored {
|
||||
return None;
|
||||
}
|
||||
|
||||
// Pull the next IPC Stream buffer from JS. `is_first_pull` is
|
||||
// consumed here and cleared so subsequent pulls continue the
|
||||
// same scan rather than restarting it.
|
||||
let is_start = state.is_first_pull;
|
||||
state.is_first_pull = false;
|
||||
let buf = match pull_next(&state.tsfn, is_start).await {
|
||||
Ok(Some(buf)) => buf,
|
||||
Ok(None) => return None,
|
||||
Err(e) => {
|
||||
state.errored = true;
|
||||
return Some((Err(e), state));
|
||||
}
|
||||
};
|
||||
|
||||
match decode_one_batch(buf.as_ref(), &state.declared_schema) {
|
||||
Ok(batch) => {
|
||||
state.batch_index += 1;
|
||||
Some((Ok(batch), state))
|
||||
}
|
||||
Err(e) => {
|
||||
let tagged = Error::Runtime {
|
||||
message: format!(
|
||||
"[scannable/rust-bridge] failure at batch index {}: {}",
|
||||
state.batch_index, e
|
||||
),
|
||||
};
|
||||
state.errored = true;
|
||||
Some((Err(tagged), state))
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
Box::pin(SimpleRecordBatchStream::new(stream, schema))
|
||||
}
|
||||
|
||||
fn num_rows(&self) -> Option<usize> {
|
||||
self.num_rows
|
||||
}
|
||||
|
||||
fn rescannable(&self) -> bool {
|
||||
self.rescannable
|
||||
}
|
||||
}
|
||||
|
||||
struct State {
|
||||
tsfn: Arc<GetNextBatchFn>,
|
||||
batch_index: usize,
|
||||
declared_schema: SchemaRef,
|
||||
errored: bool,
|
||||
/// True for the very first pull of a new scan. Forwarded to JS so the
|
||||
/// callback can drop any cached iterator and call its factory fresh,
|
||||
/// which makes rescannable sources restart at batch 0 even when the
|
||||
/// previous scan ended mid-stream.
|
||||
is_first_pull: bool,
|
||||
}
|
||||
|
||||
/// Invoke the JS callback and await its Promise. `is_start` is forwarded to
|
||||
/// the JS side as the `isStart` argument so it can reset its iterator at the
|
||||
/// scan boundary. Errors on the JS side surface here as rejected promises
|
||||
/// and are tunneled back as `lancedb::Error::Runtime`.
|
||||
async fn pull_next(tsfn: &GetNextBatchFn, is_start: bool) -> LanceResult<Option<Buffer>> {
|
||||
let promise = tsfn
|
||||
.call_async(is_start)
|
||||
.await
|
||||
.map_err(|e| Error::Runtime {
|
||||
message: format!(
|
||||
"[scannable/js-factory] napi error status={}, reason={}",
|
||||
e.status, e.reason
|
||||
),
|
||||
})?;
|
||||
promise.await.map_err(|e| Error::Runtime {
|
||||
message: format!(
|
||||
"[scannable/js-iterator] napi error status={}, reason={}",
|
||||
e.status, e.reason
|
||||
),
|
||||
})
|
||||
}
|
||||
|
||||
/// Decode one IPC Stream buffer (schema + batch + EOS) into a `RecordBatch`.
|
||||
/// Each buffer is a standalone IPC stream, so every decoded stream schema must
|
||||
/// match the one declared at construction.
|
||||
fn decode_one_batch(buf: &[u8], declared: &SchemaRef) -> LanceResult<RecordBatch> {
|
||||
let reader = StreamReader::try_new(Cursor::new(buf), None).map_err(|e| Error::Runtime {
|
||||
message: format!("failed to open IPC stream reader: {}", e),
|
||||
})?;
|
||||
|
||||
let actual = reader.schema();
|
||||
if actual.as_ref() != declared.as_ref() {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"declared schema does not match stream schema: declared={:?} actual={:?}",
|
||||
declared, actual
|
||||
),
|
||||
});
|
||||
}
|
||||
|
||||
let mut iter = reader;
|
||||
let batch = iter
|
||||
.next()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: "IPC stream contained schema but no record batch".to_string(),
|
||||
})?
|
||||
.map_err(|e| Error::Runtime {
|
||||
message: format!("failed to decode record batch: {}", e),
|
||||
})?;
|
||||
Ok(batch)
|
||||
}
|
||||
@@ -5,10 +5,12 @@ use std::collections::HashMap;
|
||||
|
||||
use lancedb::ipc::{ipc_file_to_batches, ipc_file_to_schema};
|
||||
use lancedb::table::{
|
||||
AddDataMode, ColumnAlteration as LanceColumnAlteration, Duration, NewColumnTransform,
|
||||
OptimizeAction, OptimizeOptions, Table as LanceDbTable,
|
||||
AddDataMode, ColumnAlteration as LanceColumnAlteration, Duration,
|
||||
FieldMetadataUpdate as LanceFieldMetadataUpdate, NewColumnTransform, OptimizeAction,
|
||||
OptimizeOptions, Table as LanceDbTable,
|
||||
};
|
||||
use napi::bindgen_prelude::*;
|
||||
use napi::threadsafe_function::{ThreadsafeFunction, ThreadsafeFunctionCallMode};
|
||||
use napi_derive::napi;
|
||||
|
||||
use crate::error::NapiErrorExt;
|
||||
@@ -67,8 +69,16 @@ impl Table {
|
||||
schema_to_buffer(&schema)
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn add(&self, buf: Buffer, mode: String) -> napi::Result<AddResult> {
|
||||
#[napi(
|
||||
catch_unwind,
|
||||
ts_args_type = "buf: Buffer, mode: string, progressCallback?: (progress: WriteProgressInfo) => void"
|
||||
)]
|
||||
pub async fn add(
|
||||
&self,
|
||||
buf: Buffer,
|
||||
mode: String,
|
||||
progress_callback: Option<ProgressFn>,
|
||||
) -> napi::Result<AddResult> {
|
||||
let batches = ipc_file_to_batches(buf.to_vec())
|
||||
.map_err(|e| napi::Error::from_reason(format!("Failed to read IPC file: {}", e)))?;
|
||||
let batches = batches
|
||||
@@ -92,6 +102,19 @@ impl Table {
|
||||
return Err(napi::Error::from_reason(format!("Invalid mode: {}", mode)));
|
||||
};
|
||||
|
||||
if let Some(tsfn) = progress_callback {
|
||||
op = op.progress(move |p| {
|
||||
// NonBlocking: dispatch onto the JS event loop without
|
||||
// blocking the writer thread. With napi-rs's default
|
||||
// unbounded queue, events are not dropped — a slow JS
|
||||
// callback will just queue them.
|
||||
tsfn.call(
|
||||
WriteProgressInfo::from(p),
|
||||
ThreadsafeFunctionCallMode::NonBlocking,
|
||||
);
|
||||
});
|
||||
}
|
||||
|
||||
let res = op.execute().await.default_error()?;
|
||||
Ok(res.into())
|
||||
}
|
||||
@@ -333,6 +356,23 @@ impl Table {
|
||||
Ok(res.into())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn update_field_metadata(
|
||||
&self,
|
||||
updates: Vec<FieldMetadataUpdate>,
|
||||
) -> napi::Result<UpdateFieldMetadataResult> {
|
||||
let updates = updates
|
||||
.into_iter()
|
||||
.map(LanceFieldMetadataUpdate::from)
|
||||
.collect::<Vec<_>>();
|
||||
let res = self
|
||||
.inner_ref()?
|
||||
.update_field_metadata(&updates)
|
||||
.await
|
||||
.default_error()?;
|
||||
Ok(res.into())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn drop_columns(&self, columns: Vec<String>) -> napi::Result<DropColumnsResult> {
|
||||
let col_refs = columns.iter().map(String::as_str).collect::<Vec<_>>();
|
||||
@@ -344,6 +384,36 @@ impl Table {
|
||||
Ok(res.into())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn set_unenforced_primary_key(&self, columns: Vec<String>) -> napi::Result<()> {
|
||||
self.inner_ref()?
|
||||
.set_unenforced_primary_key(columns)
|
||||
.await
|
||||
.default_error()
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn set_lsm_write_spec(&self, spec: LsmWriteSpec) -> napi::Result<()> {
|
||||
let native_spec = lancedb::table::LsmWriteSpec::try_from(spec)?;
|
||||
self.inner_ref()?
|
||||
.set_lsm_write_spec(native_spec)
|
||||
.await
|
||||
.default_error()
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn unset_lsm_write_spec(&self) -> napi::Result<()> {
|
||||
self.inner_ref()?
|
||||
.unset_lsm_write_spec()
|
||||
.await
|
||||
.default_error()
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn close_lsm_writers(&self) -> napi::Result<()> {
|
||||
self.inner_ref()?.close_lsm_writers().await.default_error()
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn version(&self) -> napi::Result<i64> {
|
||||
self.inner_ref()?
|
||||
@@ -538,6 +608,63 @@ impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
}
|
||||
}
|
||||
|
||||
/// Specification selecting Lance's MemWAL LSM-style write path for
|
||||
/// `mergeInsert`.
|
||||
///
|
||||
/// `specType` must be `"bucket"`, `"identity"`, or `"unsharded"`. For
|
||||
/// `"bucket"`, `column` and `numBuckets` are required; for `"identity"`,
|
||||
/// `column` is required.
|
||||
#[napi(object)]
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct LsmWriteSpec {
|
||||
/// One of `"bucket"`, `"identity"`, or `"unsharded"`.
|
||||
pub spec_type: String,
|
||||
/// Bucket and identity variants: the sharding column.
|
||||
pub column: Option<String>,
|
||||
/// Bucket variant: the number of buckets, in `[1, 1024]`.
|
||||
pub num_buckets: Option<u32>,
|
||||
/// Names of indexes the MemWAL should keep up to date during writes.
|
||||
pub maintained_indexes: Option<Vec<String>>,
|
||||
/// Default `ShardWriter` configuration recorded in the MemWAL index.
|
||||
pub writer_config_defaults: Option<HashMap<String, String>>,
|
||||
}
|
||||
|
||||
impl TryFrom<LsmWriteSpec> for lancedb::table::LsmWriteSpec {
|
||||
type Error = napi::Error;
|
||||
|
||||
fn try_from(value: LsmWriteSpec) -> napi::Result<Self> {
|
||||
let maintained = value.maintained_indexes.unwrap_or_default();
|
||||
let writer_config_defaults = value.writer_config_defaults.unwrap_or_default();
|
||||
let spec = match value.spec_type.as_str() {
|
||||
"bucket" => {
|
||||
let column = value.column.ok_or_else(|| {
|
||||
napi::Error::from_reason("LsmWriteSpec bucket requires `column`")
|
||||
})?;
|
||||
let num_buckets = value.num_buckets.ok_or_else(|| {
|
||||
napi::Error::from_reason("LsmWriteSpec bucket requires `numBuckets`")
|
||||
})?;
|
||||
Self::bucket(column, num_buckets)
|
||||
}
|
||||
"identity" => {
|
||||
let column = value.column.ok_or_else(|| {
|
||||
napi::Error::from_reason("LsmWriteSpec identity requires `column`")
|
||||
})?;
|
||||
Self::identity(column)
|
||||
}
|
||||
"unsharded" => Self::unsharded(),
|
||||
other => {
|
||||
return Err(napi::Error::from_reason(format!(
|
||||
"LsmWriteSpec `specType` must be 'bucket', 'identity', or 'unsharded', got '{}'",
|
||||
other
|
||||
)));
|
||||
}
|
||||
};
|
||||
Ok(spec
|
||||
.with_maintained_indexes(maintained)
|
||||
.with_writer_config_defaults(writer_config_defaults))
|
||||
}
|
||||
}
|
||||
|
||||
/// Statistics about a compaction operation.
|
||||
#[napi(object)]
|
||||
#[derive(Clone, Debug)]
|
||||
@@ -572,6 +699,44 @@ pub struct OptimizeStats {
|
||||
pub prune: RemovalStats,
|
||||
}
|
||||
|
||||
/// Progress snapshot for a write operation, delivered to the JS callback
|
||||
/// passed to `Table.add`.
|
||||
#[napi(object)]
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct WriteProgressInfo {
|
||||
/// Number of rows written so far.
|
||||
pub output_rows: i64,
|
||||
/// Number of bytes written so far.
|
||||
pub output_bytes: i64,
|
||||
/// Total rows expected, if the input source reports it.
|
||||
/// Always set on the final callback (where `done` is `true`).
|
||||
pub total_rows: Option<i64>,
|
||||
/// Wall-clock seconds since monitoring started.
|
||||
pub elapsed_seconds: f64,
|
||||
/// Number of parallel write tasks currently in flight.
|
||||
pub active_tasks: i64,
|
||||
/// Total number of parallel write tasks (the write parallelism).
|
||||
pub total_tasks: i64,
|
||||
/// `true` for the final callback; `false` otherwise.
|
||||
pub done: bool,
|
||||
}
|
||||
|
||||
impl From<&lancedb::table::write_progress::WriteProgress> for WriteProgressInfo {
|
||||
fn from(p: &lancedb::table::write_progress::WriteProgress) -> Self {
|
||||
Self {
|
||||
output_rows: p.output_rows() as i64,
|
||||
output_bytes: p.output_bytes() as i64,
|
||||
total_rows: p.total_rows().map(|n| n as i64),
|
||||
elapsed_seconds: p.elapsed().as_secs_f64(),
|
||||
active_tasks: p.active_tasks() as i64,
|
||||
total_tasks: p.total_tasks() as i64,
|
||||
done: p.done(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
type ProgressFn = ThreadsafeFunction<WriteProgressInfo, (), WriteProgressInfo, Status, false>;
|
||||
|
||||
/// A definition of a column alteration. The alteration changes the column at
|
||||
/// `path` to have the new name `name`, to be nullable if `nullable` is true,
|
||||
/// and to have the data type `data_type`. At least one of `rename` or `nullable`
|
||||
@@ -600,6 +765,29 @@ pub struct ColumnAlteration {
|
||||
pub nullable: Option<bool>,
|
||||
}
|
||||
|
||||
/// A per-field metadata update, addressed by dot-path. Merges into the field's
|
||||
/// existing metadata by default; a `null` value deletes a key, and `replace`
|
||||
/// swaps the field's entire metadata map.
|
||||
#[napi(object)]
|
||||
pub struct FieldMetadataUpdate {
|
||||
/// Dot-separated path to the field (e.g. "embedding" or "a.b.c").
|
||||
pub path: String,
|
||||
/// Metadata keys to set; a `null` value deletes that key.
|
||||
pub metadata: HashMap<String, Option<String>>,
|
||||
/// If true, replace the field's entire metadata map instead of merging.
|
||||
pub replace: Option<bool>,
|
||||
}
|
||||
|
||||
impl From<FieldMetadataUpdate> for LanceFieldMetadataUpdate {
|
||||
fn from(js: FieldMetadataUpdate) -> Self {
|
||||
Self {
|
||||
path: js.path,
|
||||
metadata: js.metadata,
|
||||
replace: js.replace.unwrap_or(false),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl TryFrom<ColumnAlteration> for LanceColumnAlteration {
|
||||
type Error = String;
|
||||
fn try_from(js: ColumnAlteration) -> std::result::Result<Self, Self::Error> {
|
||||
@@ -650,9 +838,6 @@ pub struct IndexStatistics {
|
||||
pub distance_type: Option<String>,
|
||||
/// The number of parts this index is split into.
|
||||
pub num_indices: Option<u32>,
|
||||
/// The KMeans loss value of the index,
|
||||
/// it is only present for vector indices.
|
||||
pub loss: Option<f64>,
|
||||
}
|
||||
impl From<lancedb::index::IndexStatistics> for IndexStatistics {
|
||||
fn from(value: lancedb::index::IndexStatistics) -> Self {
|
||||
@@ -662,7 +847,6 @@ impl From<lancedb::index::IndexStatistics> for IndexStatistics {
|
||||
index_type: value.index_type.to_string(),
|
||||
distance_type: value.distance_type.map(|d| d.to_string()),
|
||||
num_indices: value.num_indices,
|
||||
loss: value.loss,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -798,6 +982,7 @@ pub struct MergeResult {
|
||||
pub num_updated_rows: i64,
|
||||
pub num_deleted_rows: i64,
|
||||
pub num_attempts: i64,
|
||||
pub num_rows: i64,
|
||||
}
|
||||
|
||||
impl From<lancedb::table::MergeResult> for MergeResult {
|
||||
@@ -808,6 +993,7 @@ impl From<lancedb::table::MergeResult> for MergeResult {
|
||||
num_updated_rows: value.num_updated_rows as i64,
|
||||
num_deleted_rows: value.num_deleted_rows as i64,
|
||||
num_attempts: value.num_attempts as i64,
|
||||
num_rows: value.num_rows as i64,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -838,6 +1024,19 @@ impl From<lancedb::table::AlterColumnsResult> for AlterColumnsResult {
|
||||
}
|
||||
}
|
||||
|
||||
#[napi(object)]
|
||||
pub struct UpdateFieldMetadataResult {
|
||||
pub version: i64,
|
||||
}
|
||||
|
||||
impl From<lancedb::table::UpdateFieldMetadataResult> for UpdateFieldMetadataResult {
|
||||
fn from(value: lancedb::table::UpdateFieldMetadataResult) -> Self {
|
||||
Self {
|
||||
version: value.version as i64,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[napi(object)]
|
||||
pub struct DropColumnsResult {
|
||||
pub version: i64,
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
[tool.bumpversion]
|
||||
current_version = "0.31.0-beta.11"
|
||||
current_version = "0.33.1-beta.2"
|
||||
parse = """(?x)
|
||||
(?P<major>0|[1-9]\\d*)\\.
|
||||
(?P<minor>0|[1-9]\\d*)\\.
|
||||
|
||||
@@ -4,16 +4,26 @@ code is in the `src/` directory and the Python bindings are in the `lancedb/` di
|
||||
|
||||
Common commands:
|
||||
|
||||
* Bootstrap dev env: `uv run --extra tests --extra dev maturin develop --extras tests,dev`
|
||||
* Build: `make develop`
|
||||
* Format: `make format`
|
||||
* Lint: `make check`
|
||||
* Fix lints: `make fix`
|
||||
* Test: `make test`
|
||||
* Doc test: `make doctest`
|
||||
* Test: `uv run --extra tests pytest python/tests -vv --durations=10 -m "not slow and not s3_test"`
|
||||
* Run specific test: `uv run --extra tests pytest python/tests/<test_file>.py::<test_name> -q`
|
||||
* Doc test: `uv run --extra tests pytest --doctest-modules python/lancedb`
|
||||
|
||||
Use the uv-managed environment declared by `uv.lock` for Python validation. Do
|
||||
not treat system `python`, global `pytest`, or missing editable-install errors
|
||||
as final blockers; bootstrap or enter the uv environment instead. `make test`
|
||||
and `make doctest` assume the development environment is already prepared.
|
||||
|
||||
Before committing changes, run lints and then formatting.
|
||||
|
||||
When you change the Rust code, you will need to recompile the Python bindings: `make develop`.
|
||||
When you change the Rust code, PyO3 binding code, or see a missing/stale
|
||||
`lancedb._lancedb`, recompile the Python bindings with
|
||||
`uv run --extra tests --extra dev maturin develop --extras tests,dev` before
|
||||
running tests.
|
||||
|
||||
When you export new types from Rust to Python, you must manually update `python/lancedb/_lancedb.pyi`
|
||||
with the corresponding type hints. You can run `pyright` to check for type errors in the Python code.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "lancedb-python"
|
||||
version = "0.31.0-beta.11"
|
||||
version = "0.33.1-beta.2"
|
||||
publish = false
|
||||
edition.workspace = true
|
||||
description = "Python bindings for LanceDB"
|
||||
@@ -19,6 +19,7 @@ arrow = { version = "58.0.0", features = ["pyarrow"] }
|
||||
async-trait = "0.1"
|
||||
bytes = "1"
|
||||
lancedb = { path = "../rust/lancedb", default-features = false }
|
||||
datafusion-common.workspace = true
|
||||
lance-core.workspace = true
|
||||
lance-namespace.workspace = true
|
||||
lance-namespace-impls.workspace = true
|
||||
|
||||
@@ -94,7 +94,6 @@ def connect(
|
||||
host_override: str, optional
|
||||
The override url for LanceDB Cloud.
|
||||
read_consistency_interval: timedelta, default None
|
||||
(For LanceDB OSS only)
|
||||
The interval at which to check for updates to the table from other
|
||||
processes. If None, then consistency is not checked. For performance
|
||||
reasons, this is the default. For strong consistency, set this to
|
||||
@@ -104,6 +103,10 @@ def connect(
|
||||
the last check, then the table will be checked for updates. Note: this
|
||||
consistency only applies to read operations. Write operations are
|
||||
always consistent.
|
||||
|
||||
Stronger consistency is not free. The smaller the interval, the more
|
||||
often each read pays the cost of checking for updates against object
|
||||
storage, raising per-read latency and cost.
|
||||
client_config: ClientConfig or dict, optional
|
||||
Configuration options for the LanceDB Cloud HTTP client. If a dict, then
|
||||
the keys are the attributes of the ClientConfig class. If None, then the
|
||||
@@ -147,6 +150,13 @@ def connect(
|
||||
>>> db = lancedb.connect("s3://my-bucket/lancedb",
|
||||
... storage_options={"aws_access_key_id": "***"})
|
||||
|
||||
For tests and temporary data, use an in-memory database:
|
||||
|
||||
>>> db = lancedb.connect("memory://")
|
||||
|
||||
In-memory databases are not persisted. Tables are dropped when the last
|
||||
connection or table handle referencing them is closed.
|
||||
|
||||
Connect to LanceDB cloud:
|
||||
|
||||
>>> db = lancedb.connect("db://my_database", api_key="ldb_...",
|
||||
@@ -210,6 +220,7 @@ def connect(
|
||||
request_thread_pool=request_thread_pool,
|
||||
client_config=client_config,
|
||||
storage_options=storage_options,
|
||||
read_consistency_interval=read_consistency_interval,
|
||||
**kwargs,
|
||||
)
|
||||
_check_s3_bucket_with_dots(str(uri), storage_options)
|
||||
@@ -304,6 +315,15 @@ def deserialize_conn(
|
||||
manifest_enabled=parsed.get("manifest_enabled", False),
|
||||
namespace_client_properties=parsed.get("namespace_client_properties"),
|
||||
)
|
||||
elif connection_type == "remote":
|
||||
return RemoteDBConnection(
|
||||
parsed["db_url"],
|
||||
parsed["api_key"],
|
||||
parsed.get("region", "us-east-1"),
|
||||
host_override=parsed.get("host_override"),
|
||||
client_config=parsed.get("client_config"),
|
||||
storage_options=storage_options,
|
||||
)
|
||||
else:
|
||||
raise ValueError(f"Unknown connection_type: {connection_type}")
|
||||
|
||||
@@ -336,7 +356,6 @@ async def connect_async(
|
||||
host_override: str, optional
|
||||
The override url for LanceDB Cloud.
|
||||
read_consistency_interval: timedelta, default None
|
||||
(For LanceDB OSS only)
|
||||
The interval at which to check for updates to the table from other
|
||||
processes. If None, then consistency is not checked. For performance
|
||||
reasons, this is the default. For strong consistency, set this to
|
||||
@@ -346,6 +365,10 @@ async def connect_async(
|
||||
the last check, then the table will be checked for updates. Note: this
|
||||
consistency only applies to read operations. Write operations are
|
||||
always consistent.
|
||||
|
||||
Stronger consistency is not free. The smaller the interval, the more
|
||||
often each read pays the cost of checking for updates against object
|
||||
storage, raising per-read latency and cost.
|
||||
client_config: ClientConfig or dict, optional
|
||||
Configuration options for the LanceDB Cloud HTTP client. If a dict, then
|
||||
the keys are the attributes of the ClientConfig class. If None, then the
|
||||
@@ -378,6 +401,8 @@ async def connect_async(
|
||||
... db = await lancedb.connect_async("s3://my-bucket/lancedb",
|
||||
... storage_options={
|
||||
... "aws_access_key_id": "***"})
|
||||
... # For tests and temporary data, use an in-memory database
|
||||
... db = await lancedb.connect_async("memory://")
|
||||
... # Connect to LanceDB cloud
|
||||
... db = await lancedb.connect_async("db://my_database", api_key="ldb_...",
|
||||
... client_config={
|
||||
|
||||
@@ -51,7 +51,7 @@ class PyExpr:
|
||||
def to_sql(self) -> str: ...
|
||||
|
||||
def expr_col(name: str) -> PyExpr: ...
|
||||
def expr_lit(value: Union[bool, int, float, str]) -> PyExpr: ...
|
||||
def expr_lit(value: Union[bool, int, float, str, bytes]) -> PyExpr: ...
|
||||
def expr_func(name: str, args: List[PyExpr]) -> PyExpr: ...
|
||||
|
||||
class Session:
|
||||
@@ -208,6 +208,9 @@ class Table:
|
||||
async def alter_columns(
|
||||
self, columns: list[dict[str, Any]]
|
||||
) -> AlterColumnsResult: ...
|
||||
async def update_field_metadata(
|
||||
self, updates: list[dict[str, Any]]
|
||||
) -> UpdateFieldMetadataResult: ...
|
||||
async def optimize(
|
||||
self,
|
||||
*,
|
||||
@@ -217,6 +220,10 @@ class Table:
|
||||
async def uri(self) -> str: ...
|
||||
async def initial_storage_options(self) -> Optional[Dict[str, str]]: ...
|
||||
async def latest_storage_options(self) -> Optional[Dict[str, str]]: ...
|
||||
async def set_unenforced_primary_key(self, columns: List[str]) -> None: ...
|
||||
async def set_lsm_write_spec(self, spec: LsmWriteSpec) -> None: ...
|
||||
async def unset_lsm_write_spec(self) -> None: ...
|
||||
async def close_lsm_writers(self) -> None: ...
|
||||
@property
|
||||
def tags(self) -> Tags: ...
|
||||
def query(self) -> Query: ...
|
||||
@@ -255,6 +262,11 @@ class RecordBatchStream:
|
||||
def __aiter__(self) -> "RecordBatchStream": ...
|
||||
async def __anext__(self) -> pa.RecordBatch: ...
|
||||
|
||||
class ColumnOrdering(TypedDict):
|
||||
column_name: str
|
||||
ascending: bool
|
||||
nulls_first: bool
|
||||
|
||||
class Query:
|
||||
def where(self, filter: str): ...
|
||||
def where_expr(self, expr: PyExpr): ...
|
||||
@@ -268,6 +280,7 @@ class Query:
|
||||
def postfilter(self): ...
|
||||
def nearest_to(self, query_vec: pa.Array) -> VectorQuery: ...
|
||||
def nearest_to_text(self, query: dict) -> FTSQuery: ...
|
||||
def order_by(self, ordering: Optional[List[ColumnOrdering]]): ...
|
||||
async def output_schema(self) -> pa.Schema: ...
|
||||
async def execute(
|
||||
self, max_batch_length: Optional[int], timeout: Optional[timedelta]
|
||||
@@ -296,6 +309,7 @@ class FTSQuery:
|
||||
def get_query(self) -> str: ...
|
||||
def add_query_vector(self, query_vec: pa.Array) -> None: ...
|
||||
def nearest_to(self, query_vec: pa.Array) -> HybridQuery: ...
|
||||
def order_by(self, ordering: Optional[List[ColumnOrdering]]): ...
|
||||
async def output_schema(self) -> pa.Schema: ...
|
||||
async def execute(
|
||||
self, max_batch_length: Optional[int], timeout: Optional[timedelta]
|
||||
@@ -321,6 +335,7 @@ class VectorQuery:
|
||||
def maximum_nprobes(self, maximum_nprobes: int): ...
|
||||
def bypass_vector_index(self): ...
|
||||
def nearest_to_text(self, query: dict) -> HybridQuery: ...
|
||||
def order_by(self, ordering: Optional[List[ColumnOrdering]]): ...
|
||||
def to_query_request(self) -> PyQueryRequest: ...
|
||||
|
||||
class HybridQuery:
|
||||
@@ -339,6 +354,7 @@ class HybridQuery:
|
||||
def minimum_nprobes(self, minimum_nprobes: int): ...
|
||||
def maximum_nprobes(self, maximum_nprobes: int): ...
|
||||
def bypass_vector_index(self): ...
|
||||
def order_by(self, ordering: Optional[List[ColumnOrdering]]): ...
|
||||
def to_vector_query(self) -> VectorQuery: ...
|
||||
def to_fts_query(self) -> FTSQuery: ...
|
||||
def get_limit(self) -> int: ...
|
||||
@@ -368,6 +384,7 @@ class PyQueryRequest:
|
||||
bypass_vector_index: Optional[bool]
|
||||
postfilter: Optional[bool]
|
||||
norm: Optional[str]
|
||||
order_by: Optional[List[ColumnOrdering]]
|
||||
|
||||
class CompactionStats:
|
||||
fragments_removed: int
|
||||
@@ -407,6 +424,38 @@ class MergeResult:
|
||||
num_inserted_rows: int
|
||||
num_deleted_rows: int
|
||||
num_attempts: int
|
||||
num_rows: int
|
||||
|
||||
class LsmWriteSpec:
|
||||
"""Specification selecting Lance's MemWAL LSM-style write path for
|
||||
`merge_insert`."""
|
||||
|
||||
@staticmethod
|
||||
def bucket(column: str, num_buckets: int) -> "LsmWriteSpec": ...
|
||||
@staticmethod
|
||||
def identity(column: str) -> "LsmWriteSpec": ...
|
||||
@staticmethod
|
||||
def unsharded() -> "LsmWriteSpec": ...
|
||||
def with_maintained_indexes(self, indexes: List[str]) -> "LsmWriteSpec":
|
||||
"""Return a copy of this spec asking the MemWAL to keep the named
|
||||
indexes up to date as rows are appended."""
|
||||
...
|
||||
def with_writer_config_defaults(self, defaults: Dict[str, str]) -> "LsmWriteSpec":
|
||||
"""Return a copy of this spec recording the given default
|
||||
`ShardWriter` configuration in the MemWAL index."""
|
||||
...
|
||||
@property
|
||||
def spec_type(self) -> str:
|
||||
"""One of 'bucket', 'identity', or 'unsharded'."""
|
||||
...
|
||||
@property
|
||||
def column(self) -> Optional[str]: ...
|
||||
@property
|
||||
def num_buckets(self) -> Optional[int]: ...
|
||||
@property
|
||||
def maintained_indexes(self) -> List[str]: ...
|
||||
@property
|
||||
def writer_config_defaults(self) -> Dict[str, str]: ...
|
||||
|
||||
class AddColumnsResult:
|
||||
version: int
|
||||
@@ -414,6 +463,9 @@ class AddColumnsResult:
|
||||
class AlterColumnsResult:
|
||||
version: int
|
||||
|
||||
class UpdateFieldMetadataResult:
|
||||
version: int
|
||||
|
||||
class DropColumnsResult:
|
||||
version: int
|
||||
|
||||
|
||||
@@ -2,6 +2,7 @@
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import asyncio
|
||||
import concurrent.futures
|
||||
import os
|
||||
import threading
|
||||
import warnings
|
||||
@@ -37,6 +38,24 @@ class BackgroundEventLoop:
|
||||
|
||||
LOOP = BackgroundEventLoop()
|
||||
|
||||
|
||||
def _new_embedding_executor() -> concurrent.futures.ThreadPoolExecutor:
|
||||
return concurrent.futures.ThreadPoolExecutor(thread_name_prefix="lancedb-embedding")
|
||||
|
||||
|
||||
# Embedding functions can block for a long time -- a heavy local model or an
|
||||
# HTTP request to a remote embeddings API. Running them on asyncio's default
|
||||
# executor lets them starve the unrelated blocking I/O that shares that pool,
|
||||
# so they get a dedicated one. See
|
||||
# https://github.com/lancedb/lancedb/issues/3310.
|
||||
_EMBEDDING_EXECUTOR = _new_embedding_executor()
|
||||
|
||||
|
||||
def embedding_executor() -> concurrent.futures.ThreadPoolExecutor:
|
||||
"""Return the executor dedicated to running blocking embedding calls."""
|
||||
return _EMBEDDING_EXECUTOR
|
||||
|
||||
|
||||
_FORK_WARNED = False
|
||||
|
||||
|
||||
@@ -47,6 +66,12 @@ def _reset_after_fork():
|
||||
# the new state. The Rust-side tokio runtime is reset analogously by a
|
||||
# pthread_atfork hook installed in the _lancedb extension.
|
||||
LOOP._start()
|
||||
# The embedding executor's worker threads are dead in the child as well.
|
||||
# Replace it with a fresh pool (threads are spawned lazily, so this is
|
||||
# cheap); we don't shut down the old one, since joining its dead workers
|
||||
# could hang.
|
||||
global _EMBEDDING_EXECUTOR
|
||||
_EMBEDDING_EXECUTOR = _new_embedding_executor()
|
||||
global _FORK_WARNED
|
||||
if not _FORK_WARNED:
|
||||
_FORK_WARNED = True
|
||||
|
||||
@@ -8,7 +8,17 @@ from abc import abstractmethod
|
||||
from datetime import timedelta
|
||||
from pathlib import Path
|
||||
import sys
|
||||
from typing import TYPE_CHECKING, Any, Dict, Iterable, List, Literal, Optional, Union
|
||||
from typing import (
|
||||
TYPE_CHECKING,
|
||||
Any,
|
||||
Dict,
|
||||
Generator,
|
||||
Iterable,
|
||||
List,
|
||||
Literal,
|
||||
Optional,
|
||||
Union,
|
||||
)
|
||||
|
||||
if sys.version_info >= (3, 12):
|
||||
from typing import override
|
||||
@@ -313,7 +323,7 @@ class DBConnection(EnforceOverrides):
|
||||
>>> data = [{"vector": [1.1, 1.2], "lat": 45.5, "long": -122.7},
|
||||
... {"vector": [0.2, 1.8], "lat": 40.1, "long": -74.1}]
|
||||
>>> db.create_table("my_table", data)
|
||||
LanceTable(name='my_table', version=1, ...)
|
||||
LanceTable(name='my_table', ...)
|
||||
>>> db["my_table"].head()
|
||||
pyarrow.Table
|
||||
vector: fixed_size_list<item: float>[2]
|
||||
@@ -334,7 +344,7 @@ class DBConnection(EnforceOverrides):
|
||||
... "long": [-122.7, -74.1]
|
||||
... })
|
||||
>>> db.create_table("table2", data)
|
||||
LanceTable(name='table2', version=1, ...)
|
||||
LanceTable(name='table2', ...)
|
||||
>>> db["table2"].head()
|
||||
pyarrow.Table
|
||||
vector: fixed_size_list<item: float>[2]
|
||||
@@ -357,7 +367,7 @@ class DBConnection(EnforceOverrides):
|
||||
... pa.field("long", pa.float32())
|
||||
... ])
|
||||
>>> db.create_table("table3", data, schema = custom_schema)
|
||||
LanceTable(name='table3', version=1, ...)
|
||||
LanceTable(name='table3', ...)
|
||||
>>> db["table3"].head()
|
||||
pyarrow.Table
|
||||
vector: fixed_size_list<item: float>[2]
|
||||
@@ -391,7 +401,7 @@ class DBConnection(EnforceOverrides):
|
||||
... pa.field("price", pa.float32()),
|
||||
... ])
|
||||
>>> db.create_table("table4", make_batches(), schema=schema)
|
||||
LanceTable(name='table4', version=1, ...)
|
||||
LanceTable(name='table4', ...)
|
||||
|
||||
"""
|
||||
raise NotImplementedError
|
||||
@@ -568,15 +578,15 @@ class LanceDBConnection(DBConnection):
|
||||
>>> db = lancedb.connect("./.lancedb")
|
||||
>>> db.create_table("my_table", data=[{"vector": [1.1, 1.2], "b": 2},
|
||||
... {"vector": [0.5, 1.3], "b": 4}])
|
||||
LanceTable(name='my_table', version=1, ...)
|
||||
LanceTable(name='my_table', ...)
|
||||
>>> db.create_table("another_table", data=[{"vector": [0.4, 0.4], "b": 6}])
|
||||
LanceTable(name='another_table', version=1, ...)
|
||||
LanceTable(name='another_table', ...)
|
||||
>>> sorted(db.table_names())
|
||||
['another_table', 'my_table']
|
||||
>>> len(db)
|
||||
2
|
||||
>>> db["my_table"]
|
||||
LanceTable(name='my_table', version=1, ...)
|
||||
LanceTable(name='my_table', ...)
|
||||
>>> "my_table" in db
|
||||
True
|
||||
>>> db.drop_table("my_table")
|
||||
@@ -847,11 +857,20 @@ class LanceDBConnection(DBConnection):
|
||||
)
|
||||
)
|
||||
|
||||
def _all_table_names(self) -> Generator[str, None, None]:
|
||||
page_token = None
|
||||
while True:
|
||||
response = self.list_tables(page_token=page_token)
|
||||
yield from response.tables
|
||||
page_token = response.page_token
|
||||
if not page_token:
|
||||
return
|
||||
|
||||
def __len__(self) -> int:
|
||||
return len(self.table_names())
|
||||
return sum(1 for _ in self._all_table_names())
|
||||
|
||||
def __contains__(self, name: str) -> bool:
|
||||
return name in self.table_names()
|
||||
return name in self._all_table_names()
|
||||
|
||||
@override
|
||||
def create_table(
|
||||
|
||||
@@ -63,7 +63,7 @@ def _coerce(value: "ExprLike") -> "Expr":
|
||||
|
||||
|
||||
# Type alias used in annotations.
|
||||
ExprLike = Union["Expr", bool, int, float, str]
|
||||
ExprLike = Union["Expr", bool, int, float, str, bytes]
|
||||
|
||||
|
||||
class Expr:
|
||||
@@ -261,13 +261,13 @@ def col(name: str) -> Expr:
|
||||
return Expr(expr_col(name))
|
||||
|
||||
|
||||
def lit(value: Union[bool, int, float, str]) -> Expr:
|
||||
def lit(value: Union[bool, int, float, str, bytes]) -> Expr:
|
||||
"""Create a literal (constant) value expression.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
value:
|
||||
A Python ``bool``, ``int``, ``float``, or ``str``.
|
||||
A Python ``bool``, ``int``, ``float``, ``str``, or ``bytes``.
|
||||
|
||||
Examples
|
||||
--------
|
||||
|
||||
@@ -281,6 +281,9 @@ class HnswPq:
|
||||
m: int = 20
|
||||
ef_construction: int = 300
|
||||
target_partition_size: Optional[int] = None
|
||||
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
|
||||
# create_index() dispatches to pylance to build the index on the accelerator.
|
||||
accelerator: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -386,6 +389,9 @@ class HnswSq:
|
||||
m: int = 20
|
||||
ef_construction: int = 300
|
||||
target_partition_size: Optional[int] = None
|
||||
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
|
||||
# create_index() dispatches to pylance to build the index on the accelerator.
|
||||
accelerator: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -579,6 +585,9 @@ class IvfFlat:
|
||||
max_iterations: int = 50
|
||||
sample_rate: int = 256
|
||||
target_partition_size: Optional[int] = None
|
||||
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
|
||||
# create_index() dispatches to pylance to build the index on the accelerator.
|
||||
accelerator: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -609,6 +618,9 @@ class IvfSq:
|
||||
max_iterations: int = 50
|
||||
sample_rate: int = 256
|
||||
target_partition_size: Optional[int] = None
|
||||
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
|
||||
# create_index() dispatches to pylance to build the index on the accelerator.
|
||||
accelerator: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -739,6 +751,9 @@ class IvfPq:
|
||||
max_iterations: int = 50
|
||||
sample_rate: int = 256
|
||||
target_partition_size: Optional[int] = None
|
||||
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
|
||||
# create_index() dispatches to pylance to build the index on the accelerator.
|
||||
accelerator: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -792,6 +807,9 @@ class IvfRq:
|
||||
max_iterations: int = 50
|
||||
sample_rate: int = 256
|
||||
target_partition_size: Optional[int] = None
|
||||
# Name of the accelerator (e.g. "cuda") to use for IVF training. When set,
|
||||
# create_index() dispatches to pylance to build the index on the accelerator.
|
||||
accelerator: Optional[str] = None
|
||||
|
||||
|
||||
__all__ = [
|
||||
|
||||
@@ -34,6 +34,8 @@ class LanceMergeInsertBuilder(object):
|
||||
self._when_not_matched_by_source_condition = None
|
||||
self._timeout = None
|
||||
self._use_index = True
|
||||
self._use_lsm_write = None
|
||||
self._validate_single_shard = None
|
||||
|
||||
def when_matched_update_all(
|
||||
self, *, where: Optional[str] = None
|
||||
@@ -96,6 +98,46 @@ class LanceMergeInsertBuilder(object):
|
||||
self._use_index = use_index
|
||||
return self
|
||||
|
||||
def use_lsm_write(self, use_lsm_write: bool) -> LanceMergeInsertBuilder:
|
||||
"""
|
||||
Controls whether the merge uses the MemWAL LSM write path.
|
||||
|
||||
By default (unset), a `merge_insert` on a table with an LSM write spec
|
||||
is routed through Lance's MemWAL shard writer, and a table without one
|
||||
uses the standard path. Pass `False` to force the standard path even
|
||||
when a spec is set. Pass `True` to require a spec — `merge_insert`
|
||||
raises an error if none is installed.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
use_lsm_write: bool
|
||||
Whether to use the LSM write path.
|
||||
"""
|
||||
self._use_lsm_write = use_lsm_write
|
||||
return self
|
||||
|
||||
def validate_single_shard(
|
||||
self, validate_single_shard: bool
|
||||
) -> LanceMergeInsertBuilder:
|
||||
"""
|
||||
Controls how an LSM merge checks that its input targets a single shard.
|
||||
|
||||
When a table has an LSM write spec, every row in a `merge_insert` call
|
||||
must route to the same shard. When `True` (the default), every row is
|
||||
inspected to verify this. When `False`, only the first row is inspected
|
||||
and the shard it routes to is used for the whole input — a faster path
|
||||
for callers that have already pre-sharded their input.
|
||||
|
||||
Has no effect on tables without an LSM write spec.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
validate_single_shard: bool
|
||||
Whether to check every row routes to one shard. Defaults to `True`.
|
||||
"""
|
||||
self._validate_single_shard = validate_single_shard
|
||||
return self
|
||||
|
||||
def execute(
|
||||
self,
|
||||
new_data: DATA,
|
||||
|
||||
@@ -6,22 +6,44 @@
|
||||
from typing import Optional
|
||||
|
||||
|
||||
_CREATE_NAMESPACE_MODES = frozenset({"create", "exist_ok", "overwrite"})
|
||||
_DROP_NAMESPACE_MODES = frozenset({"SKIP", "FAIL"})
|
||||
_DROP_NAMESPACE_BEHAVIORS = frozenset({"RESTRICT", "CASCADE"})
|
||||
|
||||
|
||||
def _normalize_create_namespace_mode(mode: Optional[str]) -> Optional[str]:
|
||||
"""Normalize create namespace mode to lowercase (API expects lowercase)."""
|
||||
if mode is None:
|
||||
return None
|
||||
return mode.lower()
|
||||
normalized = mode.lower()
|
||||
if normalized not in _CREATE_NAMESPACE_MODES:
|
||||
raise ValueError(
|
||||
f"Invalid create namespace mode {mode!r}: "
|
||||
f"expected one of 'create', 'exist_ok', 'overwrite'"
|
||||
)
|
||||
return normalized
|
||||
|
||||
|
||||
def _normalize_drop_namespace_mode(mode: Optional[str]) -> Optional[str]:
|
||||
"""Normalize drop namespace mode to uppercase (API expects uppercase)."""
|
||||
if mode is None:
|
||||
return None
|
||||
return mode.upper()
|
||||
normalized = mode.upper()
|
||||
if normalized not in _DROP_NAMESPACE_MODES:
|
||||
raise ValueError(
|
||||
f"Invalid drop namespace mode {mode!r}: expected one of 'skip', 'fail'"
|
||||
)
|
||||
return normalized
|
||||
|
||||
|
||||
def _normalize_drop_namespace_behavior(behavior: Optional[str]) -> Optional[str]:
|
||||
"""Normalize drop namespace behavior to uppercase (API expects uppercase)."""
|
||||
if behavior is None:
|
||||
return None
|
||||
return behavior.upper()
|
||||
normalized = behavior.upper()
|
||||
if normalized not in _DROP_NAMESPACE_BEHAVIORS:
|
||||
raise ValueError(
|
||||
f"Invalid drop namespace behavior {behavior!r}: "
|
||||
f"expected one of 'restrict', 'cascade'"
|
||||
)
|
||||
return normalized
|
||||
|
||||
@@ -3,12 +3,13 @@
|
||||
|
||||
import copy
|
||||
import json
|
||||
import os
|
||||
|
||||
from deprecation import deprecated
|
||||
import pyarrow as pa
|
||||
|
||||
from ._lancedb import async_permutation_builder, PermutationReader
|
||||
from .table import LanceTable
|
||||
from .table import LanceTable, Table
|
||||
from .background_loop import LOOP
|
||||
from .util import batch_to_tensor, batch_to_tensor_rows
|
||||
from typing import Any, Callable, Iterator, Literal, Optional, TYPE_CHECKING, Union
|
||||
@@ -354,6 +355,49 @@ class Transforms:
|
||||
DEFAULT_BATCH_SIZE = 100
|
||||
|
||||
|
||||
def _table_to_pickle_state(table: Table) -> dict[str, Any]:
|
||||
from .remote.table import RemoteTable
|
||||
|
||||
if isinstance(table, RemoteTable):
|
||||
return {
|
||||
"kind": "remote",
|
||||
"table": table,
|
||||
}
|
||||
|
||||
if not isinstance(table, LanceTable):
|
||||
raise ValueError(f"Cannot pickle table of type {type(table)!r}")
|
||||
|
||||
base_uri = table._conn.uri
|
||||
if base_uri.startswith("memory://"):
|
||||
return {
|
||||
"kind": "memory",
|
||||
"name": table.name,
|
||||
"data": table.to_arrow(),
|
||||
}
|
||||
|
||||
return {
|
||||
"kind": "local",
|
||||
"name": table.name,
|
||||
"uri": base_uri,
|
||||
"namespace": table._namespace_path,
|
||||
"storage_options": table._conn.storage_options,
|
||||
}
|
||||
|
||||
|
||||
def _table_from_pickle_state(state: dict[str, Any]) -> Table:
|
||||
from . import connect
|
||||
|
||||
kind = state["kind"]
|
||||
if kind == "remote":
|
||||
return state["table"]
|
||||
if kind == "memory":
|
||||
return connect("memory://").create_table(state["name"], state["data"])
|
||||
if kind == "local":
|
||||
db = connect(state["uri"], storage_options=state["storage_options"])
|
||||
return db.open_table(state["name"], namespace_path=state["namespace"] or None)
|
||||
raise ValueError(f"Unknown table pickle state kind: {kind}")
|
||||
|
||||
|
||||
class Permutation:
|
||||
"""
|
||||
A Permutation is a view of a dataset that can be used as input to model training
|
||||
@@ -369,15 +413,15 @@ class Permutation:
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
base_table: LanceTable,
|
||||
permutation_table: Optional[LanceTable],
|
||||
base_table: Table,
|
||||
permutation_table: Optional[Table],
|
||||
split: int,
|
||||
selection: dict[str, str],
|
||||
batch_size: int,
|
||||
transform_fn: Callable[pa.RecordBatch, Any],
|
||||
offset: Optional[int] = None,
|
||||
limit: Optional[int] = None,
|
||||
connection_factory: Optional[Callable[[str], LanceTable]] = None,
|
||||
connection_factory: Optional[Callable[[str], Table]] = None,
|
||||
_reader: Optional[PermutationReader] = None,
|
||||
):
|
||||
"""
|
||||
@@ -397,6 +441,7 @@ class Permutation:
|
||||
if _reader is None:
|
||||
_reader = LOOP.run(self._build_reader())
|
||||
self.reader: PermutationReader = _reader
|
||||
self._pid = os.getpid()
|
||||
|
||||
async def _build_reader(self) -> PermutationReader:
|
||||
reader = await PermutationReader.from_tables(
|
||||
@@ -428,29 +473,25 @@ class Permutation:
|
||||
return new
|
||||
|
||||
def with_connection_factory(
|
||||
self, connection_factory: Callable[[str], LanceTable]
|
||||
self, connection_factory: Callable[[str], Table]
|
||||
) -> "Permutation":
|
||||
"""
|
||||
Creates a new permutation that will use ``connection_factory`` to reopen
|
||||
the base table when this permutation is unpickled in a worker process.
|
||||
|
||||
The factory is a callable that takes a single argument — the base table
|
||||
name — and returns a [LanceTable]. It must be picklable; the worker
|
||||
The factory is a callable that takes a single argument, the base table
|
||||
name, and returns a LanceDB table. It must be picklable; the worker
|
||||
will pickle it via standard ``pickle`` and call it to recover the base
|
||||
table. Picklable callables in practice means top-level (module-level)
|
||||
functions, ``functools.partial`` of such functions, or instances of
|
||||
picklable classes implementing ``__call__``. Lambdas and closures over
|
||||
local variables don't pickle with the default protocol.
|
||||
|
||||
Setting a factory is necessary when the URI alone is not enough to
|
||||
re-open the connection — most importantly for LanceDB Cloud (``db://``)
|
||||
connections, where ``api_key`` and ``region`` aren't recoverable from
|
||||
the connection object after construction.
|
||||
|
||||
For local file or cloud-storage paths the factory is optional: if not
|
||||
set, ``__getstate__`` falls back to capturing
|
||||
``(uri, storage_options, namespace_path)`` and re-opening via
|
||||
``lancedb.connect(uri, storage_options=...)``.
|
||||
A factory is optional for normal local and remote LanceDB connections:
|
||||
if not set, ``__getstate__`` captures the table's own picklable reopen
|
||||
state. Use a factory when that default state is not enough, for example
|
||||
when credentials should be loaded from the worker environment instead
|
||||
of being embedded in the pickle.
|
||||
|
||||
Examples
|
||||
--------
|
||||
@@ -508,7 +549,7 @@ class Permutation:
|
||||
return new
|
||||
|
||||
@classmethod
|
||||
def identity(cls, table: LanceTable) -> "Permutation":
|
||||
def identity(cls, table: Table) -> "Permutation":
|
||||
"""
|
||||
Creates an identity permutation for the given table.
|
||||
"""
|
||||
@@ -517,8 +558,8 @@ class Permutation:
|
||||
@classmethod
|
||||
def from_tables(
|
||||
cls,
|
||||
base_table: LanceTable,
|
||||
permutation_table: Optional[LanceTable] = None,
|
||||
base_table: Table,
|
||||
permutation_table: Optional[Table] = None,
|
||||
split: Optional[Union[str, int]] = None,
|
||||
) -> "Permutation":
|
||||
"""
|
||||
@@ -594,11 +635,10 @@ class Permutation:
|
||||
|
||||
The base table is captured either via a user-supplied
|
||||
``connection_factory`` (see [with_connection_factory]) or, as a
|
||||
fallback, by introspecting ``(uri, storage_options, namespace_path)``
|
||||
on the connection. The permutation table — always an in-memory
|
||||
LanceDB table — is captured as a pyarrow Table (which pickles via
|
||||
Arrow IPC natively). The reader is dropped from the wire format;
|
||||
``__setstate__`` rebuilds it from the restored tables.
|
||||
fallback, by the table's own picklable reopen state. The permutation
|
||||
table is captured as a pyarrow Table (which pickles via Arrow IPC
|
||||
natively). The reader is dropped from the wire format and rebuilt
|
||||
lazily on first use.
|
||||
"""
|
||||
permutation_data: Optional[pa.Table] = None
|
||||
if self.permutation_table is not None:
|
||||
@@ -622,39 +662,9 @@ class Permutation:
|
||||
# namespace from the existing connection.
|
||||
return common
|
||||
|
||||
# URI-introspection fallback: only viable for native (OSS) connections
|
||||
# where (uri, storage_options) is enough to reopen. Remote / cloud
|
||||
# connections don't expose recoverable api_key / region — those users
|
||||
# must call with_connection_factory().
|
||||
try:
|
||||
base_uri = self.base_table._conn.uri
|
||||
storage_options = self.base_table._conn.storage_options
|
||||
except AttributeError as e:
|
||||
raise ValueError(
|
||||
"Cannot pickle this Permutation: the base table's connection "
|
||||
"does not expose a uri/storage_options, which usually means it "
|
||||
"is a remote (LanceDB Cloud) connection. Call "
|
||||
"Permutation.with_connection_factory(...) first to provide a "
|
||||
"picklable callable that re-opens the base table from a worker "
|
||||
"process."
|
||||
) from e
|
||||
|
||||
if base_uri.startswith("memory://"):
|
||||
# In-memory base tables don't exist in any worker process by
|
||||
# default, so dump the entire base table into the pickle. This
|
||||
# can be expensive for large datasets — users with large
|
||||
# in-memory base tables should either persist them or set a
|
||||
# connection_factory.
|
||||
return {
|
||||
**common,
|
||||
"base_table_data": self.base_table.to_arrow(),
|
||||
}
|
||||
|
||||
return {
|
||||
**common,
|
||||
"base_table_uri": base_uri,
|
||||
"base_table_namespace": self.base_table._namespace_path,
|
||||
"base_table_storage_options": storage_options,
|
||||
"base_table_state": _table_to_pickle_state(self.base_table),
|
||||
}
|
||||
|
||||
def __setstate__(self, state: dict[str, Any]) -> None:
|
||||
@@ -663,6 +673,8 @@ class Permutation:
|
||||
connection_factory = state["connection_factory"]
|
||||
if connection_factory is not None:
|
||||
base_table = connection_factory(state["base_table_name"])
|
||||
elif "base_table_state" in state:
|
||||
base_table = _table_from_pickle_state(state["base_table_state"])
|
||||
elif "base_table_data" in state:
|
||||
# In-memory base table inlined into the pickle; rebuild the same
|
||||
# way we rebuild the in-memory permutation table.
|
||||
@@ -680,7 +692,7 @@ class Permutation:
|
||||
namespace_path=state["base_table_namespace"] or None,
|
||||
)
|
||||
|
||||
permutation_table: Optional[LanceTable] = None
|
||||
permutation_table: Optional[Table] = None
|
||||
if state["permutation_data"] is not None:
|
||||
mem_db = connect("memory://")
|
||||
permutation_table = mem_db.create_table(
|
||||
@@ -696,10 +708,28 @@ class Permutation:
|
||||
self.offset = state["offset"]
|
||||
self.limit = state["limit"]
|
||||
self.connection_factory = connection_factory
|
||||
self.reader = None
|
||||
self._pid = None
|
||||
|
||||
def _ensure_open(self) -> None:
|
||||
pid = os.getpid()
|
||||
if self.reader is not None and getattr(self, "_pid", None) == pid:
|
||||
return
|
||||
# The reader owns Rust-side table handles. Rebuild it after unpickle or
|
||||
# fork even though the Python table wrappers reopen themselves.
|
||||
if hasattr(self.base_table, "_ensure_open"):
|
||||
self.base_table._ensure_open()
|
||||
if self.permutation_table is not None and hasattr(
|
||||
self.permutation_table, "_ensure_open"
|
||||
):
|
||||
self.permutation_table._ensure_open()
|
||||
self.reader = LOOP.run(self._build_reader())
|
||||
self._pid = pid
|
||||
|
||||
@property
|
||||
def schema(self) -> pa.Schema:
|
||||
self._ensure_open()
|
||||
|
||||
async def do_output_schema():
|
||||
return await self.reader.output_schema(self.selection)
|
||||
|
||||
@@ -717,6 +747,7 @@ class Permutation:
|
||||
"""
|
||||
The number of rows in the permutation
|
||||
"""
|
||||
self._ensure_open()
|
||||
return self.reader.count_rows()
|
||||
|
||||
@property
|
||||
@@ -875,6 +906,7 @@ class Permutation:
|
||||
If skip_last_batch is True, the last batch will be skipped if it is not a
|
||||
multiple of batch_size.
|
||||
"""
|
||||
self._ensure_open()
|
||||
|
||||
async def get_iter():
|
||||
return await self.reader.read(self.selection, batch_size=batch_size)
|
||||
@@ -968,22 +1000,33 @@ class Permutation:
|
||||
new.transform_fn = transform
|
||||
return new
|
||||
|
||||
def take_offsets(self, offsets: list[int]) -> Any:
|
||||
"""
|
||||
Take rows from the permutation by offset
|
||||
|
||||
The returned value is passed through the permutation's current transform,
|
||||
so `with_format` and `with_transform` affect this method in the same way
|
||||
they affect iteration.
|
||||
"""
|
||||
self._ensure_open()
|
||||
|
||||
async def do_take_offsets():
|
||||
return await self.reader.take_offsets(offsets, selection=self.selection)
|
||||
|
||||
batch = LOOP.run(do_take_offsets())
|
||||
return self.transform_fn(batch)
|
||||
|
||||
def __getitem__(self, index: int) -> Any:
|
||||
"""
|
||||
Returns a single row from the permutation by offset
|
||||
"""
|
||||
return self.__getitems__([index])
|
||||
return self.take_offsets([index])
|
||||
|
||||
def __getitems__(self, indices: list[int]) -> Any:
|
||||
"""
|
||||
Returns rows from the permutation by offset
|
||||
"""
|
||||
|
||||
async def do_getitems():
|
||||
return await self.reader.take_offsets(indices, selection=self.selection)
|
||||
|
||||
batch = LOOP.run(do_getitems())
|
||||
return self.transform_fn(batch)
|
||||
return self.take_offsets(indices)
|
||||
|
||||
@deprecated(details="Use with_skip instead")
|
||||
def skip(self, skip: int) -> "Permutation":
|
||||
@@ -1001,9 +1044,11 @@ class Permutation:
|
||||
"""
|
||||
Skip the first `skip` rows of the permutation
|
||||
"""
|
||||
self._ensure_open()
|
||||
new = copy.copy(self)
|
||||
new.offset = skip
|
||||
new.reader = LOOP.run(new._build_reader())
|
||||
new._pid = os.getpid()
|
||||
return new
|
||||
|
||||
@deprecated(details="Use with_take instead")
|
||||
@@ -1022,9 +1067,11 @@ class Permutation:
|
||||
"""
|
||||
Limit the permutation to `limit` rows (following any `skip`)
|
||||
"""
|
||||
self._ensure_open()
|
||||
new = copy.copy(self)
|
||||
new.limit = limit
|
||||
new.reader = LOOP.run(new._build_reader())
|
||||
new._pid = os.getpid()
|
||||
return new
|
||||
|
||||
@deprecated(details="Use with_repeat instead")
|
||||
|
||||
@@ -3,12 +3,14 @@
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
from abc import ABC, abstractmethod
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from enum import Enum
|
||||
from datetime import timedelta
|
||||
from enum import Enum
|
||||
from typing import (
|
||||
TYPE_CHECKING,
|
||||
Any,
|
||||
Dict,
|
||||
List,
|
||||
Literal,
|
||||
@@ -17,44 +19,51 @@ from typing import (
|
||||
Type,
|
||||
TypeVar,
|
||||
Union,
|
||||
Any,
|
||||
)
|
||||
|
||||
import asyncio
|
||||
import deprecation
|
||||
import numpy as np
|
||||
import pyarrow as pa
|
||||
import pyarrow.compute as pc
|
||||
import pydantic
|
||||
from typing_extensions import Annotated
|
||||
|
||||
from lancedb.pydantic import PYDANTIC_VERSION
|
||||
from lancedb._lancedb import fts_query_to_json
|
||||
from lancedb.background_loop import LOOP
|
||||
from lancedb.pydantic import PYDANTIC_VERSION
|
||||
|
||||
from . import __version__
|
||||
from .arrow import AsyncRecordBatchReader
|
||||
from .dependencies import pandas as pd
|
||||
from .expr import Expr
|
||||
from .rerankers.base import Reranker
|
||||
from .rerankers.rrf import RRFReranker
|
||||
from .rerankers.util import check_reranker_result
|
||||
from .util import flatten_columns
|
||||
from .expr import Expr
|
||||
from lancedb._lancedb import fts_query_to_json
|
||||
from typing_extensions import Annotated
|
||||
|
||||
BlobMode = Literal["lazy", "bytes", "descriptions"]
|
||||
|
||||
_BLOB_MODE_TO_HANDLING = {
|
||||
"lazy": "blobs_descriptions",
|
||||
"bytes": "all_binary",
|
||||
"descriptions": "blobs_descriptions",
|
||||
}
|
||||
|
||||
if TYPE_CHECKING:
|
||||
import sys
|
||||
|
||||
import PIL
|
||||
import polars as pl
|
||||
|
||||
from ._lancedb import Query as LanceQuery
|
||||
from ._lancedb import FTSQuery as LanceFTSQuery
|
||||
from ._lancedb import HybridQuery as LanceHybridQuery
|
||||
from ._lancedb import VectorQuery as LanceVectorQuery
|
||||
from ._lancedb import TakeQuery as LanceTakeQuery
|
||||
from ._lancedb import PyQueryRequest
|
||||
from ._lancedb import Query as LanceQuery
|
||||
from ._lancedb import TakeQuery as LanceTakeQuery
|
||||
from ._lancedb import VectorQuery as LanceVectorQuery
|
||||
from .common import VEC
|
||||
from .pydantic import LanceModel
|
||||
from .table import Table
|
||||
from .table import AsyncTable, Table
|
||||
|
||||
if sys.version_info >= (3, 11):
|
||||
from typing import Self
|
||||
@@ -64,6 +73,179 @@ if TYPE_CHECKING:
|
||||
T = TypeVar("T", bound="LanceModel")
|
||||
|
||||
|
||||
def _validate_blob_mode(blob_mode: BlobMode) -> None:
|
||||
if blob_mode not in _BLOB_MODE_TO_HANDLING:
|
||||
modes = ", ".join(repr(mode) for mode in _BLOB_MODE_TO_HANDLING)
|
||||
raise ValueError(f"blob_mode must be one of {modes}, got {blob_mode!r}")
|
||||
|
||||
|
||||
def _field_is_blob(field: pa.Field) -> bool:
|
||||
metadata = field.metadata or {}
|
||||
return metadata.get(b"lance-encoding:blob") == b"true" or (
|
||||
metadata.get("lance-encoding:blob") == "true"
|
||||
)
|
||||
|
||||
|
||||
def _schema_has_blob_field(schema: pa.Schema) -> bool:
|
||||
return any(_field_is_blob(field) for field in schema)
|
||||
|
||||
|
||||
def _blob_mode_requires_native_pandas(blob_mode: BlobMode, schema: pa.Schema) -> bool:
|
||||
return blob_mode in _BLOB_MODE_TO_HANDLING and _schema_has_blob_field(schema)
|
||||
|
||||
|
||||
def _unsupported_blob_pandas_error(reason: str) -> RuntimeError:
|
||||
return RuntimeError(
|
||||
"blob columns require Lance native scanner conversion for query "
|
||||
f"to_pandas(), but {reason}. Use a plain scan query or remove blob "
|
||||
"columns from the projection."
|
||||
)
|
||||
|
||||
|
||||
def _query_is_plain_scan(query: Query) -> bool:
|
||||
return (
|
||||
query.vector is None
|
||||
and query.full_text_query is None
|
||||
and not query.postfilter
|
||||
and not query.order_by
|
||||
)
|
||||
|
||||
|
||||
def _filter_to_sql(filter: Optional[Union[str, Expr]]) -> Optional[str]:
|
||||
if filter is None:
|
||||
return None
|
||||
if isinstance(filter, Expr):
|
||||
return filter.to_sql()
|
||||
return filter
|
||||
|
||||
|
||||
def _projection_to_scanner_kwargs(
|
||||
columns: Optional[
|
||||
Union[
|
||||
List[str], List[Tuple[str, Union[str, Expr]]], Dict[str, Union[str, Expr]]
|
||||
]
|
||||
],
|
||||
) -> Dict[str, Any]:
|
||||
if columns is None:
|
||||
return {}
|
||||
if isinstance(columns, list):
|
||||
if all(isinstance(column, str) for column in columns):
|
||||
return {"columns": columns}
|
||||
if all(isinstance(column, tuple) and len(column) == 2 for column in columns):
|
||||
return {
|
||||
"columns": {
|
||||
name: expr.to_sql() if isinstance(expr, Expr) else expr
|
||||
for name, expr in columns
|
||||
}
|
||||
}
|
||||
# Let Lance raise the detailed projection validation error.
|
||||
return {"columns": columns}
|
||||
|
||||
projection = {}
|
||||
for name, expr in columns.items():
|
||||
if isinstance(expr, Expr):
|
||||
expr = expr.to_sql()
|
||||
projection[name] = expr
|
||||
return {"columns": projection}
|
||||
|
||||
|
||||
def _scanner_kwargs_for_query(
|
||||
query: Query, blob_mode: BlobMode, dataset: Optional[Any] = None
|
||||
) -> Dict[str, Any]:
|
||||
fragments = _scanner_fragments_for_query(query, dataset)
|
||||
kwargs = {
|
||||
**_projection_to_scanner_kwargs(query.columns),
|
||||
"filter": _filter_to_sql(query.filter),
|
||||
"limit": query.limit,
|
||||
"offset": query.offset,
|
||||
"with_row_id": query.with_row_id,
|
||||
"with_row_address": query.with_row_address,
|
||||
"fast_search": query.fast_search,
|
||||
"blob_handling": _BLOB_MODE_TO_HANDLING[blob_mode],
|
||||
"fragments": fragments,
|
||||
}
|
||||
return {key: value for key, value in kwargs.items() if value is not None}
|
||||
|
||||
|
||||
def _scanner_fragments_for_query(query: Query, dataset: Optional[Any]) -> Optional[Any]:
|
||||
if query.fragments is not None and query.fragment_ids is not None:
|
||||
raise ValueError("fragments and fragment_ids cannot both be set")
|
||||
if query.fragments is not None:
|
||||
return query.fragments
|
||||
if query.fragment_ids is None:
|
||||
return None
|
||||
if dataset is None:
|
||||
raise ValueError("fragment_ids require a Lance dataset")
|
||||
|
||||
requested = set(query.fragment_ids)
|
||||
fragments = [
|
||||
fragment
|
||||
for fragment in dataset.get_fragments()
|
||||
if fragment.fragment_id in requested
|
||||
]
|
||||
found = {fragment.fragment_id for fragment in fragments}
|
||||
missing = requested - found
|
||||
if missing:
|
||||
missing_ids = ", ".join(str(fragment_id) for fragment_id in sorted(missing))
|
||||
raise ValueError(f"fragment_ids not found in dataset: {missing_ids}")
|
||||
return fragments
|
||||
|
||||
|
||||
def _ensure_lazy_blob_frame(
|
||||
df: "pd.DataFrame", schema: pa.Schema, blob_mode: BlobMode
|
||||
) -> "pd.DataFrame":
|
||||
if blob_mode != "lazy" or not _schema_has_blob_field(schema) or len(df) == 0:
|
||||
return df
|
||||
|
||||
for field in schema:
|
||||
if not _field_is_blob(field) or field.name not in df.columns:
|
||||
continue
|
||||
value = df[field.name].iloc[0]
|
||||
if value is not None and not hasattr(value, "readall"):
|
||||
raise _unsupported_blob_pandas_error(
|
||||
"the Lance scanner did not return lazy blob files"
|
||||
)
|
||||
return df
|
||||
|
||||
|
||||
def _scanner_to_table(scanner: Any) -> pa.Table:
|
||||
if hasattr(scanner, "to_pyarrow"):
|
||||
reader = scanner.to_pyarrow()
|
||||
return reader.read_all()
|
||||
if hasattr(scanner, "to_table"):
|
||||
return scanner.to_table()
|
||||
reader = scanner.to_reader()
|
||||
return reader.read_all()
|
||||
|
||||
|
||||
def _scanner_to_pandas(scanner: Any, blob_mode: BlobMode, **kwargs) -> "pd.DataFrame":
|
||||
schema = getattr(scanner, "projected_schema", None)
|
||||
if schema is None:
|
||||
schema = getattr(scanner, "schema", None)
|
||||
if schema is None:
|
||||
schema = getattr(scanner, "dataset_schema", None)
|
||||
if callable(schema):
|
||||
schema = schema()
|
||||
if hasattr(scanner, "to_pandas"):
|
||||
try:
|
||||
df = scanner.to_pandas(blob_mode=blob_mode, **kwargs)
|
||||
except TypeError as err:
|
||||
message = str(err)
|
||||
if "blob_mode" not in message and "unexpected keyword" not in message:
|
||||
raise
|
||||
df = scanner.to_pandas(**kwargs)
|
||||
if schema is not None:
|
||||
return _ensure_lazy_blob_frame(df, schema, blob_mode)
|
||||
return df
|
||||
|
||||
tbl = _scanner_to_table(scanner)
|
||||
if blob_mode == "lazy" and _schema_has_blob_field(tbl.schema):
|
||||
raise _unsupported_blob_pandas_error(
|
||||
"the Lance scanner does not expose to_pandas"
|
||||
)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
|
||||
|
||||
# Pydantic validation function for vector queries
|
||||
def ensure_vector_query(
|
||||
val: Any,
|
||||
@@ -92,6 +274,12 @@ def ensure_vector_query(
|
||||
return val
|
||||
|
||||
|
||||
class ColumnOrdering(pydantic.BaseModel):
|
||||
column_name: str
|
||||
ascending: bool = True
|
||||
nulls_first: bool = False
|
||||
|
||||
|
||||
class FullTextQueryType(str, Enum):
|
||||
MATCH = "match"
|
||||
MATCH_PHRASE = "match_phrase"
|
||||
@@ -492,6 +680,13 @@ class Query(pydantic.BaseModel):
|
||||
# if true, include the row id in the results
|
||||
with_row_id: Optional[bool] = None
|
||||
|
||||
# if true, include the row address in the results
|
||||
with_row_address: Optional[bool] = None
|
||||
|
||||
# Lance fragments or fragment ids to scan on scanner-backed plain queries
|
||||
fragments: Optional[Any] = None
|
||||
fragment_ids: Optional[List[int]] = None
|
||||
|
||||
# offset to start fetching results from
|
||||
offset: Optional[int] = None
|
||||
|
||||
@@ -504,6 +699,8 @@ class Query(pydantic.BaseModel):
|
||||
# Bypass the vector index and use a brute force search
|
||||
bypass_vector_index: Optional[bool] = None
|
||||
|
||||
order_by: Optional[List[ColumnOrdering]] = None
|
||||
|
||||
@classmethod
|
||||
def from_inner(cls, req: PyQueryRequest) -> Self:
|
||||
query = cls()
|
||||
@@ -524,6 +721,8 @@ class Query(pydantic.BaseModel):
|
||||
query.refine_factor = req.refine_factor
|
||||
query.bypass_vector_index = req.bypass_vector_index
|
||||
query.postfilter = req.postfilter
|
||||
if req.order_by is not None:
|
||||
query.order_by = [ColumnOrdering(**o) for o in req.order_by]
|
||||
if req.full_text_search is not None:
|
||||
query.full_text_query = FullTextSearchQuery(
|
||||
columns=None,
|
||||
@@ -572,9 +771,22 @@ class LanceQueryBuilder(ABC):
|
||||
If "auto", the query type is inferred based on the query.
|
||||
vector_column_name: str
|
||||
The name of the vector column to use for vector search.
|
||||
ordering_field_name: Optional[str]
|
||||
.. deprecated:: 0.27.0
|
||||
Use ``order_by()`` method instead.
|
||||
fts_columns: Optional[Union[str, List[str]]]
|
||||
The columns to search in for full text search.
|
||||
fast_search: bool
|
||||
Skip flat search of unindexed data.
|
||||
"""
|
||||
if ordering_field_name is not None:
|
||||
import warnings
|
||||
|
||||
warnings.warn(
|
||||
"ordering_field_name is deprecated, use .order_by() method instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
# Check hybrid search first as it supports empty query pattern
|
||||
if query_type == "hybrid":
|
||||
# hybrid fts and vector query
|
||||
@@ -667,10 +879,14 @@ class LanceQueryBuilder(ABC):
|
||||
self._where = None
|
||||
self._postfilter = None
|
||||
self._with_row_id = None
|
||||
self._with_row_address = None
|
||||
self._fragments = None
|
||||
self._fragment_ids = None
|
||||
self._vector = None
|
||||
self._text = None
|
||||
self._ef = None
|
||||
self._bypass_vector_index = None
|
||||
self._order_by = None
|
||||
|
||||
@deprecation.deprecated(
|
||||
deprecated_in="0.3.1",
|
||||
@@ -693,7 +909,9 @@ class LanceQueryBuilder(ABC):
|
||||
self,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
*,
|
||||
blob_mode: BlobMode = "lazy",
|
||||
timeout: Optional[timedelta] = None,
|
||||
**kwargs,
|
||||
) -> "pd.DataFrame":
|
||||
"""
|
||||
Execute the query and return the results as a pandas DataFrame.
|
||||
@@ -711,9 +929,42 @@ class LanceQueryBuilder(ABC):
|
||||
timeout: Optional[timedelta]
|
||||
The maximum time to wait for the query to complete.
|
||||
If None, wait indefinitely.
|
||||
blob_mode: str, default "lazy"
|
||||
Controls how blob columns are returned for plain scan queries.
|
||||
Vector, FTS, hybrid, and other non-native query shapes keep the
|
||||
existing Arrow conversion path and only support blob descriptions.
|
||||
**kwargs
|
||||
Forwarded to pyarrow.Table.to_pandas after query execution and
|
||||
optional flattening.
|
||||
"""
|
||||
_validate_blob_mode(blob_mode)
|
||||
output_schema = getattr(self, "output_schema", None)
|
||||
if output_schema is not None:
|
||||
schema = output_schema()
|
||||
if _blob_mode_requires_native_pandas(blob_mode, schema):
|
||||
native_error = None
|
||||
if (flatten is None or blob_mode == "descriptions") and timeout is None:
|
||||
try:
|
||||
df = self._plain_scan_to_pandas(
|
||||
blob_mode, flatten=flatten, **kwargs
|
||||
)
|
||||
if df is not None:
|
||||
return df
|
||||
except Exception as err:
|
||||
native_error = err
|
||||
reason = (
|
||||
"this query shape cannot use Lance native pandas conversion"
|
||||
if native_error is None
|
||||
else str(native_error)
|
||||
)
|
||||
raise _unsupported_blob_pandas_error(reason) from native_error
|
||||
|
||||
tbl = flatten_columns(self.to_arrow(timeout=timeout), flatten)
|
||||
return tbl.to_pandas()
|
||||
if _blob_mode_requires_native_pandas(blob_mode, tbl.schema):
|
||||
raise _unsupported_blob_pandas_error(
|
||||
"this query shape cannot use Lance native pandas conversion"
|
||||
)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
|
||||
@abstractmethod
|
||||
def to_arrow(self, *, timeout: Optional[timedelta] = None) -> pa.Table:
|
||||
@@ -918,6 +1169,32 @@ class LanceQueryBuilder(ABC):
|
||||
self._with_row_id = with_row_id
|
||||
return self
|
||||
|
||||
def with_row_address(self, with_row_address: bool = True) -> Self:
|
||||
"""Set whether to return row addresses.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
with_row_address: bool, default True
|
||||
If True, return the _rowaddr column in the results.
|
||||
|
||||
Returns
|
||||
-------
|
||||
LanceQueryBuilder
|
||||
The LanceQueryBuilder object.
|
||||
"""
|
||||
self._with_row_address = with_row_address
|
||||
return self
|
||||
|
||||
def with_fragments(self, fragments: Any) -> Self:
|
||||
"""Set the Lance fragments to scan for plain scanner-backed queries."""
|
||||
self._fragments = fragments
|
||||
return self
|
||||
|
||||
def fragment_ids(self, fragment_ids: List[int]) -> Self:
|
||||
"""Set the Lance fragment ids to scan for plain scanner-backed queries."""
|
||||
self._fragment_ids = fragment_ids
|
||||
return self
|
||||
|
||||
def explain_plan(self, verbose: Optional[bool] = False) -> str:
|
||||
"""Return the execution plan for this query.
|
||||
|
||||
@@ -947,6 +1224,24 @@ class LanceQueryBuilder(ABC):
|
||||
""" # noqa: E501
|
||||
return self._table._explain_plan(self.to_query_object(), verbose=verbose)
|
||||
|
||||
def order_by(self, ordering: Optional[List[ColumnOrdering]]) -> Self:
|
||||
"""
|
||||
Set the ordering for the results.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
ordering: Optional[List[ColumnOrdering]]
|
||||
The ordering to use for the results. If None, then the default ordering
|
||||
will be used.
|
||||
|
||||
Returns
|
||||
-------
|
||||
LanceQueryBuilder
|
||||
The LanceQueryBuilder object.
|
||||
"""
|
||||
self._order_by = ordering
|
||||
return self
|
||||
|
||||
def analyze_plan(self) -> str:
|
||||
"""
|
||||
Run the query and return its execution plan with runtime metrics.
|
||||
@@ -1039,6 +1334,25 @@ class LanceQueryBuilder(ABC):
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def _plain_scan_to_pandas(
|
||||
self,
|
||||
blob_mode: BlobMode,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
**kwargs,
|
||||
) -> Optional["pd.DataFrame"]:
|
||||
query = self.to_query_object()
|
||||
if not _query_is_plain_scan(query):
|
||||
return None
|
||||
|
||||
dataset = self._table.to_lance()
|
||||
scanner = dataset.scanner(
|
||||
**_scanner_kwargs_for_query(query, blob_mode, dataset)
|
||||
)
|
||||
if flatten is not None:
|
||||
tbl = flatten_columns(_scanner_to_table(scanner), flatten)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
return _scanner_to_pandas(scanner, blob_mode, **kwargs)
|
||||
|
||||
@abstractmethod
|
||||
def to_query_object(self) -> Query:
|
||||
"""Return a serializable representation of the query
|
||||
@@ -1310,10 +1624,14 @@ class LanceVectorQueryBuilder(LanceQueryBuilder):
|
||||
refine_factor=self._refine_factor,
|
||||
vector_column=self._vector_column,
|
||||
with_row_id=self._with_row_id,
|
||||
with_row_address=self._with_row_address,
|
||||
fragments=self._fragments,
|
||||
fragment_ids=self._fragment_ids,
|
||||
offset=self._offset,
|
||||
fast_search=self._fast_search,
|
||||
ef=self._ef,
|
||||
bypass_vector_index=self._bypass_vector_index,
|
||||
order_by=self._order_by,
|
||||
)
|
||||
|
||||
def to_batches(
|
||||
@@ -1465,7 +1783,9 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
|
||||
super().__init__(table)
|
||||
self._query = query
|
||||
self._phrase_query = False
|
||||
self.ordering_field_name = ordering_field_name
|
||||
# Deprecated compatibility parameter. Native FTS ordering is now
|
||||
# configured through order_by(); LanceQueryBuilder.create emits the warning.
|
||||
_ = ordering_field_name
|
||||
self._reranker = None
|
||||
self._fast_search = fast_search
|
||||
if isinstance(fts_columns, str):
|
||||
@@ -1509,11 +1829,15 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
|
||||
limit=self._limit,
|
||||
postfilter=self._postfilter,
|
||||
with_row_id=self._with_row_id,
|
||||
with_row_address=self._with_row_address,
|
||||
fragments=self._fragments,
|
||||
fragment_ids=self._fragment_ids,
|
||||
full_text_query=FullTextSearchQuery(
|
||||
query=self._query, columns=self._fts_columns
|
||||
),
|
||||
offset=self._offset,
|
||||
fast_search=self._fast_search,
|
||||
order_by=self._order_by,
|
||||
)
|
||||
|
||||
def output_schema(self) -> pa.Schema:
|
||||
@@ -1578,7 +1902,11 @@ class LanceEmptyQueryBuilder(LanceQueryBuilder):
|
||||
filter=self._where,
|
||||
limit=self._limit,
|
||||
with_row_id=self._with_row_id,
|
||||
with_row_address=self._with_row_address,
|
||||
fragments=self._fragments,
|
||||
fragment_ids=self._fragment_ids,
|
||||
offset=self._offset,
|
||||
order_by=self._order_by,
|
||||
)
|
||||
|
||||
def output_schema(self) -> pa.Schema:
|
||||
@@ -2155,7 +2483,11 @@ class AsyncQueryBase(object):
|
||||
Base class for all async queries (take, scan, vector, fts, hybrid)
|
||||
"""
|
||||
|
||||
def __init__(self, inner: Union[LanceQuery, LanceVectorQuery, LanceTakeQuery]):
|
||||
def __init__(
|
||||
self,
|
||||
inner: Union[LanceQuery, LanceVectorQuery, LanceTakeQuery],
|
||||
table: Optional["AsyncTable"] = None,
|
||||
):
|
||||
"""
|
||||
Construct an AsyncQueryBase
|
||||
|
||||
@@ -2163,6 +2495,10 @@ class AsyncQueryBase(object):
|
||||
[AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
|
||||
"""
|
||||
self._inner = inner
|
||||
self._table = table
|
||||
self._with_row_address = None
|
||||
self._fragments = None
|
||||
self._fragment_ids = None
|
||||
|
||||
def to_query_object(self) -> Query:
|
||||
"""
|
||||
@@ -2171,7 +2507,11 @@ class AsyncQueryBase(object):
|
||||
This is currently experimental but can be useful as the query object is pure
|
||||
python and more easily serializable.
|
||||
"""
|
||||
return Query.from_inner(self._inner.to_query_request())
|
||||
query = Query.from_inner(self._inner.to_query_request())
|
||||
query.with_row_address = self._with_row_address
|
||||
query.fragments = self._fragments
|
||||
query.fragment_ids = self._fragment_ids
|
||||
return query
|
||||
|
||||
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
|
||||
"""
|
||||
@@ -2228,6 +2568,27 @@ class AsyncQueryBase(object):
|
||||
self._inner.with_row_id()
|
||||
return self
|
||||
|
||||
def with_row_address(self, with_row_address: bool = True) -> Self:
|
||||
"""
|
||||
Include the _rowaddr column in scanner-backed plain query results.
|
||||
"""
|
||||
self._with_row_address = with_row_address
|
||||
return self
|
||||
|
||||
def with_fragments(self, fragments: Any) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragments.
|
||||
"""
|
||||
self._fragments = fragments
|
||||
return self
|
||||
|
||||
def fragment_ids(self, fragment_ids: List[int]) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragment ids.
|
||||
"""
|
||||
self._fragment_ids = fragment_ids
|
||||
return self
|
||||
|
||||
async def to_batches(
|
||||
self,
|
||||
*,
|
||||
@@ -2305,6 +2666,9 @@ class AsyncQueryBase(object):
|
||||
self,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
timeout: Optional[timedelta] = None,
|
||||
*,
|
||||
blob_mode: BlobMode = "lazy",
|
||||
**kwargs,
|
||||
) -> "pd.DataFrame":
|
||||
"""
|
||||
Execute the query and collect the results into a pandas DataFrame.
|
||||
@@ -2337,10 +2701,63 @@ class AsyncQueryBase(object):
|
||||
The maximum time to wait for the query to complete.
|
||||
If not specified, no timeout is applied. If the query does not
|
||||
complete within the specified time, an error will be raised.
|
||||
blob_mode: str, default "lazy"
|
||||
Controls how blob columns are returned for plain scan queries.
|
||||
Vector, FTS, hybrid, and other non-native query shapes keep the
|
||||
existing Arrow conversion path and only support blob descriptions.
|
||||
**kwargs
|
||||
Forwarded to pyarrow.Table.to_pandas after query execution and
|
||||
optional flattening.
|
||||
"""
|
||||
return (
|
||||
flatten_columns(await self.to_arrow(timeout=timeout), flatten)
|
||||
).to_pandas()
|
||||
_validate_blob_mode(blob_mode)
|
||||
if hasattr(self._inner, "output_schema"):
|
||||
schema = await self.output_schema()
|
||||
if _blob_mode_requires_native_pandas(blob_mode, schema):
|
||||
native_error = None
|
||||
if (flatten is None or blob_mode == "descriptions") and timeout is None:
|
||||
try:
|
||||
df = await self._plain_scan_to_pandas(
|
||||
blob_mode, flatten=flatten, **kwargs
|
||||
)
|
||||
if df is not None:
|
||||
return df
|
||||
except Exception as err:
|
||||
native_error = err
|
||||
reason = (
|
||||
"this query shape cannot use Lance native pandas conversion"
|
||||
if native_error is None
|
||||
else str(native_error)
|
||||
)
|
||||
raise _unsupported_blob_pandas_error(reason) from native_error
|
||||
|
||||
tbl = flatten_columns(await self.to_arrow(timeout=timeout), flatten)
|
||||
if _blob_mode_requires_native_pandas(blob_mode, tbl.schema):
|
||||
raise _unsupported_blob_pandas_error(
|
||||
"this query shape cannot use Lance native pandas conversion"
|
||||
)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
|
||||
async def _plain_scan_to_pandas(
|
||||
self,
|
||||
blob_mode: BlobMode,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
**kwargs,
|
||||
) -> Optional["pd.DataFrame"]:
|
||||
if self._table is None:
|
||||
return None
|
||||
|
||||
query = self.to_query_object()
|
||||
if not _query_is_plain_scan(query):
|
||||
return None
|
||||
|
||||
dataset = await self._table._to_lance()
|
||||
scanner = dataset.scanner(
|
||||
**_scanner_kwargs_for_query(query, blob_mode, dataset)
|
||||
)
|
||||
if flatten is not None:
|
||||
tbl = flatten_columns(_scanner_to_table(scanner), flatten)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
return _scanner_to_pandas(scanner, blob_mode, **kwargs)
|
||||
|
||||
async def to_polars(
|
||||
self,
|
||||
@@ -2447,14 +2864,18 @@ class AsyncStandardQuery(AsyncQueryBase):
|
||||
Base class for "standard" async queries (all but take currently)
|
||||
"""
|
||||
|
||||
def __init__(self, inner: Union[LanceQuery, LanceVectorQuery]):
|
||||
def __init__(
|
||||
self,
|
||||
inner: Union[LanceQuery, LanceVectorQuery],
|
||||
table: Optional["AsyncTable"] = None,
|
||||
):
|
||||
"""
|
||||
Construct an AsyncStandardQuery
|
||||
|
||||
This method is not intended to be called directly. Instead, use the
|
||||
[AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
|
||||
"""
|
||||
super().__init__(inner)
|
||||
super().__init__(inner, table)
|
||||
|
||||
def where(self, predicate: Union[str, Expr]) -> Self:
|
||||
"""
|
||||
@@ -2502,6 +2923,27 @@ class AsyncStandardQuery(AsyncQueryBase):
|
||||
self._inner.offset(offset)
|
||||
return self
|
||||
|
||||
def order_by(self, ordering: Optional[List[ColumnOrdering]]) -> Self:
|
||||
"""
|
||||
Set the ordering for the results.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
ordering: Optional[List[ColumnOrdering]]
|
||||
The ordering to use for the results. If None, then the default ordering
|
||||
will be used.
|
||||
"""
|
||||
if ordering is None:
|
||||
self._inner.order_by(None)
|
||||
else:
|
||||
self._inner.order_by(
|
||||
[
|
||||
o.model_dump() if hasattr(o, "model_dump") else o.dict()
|
||||
for o in ordering
|
||||
]
|
||||
)
|
||||
return self
|
||||
|
||||
def fast_search(self) -> Self:
|
||||
"""
|
||||
Skip searching un-indexed data.
|
||||
@@ -2539,14 +2981,14 @@ class AsyncStandardQuery(AsyncQueryBase):
|
||||
|
||||
|
||||
class AsyncQuery(AsyncStandardQuery):
|
||||
def __init__(self, inner: LanceQuery):
|
||||
def __init__(self, inner: LanceQuery, table: Optional["AsyncTable"] = None):
|
||||
"""
|
||||
Construct an AsyncQuery
|
||||
|
||||
This method is not intended to be called directly. Instead, use the
|
||||
[AsyncTable.query][lancedb.table.AsyncTable.query] method to create a query.
|
||||
"""
|
||||
super().__init__(inner)
|
||||
super().__init__(inner, table)
|
||||
self._inner = inner
|
||||
|
||||
@classmethod
|
||||
@@ -2630,10 +3072,11 @@ class AsyncQuery(AsyncStandardQuery):
|
||||
new_self = self._inner.nearest_to(query_vectors[0])
|
||||
for v in query_vectors[1:]:
|
||||
new_self.add_query_vector(v)
|
||||
return AsyncVectorQuery(new_self)
|
||||
return AsyncVectorQuery(new_self, self._table)
|
||||
else:
|
||||
return AsyncVectorQuery(
|
||||
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
|
||||
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)),
|
||||
self._table,
|
||||
)
|
||||
|
||||
def nearest_to_text(
|
||||
@@ -2666,17 +3109,18 @@ class AsyncQuery(AsyncStandardQuery):
|
||||
|
||||
if isinstance(query, str):
|
||||
return AsyncFTSQuery(
|
||||
self._inner.nearest_to_text({"query": query, "columns": columns})
|
||||
self._inner.nearest_to_text({"query": query, "columns": columns}),
|
||||
self._table,
|
||||
)
|
||||
# FullTextQuery object
|
||||
return AsyncFTSQuery(self._inner.nearest_to_text({"query": query}))
|
||||
return AsyncFTSQuery(self._inner.nearest_to_text({"query": query}), self._table)
|
||||
|
||||
|
||||
class AsyncFTSQuery(AsyncStandardQuery):
|
||||
"""A query for full text search for LanceDB."""
|
||||
|
||||
def __init__(self, inner: LanceFTSQuery):
|
||||
super().__init__(inner)
|
||||
def __init__(self, inner: LanceFTSQuery, table: Optional["AsyncTable"] = None):
|
||||
super().__init__(inner, table)
|
||||
self._inner = inner
|
||||
self._reranker = None
|
||||
|
||||
@@ -2758,10 +3202,11 @@ class AsyncFTSQuery(AsyncStandardQuery):
|
||||
new_self = self._inner.nearest_to(query_vectors[0])
|
||||
for v in query_vectors[1:]:
|
||||
new_self.add_query_vector(v)
|
||||
return AsyncHybridQuery(new_self)
|
||||
return AsyncHybridQuery(new_self, self._table)
|
||||
else:
|
||||
return AsyncHybridQuery(
|
||||
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector))
|
||||
self._inner.nearest_to(AsyncQuery._query_vec_to_array(query_vector)),
|
||||
self._table,
|
||||
)
|
||||
|
||||
async def to_batches(
|
||||
@@ -2952,7 +3397,7 @@ class AsyncVectorQueryBase:
|
||||
|
||||
|
||||
class AsyncVectorQuery(AsyncStandardQuery, AsyncVectorQueryBase):
|
||||
def __init__(self, inner: LanceVectorQuery):
|
||||
def __init__(self, inner: LanceVectorQuery, table: Optional["AsyncTable"] = None):
|
||||
"""
|
||||
Construct an AsyncVectorQuery
|
||||
|
||||
@@ -2962,7 +3407,7 @@ class AsyncVectorQuery(AsyncStandardQuery, AsyncVectorQueryBase):
|
||||
a vector query. Or you can use
|
||||
[AsyncTable.vector_search][lancedb.table.AsyncTable.vector_search]
|
||||
"""
|
||||
super().__init__(inner)
|
||||
super().__init__(inner, table)
|
||||
self._inner = inner
|
||||
self._reranker = None
|
||||
self._query_string = None
|
||||
@@ -3016,10 +3461,13 @@ class AsyncVectorQuery(AsyncStandardQuery, AsyncVectorQueryBase):
|
||||
|
||||
if isinstance(query, str):
|
||||
return AsyncHybridQuery(
|
||||
self._inner.nearest_to_text({"query": query, "columns": columns})
|
||||
self._inner.nearest_to_text({"query": query, "columns": columns}),
|
||||
self._table,
|
||||
)
|
||||
# FullTextQuery object
|
||||
return AsyncHybridQuery(self._inner.nearest_to_text({"query": query}))
|
||||
return AsyncHybridQuery(
|
||||
self._inner.nearest_to_text({"query": query}), self._table
|
||||
)
|
||||
|
||||
async def to_batches(
|
||||
self,
|
||||
@@ -3046,8 +3494,8 @@ class AsyncHybridQuery(AsyncStandardQuery, AsyncVectorQueryBase):
|
||||
in the `rerank` method to convert the scores to ranks and then normalize them.
|
||||
"""
|
||||
|
||||
def __init__(self, inner: LanceHybridQuery):
|
||||
super().__init__(inner)
|
||||
def __init__(self, inner: LanceHybridQuery, table: Optional["AsyncTable"] = None):
|
||||
super().__init__(inner, table)
|
||||
self._inner = inner
|
||||
self._norm = "score"
|
||||
self._reranker = RRFReranker()
|
||||
@@ -3088,8 +3536,8 @@ class AsyncHybridQuery(AsyncStandardQuery, AsyncVectorQueryBase):
|
||||
max_batch_length: Optional[int] = None,
|
||||
timeout: Optional[timedelta] = None,
|
||||
) -> AsyncRecordBatchReader:
|
||||
fts_query = AsyncFTSQuery(self._inner.to_fts_query())
|
||||
vec_query = AsyncVectorQuery(self._inner.to_vector_query())
|
||||
fts_query = AsyncFTSQuery(self._inner.to_fts_query(), self._table)
|
||||
vec_query = AsyncVectorQuery(self._inner.to_vector_query(), self._table)
|
||||
|
||||
# save the row ID choice that was made on the query builder and force it
|
||||
# to actually fetch the row ids because we need this for reranking
|
||||
@@ -3189,8 +3637,16 @@ class AsyncTakeQuery(AsyncQueryBase):
|
||||
Builder for parameterizing and executing take queries.
|
||||
"""
|
||||
|
||||
def __init__(self, inner: LanceTakeQuery):
|
||||
super().__init__(inner)
|
||||
def __init__(self, inner: LanceTakeQuery, table: Optional["AsyncTable"] = None):
|
||||
super().__init__(inner, table)
|
||||
|
||||
async def _plain_scan_to_pandas(
|
||||
self,
|
||||
blob_mode: BlobMode,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
**kwargs,
|
||||
) -> Optional["pd.DataFrame"]:
|
||||
return None
|
||||
|
||||
|
||||
class BaseQueryBuilder(object):
|
||||
@@ -3242,6 +3698,27 @@ class BaseQueryBuilder(object):
|
||||
self._inner.with_row_id()
|
||||
return self
|
||||
|
||||
def with_row_address(self, with_row_address: bool = True) -> Self:
|
||||
"""
|
||||
Include the _rowaddr column in scanner-backed plain query results.
|
||||
"""
|
||||
self._inner.with_row_address(with_row_address)
|
||||
return self
|
||||
|
||||
def with_fragments(self, fragments: Any) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragments.
|
||||
"""
|
||||
self._inner.with_fragments(fragments)
|
||||
return self
|
||||
|
||||
def fragment_ids(self, fragment_ids: List[int]) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragment ids.
|
||||
"""
|
||||
self._inner.fragment_ids(fragment_ids)
|
||||
return self
|
||||
|
||||
def output_schema(self) -> pa.Schema:
|
||||
"""
|
||||
Return the output schema for the query
|
||||
@@ -3272,16 +3749,18 @@ class BaseQueryBuilder(object):
|
||||
If not specified, no timeout is applied. If the query does not
|
||||
complete within the specified time, an error will be raised.
|
||||
"""
|
||||
async_iter = LOOP.run(self._inner.execute(max_batch_length, timeout))
|
||||
async_reader = LOOP.run(
|
||||
self._inner.to_batches(max_batch_length=max_batch_length, timeout=timeout)
|
||||
)
|
||||
|
||||
def iter_sync():
|
||||
try:
|
||||
while True:
|
||||
yield LOOP.run(async_iter.__anext__())
|
||||
yield LOOP.run(async_reader.__anext__())
|
||||
except StopAsyncIteration:
|
||||
return
|
||||
|
||||
return pa.RecordBatchReader.from_batches(async_iter.schema, iter_sync())
|
||||
return pa.RecordBatchReader.from_batches(async_reader.schema, iter_sync())
|
||||
|
||||
def to_arrow(self, timeout: Optional[timedelta] = None) -> pa.Table:
|
||||
"""
|
||||
@@ -3321,6 +3800,9 @@ class BaseQueryBuilder(object):
|
||||
self,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
timeout: Optional[timedelta] = None,
|
||||
*,
|
||||
blob_mode: BlobMode = "lazy",
|
||||
**kwargs,
|
||||
) -> "pd.DataFrame":
|
||||
"""
|
||||
Execute the query and collect the results into a pandas DataFrame.
|
||||
@@ -3353,8 +3835,15 @@ class BaseQueryBuilder(object):
|
||||
The maximum time to wait for the query to complete.
|
||||
If not specified, no timeout is applied. If the query does not
|
||||
complete within the specified time, an error will be raised.
|
||||
blob_mode: str, default "lazy"
|
||||
Controls how blob columns are returned for plain scan queries.
|
||||
**kwargs
|
||||
Forwarded to pyarrow.Table.to_pandas after query execution and
|
||||
optional flattening.
|
||||
"""
|
||||
return LOOP.run(self._inner.to_pandas(flatten, timeout))
|
||||
return LOOP.run(
|
||||
self._inner.to_pandas(flatten, timeout, blob_mode=blob_mode, **kwargs)
|
||||
)
|
||||
|
||||
def to_polars(
|
||||
self,
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
|
||||
|
||||
from datetime import timedelta
|
||||
import json
|
||||
import logging
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
import sys
|
||||
@@ -17,7 +18,7 @@ else:
|
||||
|
||||
# Remove this import to fix circular dependency
|
||||
# from lancedb import connect_async
|
||||
from lancedb.remote import ClientConfig
|
||||
from lancedb.remote import ClientConfig, RetryConfig, TimeoutConfig, TlsConfig
|
||||
import pyarrow as pa
|
||||
|
||||
from ..common import DATA
|
||||
@@ -36,6 +37,64 @@ from ..table import Table
|
||||
from ..util import validate_table_name
|
||||
|
||||
|
||||
def _duration_seconds(value: Optional[timedelta]) -> Optional[float]:
|
||||
return value.total_seconds() if value is not None else None
|
||||
|
||||
|
||||
def _timeout_config_to_dict(
|
||||
config: Optional[TimeoutConfig],
|
||||
) -> Optional[dict[str, Any]]:
|
||||
if config is None:
|
||||
return None
|
||||
return {
|
||||
"timeout": _duration_seconds(config.timeout),
|
||||
"connect_timeout": _duration_seconds(config.connect_timeout),
|
||||
"read_timeout": _duration_seconds(config.read_timeout),
|
||||
"pool_idle_timeout": _duration_seconds(config.pool_idle_timeout),
|
||||
}
|
||||
|
||||
|
||||
def _retry_config_to_dict(config: RetryConfig) -> dict[str, Any]:
|
||||
return {
|
||||
"retries": config.retries,
|
||||
"connect_retries": config.connect_retries,
|
||||
"read_retries": config.read_retries,
|
||||
"backoff_factor": config.backoff_factor,
|
||||
"backoff_jitter": config.backoff_jitter,
|
||||
"statuses": config.statuses,
|
||||
}
|
||||
|
||||
|
||||
def _tls_config_to_dict(config: Optional[TlsConfig]) -> Optional[dict[str, Any]]:
|
||||
if config is None:
|
||||
return None
|
||||
return {
|
||||
"cert_file": config.cert_file,
|
||||
"key_file": config.key_file,
|
||||
"ssl_ca_cert": config.ssl_ca_cert,
|
||||
"assert_hostname": config.assert_hostname,
|
||||
}
|
||||
|
||||
|
||||
def _client_config_to_dict(config: ClientConfig) -> dict[str, Any]:
|
||||
if config.header_provider is not None:
|
||||
raise ValueError(
|
||||
"Cannot serialize a remote connection with a header_provider. "
|
||||
"Use static api_key/extra_headers or provide a worker-side "
|
||||
"connection factory instead."
|
||||
)
|
||||
return {
|
||||
"user_agent": config.user_agent,
|
||||
"retry_config": _retry_config_to_dict(config.retry_config),
|
||||
"timeout_config": _timeout_config_to_dict(config.timeout_config),
|
||||
"extra_headers": config.extra_headers,
|
||||
"id_delimiter": config.id_delimiter,
|
||||
"tls_config": _tls_config_to_dict(config.tls_config),
|
||||
"header_provider": None,
|
||||
"user_id": config.user_id,
|
||||
}
|
||||
|
||||
|
||||
class RemoteDBConnection(DBConnection):
|
||||
"""A connection to a remote LanceDB database."""
|
||||
|
||||
@@ -50,6 +109,7 @@ class RemoteDBConnection(DBConnection):
|
||||
connection_timeout: Optional[float] = None,
|
||||
read_timeout: Optional[float] = None,
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
read_consistency_interval: Optional[timedelta] = None,
|
||||
):
|
||||
"""Connect to a remote LanceDB database."""
|
||||
if isinstance(client_config, dict):
|
||||
@@ -88,6 +148,11 @@ class RemoteDBConnection(DBConnection):
|
||||
parsed = urlparse(db_url)
|
||||
if parsed.scheme != "db":
|
||||
raise ValueError(f"Invalid scheme: {parsed.scheme}, only accepts db://")
|
||||
self.db_url = db_url
|
||||
self.api_key = api_key
|
||||
self.region = region
|
||||
self.host_override = host_override
|
||||
self.storage_options = storage_options
|
||||
self.db_name = parsed.netloc
|
||||
|
||||
self.client_config = client_config
|
||||
@@ -103,12 +168,27 @@ class RemoteDBConnection(DBConnection):
|
||||
host_override=host_override,
|
||||
client_config=client_config,
|
||||
storage_options=storage_options,
|
||||
read_consistency_interval=read_consistency_interval,
|
||||
)
|
||||
)
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"RemoteConnect(name={self.db_name})"
|
||||
|
||||
@override
|
||||
def serialize(self) -> str:
|
||||
return json.dumps(
|
||||
{
|
||||
"connection_type": "remote",
|
||||
"db_url": self.db_url,
|
||||
"api_key": self.api_key,
|
||||
"region": self.region,
|
||||
"host_override": self.host_override,
|
||||
"client_config": _client_config_to_dict(self.client_config),
|
||||
"storage_options": self.storage_options,
|
||||
}
|
||||
)
|
||||
|
||||
@override
|
||||
def list_namespaces(
|
||||
self,
|
||||
@@ -329,7 +409,12 @@ class RemoteDBConnection(DBConnection):
|
||||
)
|
||||
|
||||
table = LOOP.run(self._conn.open_table(name, namespace_path=namespace_path))
|
||||
return RemoteTable(table, self.db_name)
|
||||
return RemoteTable(
|
||||
table,
|
||||
self.db_name,
|
||||
connection_state=self.serialize,
|
||||
namespace_path=namespace_path,
|
||||
)
|
||||
|
||||
def clone_table(
|
||||
self,
|
||||
@@ -378,7 +463,12 @@ class RemoteDBConnection(DBConnection):
|
||||
is_shallow=is_shallow,
|
||||
)
|
||||
)
|
||||
return RemoteTable(table, self.db_name)
|
||||
return RemoteTable(
|
||||
table,
|
||||
self.db_name,
|
||||
connection_state=self.serialize,
|
||||
namespace_path=target_namespace_path,
|
||||
)
|
||||
|
||||
@override
|
||||
def create_table(
|
||||
@@ -523,7 +613,12 @@ class RemoteDBConnection(DBConnection):
|
||||
fill_value=fill_value,
|
||||
)
|
||||
)
|
||||
return RemoteTable(table, self.db_name)
|
||||
return RemoteTable(
|
||||
table,
|
||||
self.db_name,
|
||||
connection_state=self.serialize,
|
||||
namespace_path=namespace_path,
|
||||
)
|
||||
|
||||
@override
|
||||
def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):
|
||||
|
||||
@@ -27,6 +27,9 @@ class LanceDBClientError(RuntimeError):
|
||||
self.request_id = request_id
|
||||
self.status_code = status_code
|
||||
|
||||
def __reduce__(self) -> tuple[type, tuple]:
|
||||
return (self.__class__, (str(self), self.request_id, self.status_code))
|
||||
|
||||
|
||||
class HttpError(LanceDBClientError):
|
||||
"""An error that occurred during an HTTP request.
|
||||
@@ -101,3 +104,19 @@ class RetryError(LanceDBClientError):
|
||||
self.max_request_failures = max_request_failures
|
||||
self.max_connect_failures = max_connect_failures
|
||||
self.max_read_failures = max_read_failures
|
||||
|
||||
def __reduce__(self) -> tuple[type, tuple]:
|
||||
return (
|
||||
self.__class__,
|
||||
(
|
||||
str(self),
|
||||
self.request_id,
|
||||
self.request_failures,
|
||||
self.connect_failures,
|
||||
self.read_failures,
|
||||
self.max_request_failures,
|
||||
self.max_connect_failures,
|
||||
self.max_read_failures,
|
||||
self.status_code,
|
||||
),
|
||||
)
|
||||
|
||||
@@ -2,18 +2,34 @@
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
from datetime import timedelta
|
||||
import deprecation
|
||||
import logging
|
||||
from functools import cached_property
|
||||
from typing import Any, Callable, Dict, Iterable, List, Optional, Union, Literal
|
||||
import os
|
||||
from typing import (
|
||||
Any,
|
||||
Callable,
|
||||
Dict,
|
||||
Iterable,
|
||||
List,
|
||||
Optional,
|
||||
Union,
|
||||
Literal,
|
||||
overload,
|
||||
)
|
||||
import warnings
|
||||
|
||||
from lancedb import __version__
|
||||
|
||||
from lancedb._lancedb import (
|
||||
AddColumnsResult,
|
||||
AddResult,
|
||||
AlterColumnsResult,
|
||||
UpdateFieldMetadataResult,
|
||||
DeleteResult,
|
||||
DropColumnsResult,
|
||||
IndexConfig,
|
||||
LsmWriteSpec,
|
||||
MergeResult,
|
||||
UpdateResult,
|
||||
)
|
||||
@@ -31,6 +47,7 @@ from lancedb.index import (
|
||||
LabelList,
|
||||
)
|
||||
from lancedb.remote.db import LOOP
|
||||
from lancedb.table import IndexConfigType, KNOWN_METRICS
|
||||
import pyarrow as pa
|
||||
|
||||
from lancedb.common import DATA, VEC, VECTOR_COLUMN_NAME
|
||||
@@ -39,7 +56,7 @@ from lancedb.embeddings import EmbeddingFunctionRegistry
|
||||
from lancedb.table import _normalize_progress
|
||||
|
||||
from ..query import LanceVectorQueryBuilder, LanceQueryBuilder, LanceTakeQueryBuilder
|
||||
from ..table import AsyncTable, IndexStatistics, Query, Table, Tags
|
||||
from ..table import AsyncTable, BlobMode, IndexStatistics, Query, Table, Tags
|
||||
from ..types import BaseTokenizerType
|
||||
|
||||
|
||||
@@ -48,14 +65,80 @@ class RemoteTable(Table):
|
||||
self,
|
||||
table: AsyncTable,
|
||||
db_name: str,
|
||||
*,
|
||||
connection_state: Optional[Union[str, Callable[[], str]]] = None,
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
):
|
||||
self._table = table
|
||||
self._table_handle = table
|
||||
self._name = table.name
|
||||
self.db_name = db_name
|
||||
self._connection_state = connection_state
|
||||
self._namespace_path = list(namespace_path or [])
|
||||
self._checkout_version: Optional[int] = None
|
||||
self._pid = os.getpid()
|
||||
|
||||
def _serialized_connection_state(self) -> str:
|
||||
if self._connection_state is None:
|
||||
raise RuntimeError(
|
||||
"Cannot reopen this remote table because it does not carry "
|
||||
"serialized connection state"
|
||||
)
|
||||
if callable(self._connection_state):
|
||||
self._connection_state = self._connection_state()
|
||||
return self._connection_state
|
||||
|
||||
@property
|
||||
def _table(self) -> AsyncTable:
|
||||
self._ensure_open()
|
||||
assert self._table_handle is not None
|
||||
return self._table_handle
|
||||
|
||||
@_table.setter
|
||||
def _table(self, table: AsyncTable) -> None:
|
||||
self._table_handle = table
|
||||
self._name = table.name
|
||||
self._pid = os.getpid()
|
||||
|
||||
def _ensure_open(self) -> None:
|
||||
pid = os.getpid()
|
||||
if self._table_handle is not None and self._pid == pid:
|
||||
return
|
||||
|
||||
# Pickle clears the handle; fork inherits a handle created in the
|
||||
# parent process. In both cases reopen before touching the Rust client.
|
||||
from lancedb import deserialize_conn
|
||||
|
||||
db = deserialize_conn(self._serialized_connection_state(), for_worker=True)
|
||||
table = db.open_table(self._name, namespace_path=self._namespace_path)
|
||||
if self._checkout_version is not None:
|
||||
table.checkout(self._checkout_version)
|
||||
|
||||
self._table_handle = table._table
|
||||
self.db_name = table.db_name
|
||||
self._pid = pid
|
||||
|
||||
def __getstate__(self) -> dict:
|
||||
return {
|
||||
"connection_state": self._serialized_connection_state(),
|
||||
"db_name": self.db_name,
|
||||
"name": self.name,
|
||||
"namespace_path": self._namespace_path,
|
||||
"checkout_version": self._checkout_version,
|
||||
}
|
||||
|
||||
def __setstate__(self, state: dict) -> None:
|
||||
self._table_handle = None
|
||||
self._name = state["name"]
|
||||
self.db_name = state["db_name"]
|
||||
self._connection_state = state["connection_state"]
|
||||
self._namespace_path = state["namespace_path"]
|
||||
self._checkout_version = state["checkout_version"]
|
||||
self._pid = None
|
||||
|
||||
@property
|
||||
def name(self) -> str:
|
||||
"""The name of the table"""
|
||||
return self._table.name
|
||||
return self._name
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"RemoteTable({self.db_name}.{self.name})"
|
||||
@@ -100,18 +183,24 @@ class RemoteTable(Table):
|
||||
"""to_arrow() is not yet supported on LanceDB cloud."""
|
||||
raise NotImplementedError("to_arrow() is not yet supported on LanceDB cloud.")
|
||||
|
||||
def to_pandas(self):
|
||||
def to_pandas(self, blob_mode: BlobMode = "lazy", **kwargs):
|
||||
"""to_pandas() is not yet supported on LanceDB cloud."""
|
||||
raise NotImplementedError("to_pandas() is not yet supported on LanceDB cloud.")
|
||||
|
||||
def checkout(self, version: Union[int, str]):
|
||||
return LOOP.run(self._table.checkout(version))
|
||||
result = LOOP.run(self._table.checkout(version))
|
||||
self._checkout_version = self.version
|
||||
return result
|
||||
|
||||
def checkout_latest(self):
|
||||
return LOOP.run(self._table.checkout_latest())
|
||||
result = LOOP.run(self._table.checkout_latest())
|
||||
self._checkout_version = None
|
||||
return result
|
||||
|
||||
def restore(self, version: Optional[Union[int, str]] = None):
|
||||
return LOOP.run(self._table.restore(version))
|
||||
result = LOOP.run(self._table.restore(version))
|
||||
self._checkout_version = None
|
||||
return result
|
||||
|
||||
def list_indices(self) -> Iterable[IndexConfig]:
|
||||
"""List all the indices on the table"""
|
||||
@@ -121,6 +210,11 @@ class RemoteTable(Table):
|
||||
"""List all the stats of a specified index"""
|
||||
return LOOP.run(self._table.index_stats(index_uuid))
|
||||
|
||||
@deprecation.deprecated(
|
||||
deprecated_in="0.25.0",
|
||||
current_version=__version__,
|
||||
details="Use create_index() with config=BTree()/Bitmap()/LabelList() instead.",
|
||||
)
|
||||
def create_scalar_index(
|
||||
self,
|
||||
column: str,
|
||||
@@ -130,7 +224,12 @@ class RemoteTable(Table):
|
||||
wait_timeout: Optional[timedelta] = None,
|
||||
name: Optional[str] = None,
|
||||
):
|
||||
"""Creates a scalar index
|
||||
"""Creates a scalar index.
|
||||
|
||||
.. deprecated:: 0.25.0
|
||||
Use :meth:`create_index` with a BTree, Bitmap, or LabelList config instead.
|
||||
Example: ``table.create_index("column", config=BTree())``
|
||||
|
||||
Parameters
|
||||
----------
|
||||
column : str
|
||||
@@ -161,6 +260,11 @@ class RemoteTable(Table):
|
||||
)
|
||||
)
|
||||
|
||||
@deprecation.deprecated(
|
||||
deprecated_in="0.25.0",
|
||||
current_version=__version__,
|
||||
details="Use create_index() with config=FTS() instead.",
|
||||
)
|
||||
def create_fts_index(
|
||||
self,
|
||||
column: str,
|
||||
@@ -181,6 +285,12 @@ class RemoteTable(Table):
|
||||
prefix_only: bool = False,
|
||||
name: Optional[str] = None,
|
||||
):
|
||||
"""Create a full-text search index on a column.
|
||||
|
||||
.. deprecated:: 0.25.0
|
||||
Use :meth:`create_index` with an FTS config instead.
|
||||
Example: ``table.create_index("text_column", config=FTS())``
|
||||
"""
|
||||
config = FTS(
|
||||
with_position=with_position,
|
||||
base_tokenizer=base_tokenizer,
|
||||
@@ -204,9 +314,43 @@ class RemoteTable(Table):
|
||||
)
|
||||
)
|
||||
|
||||
# New unified API overload
|
||||
@overload
|
||||
def create_index(
|
||||
self,
|
||||
metric="l2",
|
||||
column: str,
|
||||
/,
|
||||
*,
|
||||
config: IndexConfigType,
|
||||
wait_timeout: Optional[timedelta] = ...,
|
||||
name: Optional[str] = ...,
|
||||
train: bool = ...,
|
||||
) -> None: ...
|
||||
|
||||
# Legacy API overload (deprecated)
|
||||
@overload
|
||||
def create_index(
|
||||
self,
|
||||
metric: Literal["l2", "cosine", "dot", "hamming"] = ...,
|
||||
vector_column_name: str = ...,
|
||||
index_cache_size: Optional[int] = ...,
|
||||
num_partitions: Optional[int] = ...,
|
||||
num_sub_vectors: Optional[int] = ...,
|
||||
replace: Optional[bool] = ...,
|
||||
accelerator: Optional[str] = ...,
|
||||
index_type: Literal[
|
||||
"VECTOR", "IVF_FLAT", "IVF_SQ", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
|
||||
] = ...,
|
||||
wait_timeout: Optional[timedelta] = ...,
|
||||
*,
|
||||
num_bits: int = ...,
|
||||
name: Optional[str] = ...,
|
||||
train: bool = ...,
|
||||
) -> None: ...
|
||||
|
||||
def create_index(
|
||||
self,
|
||||
metric: str = "l2",
|
||||
vector_column_name: str = VECTOR_COLUMN_NAME,
|
||||
index_cache_size: Optional[int] = None,
|
||||
num_partitions: Optional[int] = None,
|
||||
@@ -217,89 +361,113 @@ class RemoteTable(Table):
|
||||
wait_timeout: Optional[timedelta] = None,
|
||||
*,
|
||||
num_bits: int = 8,
|
||||
config: Optional[IndexConfigType] = None,
|
||||
name: Optional[str] = None,
|
||||
train: bool = True,
|
||||
):
|
||||
"""Create an index on the table.
|
||||
"""Create an index on a column.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
metric : str
|
||||
The metric to use for the index. Default is "l2".
|
||||
vector_column_name : str
|
||||
The name of the vector column. Default is "vector".
|
||||
This method supports both the new unified API and the legacy API
|
||||
for backwards compatibility. The new API takes the column name as the
|
||||
first positional argument and an index configuration object via
|
||||
``config``; the legacy API takes the distance metric as the first
|
||||
argument plus separate ``vector_column_name`` / ``num_partitions`` /
|
||||
etc. parameters, and emits a ``DeprecationWarning``.
|
||||
|
||||
Examples
|
||||
--------
|
||||
>>> import lancedb
|
||||
>>> import uuid
|
||||
>>> from lancedb.schema import vector
|
||||
>>> db = lancedb.connect("db://...", api_key="...", # doctest: +SKIP
|
||||
... region="...") # doctest: +SKIP
|
||||
>>> table_name = uuid.uuid4().hex
|
||||
>>> schema = pa.schema(
|
||||
... [
|
||||
... pa.field("id", pa.uint32(), False),
|
||||
... pa.field("vector", vector(128), False),
|
||||
... pa.field("s", pa.string(), False),
|
||||
... ]
|
||||
New API (recommended):
|
||||
|
||||
>>> table.create_index( # doctest: +SKIP
|
||||
... "vector", config=IvfPq(distance_type="l2")
|
||||
... )
|
||||
>>> table = db.create_table( # doctest: +SKIP
|
||||
... table_name, # doctest: +SKIP
|
||||
... schema=schema, # doctest: +SKIP
|
||||
>>> table.create_index("category", config=BTree()) # doctest: +SKIP
|
||||
>>> table.create_index("content", config=FTS()) # doctest: +SKIP
|
||||
|
||||
Legacy API (deprecated):
|
||||
|
||||
>>> table.create_index( # doctest: +SKIP
|
||||
... "l2", vector_column_name="vector"
|
||||
... )
|
||||
>>> table.create_index("l2", "vector") # doctest: +SKIP
|
||||
"""
|
||||
# Detect whether this is a legacy API call
|
||||
is_legacy = self._is_legacy_create_index_call(
|
||||
metric,
|
||||
config,
|
||||
num_partitions,
|
||||
num_sub_vectors,
|
||||
vector_column_name,
|
||||
accelerator,
|
||||
index_cache_size,
|
||||
replace,
|
||||
)
|
||||
|
||||
if accelerator is not None:
|
||||
logging.warning(
|
||||
"GPU accelerator is not yet supported on LanceDB cloud."
|
||||
"If you have 100M+ vectors to index,"
|
||||
"please contact us at contact@lancedb.com"
|
||||
)
|
||||
if replace is not None:
|
||||
logging.warning(
|
||||
"replace is not supported on LanceDB cloud."
|
||||
"Existing indexes will always be replaced."
|
||||
if is_legacy:
|
||||
warnings.warn(
|
||||
"The create_index() API with metric/num_partitions parameters is "
|
||||
"deprecated and will be removed in a future version. "
|
||||
"Please migrate to the new unified API:\n"
|
||||
" # Old (deprecated):\n"
|
||||
" table.create_index('l2', vector_column_name='my_vector')\n"
|
||||
" # New (recommended):\n"
|
||||
" table.create_index('my_vector', config=IvfPq(distance_type='l2'))",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
index_type = index_type.upper()
|
||||
if index_type == "VECTOR" or index_type == "IVF_PQ":
|
||||
config = IvfPq(
|
||||
distance_type=metric,
|
||||
num_partitions=num_partitions,
|
||||
num_sub_vectors=num_sub_vectors,
|
||||
num_bits=num_bits,
|
||||
)
|
||||
elif index_type == "IVF_RQ":
|
||||
config = IvfRq(
|
||||
distance_type=metric,
|
||||
num_partitions=num_partitions,
|
||||
num_bits=num_bits,
|
||||
)
|
||||
elif index_type == "IVF_SQ":
|
||||
config = IvfSq(distance_type=metric, num_partitions=num_partitions)
|
||||
elif index_type == "IVF_HNSW_PQ":
|
||||
raise ValueError(
|
||||
"IVF_HNSW_PQ is not supported on LanceDB cloud."
|
||||
"Please use IVF_HNSW_SQ instead."
|
||||
)
|
||||
elif index_type == "IVF_HNSW_SQ":
|
||||
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
|
||||
elif index_type == "IVF_HNSW_FLAT":
|
||||
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
|
||||
elif index_type == "IVF_FLAT":
|
||||
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
|
||||
column = vector_column_name
|
||||
|
||||
if accelerator is not None:
|
||||
logging.warning(
|
||||
"GPU accelerator is not yet supported on LanceDB cloud."
|
||||
"If you have 100M+ vectors to index,"
|
||||
"please contact us at contact@lancedb.com"
|
||||
)
|
||||
if replace is not None:
|
||||
logging.warning(
|
||||
"replace is not supported on LanceDB cloud."
|
||||
"Existing indexes will always be replaced."
|
||||
)
|
||||
|
||||
idx_type = index_type.upper()
|
||||
if idx_type == "VECTOR" or idx_type == "IVF_PQ":
|
||||
config = IvfPq(
|
||||
distance_type=metric,
|
||||
num_partitions=num_partitions,
|
||||
num_sub_vectors=num_sub_vectors,
|
||||
num_bits=num_bits,
|
||||
)
|
||||
elif idx_type == "IVF_RQ":
|
||||
config = IvfRq(
|
||||
distance_type=metric,
|
||||
num_partitions=num_partitions,
|
||||
num_bits=num_bits,
|
||||
)
|
||||
elif idx_type == "IVF_SQ":
|
||||
config = IvfSq(distance_type=metric, num_partitions=num_partitions)
|
||||
elif idx_type == "IVF_HNSW_PQ":
|
||||
raise ValueError(
|
||||
"IVF_HNSW_PQ is not supported on LanceDB cloud."
|
||||
"Please use IVF_HNSW_SQ instead."
|
||||
)
|
||||
elif idx_type == "IVF_HNSW_SQ":
|
||||
config = HnswSq(distance_type=metric, num_partitions=num_partitions)
|
||||
elif idx_type == "IVF_HNSW_FLAT":
|
||||
config = HnswFlat(distance_type=metric, num_partitions=num_partitions)
|
||||
elif idx_type == "IVF_FLAT":
|
||||
config = IvfFlat(distance_type=metric, num_partitions=num_partitions)
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Unknown vector index type: {idx_type}. Valid options are"
|
||||
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
|
||||
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
|
||||
)
|
||||
else:
|
||||
raise ValueError(
|
||||
f"Unknown vector index type: {index_type}. Valid options are"
|
||||
" 'IVF_FLAT', 'IVF_PQ', 'IVF_RQ', 'IVF_SQ',"
|
||||
" 'IVF_HNSW_PQ', 'IVF_HNSW_SQ', 'IVF_HNSW_FLAT'"
|
||||
)
|
||||
column = metric
|
||||
|
||||
LOOP.run(
|
||||
self._table.create_index(
|
||||
vector_column_name,
|
||||
column,
|
||||
config=config,
|
||||
wait_timeout=wait_timeout,
|
||||
name=name,
|
||||
@@ -307,6 +475,37 @@ class RemoteTable(Table):
|
||||
)
|
||||
)
|
||||
|
||||
def _is_legacy_create_index_call(
|
||||
self,
|
||||
first_arg: str,
|
||||
config: Optional[IndexConfigType],
|
||||
num_partitions: Optional[int],
|
||||
num_sub_vectors: Optional[int],
|
||||
vector_column_name: str,
|
||||
accelerator: Optional[str],
|
||||
index_cache_size: Optional[int],
|
||||
replace: Optional[bool],
|
||||
) -> bool:
|
||||
"""Detect if this is a legacy create_index call."""
|
||||
if config is not None:
|
||||
return False
|
||||
if any(
|
||||
x is not None
|
||||
for x in (
|
||||
num_partitions,
|
||||
num_sub_vectors,
|
||||
accelerator,
|
||||
index_cache_size,
|
||||
replace,
|
||||
)
|
||||
):
|
||||
return True
|
||||
if vector_column_name != VECTOR_COLUMN_NAME:
|
||||
return True
|
||||
if first_arg.lower() in KNOWN_METRICS:
|
||||
return True
|
||||
return False
|
||||
|
||||
def add(
|
||||
self,
|
||||
data: DATA,
|
||||
@@ -652,9 +851,30 @@ class RemoteTable(Table):
|
||||
) -> AlterColumnsResult:
|
||||
return LOOP.run(self._table.alter_columns(*alterations))
|
||||
|
||||
def update_field_metadata(
|
||||
self, *updates: dict[str, Any]
|
||||
) -> UpdateFieldMetadataResult:
|
||||
return LOOP.run(self._table.update_field_metadata(*updates))
|
||||
|
||||
def drop_columns(self, columns: Iterable[str]) -> DropColumnsResult:
|
||||
return LOOP.run(self._table.drop_columns(columns))
|
||||
|
||||
def set_unenforced_primary_key(self, columns: Union[str, Iterable[str]]) -> None:
|
||||
"""Not supported on LanceDB Cloud."""
|
||||
return LOOP.run(self._table.set_unenforced_primary_key(columns))
|
||||
|
||||
def set_lsm_write_spec(self, spec: "LsmWriteSpec") -> None:
|
||||
"""Not supported on LanceDB Cloud."""
|
||||
return LOOP.run(self._table.set_lsm_write_spec(spec))
|
||||
|
||||
def unset_lsm_write_spec(self) -> None:
|
||||
"""Not supported on LanceDB Cloud."""
|
||||
return LOOP.run(self._table.unset_lsm_write_spec())
|
||||
|
||||
def close_lsm_writers(self) -> None:
|
||||
"""No-op on LanceDB Cloud (no local shard writers)."""
|
||||
return LOOP.run(self._table.close_lsm_writers())
|
||||
|
||||
def drop_index(self, index_name: str):
|
||||
return LOOP.run(self._table.drop_index(index_name))
|
||||
|
||||
|
||||
@@ -102,8 +102,15 @@ class LinearCombinationReranker(Reranker):
|
||||
|
||||
combined_list = []
|
||||
for row_id, result in results.items():
|
||||
# Convert vector distance to a relevance score in [0, 1] where
|
||||
# higher is better. Missing vector entries are penalised with
|
||||
# `_invert_score(fill)` = 1 - fill (= 0.0 for the default fill=1).
|
||||
vector_score = self._invert_score(result.get("_distance", fill))
|
||||
fts_score = result.get("_score", fill)
|
||||
# FTS scores (BM25) are already in a "higher = more relevant" space.
|
||||
# Missing FTS entries are penalised symmetrically: we use
|
||||
# `1 - fill` so that the same `fill` value drives both missing-vector
|
||||
# and missing-FTS penalties in the same direction.
|
||||
fts_score = result.get("_score", 1 - fill)
|
||||
result["_relevance_score"] = self._combine_score(vector_score, fts_score)
|
||||
combined_list.append(result)
|
||||
|
||||
@@ -123,8 +130,12 @@ class LinearCombinationReranker(Reranker):
|
||||
return tbl
|
||||
|
||||
def _combine_score(self, vector_score, fts_score):
|
||||
# these scores represent distance
|
||||
return 1 - (self.weight * vector_score + (1 - self.weight) * fts_score)
|
||||
# Both vector_score (inverted distance) and fts_score are in a
|
||||
# "higher = more relevant" space. A straight weighted average gives
|
||||
# higher _relevance_score to better matches, as expected.
|
||||
# Previously this returned `1 - (...)` which inverted the final
|
||||
# ranking so that the *least* relevant document ranked first.
|
||||
return self.weight * vector_score + (1 - self.weight) * fts_score
|
||||
|
||||
def _invert_score(self, dist: float):
|
||||
# Invert the score between relevance and distance
|
||||
|
||||
@@ -125,6 +125,9 @@ class MRRReranker(Reranker):
|
||||
This cannot reuse rerank_hybrid because MRR semantics require treating
|
||||
each vector result as a separate ranking system.
|
||||
"""
|
||||
if not vector_results:
|
||||
raise ValueError("vector_results must not be empty")
|
||||
|
||||
if not all(isinstance(v, type(vector_results[0])) for v in vector_results):
|
||||
raise ValueError(
|
||||
"All elements in vector_results should be of the same type"
|
||||
|
||||
@@ -82,6 +82,9 @@ class RRFReranker(Reranker):
|
||||
results from multiple vector searches as it doesn't support reranking
|
||||
vector results individually.
|
||||
"""
|
||||
if not vector_results:
|
||||
raise ValueError("vector_results must not be empty")
|
||||
|
||||
# Make sure all elements are of the same type
|
||||
if not all(isinstance(v, type(vector_results[0])) for v in vector_results):
|
||||
raise ValueError(
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -10,7 +10,7 @@ import pathlib
|
||||
import warnings
|
||||
from datetime import date, datetime
|
||||
from functools import singledispatch
|
||||
from typing import Tuple, Union, Optional, Any
|
||||
from typing import Tuple, Union, Optional, Any, List
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import numpy as np
|
||||
@@ -189,7 +189,33 @@ def flatten_columns(tbl: pa.Table, flatten: Optional[Union[int, bool]] = None):
|
||||
return tbl
|
||||
|
||||
|
||||
def inf_vector_column_query(schema: pa.Schema) -> str:
|
||||
def _format_field_path(path: List[str]) -> str:
|
||||
def format_segment(segment: str) -> str:
|
||||
if all(char.isalnum() or char == "_" for char in segment):
|
||||
return segment
|
||||
return f"`{segment.replace('`', '``')}`"
|
||||
|
||||
return ".".join(format_segment(segment) for segment in path)
|
||||
|
||||
|
||||
def _iter_vector_columns(
|
||||
field: pa.Field, path: List[str], dim: Optional[int] = None
|
||||
) -> List[str]:
|
||||
field_path = [*path, field.name]
|
||||
if is_vector_column(field.type):
|
||||
vector_dim = infer_vector_column_dim(field.type)
|
||||
if dim is None or vector_dim == dim:
|
||||
return [_format_field_path(field_path)]
|
||||
return []
|
||||
if pa.types.is_struct(field.type):
|
||||
columns = []
|
||||
for idx in range(field.type.num_fields):
|
||||
columns.extend(_iter_vector_columns(field.type.field(idx), field_path, dim))
|
||||
return columns
|
||||
return []
|
||||
|
||||
|
||||
def inf_vector_column_query(schema: pa.Schema, dim: Optional[int] = None) -> str:
|
||||
"""
|
||||
Get the vector column name
|
||||
|
||||
@@ -202,26 +228,21 @@ def inf_vector_column_query(schema: pa.Schema) -> str:
|
||||
-------
|
||||
str: the vector column name.
|
||||
"""
|
||||
vector_col_name = ""
|
||||
vector_col_count = 0
|
||||
for field_name in schema.names:
|
||||
field = schema.field(field_name)
|
||||
if is_vector_column(field.type):
|
||||
vector_col_count += 1
|
||||
if vector_col_count > 1:
|
||||
raise ValueError(
|
||||
"Schema has more than one vector column. "
|
||||
"Please specify the vector column name "
|
||||
"for vector search"
|
||||
)
|
||||
elif vector_col_count == 1:
|
||||
vector_col_name = field_name
|
||||
if vector_col_count == 0:
|
||||
vector_col_names = []
|
||||
for field in schema:
|
||||
vector_col_names.extend(_iter_vector_columns(field, [], dim))
|
||||
if len(vector_col_names) > 1:
|
||||
raise ValueError(
|
||||
"Schema has more than one vector column. "
|
||||
"Please specify the vector column name "
|
||||
f"for vector search. Candidates: {vector_col_names}"
|
||||
)
|
||||
if len(vector_col_names) == 0:
|
||||
raise ValueError(
|
||||
"There is no vector column in the data. "
|
||||
"Please specify the vector column name for vector search"
|
||||
)
|
||||
return vector_col_name
|
||||
return vector_col_names[0]
|
||||
|
||||
|
||||
def is_vector_column(data_type: pa.DataType) -> bool:
|
||||
@@ -247,6 +268,29 @@ def is_vector_column(data_type: pa.DataType) -> bool:
|
||||
return False
|
||||
|
||||
|
||||
def infer_vector_column_dim(data_type: pa.DataType) -> Optional[int]:
|
||||
if pa.types.is_fixed_size_list(data_type):
|
||||
return data_type.list_size
|
||||
if pa.types.is_list(data_type):
|
||||
return infer_vector_column_dim(data_type.value_type)
|
||||
return None
|
||||
|
||||
|
||||
def _query_vector_dim(query: Optional[Any]) -> Optional[int]:
|
||||
if query is None:
|
||||
return None
|
||||
if isinstance(query, np.ndarray):
|
||||
if query.ndim == 0:
|
||||
return None
|
||||
return query.shape[-1]
|
||||
if isinstance(query, list) and query:
|
||||
first = query[0]
|
||||
if isinstance(first, (list, tuple, np.ndarray)):
|
||||
return len(first)
|
||||
return len(query)
|
||||
return None
|
||||
|
||||
|
||||
def infer_vector_column_name(
|
||||
schema: pa.Schema,
|
||||
query_type: str,
|
||||
@@ -262,7 +306,9 @@ def infer_vector_column_name(
|
||||
|
||||
if query is not None or query_type == "hybrid":
|
||||
try:
|
||||
vector_column_name = inf_vector_column_query(schema)
|
||||
vector_column_name = inf_vector_column_query(
|
||||
schema, dim=_query_vector_dim(query)
|
||||
)
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
||||
|
||||
@@ -57,7 +57,7 @@ async def test_upsert_async(mem_db_async):
|
||||
await table.count_rows() # 3
|
||||
res
|
||||
# MergeResult(version=2, num_updated_rows=1,
|
||||
# num_inserted_rows=1, num_deleted_rows=0)
|
||||
# num_inserted_rows=1, num_deleted_rows=0, num_rows=2)
|
||||
# --8<-- [end:upsert_basic_async]
|
||||
assert await table.count_rows() == 3
|
||||
assert res.version == 2
|
||||
@@ -86,7 +86,7 @@ def test_insert_if_not_exists(mem_db):
|
||||
table.count_rows() # 3
|
||||
res
|
||||
# MergeResult(version=2, num_updated_rows=0,
|
||||
# num_inserted_rows=1, num_deleted_rows=0)
|
||||
# num_inserted_rows=1, num_deleted_rows=0, num_rows=1)
|
||||
# --8<-- [end:insert_if_not_exists]
|
||||
assert table.count_rows() == 3
|
||||
assert res.version == 2
|
||||
@@ -116,7 +116,7 @@ async def test_insert_if_not_exists_async(mem_db_async):
|
||||
await table.count_rows() # 3
|
||||
res
|
||||
# MergeResult(version=2, num_updated_rows=0,
|
||||
# num_inserted_rows=1, num_deleted_rows=0)
|
||||
# num_inserted_rows=1, num_deleted_rows=0, num_rows=1)
|
||||
# --8<-- [end:insert_if_not_exists]
|
||||
assert await table.count_rows() == 3
|
||||
assert res.version == 2
|
||||
@@ -150,7 +150,7 @@ def test_replace_range(mem_db):
|
||||
table.count_rows("doc_id = 1") # 1
|
||||
res
|
||||
# MergeResult(version=2, num_updated_rows=1,
|
||||
# num_inserted_rows=0, num_deleted_rows=1)
|
||||
# num_inserted_rows=0, num_deleted_rows=1, num_rows=1)
|
||||
# --8<-- [end:insert_if_not_exists]
|
||||
assert table.count_rows("doc_id = 1") == 1
|
||||
assert res.version == 2
|
||||
@@ -185,7 +185,7 @@ async def test_replace_range_async(mem_db_async):
|
||||
await table.count_rows("doc_id = 1") # 1
|
||||
res
|
||||
# MergeResult(version=2, num_updated_rows=1,
|
||||
# num_inserted_rows=0, num_deleted_rows=1)
|
||||
# num_inserted_rows=0, num_deleted_rows=1, num_rows=1)
|
||||
# --8<-- [end:insert_if_not_exists]
|
||||
assert await table.count_rows("doc_id = 1") == 1
|
||||
assert res.version == 2
|
||||
|
||||
@@ -1,4 +1,3 @@
|
||||
segmenter:
|
||||
mode: "normal"
|
||||
dictionary:
|
||||
path: "./python/tests/models/lindera/ipadic/main"
|
||||
dictionary: "./python/tests/models/lindera/ipadic/main"
|
||||
|
||||
Binary file not shown.
@@ -6,6 +6,7 @@ import re
|
||||
import sys
|
||||
from datetime import timedelta
|
||||
import os
|
||||
from types import SimpleNamespace
|
||||
|
||||
import lancedb
|
||||
import numpy as np
|
||||
@@ -188,6 +189,43 @@ def test_table_names(tmp_db: lancedb.DBConnection):
|
||||
assert len(result) == 3
|
||||
|
||||
|
||||
def test_db_contains_and_len_include_all_table_name_pages(tmp_db: lancedb.DBConnection):
|
||||
for idx in range(20):
|
||||
tmp_db.create_table(f"table_{idx}", data=[{"id": idx}])
|
||||
|
||||
assert len(tmp_db) == 20
|
||||
for idx in range(20):
|
||||
assert f"table_{idx}" in tmp_db
|
||||
assert "does_not_exist" not in tmp_db
|
||||
|
||||
|
||||
def test_db_contains_stops_after_matching_table_page(
|
||||
tmp_db: lancedb.DBConnection, monkeypatch
|
||||
):
|
||||
calls = []
|
||||
pages = {
|
||||
None: SimpleNamespace(tables=["table_0", "table_1"], page_token="next"),
|
||||
"next": SimpleNamespace(tables=["table_2"], page_token=None),
|
||||
}
|
||||
|
||||
def list_tables(*, page_token=None, **_kwargs):
|
||||
calls.append(page_token)
|
||||
return pages[page_token]
|
||||
|
||||
monkeypatch.setattr(tmp_db, "list_tables", list_tables)
|
||||
|
||||
assert "table_1" in tmp_db
|
||||
assert calls == [None]
|
||||
|
||||
calls.clear()
|
||||
assert "table_2" in tmp_db
|
||||
assert calls == [None, "next"]
|
||||
|
||||
calls.clear()
|
||||
assert len(tmp_db) == 3
|
||||
assert calls == [None, "next"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_table_names_async(tmp_path):
|
||||
db = lancedb.connect(tmp_path)
|
||||
@@ -428,7 +466,8 @@ async def test_create_table_v2_manifest_paths_async(tmp_path):
|
||||
assert await tbl.uses_v2_manifest_paths()
|
||||
manifests_dir = tmp_path / "test_v2_manifest_paths.lance" / "_versions"
|
||||
for manifest in os.listdir(manifests_dir):
|
||||
assert re.match(r"\d{20}\.manifest", manifest)
|
||||
if manifest.endswith(".manifest"):
|
||||
assert re.match(r"\d{20}\.manifest", manifest)
|
||||
|
||||
# Start a table in V1 mode then migrate
|
||||
tbl = await db_no_v2_paths.create_table(
|
||||
@@ -438,13 +477,15 @@ async def test_create_table_v2_manifest_paths_async(tmp_path):
|
||||
assert not await tbl.uses_v2_manifest_paths()
|
||||
manifests_dir = tmp_path / "test_v2_migration.lance" / "_versions"
|
||||
for manifest in os.listdir(manifests_dir):
|
||||
assert re.match(r"\d\.manifest", manifest)
|
||||
if manifest.endswith(".manifest"):
|
||||
assert re.match(r"\d\.manifest", manifest)
|
||||
|
||||
await tbl.migrate_manifest_paths_v2()
|
||||
assert await tbl.uses_v2_manifest_paths()
|
||||
|
||||
for manifest in os.listdir(manifests_dir):
|
||||
assert re.match(r"\d{20}\.manifest", manifest)
|
||||
if manifest.endswith(".manifest"):
|
||||
assert re.match(r"\d{20}\.manifest", manifest)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@@ -914,6 +955,29 @@ def test_local_namespace_operations(tmp_path):
|
||||
assert db.list_namespaces().namespaces == []
|
||||
|
||||
|
||||
def test_create_namespace_invalid_mode_raises(tmp_path):
|
||||
"""Unrecognized create namespace modes raise a clear error."""
|
||||
db = lancedb.connect(tmp_path)
|
||||
with pytest.raises(ValueError, match="Invalid create namespace mode"):
|
||||
db.create_namespace(["child"], mode="frobnicate")
|
||||
|
||||
|
||||
def test_drop_namespace_invalid_mode_raises(tmp_path):
|
||||
"""Unrecognized drop namespace modes raise a clear error."""
|
||||
db = lancedb.connect(tmp_path)
|
||||
db.create_namespace(["child"])
|
||||
with pytest.raises(ValueError, match="Invalid drop namespace mode"):
|
||||
db.drop_namespace(["child"], mode="frobnicate")
|
||||
|
||||
|
||||
def test_drop_namespace_invalid_behavior_raises(tmp_path):
|
||||
"""Unrecognized drop namespace behaviors raise a clear error."""
|
||||
db = lancedb.connect(tmp_path)
|
||||
db.create_namespace(["child"])
|
||||
with pytest.raises(ValueError, match="Invalid drop namespace behavior"):
|
||||
db.drop_namespace(["child"], behavior="frobnicate")
|
||||
|
||||
|
||||
def test_clone_table_latest_version(tmp_path):
|
||||
"""Test cloning a table with the latest version (default behavior)"""
|
||||
import os
|
||||
|
||||
56
python/python/tests/test_errors.py
Normal file
56
python/python/tests/test_errors.py
Normal file
@@ -0,0 +1,56 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import pickle
|
||||
|
||||
from lancedb.remote.errors import HttpError, LanceDBClientError, RetryError
|
||||
|
||||
|
||||
def test_pickle_lancedb_client_error():
|
||||
err = LanceDBClientError("something went wrong", "req-123", 400)
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert str(restored) == "something went wrong"
|
||||
assert restored.request_id == "req-123"
|
||||
assert restored.status_code == 400
|
||||
|
||||
|
||||
def test_pickle_lancedb_client_error_no_status_code():
|
||||
err = LanceDBClientError("fail", "req-456")
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert str(restored) == "fail"
|
||||
assert restored.request_id == "req-456"
|
||||
assert restored.status_code is None
|
||||
|
||||
|
||||
def test_pickle_http_error():
|
||||
err = HttpError("not found", "req-789", 404)
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert isinstance(restored, HttpError)
|
||||
assert str(restored) == "not found"
|
||||
assert restored.request_id == "req-789"
|
||||
assert restored.status_code == 404
|
||||
|
||||
|
||||
def test_pickle_retry_error():
|
||||
err = RetryError(
|
||||
"max retries exceeded",
|
||||
"req-abc",
|
||||
request_failures=3,
|
||||
connect_failures=1,
|
||||
read_failures=2,
|
||||
max_request_failures=5,
|
||||
max_connect_failures=3,
|
||||
max_read_failures=3,
|
||||
status_code=503,
|
||||
)
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert isinstance(restored, RetryError)
|
||||
assert str(restored) == "max retries exceeded"
|
||||
assert restored.request_id == "req-abc"
|
||||
assert restored.request_failures == 3
|
||||
assert restored.connect_failures == 1
|
||||
assert restored.read_failures == 2
|
||||
assert restored.max_request_failures == 5
|
||||
assert restored.max_connect_failures == 3
|
||||
assert restored.max_read_failures == 3
|
||||
assert restored.status_code == 503
|
||||
@@ -29,6 +29,7 @@ from lancedb.query import (
|
||||
MultiMatchQuery,
|
||||
PhraseQuery,
|
||||
BooleanQuery,
|
||||
ColumnOrdering,
|
||||
Occur,
|
||||
LanceFtsQueryBuilder,
|
||||
)
|
||||
@@ -116,8 +117,7 @@ def lindera_ipadic(language_model_home):
|
||||
config_path.write_text(
|
||||
"segmenter:\n"
|
||||
' mode: "normal"\n'
|
||||
" dictionary:\n"
|
||||
f' path: "{extracted_model.resolve().as_posix()}"\n',
|
||||
f' dictionary: "{extracted_model.resolve().as_posix()}"\n',
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
@@ -215,11 +215,12 @@ def test_reject_legacy_tantivy_index(table):
|
||||
|
||||
@pytest.mark.parametrize("with_position", [True, False])
|
||||
def test_create_inverted_index(table, with_position):
|
||||
table.create_fts_index(
|
||||
"text",
|
||||
with_position=with_position,
|
||||
name="custom_fts_index",
|
||||
)
|
||||
with pytest.warns(DeprecationWarning, match="create_fts_index"):
|
||||
table.create_fts_index(
|
||||
"text",
|
||||
with_position=with_position,
|
||||
name="custom_fts_index",
|
||||
)
|
||||
indices = table.list_indices()
|
||||
fts_indices = [i for i in indices if i.index_type == "FTS"]
|
||||
assert any(i.name == "custom_fts_index" for i in fts_indices)
|
||||
@@ -500,6 +501,36 @@ async def test_search_fts_specify_column_async(async_table):
|
||||
pass
|
||||
|
||||
|
||||
def test_search_order_by_descending(table):
|
||||
table.create_fts_index("text")
|
||||
rows = (
|
||||
table.search("puppy")
|
||||
.order_by([ColumnOrdering(column_name="count", ascending=False)])
|
||||
.limit(20)
|
||||
.select(["text", "count"])
|
||||
.to_list()
|
||||
)
|
||||
|
||||
for r in rows:
|
||||
assert "puppy" in r["text"]
|
||||
assert sorted(rows, key=lambda x: x["count"], reverse=True) == rows
|
||||
|
||||
|
||||
def test_search_order_by_ascending(table):
|
||||
table.create_fts_index("text")
|
||||
rows = (
|
||||
table.search("puppy")
|
||||
.order_by([ColumnOrdering(column_name="count", ascending=True)])
|
||||
.limit(20)
|
||||
.select(["text", "count"])
|
||||
.to_list()
|
||||
)
|
||||
|
||||
for r in rows:
|
||||
assert "puppy" in r["text"]
|
||||
assert sorted(rows, key=lambda x: x["count"]) == rows
|
||||
|
||||
|
||||
def test_create_index_from_table(tmp_path, table):
|
||||
table.create_fts_index("text")
|
||||
df = table.search("puppy").limit(5).select(["text"]).to_pandas()
|
||||
@@ -533,8 +564,111 @@ def test_create_index_multiple_columns(tmp_path, table):
|
||||
|
||||
|
||||
def test_nested_schema(tmp_path, table):
|
||||
with pytest.raises(ValueError, match="top-level fields"):
|
||||
table.create_fts_index("nested.text")
|
||||
table.create_fts_index("nested.text", with_position=True)
|
||||
indices = table.list_indices()
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "FTS"
|
||||
assert indices[0].columns == ["nested.text"]
|
||||
|
||||
results = (
|
||||
table.search("puppy", query_type="fts", fts_columns="nested.text")
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all("puppy" in row["nested"]["text"] for row in results)
|
||||
|
||||
results = table.search(MatchQuery("puppy", "nested.text")).limit(5).to_list()
|
||||
assert len(results) > 0
|
||||
assert all("puppy" in row["nested"]["text"] for row in results)
|
||||
|
||||
phrase_results = (
|
||||
table.search(PhraseQuery("puppy runs", "nested.text")).limit(5).to_list()
|
||||
)
|
||||
assert len(phrase_results) > 0
|
||||
assert all("puppy runs" in row["nested"]["text"] for row in phrase_results)
|
||||
|
||||
hybrid_results = (
|
||||
table.search(query_type="hybrid", fts_columns="nested.text")
|
||||
.vector([0 for _ in range(128)])
|
||||
.text("puppy")
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(hybrid_results) > 0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_schema_async(async_table):
|
||||
await async_table.create_index("nested.text", config=FTS(with_position=True))
|
||||
indices = await async_table.list_indices()
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "FTS"
|
||||
assert indices[0].columns == ["nested.text"]
|
||||
|
||||
results = await (
|
||||
async_table.query()
|
||||
.nearest_to_text("puppy", columns="nested.text")
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all("puppy" in row["nested"]["text"] for row in results)
|
||||
|
||||
results = await (
|
||||
async_table.query()
|
||||
.nearest_to_text(MatchQuery("puppy", "nested.text"))
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all("puppy" in row["nested"]["text"] for row in results)
|
||||
|
||||
phrase_results = await (
|
||||
async_table.query()
|
||||
.nearest_to_text(PhraseQuery("puppy runs", "nested.text"))
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(phrase_results) > 0
|
||||
assert all("puppy runs" in row["nested"]["text"] for row in phrase_results)
|
||||
|
||||
hybrid_results = await (
|
||||
async_table.query()
|
||||
.nearest_to([0 for _ in range(128)])
|
||||
.nearest_to_text("puppy", columns="nested.text")
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(hybrid_results) > 0
|
||||
|
||||
|
||||
def test_nested_schema_rejects_invalid_fts_fields(tmp_path):
|
||||
db = ldb.connect(tmp_path)
|
||||
data = pa.table(
|
||||
{
|
||||
"payload": pa.array(
|
||||
[
|
||||
{"text": "puppy runs", "count": 1},
|
||||
{"text": "car drives", "count": 2},
|
||||
]
|
||||
),
|
||||
"vector": pa.array(
|
||||
[[0.1, 0.1], [0.2, 0.2]],
|
||||
type=pa.list_(pa.float32(), list_size=2),
|
||||
),
|
||||
}
|
||||
)
|
||||
table = db.create_table("test", data=data)
|
||||
|
||||
with pytest.raises(ValueError, match="FTS index cannot be created.*payload"):
|
||||
table.create_fts_index("payload")
|
||||
|
||||
with pytest.raises(ValueError, match="FTS index cannot be created.*count"):
|
||||
table.create_fts_index("payload.count")
|
||||
|
||||
with pytest.raises(ValueError, match="Field path `payload.missing` not found"):
|
||||
table.create_fts_index("payload.missing")
|
||||
|
||||
|
||||
def test_search_index_with_filter(table):
|
||||
|
||||
@@ -105,6 +105,46 @@ async def test_create_scalar_index(some_table: AsyncTable):
|
||||
assert len(indices) == 0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
|
||||
metadata_type = pa.struct(
|
||||
[
|
||||
pa.field("user_id", pa.int32()),
|
||||
pa.field("user.id", pa.int32()),
|
||||
]
|
||||
)
|
||||
data = pa.Table.from_arrays(
|
||||
[
|
||||
pa.array([1, 2, 3], type=pa.int32()),
|
||||
pa.array(
|
||||
[
|
||||
{"user_id": 10, "user.id": 100},
|
||||
{"user_id": 20, "user.id": 200},
|
||||
{"user_id": 30, "user.id": 300},
|
||||
],
|
||||
type=metadata_type,
|
||||
),
|
||||
],
|
||||
names=["user_id", "metadata"],
|
||||
)
|
||||
table = await db_async.create_table("nested_scalar_index", data)
|
||||
|
||||
await table.create_index("user_id", config=BTree(), name="top_user_id_idx")
|
||||
await table.create_index(
|
||||
"metadata.user_id", config=BTree(), name="nested_user_id_idx"
|
||||
)
|
||||
await table.create_index(
|
||||
"metadata.`user.id`", config=BTree(), name="escaped_user_id_idx"
|
||||
)
|
||||
|
||||
columns_by_name = {
|
||||
index.name: index.columns for index in await table.list_indices()
|
||||
}
|
||||
assert columns_by_name["top_user_id_idx"] == ["user_id"]
|
||||
assert columns_by_name["nested_user_id_idx"] == ["metadata.user_id"]
|
||||
assert columns_by_name["escaped_user_id_idx"] == ["metadata.`user.id`"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_fixed_size_binary_index(some_table: AsyncTable):
|
||||
await some_table.create_index("fsb", config=BTree())
|
||||
@@ -122,12 +162,13 @@ async def test_create_bitmap_index(some_table: AsyncTable):
|
||||
await some_table.create_index("data", config=Bitmap())
|
||||
indices = await some_table.list_indices()
|
||||
assert len(indices) == 3
|
||||
# list_indices returns indices in alphabetical order by name
|
||||
assert indices[0].index_type == "Bitmap"
|
||||
assert indices[0].columns == ["id"]
|
||||
assert indices[0].columns == ["data"]
|
||||
assert indices[1].index_type == "Bitmap"
|
||||
assert indices[1].columns == ["is_active"]
|
||||
assert indices[1].columns == ["id"]
|
||||
assert indices[2].index_type == "Bitmap"
|
||||
assert indices[2].columns == ["data"]
|
||||
assert indices[2].columns == ["is_active"]
|
||||
|
||||
index_name = indices[0].name
|
||||
stats = await some_table.index_stats(index_name)
|
||||
@@ -185,7 +226,6 @@ async def test_create_vector_index(some_table: AsyncTable):
|
||||
assert stats.num_indexed_rows == await some_table.count_rows()
|
||||
assert stats.num_unindexed_rows == 0
|
||||
assert stats.num_indices == 1
|
||||
assert stats.loss >= 0.0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@@ -209,7 +249,6 @@ async def test_create_4bit_ivfpq_index(some_table: AsyncTable):
|
||||
assert stats.num_indexed_rows == await some_table.count_rows()
|
||||
assert stats.num_unindexed_rows == 0
|
||||
assert stats.num_indices == 1
|
||||
assert stats.loss >= 0.0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
|
||||
138
python/python/tests/test_lsm_write_spec.py
Normal file
138
python/python/tests/test_lsm_write_spec.py
Normal file
@@ -0,0 +1,138 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
"""Tests for installing and clearing an LsmWriteSpec via
|
||||
`Table.set_lsm_write_spec` / `Table.unset_lsm_write_spec`.
|
||||
"""
|
||||
|
||||
from datetime import timedelta
|
||||
|
||||
import lancedb
|
||||
import pyarrow as pa
|
||||
import pytest
|
||||
from lancedb._lancedb import LsmWriteSpec
|
||||
|
||||
SCHEMA = pa.schema(
|
||||
[
|
||||
pa.field("id", pa.utf8(), nullable=False),
|
||||
pa.field("v", pa.int32(), nullable=False),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _batch(ids, vs):
|
||||
return pa.RecordBatch.from_arrays(
|
||||
[pa.array(ids, type=pa.utf8()), pa.array(vs, type=pa.int32())],
|
||||
schema=SCHEMA,
|
||||
)
|
||||
|
||||
|
||||
def _reader(ids, vs):
|
||||
return pa.RecordBatchReader.from_batches(SCHEMA, [_batch(ids, vs)])
|
||||
|
||||
|
||||
def _make_table(tmp_path):
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
|
||||
table = db.create_table("t", _reader(["seed"], [0]))
|
||||
return db, table
|
||||
|
||||
|
||||
def test_set_lsm_write_spec_validates(tmp_path):
|
||||
_db, table = _make_table(tmp_path)
|
||||
|
||||
# Out-of-range num_buckets.
|
||||
with pytest.raises(Exception, match="num_buckets"):
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 0))
|
||||
with pytest.raises(Exception, match="num_buckets"):
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1025))
|
||||
|
||||
# Happy path then mutation rejected.
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 4))
|
||||
with pytest.raises(Exception, match="mutation"):
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 8))
|
||||
|
||||
|
||||
def test_unset_lsm_write_spec(tmp_path):
|
||||
_db, table = _make_table(tmp_path)
|
||||
|
||||
# unset errors when no spec is set.
|
||||
with pytest.raises(Exception, match="no LSM write spec"):
|
||||
table.unset_lsm_write_spec()
|
||||
|
||||
# Install a spec, then remove it; afterwards a fresh spec can be set.
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 4))
|
||||
table.unset_lsm_write_spec()
|
||||
# A second unset errors — there is no spec left to remove.
|
||||
with pytest.raises(Exception, match="no LSM write spec"):
|
||||
table.unset_lsm_write_spec()
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 8))
|
||||
|
||||
|
||||
def test_set_unsharded_spec(tmp_path):
|
||||
_db, table = _make_table(tmp_path)
|
||||
# Lance MemWAL still requires a primary key on the dataset; Unsharded
|
||||
# just skips per-row hashing.
|
||||
table.set_unenforced_primary_key("id")
|
||||
table.set_lsm_write_spec(LsmWriteSpec.unsharded())
|
||||
table.unset_lsm_write_spec()
|
||||
|
||||
|
||||
def test_lsm_write_spec_repr():
|
||||
s = LsmWriteSpec.bucket("id", 4)
|
||||
assert s.spec_type == "bucket"
|
||||
assert s.column == "id"
|
||||
assert s.num_buckets == 4
|
||||
assert s.maintained_indexes == []
|
||||
assert "bucket" in repr(s)
|
||||
assert "id" in repr(s)
|
||||
assert "4" in repr(s)
|
||||
|
||||
u = LsmWriteSpec.unsharded()
|
||||
assert u.spec_type == "unsharded"
|
||||
assert u.column is None
|
||||
assert u.num_buckets is None
|
||||
assert "unsharded" in repr(u)
|
||||
|
||||
|
||||
def test_lsm_write_spec_with_maintained_indexes():
|
||||
s = LsmWriteSpec.bucket("id", 4).with_maintained_indexes(["idx_a", "idx_b"])
|
||||
assert s.maintained_indexes == ["idx_a", "idx_b"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_set_unset_lsm_write_spec(tmp_path):
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path, read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
table = await db.create_table(
|
||||
"t",
|
||||
pa.RecordBatchReader.from_batches(SCHEMA, [_batch(["seed"], [0])]),
|
||||
)
|
||||
|
||||
await table.set_unenforced_primary_key("id")
|
||||
await table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 4))
|
||||
await table.unset_lsm_write_spec()
|
||||
# A second unset errors.
|
||||
with pytest.raises(Exception, match="no LSM write spec"):
|
||||
await table.unset_lsm_write_spec()
|
||||
|
||||
|
||||
def test_set_identity_spec(tmp_path):
|
||||
_db, table = _make_table(tmp_path)
|
||||
# Identity sharding still requires an unenforced primary key on the
|
||||
# table; it shards by the raw value of the given column.
|
||||
table.set_unenforced_primary_key("id")
|
||||
table.set_lsm_write_spec(LsmWriteSpec.identity("v"))
|
||||
table.unset_lsm_write_spec()
|
||||
|
||||
|
||||
def test_lsm_write_spec_identity_and_writer_config_defaults():
|
||||
s = LsmWriteSpec.identity("v")
|
||||
assert s.spec_type == "identity"
|
||||
assert s.column == "v"
|
||||
assert s.num_buckets is None
|
||||
assert "identity" in repr(s)
|
||||
|
||||
s = s.with_writer_config_defaults({"durable_write": "false"})
|
||||
assert s.writer_config_defaults == {"durable_write": "false"}
|
||||
assert "durable_write" in repr(s)
|
||||
196
python/python/tests/test_merge_insert_lsm.py
Normal file
196
python/python/tests/test_merge_insert_lsm.py
Normal file
@@ -0,0 +1,196 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
"""Tests for the MemWAL LSM ``merge_insert`` dispatch."""
|
||||
|
||||
from datetime import timedelta
|
||||
|
||||
import lancedb
|
||||
import pyarrow as pa
|
||||
import pytest
|
||||
from lancedb._lancedb import LsmWriteSpec
|
||||
|
||||
SCHEMA = pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int64(), nullable=False),
|
||||
pa.field("value", pa.int64(), nullable=False),
|
||||
]
|
||||
)
|
||||
|
||||
REGION_SCHEMA = pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int64(), nullable=False),
|
||||
pa.field("region", pa.utf8(), nullable=False),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _reader(ids):
|
||||
batch = pa.RecordBatch.from_arrays(
|
||||
[
|
||||
pa.array(ids, type=pa.int64()),
|
||||
pa.array(list(range(len(ids))), type=pa.int64()),
|
||||
],
|
||||
schema=SCHEMA,
|
||||
)
|
||||
return pa.RecordBatchReader.from_batches(SCHEMA, [batch])
|
||||
|
||||
|
||||
def _region_reader(rows):
|
||||
batch = pa.RecordBatch.from_arrays(
|
||||
[
|
||||
pa.array([row[0] for row in rows], type=pa.int64()),
|
||||
pa.array([row[1] for row in rows], type=pa.utf8()),
|
||||
],
|
||||
schema=REGION_SCHEMA,
|
||||
)
|
||||
return pa.RecordBatchReader.from_batches(REGION_SCHEMA, [batch])
|
||||
|
||||
|
||||
def _bucket_table(tmp_path):
|
||||
"""A table with ``id`` as the primary key and a single-bucket LSM spec."""
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
|
||||
table = db.create_table("t", _reader([1, 2, 3]))
|
||||
table.set_unenforced_primary_key("id")
|
||||
# num_buckets = 1: every row routes to the single bucket.
|
||||
table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1))
|
||||
return table
|
||||
|
||||
|
||||
def test_lsm_merge_insert_bucket(tmp_path):
|
||||
table = _bucket_table(tmp_path)
|
||||
# Empty `on` defaults to the primary key.
|
||||
result = (
|
||||
table.merge_insert([])
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.execute(_reader([3, 4, 5]))
|
||||
)
|
||||
# LSM path: rows go to the MemWAL, so only num_rows is populated.
|
||||
assert result.num_rows == 3
|
||||
assert result.version == 0
|
||||
assert result.num_inserted_rows == 0
|
||||
assert result.num_updated_rows == 0
|
||||
|
||||
|
||||
def test_lsm_merge_insert_unsharded(tmp_path):
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
|
||||
table = db.create_table("t", _reader([1, 2, 3]))
|
||||
table.set_unenforced_primary_key("id")
|
||||
table.set_lsm_write_spec(LsmWriteSpec.unsharded())
|
||||
result = (
|
||||
table.merge_insert("id")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.execute(_reader([10, 11, 12, 13]))
|
||||
)
|
||||
assert result.num_rows == 4
|
||||
|
||||
|
||||
def test_lsm_merge_insert_identity(tmp_path):
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
|
||||
table = db.create_table("t", _region_reader([(1, "us"), (2, "us")]))
|
||||
table.set_unenforced_primary_key("id")
|
||||
table.set_lsm_write_spec(LsmWriteSpec.identity("region"))
|
||||
# All rows share one identity value, so they route to one shard.
|
||||
result = (
|
||||
table.merge_insert([])
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.execute(_region_reader([(3, "us"), (4, "us")]))
|
||||
)
|
||||
assert result.num_rows == 2
|
||||
|
||||
|
||||
def test_lsm_merge_insert_use_lsm_write_false(tmp_path):
|
||||
table = _bucket_table(tmp_path) # rows id = 1, 2, 3
|
||||
# use_lsm_write(False) opts out: the standard path runs and commits.
|
||||
result = (
|
||||
table.merge_insert("id")
|
||||
.when_not_matched_insert_all()
|
||||
.use_lsm_write(False)
|
||||
.execute(_reader([3, 4, 5]))
|
||||
)
|
||||
assert result.num_inserted_rows == 2
|
||||
assert table.count_rows() == 5
|
||||
|
||||
|
||||
def test_lsm_merge_insert_validate_single_shard_off(tmp_path):
|
||||
table = _bucket_table(tmp_path)
|
||||
result = (
|
||||
table.merge_insert([])
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.validate_single_shard(False)
|
||||
.execute(_reader([6, 7, 8]))
|
||||
)
|
||||
assert result.num_rows == 3
|
||||
|
||||
|
||||
def test_lsm_merge_insert_use_lsm_write_true_requires_spec(tmp_path):
|
||||
# A table with a primary key but no LSM write spec installed.
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(seconds=0))
|
||||
table = db.create_table("t", _reader([1, 2, 3]))
|
||||
table.set_unenforced_primary_key("id")
|
||||
with pytest.raises(Exception, match="use_lsm_write"):
|
||||
(
|
||||
table.merge_insert("id")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.use_lsm_write(True)
|
||||
.execute(_reader([4]))
|
||||
)
|
||||
|
||||
|
||||
def test_lsm_merge_insert_rejects_on_not_primary_key(tmp_path):
|
||||
table = _bucket_table(tmp_path)
|
||||
with pytest.raises(Exception, match="primary key"):
|
||||
(
|
||||
table.merge_insert("value")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.execute(_reader([1]))
|
||||
)
|
||||
|
||||
|
||||
def test_lsm_merge_insert_rejects_non_upsert(tmp_path):
|
||||
table = _bucket_table(tmp_path)
|
||||
# Insert-only (no when_matched_update_all) is not the upsert shape.
|
||||
with pytest.raises(Exception, match="upsert"):
|
||||
table.merge_insert([]).when_not_matched_insert_all().execute(_reader([4]))
|
||||
|
||||
|
||||
def test_lsm_close_writers(tmp_path):
|
||||
table = _bucket_table(tmp_path)
|
||||
(
|
||||
table.merge_insert([])
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.execute(_reader([7, 8]))
|
||||
)
|
||||
table.close_lsm_writers()
|
||||
# The writer reopens lazily on the next merge_insert.
|
||||
result = (
|
||||
table.merge_insert([])
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.execute(_reader([9]))
|
||||
)
|
||||
assert result.num_rows == 1
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_lsm_merge_insert(tmp_path):
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path, read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
table = await db.create_table("t", _reader([1, 2, 3]))
|
||||
await table.set_unenforced_primary_key("id")
|
||||
await table.set_lsm_write_spec(LsmWriteSpec.bucket("id", 1))
|
||||
|
||||
builder = (
|
||||
table.merge_insert([]).when_matched_update_all().when_not_matched_insert_all()
|
||||
)
|
||||
result = await builder.execute(_reader([3, 4, 5]))
|
||||
assert result.num_rows == 3
|
||||
await table.close_lsm_writers()
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user