Compare commits

...

26 Commits

Author SHA1 Message Date
Jack Ye
85098c7435 feat: use declare_table instead of deprecated create_empty_table
Replace create_empty_table calls with declare_table in the namespace
module. create_empty_table is deprecated and being removed in Lance 3.0.0.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-20 22:24:21 -08:00
lancedb automation
08b8be2ea6 chore: update lance dependency to v3.0.0-beta.4 2026-02-20 23:22:55 +00:00
Will Jones
48ddc833dd feat: check for dataset updates in the background (#3021)
This updates `DatasetConsistencyWrapper` to block less:

1. `DatasetConsistencyWrapper::get()` just returns `Arc<Dataset>` now,
instead of a guard that blocks writes.
`DatasetConsistencyWrapper::get_mut()` is gone; now write methods just
use `get()` and then later call `update()` with the new version. This
means a given table handle can do concurrent reads **and** writes.
2. In weak consistency mode, will check for dataset updates in the
background, instead of blocking calls to `get()`.

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-20 11:18:33 -08:00
Varun Chawla
2802764092 fix(embeddings): stop retrying OpenAI 401 authentication errors (#2995)
## Summary
Fixes #1679

This PR prevents the OpenAI embedding function from retrying when
receiving a 401 Unauthorized error. Authentication errors are permanent
failures that won't be fixed by retrying, yet the current implementation
retries all exceptions up to 7 times by default.

## Changes
- Modified `retry_with_exponential_backoff` in `utils.py` to check for
non-retryable errors before retrying
- Added `_is_non_retryable_error` helper function that detects:
  - Exceptions with name `AuthenticationError` (OpenAI's 401 error)
  - Exceptions with `status_code` attribute of 401 or 403
- Enhanced OpenAI embeddings to explicitly catch and re-raise
`AuthenticationError` with better logging
- Added unit test `test_openai_no_retry_on_401` to verify authentication
errors don't trigger retries

## Test Plan
- Added test that verifies:
  1. A function raising `AuthenticationError` is only called once
  2. No retry delays occur (sleep is never called)
- Existing tests continue to pass
- Formatting applied via `make format`

## Example Behavior

**Before**: With an invalid API key, users would see 7 retry attempts
over ~2 minutes:
```
WARNING:root:Error occurred: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}}
 Retrying in 3.97 seconds (retry 1 of 7)
WARNING:root:Error occurred: Error code: 401...
 Retrying in 7.94 seconds (retry 2 of 7)
...
```

**After**: With an invalid API key, the error is raised immediately:
```
ERROR:root:Authentication failed: Invalid API key provided
AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}}
```

This provides better UX and prevents unnecessary API calls that would
fail anyway.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2026-02-19 09:20:54 -08:00
Weston Pace
37bbb0dba1 fix: allow permutation reader to work with remote tables as well (#3047)
Fixed one more spot that was relying on `_inner`.
2026-02-19 00:41:41 +05:30
Prashanth Rao
155ec16161 fix: deprecate outdated files for embedding registry (#3037)
There are old and outdated files in our embedding registry that can
confuse coding agents. This PR deprecates the following files that have
newer, more modern methods to generate such embeddings.

- Deprecate `embeddings/siglip.py` 
- Deprecate `embeddings/gte.py` 

## Why this change?

Per a discussion with @AyushExel, the [embedding registry directory
](1840aa7edc/python/python/lancedb/embeddings)
in the LanceDB repo has a number of outdated files that need to be
deprecated.

See https://github.com/lancedb/docs/issues/85 for the docs gaps that
identified this.
- Add note in `openclip` docs that it can be used for SigLip embeddings,
which it now supports
- Add note in the `sentence-transformers` page that ALL text embedding
models on Hugging Face can be used
2026-02-18 12:04:39 -05:00
Weston Pace
636b8b5bbd fix: allow permutation reader to be used with remote tables (#3019)
There were two issues:

1. The python code needs to get access to the underlying rust table to
setup the permutation reader and the attributes involved in this differ
between the python local table and remote table objects.
~~2. The remote table was sending projection dictionaries as arrays of
tuples and (on LanceDB cloud at least) it does not appear this is how
rest servers are setup to receive them.~~ (this is now fixed as #3023)

~~Leaving as draft as this is built on
https://github.com/lancedb/lancedb/pull/3016~~
2026-02-18 05:44:08 -08:00
Omair Afzal
715b81c86b fix(python): graceful handling of empty result sets in hybrid search (#3030)
## Problem

When applying hard filters that result in zero matches, hybrid search
crashes with `IndexError: list index out of range` during reranking.
This happens because empty result tables are passed through the full
reranker pipeline, which expects at least one result.

Traceback from the issue:
```
lancedb/query.py: in _combine_hybrid_results
    results = reranker.rerank_hybrid(fts_query, vector_results, fts_results)
lancedb/rerankers/answerdotai.py: in rerank_hybrid
    combined_results = self._rerank(combined_results, query)
...
IndexError: list index out of range
```

## Fix

Added an early return in `_combine_hybrid_results` when both vector and
FTS results are empty. Instead of passing empty tables through
normalization, reranking, and score restoration (which can fail in
various ways), we now build a properly-typed empty result table with the
`_relevance_score` column and return it directly.

## Test

Added `test_empty_hybrid_result_reranker` that exercises
`_combine_hybrid_results` directly with empty vector and FTS tables,
verifying:
- Returns empty table with correct schema  
- Includes `_relevance_score` column
- Respects `with_row_ids` flag

Closes #2425
2026-02-17 11:37:10 -08:00
Omair Afzal
7e1616376e refactor: extract merge_insert into table/merge.rs submodule (#3031)
Completes the **merge_insert.rs** checklist item from #2949.

## Changes

- Moved `MergeResult` struct from `table.rs` to `table/merge.rs`
- Moved the `NativeTable::merge_insert` implementation into
`merge::execute_merge_insert()`, with the trait impl now delegating to
it (same pattern as `delete.rs`)
- Moved `test_merge_insert` and `test_merge_insert_use_index` tests into
`table/merge.rs`
- Improved moved tests to use `memory://` URIs instead of temporary
directories
- Cleaned up unused imports from `table.rs` (`FutureExt`,
`TryFutureExt`, `Either`, `WhenMatched`, `WhenNotMatchedBySource`,
`LanceMergeInsertBuilder`)
- `MergeResult` is re-exported from `table.rs` so the public API is
unchanged

## Testing

`cargo build -p lancedb` compiles cleanly with no warnings.
2026-02-17 11:36:53 -08:00
ChinmayGowda71
d5ac5b949a refactor(rust): extract query logic to src/table/query.rs (#3035)
References #2949 Moved query logic and helpers from table.rs to
query.rs. Refactored tests using guidelines and added coverage for multi
vector plan structure.
2026-02-17 09:04:21 -08:00
Lance Release
7be6f45e0b Bump version: 0.26.2 → 0.27.0-beta.0 2026-02-17 00:28:24 +00:00
Lance Release
d9e2d51f51 Bump version: 0.29.2 → 0.30.0-beta.0 2026-02-17 00:27:45 +00:00
LuQQiu
e081708cce fix: non-stopping dataset version check after passing the first consistency check interval (#3034)
When a table has a read consistency interval, queries within the
interval skip the version check. Once the interval expires, a list call
checks for new versions. If the version hasn't changed, the timer should
reset so the next interval begins, but it didn't. The timer stayed
expired, so every query after that triggered a list call, even though
nothing changed.

This affects all read operations (queries, schema lookups, searches) on
tables with read_consistency_interval set. Each operation adds a
list("_versions/") call to object storage, adding latency proportional
to the store's list performance. For high-QPS workloads, this can
saturate object store list throughput and significantly degrade query
latency.

Bug flow:
1. Every read operation (query, schema, search) calls
ensure_up_to_date()
2. ensure_up_to_date() calls is_up_to_date(), which compares
last_consistency_check.elapsed() against
   read_consistency_interval
  3. If the interval has expired, it calls reload()
4. reload() calls need_reload(), which calls latest_version_id() — this
is the list IOP
  (list("_versions/"))
5. If no new version, reload() returns early without resetting
last_consistency_check
6. On the next query, step 2 sees the stale timer again → step 3 → step
4 → another list IOP
  7. This repeats on every query forever
2026-02-16 15:49:14 -08:00
Will Jones
2d60ea6938 perf(remote): cache schema of remote tables (#3015)
Caches the schema of remote tables and invalidates the cache when:

1. After 30 second TTL
2. When we do an operation that changes schema (e.g. add_columns) or
checks out a different version (e.g. checkout_version)
3. When we get a 400, 404, or 500 reponse

If the schema is retrieved close to the TTL, we optimistically fetch the
schema in the background. This means a continuous stream of queries will
never have the schema fetch on the critical path.

Closes #3014

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-13 15:21:04 -08:00
Jack Ye
dcb1443143 ci: add codex fix ci workflow (#3022)
Similar to the lance one added recently:
https://github.com/lance-format/lance/actions/workflows/codex-fix-ci.yml
2026-02-13 14:20:02 -08:00
Will Jones
c0230f91d2 feat(rust)!: accept RecordBatch, Vec<RecordBatch> in create_table() and Table.add() (#2948)
BREAKING CHANGE: Arbitrary `impl RecordBatchReader` is no longer
accepted, it must be made into `Box<dyn RecordBatchReader>`.

This PR replaces `IntoArrow` with a new trait `Scannable` to define
input row data. This provides the following advantages:

1. **We can implement `Scannable` for more types than `IntoArrow`, such
as `RecordBatch` and `Vec<RecordBatch>`.** The `IntoArrow` trait was
implemented for arbitrary `T: RecordBatchReader`, and the Rust compiler
would prevent us from implementing it for foreign types like
`RecordBatch` because (theoretically) those types might implement
`RecordBatchReader` in the future. That's why we implement `Scannable`
for `Box<dyn RecordBatchReader>` instead; since it's a concrete type it
doesn't block implementing for other foreign types.
2. **We can potentially replay `Scannable` values**. Previously, we had
to choose between buffering all data in memory and supporting retries of
writes. But because `Scannable` things can optionally support
re-scanning, we now have a way of supporting retries while also
streaming.
3. **`Scannable` can provide hints like `num_rows`, which can be used to
schedule parallel writers.** Without knowing the total number of rows,
it's difficult to know whether it's worth writing multiple files in
parallel.

We don't yet fully take advantage of (2) and (3) yet, but will in future
PRs. For (2), in order to be ready to leverage this, we need to hook the
`Scannable` implementation up to Python and NodeJS bindings. Right now
they always pass down a stream, but we want to make sure they support
retries when possible. And for (3), this will need to be hooked up to
#2939 and to a pipeline for running pre-processing steps (like embedding
generation).

## Other changes

* Moved `create_table` and `add_data` into their own modules. I've
created a follow up issue to split up `table.rs` further, as it's by far
the largest file: https://github.com/lancedb/lancedb/issues/2949
* Eliminated the `HAS_DATA` generic for `CreateTableBuilder`. I didn't
see any public-facing places where we differentiated methods, which is
why I felt this simplification was okay.
* Added an `Error::External` variant and integrated some conversions to
allow certain errors to pass through transparently. This will fully work
once we upgrade Lance and get to take advantage of changes in
https://github.com/lance-format/lance/pull/5606
* Added LZ4 compression support for write requests to remote endpoints.
I checked and this has been supported on the server for > 1 year.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 14:18:36 -08:00
LanceDB Robot
5d629c9ecb feat: update lance dependency to v2.0.1 (#3027)
## Summary
- Bump Lance Rust workspace dependencies to v2.0.1 and update Java
`lance-core` version.
- Verified `cargo clippy --workspace --tests --all-features -- -D
warnings` and `cargo fmt --all`.
- Triggering tag: https://github.com/lancedb/lance/releases/tag/v2.0.1
2026-02-13 13:53:02 -08:00
Weston Pace
14973ac9d1 fix: support dynamic projection on remote table (#3023)
The remote server expects an object (`{"alias": "col"}`) and the client
was previously sending a list of tuples `[["alias", "col"]]`
2026-02-13 10:10:56 -08:00
Weston Pace
70cbee6293 feat: improve Permutation pytorch integration (#3016)
This changes around the output format of `Permutation` in some breaking
ways but I think the API is still new enough to be considered
experimental.

1. In order to align with both huggingface's dataset and torch's
expectations the default output format is now a list of dicts
(row-major) instead of a dict of lists (column-major). I've added a
python_col option which will return the dict of lists.
2. In order to align with pytorch's expectation the `torch` format is
now a list of tensors (row-major) instead of a 2D tensor (column-major).
I've added a torch_col option which will return the 2D tensor instead.

Added tests for torch integration with Permutation

~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013
merges as this is built on top of that~~
2026-02-12 13:41:14 -08:00
Weston Pace
02783bf440 feat: add a getitems implementation for the permutation (#3013) 2026-02-12 05:36:11 -08:00
Dhruv
4323ca0147 feat: show reranker info in hybrid search explain plan (#3006)
Closes #3000

The hybrid search `explain_plan` now shows the reranker as the top-level
node with
the vector and FTS sub-plans indented underneath, instead of just
listing them
separately with no reranker context.

**Before:**
```
Vector Search Plan:
ProjectionExec: ...
FTS Search Plan:
ProjectionExec: ...
```

**After:**
```
RRFReranker(K=60)
  Vector Search Plan:
  ProjectionExec: ...
  FTS Search Plan:
  ProjectionExec: ...
```

Other rerankers display similarly ; e.g.
`LinearCombinationReranker(weight=0.7, fill=1.0)`,
`MRRReranker(weight_vector=0.5, weight_fts=0.5)`,
`CohereReranker(model_name=name)`.

---------

Signed-off-by: dask-58 <googldhruv@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2026-02-10 11:45:39 -08:00
Dhruv
bd3dd6a8e5 fix: improve error message for multi-field FTS index creation (#3005)
Fixes #2999

The error message previously said `"field_names must be a string when
use_tantivy=False"` implying they should use the to be deprecated
tantivy backend #2998.

Updated the error message and docstring to instead guide users to create
a separate FTS index for each field

Signed-off-by: dask-58 <googldhruv@gmail.com>
2026-02-09 16:28:50 -08:00
Abhishek
3c1162612e refactor: extract optimize logic from table.rs into submodule (#2979)
## Summary

Continues the modularization effort of table operations as outlined in
#2949.

- Extracts optimization operations (`OptimizeAction`, `OptimizeStats`,
`execute_optimize`, `compact_files_impl`, `cleanup_old_versions`,
`optimize_indices`) from
  `table.rs` into `table/optimize.rs`
  - Public API remains unchanged via re-exports
- Adds comprehensive tests including error cases with message assertions

  ## Test plan

  - [x] All new optimization tests pass
  - [x] All existing tests pass
  - [x] `cargo clippy` passes with no warnings
  - [x] `cargo fmt --check` passes

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2026-02-09 16:22:57 -08:00
Jack Ye
53c7c560c9 feat: add third party licenses lists (#3010)
The files are generated with `make licenses`, currently expected to run
manually. In the future, some automations could be built.
2026-02-09 16:16:46 -08:00
Lance Release
de4f77800d Bump version: 0.26.2-beta.0 → 0.26.2 2026-02-09 06:06:22 +00:00
Lance Release
b6ab721cf7 Bump version: 0.26.1 → 0.26.2-beta.0 2026-02-09 06:06:03 +00:00
107 changed files with 58430 additions and 3442 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.26.1"
current_version = "0.27.0-beta.0"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

173
.github/workflows/codex-fix-ci.yml vendored Normal file
View File

@@ -0,0 +1,173 @@
name: Codex Fix CI
on:
workflow_dispatch:
inputs:
workflow_run_url:
description: "Failing CI workflow run URL (e.g., https://github.com/lancedb/lancedb/actions/runs/12345678)"
required: true
type: string
branch:
description: "Branch to fix (e.g., main, release/v2.0, or feature-branch)"
required: true
type: string
guidelines:
description: "Additional guidelines for the fix (optional)"
required: false
type: string
permissions:
contents: write
pull-requests: write
actions: read
jobs:
fix-ci:
runs-on: warp-ubuntu-latest-x64-4x
timeout-minutes: 60
env:
CC: clang
CXX: clang++
steps:
- name: Show inputs
run: |
echo "workflow_run_url = ${{ inputs.workflow_run_url }}"
echo "branch = ${{ inputs.branch }}"
echo "guidelines = ${{ inputs.guidelines }}"
- name: Checkout Repo
uses: actions/checkout@v4
with:
ref: ${{ inputs.branch }}
fetch-depth: 0
persist-credentials: true
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: 20
- name: Install Codex CLI
run: npm install -g @openai/codex
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
with:
toolchain: stable
components: clippy, rustfmt
- uses: Swatinem/rust-cache@v2
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y protobuf-compiler libssl-dev
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Python dependencies
run: |
pip install maturin ruff pytest pyarrow pandas polars
- name: Set up Java
uses: actions/setup-java@v4
with:
distribution: temurin
java-version: '11'
cache: maven
- name: Install Node.js dependencies for TypeScript bindings
run: |
cd nodejs
npm ci
- name: Configure git user
run: |
git config user.name "lancedb automation"
git config user.email "robot@lancedb.com"
- name: Run Codex to fix CI failure
env:
WORKFLOW_RUN_URL: ${{ inputs.workflow_run_url }}
BRANCH: ${{ inputs.branch }}
GUIDELINES: ${{ inputs.guidelines }}
GITHUB_TOKEN: ${{ secrets.ROBOT_TOKEN }}
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
OPENAI_API_KEY: ${{ secrets.CODEX_TOKEN }}
run: |
set -euo pipefail
cat <<EOF >/tmp/codex-prompt.txt
You are running inside the lancedb repository on a GitHub Actions runner. Your task is to fix a CI failure.
Input parameters:
- Failing workflow run URL: ${WORKFLOW_RUN_URL}
- Branch to fix: ${BRANCH}
- Additional guidelines: ${GUIDELINES:-"None provided"}
Follow these steps exactly:
1. Extract the run ID from the workflow URL. The URL format is https://github.com/lancedb/lancedb/actions/runs/<run_id>.
2. Use "gh run view <run_id> --json jobs,conclusion,name" to get information about the failed run.
3. Identify which jobs failed. For each failed job, use "gh run view <run_id> --job <job_id> --log-failed" to get the failure logs.
4. Analyze the failure logs to understand what went wrong. Common failures include:
- Compilation errors
- Test failures
- Clippy warnings treated as errors
- Formatting issues
- Dependency issues
5. Based on the analysis, fix the issues in the codebase:
- For compilation errors: Fix the code that doesn't compile
- For test failures: Fix the failing tests or the code they test
- For clippy warnings: Apply the suggested fixes
- For formatting issues: Run "cargo fmt --all"
- For other issues: Apply appropriate fixes
6. After making fixes, verify them locally:
- Run "cargo fmt --all" to ensure formatting is correct
- Run "cargo clippy --workspace --tests --all-features -- -D warnings" to check for issues
- Run ONLY the specific failing tests to confirm they pass now:
- For Rust test failures: Run the specific test with "cargo test -p <crate> <test_name>"
- For Python test failures: Build with "cd python && maturin develop" then run "pytest <specific_test_file>::<test_name>"
- For Java test failures: Run "cd java && mvn test -Dtest=<TestClass>#<testMethod>"
- For TypeScript test failures: Run "cd nodejs && npm run build && npm test -- --testNamePattern='<test_name>'"
- Do NOT run the full test suite - only run the tests that were failing
7. If the additional guidelines are provided, follow them as well.
8. Inspect "git status --short" and "git diff" to review your changes.
9. Create a fix branch: "git checkout -b codex/fix-ci-<run_id>".
10. Stage all changes with "git add -A" and commit with message "fix: resolve CI failures from run <run_id>".
11. Push the branch: "git push origin codex/fix-ci-<run_id>". If the remote branch exists, delete it first with "gh api -X DELETE repos/lancedb/lancedb/git/refs/heads/codex/fix-ci-<run_id>" then push. Do NOT use "git push --force" or "git push -f".
12. Create a pull request targeting "${BRANCH}":
- Title: "ci: <short summary describing the fix>" (e.g., "ci: fix clippy warnings in lancedb" or "ci: resolve test flakiness in vector search")
- First, write the PR body to /tmp/pr-body.md using a heredoc (cat <<'PREOF' > /tmp/pr-body.md). The body should include:
- Link to the failing workflow run
- Summary of what failed
- Description of the fixes applied
- Then run "gh pr create --base ${BRANCH} --body-file /tmp/pr-body.md".
13. Display the new PR URL, "git status --short", and a summary of what was fixed.
Constraints:
- Use bash commands for all operations.
- Do not merge the PR.
- Do not modify GitHub workflow files unless they are the cause of the failure.
- If any command fails, diagnose and attempt to fix the issue instead of aborting immediately.
- If you cannot fix the issue automatically, create the PR anyway with a clear explanation of what you tried and what remains to be fixed.
- env "GH_TOKEN" is available, use "gh" tools for GitHub-related operations.
EOF
printenv OPENAI_API_KEY | codex login --with-api-key
codex --config shell_environment_policy.ignore_default_excludes=true exec --dangerously-bypass-approvals-and-sandbox "$(cat /tmp/codex-prompt.txt)"

123
Cargo.lock generated
View File

@@ -1389,9 +1389,9 @@ checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
[[package]]
name = "bytes"
version = "1.10.1"
version = "1.11.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d71b6127be86fdcfddb610f7182ac57211d4b18a3e9c82eb2d17662f2227ad6a"
checksum = "1e748733b7cbc798e1434b6ac524f0c1ff2ab456fe201501e6497c8417a4fc33"
[[package]]
name = "bytes-utils"
@@ -1783,6 +1783,16 @@ dependencies = [
"crossbeam-utils",
]
[[package]]
name = "crossbeam-skiplist"
version = "0.1.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "df29de440c58ca2cc6e587ec3d22347551a32435fbde9d2bff64e78a9ffa151b"
dependencies = [
"crossbeam-epoch",
"crossbeam-utils",
]
[[package]]
name = "crossbeam-utils"
version = "0.8.21"
@@ -3072,9 +3082,8 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
[[package]]
name = "fsst"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0f03a771ab914e207dd26bd2f12666839555ec8ecc7e1770e1ed6f9900d899a4"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-array",
"rand 0.9.2",
@@ -4405,9 +4414,8 @@ dependencies = [
[[package]]
name = "lance"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "47b685aca3f97ee02997c83ded16f59c747ccb69e74c8abbbae4aa3d22cf1301"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-arith",
@@ -4426,6 +4434,7 @@ dependencies = [
"byteorder",
"bytes",
"chrono",
"crossbeam-skiplist",
"dashmap",
"datafusion",
"datafusion-expr",
@@ -4465,6 +4474,7 @@ dependencies = [
"tantivy",
"tokio",
"tokio-stream",
"tokio-util",
"tracing",
"url",
"uuid",
@@ -4472,9 +4482,8 @@ dependencies = [
[[package]]
name = "lance-arrow"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "daf00c7537df524cc518a089f0d156a036d95ca3f5bc2bc1f0a9f9293e9b62ef"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4493,9 +4502,8 @@ dependencies = [
[[package]]
name = "lance-bitpacking"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "46752e4ac8fc5590a445e780b63a8800adc7a770bd74770a8dc66963778e4e77"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrayref",
"paste",
@@ -4504,9 +4512,8 @@ dependencies = [
[[package]]
name = "lance-core"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "3d13d87d07305c6d4b4dc7780fb1107babf782a0e5b1dc7872e17ae1f8fd11ca"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4543,9 +4550,8 @@ dependencies = [
[[package]]
name = "lance-datafusion"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6451b5af876eaef8bec4b38a39dadac9d44621e1ecf85d0cdf6097a5d0aa8721"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-array",
@@ -4568,6 +4574,7 @@ dependencies = [
"log",
"pin-project",
"prost",
"prost-build",
"snafu",
"tokio",
"tracing",
@@ -4575,9 +4582,8 @@ dependencies = [
[[package]]
name = "lance-datagen"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e1736708dd7867dfbab8fcc930b21c96717c6c00be73b7d9a240336a4ed80375"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-array",
@@ -4595,9 +4601,8 @@ dependencies = [
[[package]]
name = "lance-encoding"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d6b6ca4ff94833240d5ba4a94a742cba786d1949b3c3fa7e11d6f0050443432a"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4634,9 +4639,8 @@ dependencies = [
[[package]]
name = "lance-file"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "55fbe959bffe185543aed3cbeb14484f1aa2e55886034fdb1ea3d8cc9b70aad8"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4668,9 +4672,8 @@ dependencies = [
[[package]]
name = "lance-geo"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a52b0adabc953d457f336a784a3b37353a180e6a79905f544949746e0d4c6483"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"datafusion",
"geo-traits",
@@ -4684,9 +4687,8 @@ dependencies = [
[[package]]
name = "lance-index"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6b67654bf86fd942dd2cf08294ee7e91053427cd148225f49c9ff398ff9a40fd"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-arith",
@@ -4753,9 +4755,8 @@ dependencies = [
[[package]]
name = "lance-io"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8eb0ccc1c414e31687d83992d546af0a0237c8d2f4bf2ae3d347d539fd0fc141"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-arith",
@@ -4788,6 +4789,7 @@ dependencies = [
"serde",
"shellexpand",
"snafu",
"tempfile",
"tokio",
"tracing",
"url",
@@ -4795,9 +4797,8 @@ dependencies = [
[[package]]
name = "lance-linalg"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "083404cf12dcdb1a7df98fb58f9daf626b6e43a2f794b37b6b89b4012a0e1f78"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4813,9 +4814,8 @@ dependencies = [
[[package]]
name = "lance-namespace"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c12778d2aabf9c2bfd16e2509ebe120e562a288d8ae630ec6b6b204868df41b2"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"async-trait",
@@ -4827,9 +4827,8 @@ dependencies = [
[[package]]
name = "lance-namespace-impls"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8863aababdd13a6d2c8d6179dc6981f4f8f49d8b66a00c5dd75115aec4cadc99"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-ipc",
@@ -4844,6 +4843,7 @@ dependencies = [
"lance-index",
"lance-io",
"lance-namespace",
"lance-table",
"log",
"object_store",
"rand 0.9.2",
@@ -4859,9 +4859,9 @@ dependencies = [
[[package]]
name = "lance-namespace-reqwest-client"
version = "0.4.5"
version = "0.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a2acdba67f84190067532fce07b51a435dd390d7cdc1129a05003e5cb3274cf0"
checksum = "3ad4c947349acd6e37e984eba0254588bd894e6128434338b9e6904e56fb4633"
dependencies = [
"reqwest",
"serde",
@@ -4872,9 +4872,8 @@ dependencies = [
[[package]]
name = "lance-table"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0fcc83f197ce2000c4abe4f5e0873490ab1f41788fa76571c4209b87d4daf50"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow",
"arrow-array",
@@ -4913,9 +4912,8 @@ dependencies = [
[[package]]
name = "lance-testing"
version = "2.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7fb1f7c7e06f91360e141ecee1cf2110f858c231705f69f2cd2fda9e30c1e9f4"
version = "3.0.0-beta.4"
source = "git+https://github.com/lance-format/lance.git?tag=v3.0.0-beta.4#649b4871dbcbc4f26586e33677b1f9535d9a7f63"
dependencies = [
"arrow-array",
"arrow-schema",
@@ -4926,7 +4924,7 @@ dependencies = [
[[package]]
name = "lancedb"
version = "0.26.1"
version = "0.27.0-beta.0"
dependencies = [
"ahash",
"anyhow",
@@ -5006,7 +5004,7 @@ dependencies = [
[[package]]
name = "lancedb-nodejs"
version = "0.26.1"
version = "0.27.0-beta.0"
dependencies = [
"arrow-array",
"arrow-ipc",
@@ -5026,7 +5024,7 @@ dependencies = [
[[package]]
name = "lancedb-python"
version = "0.29.1"
version = "0.30.0-beta.0"
dependencies = [
"arrow",
"async-trait",
@@ -5628,11 +5626,10 @@ dependencies = [
[[package]]
name = "num-bigint-dig"
version = "0.8.4"
version = "0.8.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "dc84195820f291c7697304f3cbdadd1cb7199c0efc917ff5eafd71225c136151"
checksum = "e661dda6640fad38e827a6d4a310ff4763082116fe217f279885c97f511bb0b7"
dependencies = [
"byteorder",
"lazy_static",
"libm",
"num-integer",
@@ -7274,9 +7271,9 @@ dependencies = [
[[package]]
name = "roaring"
version = "0.10.12"
version = "0.11.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "19e8d2cfa184d94d0726d650a9f4a1be7f9b76ac9fdb954219878dc00c1c1e7b"
checksum = "8ba9ce64a8f45d7fc86358410bb1a82e8c987504c0d4900e9141d69a9f26c885"
dependencies = [
"bytemuck",
"byteorder",

View File

@@ -15,20 +15,20 @@ categories = ["database-implementations"]
rust-version = "1.88.0"
[workspace.dependencies]
lance = { "version" = "=2.0.0", default-features = false }
lance-core = "=2.0.0"
lance-datagen = "=2.0.0"
lance-file = "=2.0.0"
lance-io = { "version" = "=2.0.0", default-features = false }
lance-index = "=2.0.0"
lance-linalg = "=2.0.0"
lance-namespace = "=2.0.0"
lance-namespace-impls = { "version" = "=2.0.0", default-features = false }
lance-table = "=2.0.0"
lance-testing = "=2.0.0"
lance-datafusion = "=2.0.0"
lance-encoding = "=2.0.0"
lance-arrow = "=2.0.0"
lance = { "version" = "=3.0.0-beta.4", default-features = false, "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=3.0.0-beta.4", default-features = false, "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=3.0.0-beta.4", default-features = false, "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=3.0.0-beta.4", "tag" = "v3.0.0-beta.4", "git" = "https://github.com/lance-format/lance.git" }
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "57.2", optional = false }

9
Makefile Normal file
View File

@@ -0,0 +1,9 @@
.PHONY: licenses
licenses:
cargo about generate about.hbs -o RUST_THIRD_PARTY_LICENSES.html -c about.toml
cd python && cargo about generate ../about.hbs -o RUST_THIRD_PARTY_LICENSES.html -c ../about.toml
cd python && uv sync --all-extras && uv tool run pip-licenses --python .venv/bin/python --format=markdown --with-urls --output-file=PYTHON_THIRD_PARTY_LICENSES.md
cd nodejs && cargo about generate ../about.hbs -o RUST_THIRD_PARTY_LICENSES.html -c ../about.toml
cd nodejs && npx license-checker --markdown --out NODEJS_THIRD_PARTY_LICENSES.md
cd java && ./mvnw license:aggregate-add-third-party -q

15276
RUST_THIRD_PARTY_LICENSES.html Normal file

File diff suppressed because it is too large Load Diff

70
about.hbs Normal file
View File

@@ -0,0 +1,70 @@
<html>
<head>
<style>
@media (prefers-color-scheme: dark) {
body {
background: #333;
color: white;
}
a {
color: skyblue;
}
}
.container {
font-family: sans-serif;
max-width: 800px;
margin: 0 auto;
}
.intro {
text-align: center;
}
.licenses-list {
list-style-type: none;
margin: 0;
padding: 0;
}
.license-used-by {
margin-top: -10px;
}
.license-text {
max-height: 200px;
overflow-y: scroll;
white-space: pre-wrap;
}
</style>
</head>
<body>
<main class="container">
<div class="intro">
<h1>Third Party Licenses</h1>
<p>This page lists the licenses of the projects used in cargo-about.</p>
</div>
<h2>Overview of licenses:</h2>
<ul class="licenses-overview">
{{#each overview}}
<li><a href="#{{id}}">{{name}}</a> ({{count}})</li>
{{/each}}
</ul>
<h2>All license text:</h2>
<ul class="licenses-list">
{{#each licenses}}
<li class="license">
<h3 id="{{id}}">{{name}}</h3>
<h4>Used by:</h4>
<ul class="license-used-by">
{{#each used_by}}
<li><a href="{{#if crate.repository}} {{crate.repository}} {{else}} https://crates.io/crates/{{crate.name}} {{/if}}">{{crate.name}} {{crate.version}}</a></li>
{{/each}}
</ul>
<pre class="license-text">{{text}}</pre>
</li>
{{/each}}
</ul>
</main>
</body>
</html>

18
about.toml Normal file
View File

@@ -0,0 +1,18 @@
accepted = [
"0BSD",
"Apache-2.0",
"Apache-2.0 WITH LLVM-exception",
"BSD-2-Clause",
"BSD-3-Clause",
"BSL-1.0",
"bzip2-1.0.6",
"CC0-1.0",
"CDDL-1.0",
"CDLA-Permissive-2.0",
"ISC",
"MIT",
"MPL-2.0",
"OpenSSL",
"Unicode-3.0",
"Zlib",
]

View File

@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
<dependency>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-core</artifactId>
<version>0.26.1</version>
<version>0.27.0-beta.0</version>
</dependency>
```

View File

@@ -0,0 +1,71 @@
List of third-party dependencies grouped by their license type.
Apache 2.0:
* error-prone annotations (com.google.errorprone:error_prone_annotations:2.28.0 - https://errorprone.info/error_prone_annotations)
Apache License 2.0:
* JsonNullable Jackson module (org.openapitools:jackson-databind-nullable:0.2.6 - https://github.com/OpenAPITools/jackson-databind-nullable)
Apache License V2.0:
* FlatBuffers Java API (com.google.flatbuffers:flatbuffers-java:23.5.26 - https://github.com/google/flatbuffers)
Apache License, Version 2.0:
* Apache Commons Codec (commons-codec:commons-codec:1.15 - https://commons.apache.org/proper/commons-codec/)
* Apache HttpClient (org.apache.httpcomponents.client5:httpclient5:5.2.1 - https://hc.apache.org/httpcomponents-client-5.0.x/5.2.1/httpclient5/)
* Apache HttpComponents Core HTTP/1.1 (org.apache.httpcomponents.core5:httpcore5:5.2 - https://hc.apache.org/httpcomponents-core-5.2.x/5.2/httpcore5/)
* Apache HttpComponents Core HTTP/2 (org.apache.httpcomponents.core5:httpcore5-h2:5.2 - https://hc.apache.org/httpcomponents-core-5.2.x/5.2/httpcore5-h2/)
* Arrow Format (org.apache.arrow:arrow-format:15.0.0 - https://arrow.apache.org/arrow-format/)
* Arrow Java C Data Interface (org.apache.arrow:arrow-c-data:15.0.0 - https://arrow.apache.org/arrow-c-data/)
* Arrow Java Dataset (org.apache.arrow:arrow-dataset:15.0.0 - https://arrow.apache.org/arrow-dataset/)
* Arrow Memory - Core (org.apache.arrow:arrow-memory-core:15.0.0 - https://arrow.apache.org/arrow-memory/arrow-memory-core/)
* Arrow Memory - Netty (org.apache.arrow:arrow-memory-netty:15.0.0 - https://arrow.apache.org/arrow-memory/arrow-memory-netty/)
* Arrow Vectors (org.apache.arrow:arrow-vector:15.0.0 - https://arrow.apache.org/arrow-vector/)
* Guava: Google Core Libraries for Java (com.google.guava:guava:33.3.1-jre - https://github.com/google/guava)
* J2ObjC Annotations (com.google.j2objc:j2objc-annotations:3.0.0 - https://github.com/google/j2objc/)
* Netty/Buffer (io.netty:netty-buffer:4.1.104.Final - https://netty.io/netty-buffer/)
* Netty/Common (io.netty:netty-common:4.1.104.Final - https://netty.io/netty-common/)
Apache-2.0:
* Apache Commons Lang (org.apache.commons:commons-lang3:3.18.0 - https://commons.apache.org/proper/commons-lang/)
* lance-namespace-apache-client (org.lance:lance-namespace-apache-client:0.4.5 - https://github.com/openapitools/openapi-generator)
* lance-namespace-core (org.lance:lance-namespace-core:0.4.5 - https://lance.org/format/namespace/lance-namespace-core/)
EDL 1.0:
* Jakarta Activation API jar (jakarta.activation:jakarta.activation-api:1.2.2 - https://github.com/eclipse-ee4j/jaf/jakarta.activation-api)
Eclipse Distribution License - v 1.0:
* Eclipse Collections API (org.eclipse.collections:eclipse-collections-api:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections-api)
* Eclipse Collections Main Library (org.eclipse.collections:eclipse-collections:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections)
* Jakarta XML Binding API (jakarta.xml.bind:jakarta.xml.bind-api:2.3.3 - https://github.com/eclipse-ee4j/jaxb-api/jakarta.xml.bind-api)
Eclipse Public License - v 1.0:
* Eclipse Collections API (org.eclipse.collections:eclipse-collections-api:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections-api)
* Eclipse Collections Main Library (org.eclipse.collections:eclipse-collections:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections)
The Apache Software License, Version 2.0:
* FindBugs-jsr305 (com.google.code.findbugs:jsr305:3.0.2 - http://findbugs.sourceforge.net/)
* Guava InternalFutureFailureAccess and InternalFutures (com.google.guava:failureaccess:1.0.2 - https://github.com/google/guava/failureaccess)
* Guava ListenableFuture only (com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava - https://github.com/google/guava/listenablefuture)
* Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.16.0 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
* Jackson module: Old JAXB Annotations (javax.xml.bind) (com.fasterxml.jackson.module:jackson-module-jaxb-annotations:2.17.1 - https://github.com/FasterXML/jackson-modules-base)
* Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.16.0 - https://github.com/FasterXML/jackson)
* Jackson-core (com.fasterxml.jackson.core:jackson-core:2.16.0 - https://github.com/FasterXML/jackson-core)
* jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.15.2 - https://github.com/FasterXML/jackson)
* Jackson-JAXRS: base (com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.17.1 - https://github.com/FasterXML/jackson-jaxrs-providers/jackson-jaxrs-base)
* Jackson-JAXRS: JSON (com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.17.1 - https://github.com/FasterXML/jackson-jaxrs-providers/jackson-jaxrs-json-provider)
* JAR JNI Loader (org.questdb:jar-jni:1.1.1 - https://github.com/questdb/rust-maven-plugin)
* Lance Core (org.lance:lance-core:2.0.0 - https://lance.org/)
The MIT License:
* Checker Qual (org.checkerframework:checker-qual:3.43.0 - https://checkerframework.org/)

View File

@@ -0,0 +1,71 @@
List of third-party dependencies grouped by their license type.
Apache 2.0:
* error-prone annotations (com.google.errorprone:error_prone_annotations:2.28.0 - https://errorprone.info/error_prone_annotations)
Apache License 2.0:
* JsonNullable Jackson module (org.openapitools:jackson-databind-nullable:0.2.6 - https://github.com/OpenAPITools/jackson-databind-nullable)
Apache License V2.0:
* FlatBuffers Java API (com.google.flatbuffers:flatbuffers-java:23.5.26 - https://github.com/google/flatbuffers)
Apache License, Version 2.0:
* Apache Commons Codec (commons-codec:commons-codec:1.15 - https://commons.apache.org/proper/commons-codec/)
* Apache HttpClient (org.apache.httpcomponents.client5:httpclient5:5.2.1 - https://hc.apache.org/httpcomponents-client-5.0.x/5.2.1/httpclient5/)
* Apache HttpComponents Core HTTP/1.1 (org.apache.httpcomponents.core5:httpcore5:5.2 - https://hc.apache.org/httpcomponents-core-5.2.x/5.2/httpcore5/)
* Apache HttpComponents Core HTTP/2 (org.apache.httpcomponents.core5:httpcore5-h2:5.2 - https://hc.apache.org/httpcomponents-core-5.2.x/5.2/httpcore5-h2/)
* Arrow Format (org.apache.arrow:arrow-format:15.0.0 - https://arrow.apache.org/arrow-format/)
* Arrow Java C Data Interface (org.apache.arrow:arrow-c-data:15.0.0 - https://arrow.apache.org/arrow-c-data/)
* Arrow Java Dataset (org.apache.arrow:arrow-dataset:15.0.0 - https://arrow.apache.org/arrow-dataset/)
* Arrow Memory - Core (org.apache.arrow:arrow-memory-core:15.0.0 - https://arrow.apache.org/arrow-memory/arrow-memory-core/)
* Arrow Memory - Netty (org.apache.arrow:arrow-memory-netty:15.0.0 - https://arrow.apache.org/arrow-memory/arrow-memory-netty/)
* Arrow Vectors (org.apache.arrow:arrow-vector:15.0.0 - https://arrow.apache.org/arrow-vector/)
* Guava: Google Core Libraries for Java (com.google.guava:guava:33.3.1-jre - https://github.com/google/guava)
* J2ObjC Annotations (com.google.j2objc:j2objc-annotations:3.0.0 - https://github.com/google/j2objc/)
* Netty/Buffer (io.netty:netty-buffer:4.1.104.Final - https://netty.io/netty-buffer/)
* Netty/Common (io.netty:netty-common:4.1.104.Final - https://netty.io/netty-common/)
Apache-2.0:
* Apache Commons Lang (org.apache.commons:commons-lang3:3.18.0 - https://commons.apache.org/proper/commons-lang/)
* lance-namespace-apache-client (org.lance:lance-namespace-apache-client:0.4.5 - https://github.com/openapitools/openapi-generator)
* lance-namespace-core (org.lance:lance-namespace-core:0.4.5 - https://lance.org/format/namespace/lance-namespace-core/)
EDL 1.0:
* Jakarta Activation API jar (jakarta.activation:jakarta.activation-api:1.2.2 - https://github.com/eclipse-ee4j/jaf/jakarta.activation-api)
Eclipse Distribution License - v 1.0:
* Eclipse Collections API (org.eclipse.collections:eclipse-collections-api:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections-api)
* Eclipse Collections Main Library (org.eclipse.collections:eclipse-collections:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections)
* Jakarta XML Binding API (jakarta.xml.bind:jakarta.xml.bind-api:2.3.3 - https://github.com/eclipse-ee4j/jaxb-api/jakarta.xml.bind-api)
Eclipse Public License - v 1.0:
* Eclipse Collections API (org.eclipse.collections:eclipse-collections-api:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections-api)
* Eclipse Collections Main Library (org.eclipse.collections:eclipse-collections:11.1.0 - https://github.com/eclipse/eclipse-collections/eclipse-collections)
The Apache Software License, Version 2.0:
* FindBugs-jsr305 (com.google.code.findbugs:jsr305:3.0.2 - http://findbugs.sourceforge.net/)
* Guava InternalFutureFailureAccess and InternalFutures (com.google.guava:failureaccess:1.0.2 - https://github.com/google/guava/failureaccess)
* Guava ListenableFuture only (com.google.guava:listenablefuture:9999.0-empty-to-avoid-conflict-with-guava - https://github.com/google/guava/listenablefuture)
* Jackson datatype: JSR310 (com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.16.0 - https://github.com/FasterXML/jackson-modules-java8/jackson-datatype-jsr310)
* Jackson module: Old JAXB Annotations (javax.xml.bind) (com.fasterxml.jackson.module:jackson-module-jaxb-annotations:2.17.1 - https://github.com/FasterXML/jackson-modules-base)
* Jackson-annotations (com.fasterxml.jackson.core:jackson-annotations:2.16.0 - https://github.com/FasterXML/jackson)
* Jackson-core (com.fasterxml.jackson.core:jackson-core:2.16.0 - https://github.com/FasterXML/jackson-core)
* jackson-databind (com.fasterxml.jackson.core:jackson-databind:2.15.2 - https://github.com/FasterXML/jackson)
* Jackson-JAXRS: base (com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.17.1 - https://github.com/FasterXML/jackson-jaxrs-providers/jackson-jaxrs-base)
* Jackson-JAXRS: JSON (com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.17.1 - https://github.com/FasterXML/jackson-jaxrs-providers/jackson-jaxrs-json-provider)
* JAR JNI Loader (org.questdb:jar-jni:1.1.1 - https://github.com/questdb/rust-maven-plugin)
* Lance Core (org.lance:lance-core:2.0.0 - https://lance.org/)
The MIT License:
* Checker Qual (org.checkerframework:checker-qual:3.43.0 - https://checkerframework.org/)

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.26.1-final.0</version>
<version>0.27.0-beta.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.26.1-final.0</version>
<version>0.27.0-beta.0</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>
@@ -28,7 +28,7 @@
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<arrow.version>15.0.0</arrow.version>
<lance-core.version>2.0.0</lance-core.version>
<lance-core.version>3.0.0-beta.4</lance-core.version>
<spotless.skip>false</spotless.skip>
<spotless.version>2.30.0</spotless.version>
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>
@@ -160,6 +160,19 @@
<groupId>com.diffplug.spotless</groupId>
<artifactId>spotless-maven-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>license-maven-plugin</artifactId>
<version>2.4.0</version>
<configuration>
<outputDirectory>${project.basedir}</outputDirectory>
<thirdPartyFilename>JAVA_THIRD_PARTY_LICENSES.md</thirdPartyFilename>
<fileTemplate>/org/codehaus/mojo/license/third-party-file-groupByLicense.ftl</fileTemplate>
<includedScopes>compile,runtime</includedScopes>
<excludedScopes>test,provided</excludedScopes>
<sortArtifactByName>true</sortArtifactByName>
</configuration>
</plugin>
</plugins>
<pluginManagement>
<plugins>

View File

@@ -1,7 +1,7 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.26.1"
version = "0.27.0-beta.0"
license.workspace = true
description.workspace = true
repository.workspace = true

View File

@@ -0,0 +1,668 @@
[@75lb/deep-merge@1.1.2](https://github.com/75lb/deep-merge) - MIT
[@aashutoshrathi/word-wrap@1.2.6](https://github.com/aashutoshrathi/word-wrap) - MIT
[@ampproject/remapping@2.2.1](https://github.com/ampproject/remapping) - Apache-2.0
[@aws-crypto/crc32@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/crc32c@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/ie11-detection@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/sha1-browser@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/sha256-browser@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/sha256-browser@5.2.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/sha256-js@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/sha256-js@5.2.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/supports-web-crypto@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/supports-web-crypto@5.2.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/util@3.0.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-crypto/util@5.2.0](https://github.com/aws/aws-sdk-js-crypto-helpers) - Apache-2.0
[@aws-sdk/client-dynamodb@3.602.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-kms@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-s3@3.550.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-sso-oidc@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-sso-oidc@3.600.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-sso@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-sso@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-sts@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/client-sts@3.600.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/core@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/core@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-env@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-env@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-http@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-http@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-ini@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-ini@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-node@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-node@3.600.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-process@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-process@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-sso@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-sso@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-web-identity@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/credential-provider-web-identity@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/endpoint-cache@3.572.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-bucket-endpoint@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-endpoint-discovery@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-expect-continue@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-flexible-checksums@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-host-header@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-host-header@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-location-constraint@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-logger@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-logger@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-recursion-detection@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-recursion-detection@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-sdk-s3@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-signing@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-ssec@3.537.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-user-agent@3.540.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/middleware-user-agent@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/region-config-resolver@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/region-config-resolver@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/signature-v4-multi-region@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/token-providers@3.549.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/token-providers@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/types@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/types@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-arn-parser@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-endpoints@3.540.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-endpoints@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-locate-window@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-user-agent-browser@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-user-agent-browser@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-user-agent-node@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-user-agent-node@3.598.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/util-utf8-browser@3.259.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@aws-sdk/xml-builder@3.535.0](https://github.com/aws/aws-sdk-js-v3) - Apache-2.0
[@babel/code-frame@7.26.2](https://github.com/babel/babel) - MIT
[@babel/compat-data@7.23.5](https://github.com/babel/babel) - MIT
[@babel/core@7.23.7](https://github.com/babel/babel) - MIT
[@babel/generator@7.23.6](https://github.com/babel/babel) - MIT
[@babel/helper-compilation-targets@7.23.6](https://github.com/babel/babel) - MIT
[@babel/helper-environment-visitor@7.22.20](https://github.com/babel/babel) - MIT
[@babel/helper-function-name@7.23.0](https://github.com/babel/babel) - MIT
[@babel/helper-hoist-variables@7.22.5](https://github.com/babel/babel) - MIT
[@babel/helper-module-imports@7.22.15](https://github.com/babel/babel) - MIT
[@babel/helper-module-transforms@7.23.3](https://github.com/babel/babel) - MIT
[@babel/helper-plugin-utils@7.22.5](https://github.com/babel/babel) - MIT
[@babel/helper-simple-access@7.22.5](https://github.com/babel/babel) - MIT
[@babel/helper-split-export-declaration@7.22.6](https://github.com/babel/babel) - MIT
[@babel/helper-string-parser@7.25.9](https://github.com/babel/babel) - MIT
[@babel/helper-validator-identifier@7.25.9](https://github.com/babel/babel) - MIT
[@babel/helper-validator-option@7.23.5](https://github.com/babel/babel) - MIT
[@babel/helpers@7.27.0](https://github.com/babel/babel) - MIT
[@babel/parser@7.27.0](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-async-generators@7.8.4](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-async-generators) - MIT
[@babel/plugin-syntax-bigint@7.8.3](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-bigint) - MIT
[@babel/plugin-syntax-class-properties@7.12.13](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-import-meta@7.10.4](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-json-strings@7.8.3](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-json-strings) - MIT
[@babel/plugin-syntax-jsx@7.23.3](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-logical-assignment-operators@7.10.4](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-nullish-coalescing-operator@7.8.3](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-nullish-coalescing-operator) - MIT
[@babel/plugin-syntax-numeric-separator@7.10.4](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-object-rest-spread@7.8.3](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-object-rest-spread) - MIT
[@babel/plugin-syntax-optional-catch-binding@7.8.3](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-optional-catch-binding) - MIT
[@babel/plugin-syntax-optional-chaining@7.8.3](https://github.com/babel/babel/tree/master/packages/babel-plugin-syntax-optional-chaining) - MIT
[@babel/plugin-syntax-top-level-await@7.14.5](https://github.com/babel/babel) - MIT
[@babel/plugin-syntax-typescript@7.23.3](https://github.com/babel/babel) - MIT
[@babel/template@7.27.0](https://github.com/babel/babel) - MIT
[@babel/traverse@7.23.7](https://github.com/babel/babel) - MIT
[@babel/types@7.27.0](https://github.com/babel/babel) - MIT
[@bcoe/v8-coverage@0.2.3](https://github.com/demurgos/v8-coverage) - MIT
[@biomejs/biome@1.8.3](https://github.com/biomejs/biome) - MIT OR Apache-2.0
[@biomejs/cli-darwin-arm64@1.8.3](https://github.com/biomejs/biome) - MIT OR Apache-2.0
[@eslint-community/eslint-utils@4.4.0](https://github.com/eslint-community/eslint-utils) - MIT
[@eslint-community/regexpp@4.10.0](https://github.com/eslint-community/regexpp) - MIT
[@eslint/eslintrc@2.1.4](https://github.com/eslint/eslintrc) - MIT
[@eslint/js@8.57.0](https://github.com/eslint/eslint) - MIT
[@huggingface/jinja@0.3.2](https://github.com/huggingface/huggingface.js) - MIT
[@huggingface/transformers@3.0.2](https://github.com/huggingface/transformers.js) - Apache-2.0
[@humanwhocodes/config-array@0.11.14](https://github.com/humanwhocodes/config-array) - Apache-2.0
[@humanwhocodes/module-importer@1.0.1](https://github.com/humanwhocodes/module-importer) - Apache-2.0
[@humanwhocodes/object-schema@2.0.2](https://github.com/humanwhocodes/object-schema) - BSD-3-Clause
[@img/sharp-darwin-arm64@0.33.5](https://github.com/lovell/sharp) - Apache-2.0
[@img/sharp-libvips-darwin-arm64@1.0.4](https://github.com/lovell/sharp-libvips) - LGPL-3.0-or-later
[@isaacs/cliui@8.0.2](https://github.com/yargs/cliui) - ISC
[@isaacs/fs-minipass@4.0.1](https://github.com/npm/fs-minipass) - ISC
[@istanbuljs/load-nyc-config@1.1.0](https://github.com/istanbuljs/load-nyc-config) - ISC
[@istanbuljs/schema@0.1.3](https://github.com/istanbuljs/schema) - MIT
[@jest/console@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/core@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/environment@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/expect-utils@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/expect@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/fake-timers@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/globals@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/reporters@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/schemas@29.6.3](https://github.com/jestjs/jest) - MIT
[@jest/source-map@29.6.3](https://github.com/jestjs/jest) - MIT
[@jest/test-result@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/test-sequencer@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/transform@29.7.0](https://github.com/jestjs/jest) - MIT
[@jest/types@29.6.3](https://github.com/jestjs/jest) - MIT
[@jridgewell/gen-mapping@0.3.3](https://github.com/jridgewell/gen-mapping) - MIT
[@jridgewell/resolve-uri@3.1.1](https://github.com/jridgewell/resolve-uri) - MIT
[@jridgewell/set-array@1.1.2](https://github.com/jridgewell/set-array) - MIT
[@jridgewell/sourcemap-codec@1.4.15](https://github.com/jridgewell/sourcemap-codec) - MIT
[@jridgewell/trace-mapping@0.3.22](https://github.com/jridgewell/trace-mapping) - MIT
[@lancedb/lancedb@0.26.2](https://github.com/lancedb/lancedb) - Apache-2.0
[@napi-rs/cli@2.18.3](https://github.com/napi-rs/napi-rs) - MIT
[@nodelib/fs.scandir@2.1.5](https://github.com/nodelib/nodelib/tree/master/packages/fs/fs.scandir) - MIT
[@nodelib/fs.stat@2.0.5](https://github.com/nodelib/nodelib/tree/master/packages/fs/fs.stat) - MIT
[@nodelib/fs.walk@1.2.8](https://github.com/nodelib/nodelib/tree/master/packages/fs/fs.walk) - MIT
[@pkgjs/parseargs@0.11.0](https://github.com/pkgjs/parseargs) - MIT
[@protobufjs/aspromise@1.1.2](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/base64@1.1.2](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/codegen@2.0.4](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/eventemitter@1.1.0](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/fetch@1.1.0](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/float@1.0.2](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/inquire@1.1.0](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/path@1.1.2](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/pool@1.1.0](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@protobufjs/utf8@1.1.0](https://github.com/dcodeIO/protobuf.js) - BSD-3-Clause
[@shikijs/core@1.10.3](https://github.com/shikijs/shiki) - MIT
[@sinclair/typebox@0.27.8](https://github.com/sinclairzx81/typebox) - MIT
[@sinonjs/commons@3.0.1](https://github.com/sinonjs/commons) - BSD-3-Clause
[@sinonjs/fake-timers@10.3.0](https://github.com/sinonjs/fake-timers) - BSD-3-Clause
[@smithy/abort-controller@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/abort-controller@3.1.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/chunked-blob-reader-native@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/chunked-blob-reader@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/config-resolver@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/config-resolver@3.0.3](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/core@1.4.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/core@2.2.3](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/credential-provider-imds@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/credential-provider-imds@3.1.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/eventstream-codec@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/eventstream-serde-browser@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/eventstream-serde-config-resolver@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/eventstream-serde-node@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/eventstream-serde-universal@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/fetch-http-handler@2.5.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/fetch-http-handler@3.1.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/hash-blob-browser@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/hash-node@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/hash-node@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/hash-stream-node@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/invalid-dependency@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/invalid-dependency@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/is-array-buffer@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/is-array-buffer@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/md5-js@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-content-length@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-content-length@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-endpoint@2.5.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-endpoint@3.0.3](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-retry@2.3.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-retry@3.0.6](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-serde@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-serde@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-stack@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/middleware-stack@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/node-config-provider@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/node-config-provider@3.1.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/node-http-handler@2.5.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/node-http-handler@3.1.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/property-provider@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/property-provider@3.1.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/protocol-http@3.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/protocol-http@4.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/querystring-builder@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/querystring-builder@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/querystring-parser@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/querystring-parser@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/service-error-classification@2.1.5](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/service-error-classification@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/shared-ini-file-loader@2.4.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/shared-ini-file-loader@3.1.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/signature-v4@2.2.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/signature-v4@3.1.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/smithy-client@2.5.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/smithy-client@3.1.4](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/types@2.12.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/types@3.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/url-parser@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/url-parser@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-base64@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-base64@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-body-length-browser@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-body-length-browser@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-body-length-node@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-body-length-node@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-buffer-from@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-buffer-from@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-config-provider@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-config-provider@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-defaults-mode-browser@2.2.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-defaults-mode-browser@3.0.6](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-defaults-mode-node@2.3.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-defaults-mode-node@3.0.6](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-endpoints@1.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-endpoints@2.0.3](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-hex-encoding@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-hex-encoding@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-middleware@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-middleware@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-retry@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-retry@3.0.2](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-stream@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-stream@3.0.4](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-uri-escape@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-uri-escape@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-utf8@2.3.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-utf8@3.0.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-waiter@2.2.0](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@smithy/util-waiter@3.1.1](https://github.com/awslabs/smithy-typescript) - Apache-2.0
[@swc/helpers@0.5.12](https://github.com/swc-project/swc) - Apache-2.0
[@types/axios@0.14.0](https://github.com/mzabriskie/axios) - MIT
[@types/babel__core@7.20.5](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/babel__generator@7.6.8](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/babel__template@7.4.4](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/babel__traverse@7.20.5](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/command-line-args@5.2.3](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/command-line-usage@5.0.2](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/command-line-usage@5.0.4](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/graceful-fs@4.1.9](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/hast@3.0.4](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/istanbul-lib-coverage@2.0.6](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/istanbul-lib-report@3.0.3](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/istanbul-reports@3.0.4](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/jest@29.5.12](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/json-schema@7.0.15](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/node-fetch@2.6.11](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/node@18.19.26](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/node@20.16.10](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/node@20.17.9](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/node@22.7.4](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/semver@7.5.6](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/stack-utils@2.0.3](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/tmp@0.2.6](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/unist@3.0.2](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/yargs-parser@21.0.3](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@types/yargs@17.0.32](https://github.com/DefinitelyTyped/DefinitelyTyped) - MIT
[@typescript-eslint/eslint-plugin@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[@typescript-eslint/parser@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - BSD-2-Clause
[@typescript-eslint/scope-manager@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[@typescript-eslint/type-utils@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[@typescript-eslint/types@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[@typescript-eslint/typescript-estree@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - BSD-2-Clause
[@typescript-eslint/utils@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[@typescript-eslint/visitor-keys@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[@ungap/structured-clone@1.2.0](https://github.com/ungap/structured-clone) - ISC
[abort-controller@3.0.0](https://github.com/mysticatea/abort-controller) - MIT
[acorn-jsx@5.3.2](https://github.com/acornjs/acorn-jsx) - MIT
[acorn@8.11.3](https://github.com/acornjs/acorn) - MIT
[agentkeepalive@4.5.0](https://github.com/node-modules/agentkeepalive) - MIT
[ajv@6.12.6](https://github.com/ajv-validator/ajv) - MIT
[ansi-escapes@4.3.2](https://github.com/sindresorhus/ansi-escapes) - MIT
[ansi-regex@5.0.1](https://github.com/chalk/ansi-regex) - MIT
[ansi-regex@6.1.0](https://github.com/chalk/ansi-regex) - MIT
[ansi-styles@4.3.0](https://github.com/chalk/ansi-styles) - MIT
[ansi-styles@5.2.0](https://github.com/chalk/ansi-styles) - MIT
[ansi-styles@6.2.1](https://github.com/chalk/ansi-styles) - MIT
[anymatch@3.1.3](https://github.com/micromatch/anymatch) - ISC
[apache-arrow@15.0.0](https://github.com/apache/arrow) - Apache-2.0
[apache-arrow@16.0.0](https://github.com/apache/arrow) - Apache-2.0
[apache-arrow@17.0.0](https://github.com/apache/arrow) - Apache-2.0
[apache-arrow@18.0.0](https://github.com/apache/arrow) - Apache-2.0
[argparse@1.0.10](https://github.com/nodeca/argparse) - MIT
[argparse@2.0.1](https://github.com/nodeca/argparse) - Python-2.0
[array-back@3.1.0](https://github.com/75lb/array-back) - MIT
[array-back@6.2.2](https://github.com/75lb/array-back) - MIT
[array-union@2.1.0](https://github.com/sindresorhus/array-union) - MIT
[asynckit@0.4.0](https://github.com/alexindigo/asynckit) - MIT
[axios@1.8.4](https://github.com/axios/axios) - MIT
[babel-jest@29.7.0](https://github.com/jestjs/jest) - MIT
[babel-plugin-istanbul@6.1.1](https://github.com/istanbuljs/babel-plugin-istanbul) - BSD-3-Clause
[babel-plugin-jest-hoist@29.6.3](https://github.com/jestjs/jest) - MIT
[babel-preset-current-node-syntax@1.0.1](https://github.com/nicolo-ribaudo/babel-preset-current-node-syntax) - MIT
[babel-preset-jest@29.6.3](https://github.com/jestjs/jest) - MIT
[balanced-match@1.0.2](https://github.com/juliangruber/balanced-match) - MIT
[base-64@0.1.0](https://github.com/mathiasbynens/base64) - MIT
[bowser@2.11.0](https://github.com/lancedikson/bowser) - MIT
[brace-expansion@1.1.11](https://github.com/juliangruber/brace-expansion) - MIT
[brace-expansion@2.0.1](https://github.com/juliangruber/brace-expansion) - MIT
[braces@3.0.3](https://github.com/micromatch/braces) - MIT
[browserslist@4.22.2](https://github.com/browserslist/browserslist) - MIT
[bs-logger@0.2.6](https://github.com/huafu/bs-logger) - MIT
[bser@2.1.1](https://github.com/facebook/watchman) - Apache-2.0
[buffer-from@1.1.2](https://github.com/LinusU/buffer-from) - MIT
[callsites@3.1.0](https://github.com/sindresorhus/callsites) - MIT
[camelcase@5.3.1](https://github.com/sindresorhus/camelcase) - MIT
[camelcase@6.3.0](https://github.com/sindresorhus/camelcase) - MIT
[caniuse-lite@1.0.30001579](https://github.com/browserslist/caniuse-lite) - CC-BY-4.0
[chalk-template@0.4.0](https://github.com/chalk/chalk-template) - MIT
[chalk@4.1.2](https://github.com/chalk/chalk) - MIT
[char-regex@1.0.2](https://github.com/Richienb/char-regex) - MIT
[charenc@0.0.2](https://github.com/pvorb/node-charenc) - BSD-3-Clause
[chownr@3.0.0](https://github.com/isaacs/chownr) - BlueOak-1.0.0
[ci-info@3.9.0](https://github.com/watson/ci-info) - MIT
[cjs-module-lexer@1.2.3](https://github.com/nodejs/cjs-module-lexer) - MIT
[cliui@8.0.1](https://github.com/yargs/cliui) - ISC
[co@4.6.0](https://github.com/tj/co) - MIT
[collect-v8-coverage@1.0.2](https://github.com/SimenB/collect-v8-coverage) - MIT
[color-convert@2.0.1](https://github.com/Qix-/color-convert) - MIT
[color-name@1.1.4](https://github.com/colorjs/color-name) - MIT
[color-string@1.9.1](https://github.com/Qix-/color-string) - MIT
[color@4.2.3](https://github.com/Qix-/color) - MIT
[combined-stream@1.0.8](https://github.com/felixge/node-combined-stream) - MIT
[command-line-args@5.2.1](https://github.com/75lb/command-line-args) - MIT
[command-line-usage@7.0.1](https://github.com/75lb/command-line-usage) - MIT
[concat-map@0.0.1](https://github.com/substack/node-concat-map) - MIT
[convert-source-map@2.0.0](https://github.com/thlorenz/convert-source-map) - MIT
[create-jest@29.7.0](https://github.com/jestjs/jest) - MIT
[cross-spawn@7.0.6](https://github.com/moxystudio/node-cross-spawn) - MIT
[crypt@0.0.2](https://github.com/pvorb/node-crypt) - BSD-3-Clause
[debug@4.3.4](https://github.com/debug-js/debug) - MIT
[dedent@1.5.1](https://github.com/dmnd/dedent) - MIT
[deep-is@0.1.4](https://github.com/thlorenz/deep-is) - MIT
[deepmerge@4.3.1](https://github.com/TehShrike/deepmerge) - MIT
[delayed-stream@1.0.0](https://github.com/felixge/node-delayed-stream) - MIT
[detect-libc@2.0.3](https://github.com/lovell/detect-libc) - Apache-2.0
[detect-newline@3.1.0](https://github.com/sindresorhus/detect-newline) - MIT
[diff-sequences@29.6.3](https://github.com/jestjs/jest) - MIT
[digest-fetch@1.3.0](https://github.com/devfans/digest-fetch) - ISC
[dir-glob@3.0.1](https://github.com/kevva/dir-glob) - MIT
[doctrine@3.0.0](https://github.com/eslint/doctrine) - Apache-2.0
[eastasianwidth@0.2.0](https://github.com/komagata/eastasianwidth) - MIT
[electron-to-chromium@1.4.642](https://github.com/kilian/electron-to-chromium) - ISC
[emittery@0.13.1](https://github.com/sindresorhus/emittery) - MIT
[emoji-regex@8.0.0](https://github.com/mathiasbynens/emoji-regex) - MIT
[emoji-regex@9.2.2](https://github.com/mathiasbynens/emoji-regex) - MIT
[entities@4.5.0](https://github.com/fb55/entities) - BSD-2-Clause
[error-ex@1.3.2](https://github.com/qix-/node-error-ex) - MIT
[escalade@3.1.1](https://github.com/lukeed/escalade) - MIT
[escape-string-regexp@2.0.0](https://github.com/sindresorhus/escape-string-regexp) - MIT
[escape-string-regexp@4.0.0](https://github.com/sindresorhus/escape-string-regexp) - MIT
[eslint-scope@7.2.2](https://github.com/eslint/eslint-scope) - BSD-2-Clause
[eslint-visitor-keys@3.4.3](https://github.com/eslint/eslint-visitor-keys) - Apache-2.0
[eslint@8.57.0](https://github.com/eslint/eslint) - MIT
[espree@9.6.1](https://github.com/eslint/espree) - BSD-2-Clause
[esprima@4.0.1](https://github.com/jquery/esprima) - BSD-2-Clause
[esquery@1.5.0](https://github.com/estools/esquery) - BSD-3-Clause
[esrecurse@4.3.0](https://github.com/estools/esrecurse) - BSD-2-Clause
[estraverse@5.3.0](https://github.com/estools/estraverse) - BSD-2-Clause
[esutils@2.0.3](https://github.com/estools/esutils) - BSD-2-Clause
[event-target-shim@5.0.1](https://github.com/mysticatea/event-target-shim) - MIT
[execa@5.1.1](https://github.com/sindresorhus/execa) - MIT
[exit@0.1.2](https://github.com/cowboy/node-exit) - MIT
[expect@29.7.0](https://github.com/jestjs/jest) - MIT
[fast-deep-equal@3.1.3](https://github.com/epoberezkin/fast-deep-equal) - MIT
[fast-glob@3.3.2](https://github.com/mrmlnc/fast-glob) - MIT
[fast-json-stable-stringify@2.1.0](https://github.com/epoberezkin/fast-json-stable-stringify) - MIT
[fast-levenshtein@2.0.6](https://github.com/hiddentao/fast-levenshtein) - MIT
[fast-xml-parser@4.2.5](https://github.com/NaturalIntelligence/fast-xml-parser) - MIT
[fastq@1.16.0](https://github.com/mcollina/fastq) - ISC
[fb-watchman@2.0.2](https://github.com/facebook/watchman) - Apache-2.0
[file-entry-cache@6.0.1](https://github.com/royriojas/file-entry-cache) - MIT
[fill-range@7.1.1](https://github.com/jonschlinkert/fill-range) - MIT
[find-replace@3.0.0](https://github.com/75lb/find-replace) - MIT
[find-up@4.1.0](https://github.com/sindresorhus/find-up) - MIT
[find-up@5.0.0](https://github.com/sindresorhus/find-up) - MIT
[flat-cache@3.2.0](https://github.com/jaredwray/flat-cache) - MIT
[flatbuffers@1.12.0](https://github.com/google/flatbuffers) - Apache*
[flatbuffers@23.5.26](https://github.com/google/flatbuffers) - Apache*
[flatbuffers@24.3.25](https://github.com/google/flatbuffers) - Apache-2.0
[flatted@3.2.9](https://github.com/WebReflection/flatted) - ISC
[follow-redirects@1.15.6](https://github.com/follow-redirects/follow-redirects) - MIT
[foreground-child@3.3.0](https://github.com/tapjs/foreground-child) - ISC
[form-data-encoder@1.7.2](https://github.com/octet-stream/form-data-encoder) - MIT
[form-data@4.0.0](https://github.com/form-data/form-data) - MIT
[formdata-node@4.4.1](https://github.com/octet-stream/form-data) - MIT
[fs.realpath@1.0.0](https://github.com/isaacs/fs.realpath) - ISC
[fsevents@2.3.3](https://github.com/fsevents/fsevents) - MIT
[function-bind@1.1.2](https://github.com/Raynos/function-bind) - MIT
[gensync@1.0.0-beta.2](https://github.com/loganfsmyth/gensync) - MIT
[get-caller-file@2.0.5](https://github.com/stefanpenner/get-caller-file) - ISC
[get-package-type@0.1.0](https://github.com/cfware/get-package-type) - MIT
[get-stream@6.0.1](https://github.com/sindresorhus/get-stream) - MIT
[glob-parent@5.1.2](https://github.com/gulpjs/glob-parent) - ISC
[glob-parent@6.0.2](https://github.com/gulpjs/glob-parent) - ISC
[glob@10.4.5](https://github.com/isaacs/node-glob) - ISC
[glob@7.2.3](https://github.com/isaacs/node-glob) - ISC
[globals@11.12.0](https://github.com/sindresorhus/globals) - MIT
[globals@13.24.0](https://github.com/sindresorhus/globals) - MIT
[globby@11.1.0](https://github.com/sindresorhus/globby) - MIT
[graceful-fs@4.2.11](https://github.com/isaacs/node-graceful-fs) - ISC
[graphemer@1.4.0](https://github.com/flmnt/graphemer) - MIT
[guid-typescript@1.0.9](https://github.com/NicolasDeveloper/guid-typescript) - ISC
[has-flag@4.0.0](https://github.com/sindresorhus/has-flag) - MIT
[hasown@2.0.0](https://github.com/inspect-js/hasOwn) - MIT
[html-escaper@2.0.2](https://github.com/WebReflection/html-escaper) - MIT
[human-signals@2.1.0](https://github.com/ehmicky/human-signals) - Apache-2.0
[humanize-ms@1.2.1](https://github.com/node-modules/humanize-ms) - MIT
[ignore@5.3.0](https://github.com/kaelzhang/node-ignore) - MIT
[import-fresh@3.3.0](https://github.com/sindresorhus/import-fresh) - MIT
[import-local@3.1.0](https://github.com/sindresorhus/import-local) - MIT
[imurmurhash@0.1.4](https://github.com/jensyt/imurmurhash-js) - MIT
[inflight@1.0.6](https://github.com/npm/inflight) - ISC
[inherits@2.0.4](https://github.com/isaacs/inherits) - ISC
[interpret@1.4.0](https://github.com/gulpjs/interpret) - MIT
[is-arrayish@0.2.1](https://github.com/qix-/node-is-arrayish) - MIT
[is-arrayish@0.3.2](https://github.com/qix-/node-is-arrayish) - MIT
[is-buffer@1.1.6](https://github.com/feross/is-buffer) - MIT
[is-core-module@2.13.1](https://github.com/inspect-js/is-core-module) - MIT
[is-extglob@2.1.1](https://github.com/jonschlinkert/is-extglob) - MIT
[is-fullwidth-code-point@3.0.0](https://github.com/sindresorhus/is-fullwidth-code-point) - MIT
[is-generator-fn@2.1.0](https://github.com/sindresorhus/is-generator-fn) - MIT
[is-glob@4.0.3](https://github.com/micromatch/is-glob) - MIT
[is-number@7.0.0](https://github.com/jonschlinkert/is-number) - MIT
[is-path-inside@3.0.3](https://github.com/sindresorhus/is-path-inside) - MIT
[is-stream@2.0.1](https://github.com/sindresorhus/is-stream) - MIT
[isexe@2.0.0](https://github.com/isaacs/isexe) - ISC
[istanbul-lib-coverage@3.2.2](https://github.com/istanbuljs/istanbuljs) - BSD-3-Clause
[istanbul-lib-instrument@5.2.1](https://github.com/istanbuljs/istanbuljs) - BSD-3-Clause
[istanbul-lib-instrument@6.0.1](https://github.com/istanbuljs/istanbuljs) - BSD-3-Clause
[istanbul-lib-report@3.0.1](https://github.com/istanbuljs/istanbuljs) - BSD-3-Clause
[istanbul-lib-source-maps@4.0.1](https://github.com/istanbuljs/istanbuljs) - BSD-3-Clause
[istanbul-reports@3.1.6](https://github.com/istanbuljs/istanbuljs) - BSD-3-Clause
[jackspeak@3.4.3](https://github.com/isaacs/jackspeak) - BlueOak-1.0.0
[jest-changed-files@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-circus@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-cli@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-config@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-diff@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-docblock@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-each@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-environment-node@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-get-type@29.6.3](https://github.com/jestjs/jest) - MIT
[jest-haste-map@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-leak-detector@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-matcher-utils@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-message-util@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-mock@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-pnp-resolver@1.2.3](https://github.com/arcanis/jest-pnp-resolver) - MIT
[jest-regex-util@29.6.3](https://github.com/jestjs/jest) - MIT
[jest-resolve-dependencies@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-resolve@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-runner@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-runtime@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-snapshot@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-util@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-validate@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-watcher@29.7.0](https://github.com/jestjs/jest) - MIT
[jest-worker@29.7.0](https://github.com/jestjs/jest) - MIT
[jest@29.7.0](https://github.com/jestjs/jest) - MIT
[js-tokens@4.0.0](https://github.com/lydell/js-tokens) - MIT
[js-yaml@3.14.1](https://github.com/nodeca/js-yaml) - MIT
[js-yaml@4.1.0](https://github.com/nodeca/js-yaml) - MIT
[jsesc@2.5.2](https://github.com/mathiasbynens/jsesc) - MIT
[json-bignum@0.0.3](https://github.com/datalanche/json-bignum) - MIT
[json-buffer@3.0.1](https://github.com/dominictarr/json-buffer) - MIT
[json-parse-even-better-errors@2.3.1](https://github.com/npm/json-parse-even-better-errors) - MIT
[json-schema-traverse@0.4.1](https://github.com/epoberezkin/json-schema-traverse) - MIT
[json-stable-stringify-without-jsonify@1.0.1](https://github.com/samn/json-stable-stringify) - MIT
[json5@2.2.3](https://github.com/json5/json5) - MIT
[keyv@4.5.4](https://github.com/jaredwray/keyv) - MIT
[kleur@3.0.3](https://github.com/lukeed/kleur) - MIT
[leven@3.1.0](https://github.com/sindresorhus/leven) - MIT
[levn@0.4.1](https://github.com/gkz/levn) - MIT
[lines-and-columns@1.2.4](https://github.com/eventualbuddha/lines-and-columns) - MIT
[linkify-it@5.0.0](https://github.com/markdown-it/linkify-it) - MIT
[locate-path@5.0.0](https://github.com/sindresorhus/locate-path) - MIT
[locate-path@6.0.0](https://github.com/sindresorhus/locate-path) - MIT
[lodash.camelcase@4.3.0](https://github.com/lodash/lodash) - MIT
[lodash.memoize@4.1.2](https://github.com/lodash/lodash) - MIT
[lodash.merge@4.6.2](https://github.com/lodash/lodash) - MIT
[lodash@4.17.21](https://github.com/lodash/lodash) - MIT
[long@5.2.3](https://github.com/dcodeIO/long.js) - Apache-2.0
[lru-cache@10.4.3](https://github.com/isaacs/node-lru-cache) - ISC
[lru-cache@5.1.1](https://github.com/isaacs/node-lru-cache) - ISC
[lunr@2.3.9](https://github.com/olivernn/lunr.js) - MIT
[make-dir@4.0.0](https://github.com/sindresorhus/make-dir) - MIT
[make-error@1.3.6](https://github.com/JsCommunity/make-error) - ISC
[makeerror@1.0.12](https://github.com/daaku/nodejs-makeerror) - BSD-3-Clause
[markdown-it@14.1.0](https://github.com/markdown-it/markdown-it) - MIT
[md5@2.3.0](https://github.com/pvorb/node-md5) - BSD-3-Clause
[mdurl@2.0.0](https://github.com/markdown-it/mdurl) - MIT
[merge-stream@2.0.0](https://github.com/grncdr/merge-stream) - MIT
[merge2@1.4.1](https://github.com/teambition/merge2) - MIT
[micromatch@4.0.8](https://github.com/micromatch/micromatch) - MIT
[mime-db@1.52.0](https://github.com/jshttp/mime-db) - MIT
[mime-types@2.1.35](https://github.com/jshttp/mime-types) - MIT
[mimic-fn@2.1.0](https://github.com/sindresorhus/mimic-fn) - MIT
[minimatch@3.1.2](https://github.com/isaacs/minimatch) - ISC
[minimatch@9.0.3](https://github.com/isaacs/minimatch) - ISC
[minimatch@9.0.5](https://github.com/isaacs/minimatch) - ISC
[minimist@1.2.8](https://github.com/minimistjs/minimist) - MIT
[minipass@7.1.2](https://github.com/isaacs/minipass) - ISC
[minizlib@3.0.1](https://github.com/isaacs/minizlib) - MIT
[mkdirp@3.0.1](https://github.com/isaacs/node-mkdirp) - MIT
[mnemonist@0.38.3](https://github.com/yomguithereal/mnemonist) - MIT
[ms@2.1.2](https://github.com/zeit/ms) - MIT
[ms@2.1.3](https://github.com/vercel/ms) - MIT
[natural-compare@1.4.0](https://github.com/litejs/natural-compare-lite) - MIT
[node-domexception@1.0.0](https://github.com/jimmywarting/node-domexception) - MIT
[node-fetch@2.7.0](https://github.com/bitinn/node-fetch) - MIT
[node-int64@0.4.0](https://github.com/broofa/node-int64) - MIT
[node-releases@2.0.14](https://github.com/chicoxyzzy/node-releases) - MIT
[normalize-path@3.0.0](https://github.com/jonschlinkert/normalize-path) - MIT
[npm-run-path@4.0.1](https://github.com/sindresorhus/npm-run-path) - MIT
[obliterator@1.6.1](https://github.com/yomguithereal/obliterator) - MIT
[once@1.4.0](https://github.com/isaacs/once) - ISC
[onetime@5.1.2](https://github.com/sindresorhus/onetime) - MIT
[onnxruntime-common@1.19.2](https://github.com/Microsoft/onnxruntime) - MIT
[onnxruntime-common@1.20.0-dev.20241016-2b8fc5529b](https://github.com/Microsoft/onnxruntime) - MIT
[onnxruntime-node@1.19.2](https://github.com/Microsoft/onnxruntime) - MIT
[onnxruntime-web@1.21.0-dev.20241024-d9ca84ef96](https://github.com/Microsoft/onnxruntime) - MIT
[openai@4.29.2](https://github.com/openai/openai-node) - Apache-2.0
[optionator@0.9.3](https://github.com/gkz/optionator) - MIT
[p-limit@2.3.0](https://github.com/sindresorhus/p-limit) - MIT
[p-limit@3.1.0](https://github.com/sindresorhus/p-limit) - MIT
[p-locate@4.1.0](https://github.com/sindresorhus/p-locate) - MIT
[p-locate@5.0.0](https://github.com/sindresorhus/p-locate) - MIT
[p-try@2.2.0](https://github.com/sindresorhus/p-try) - MIT
[package-json-from-dist@1.0.1](https://github.com/isaacs/package-json-from-dist) - BlueOak-1.0.0
[parent-module@1.0.1](https://github.com/sindresorhus/parent-module) - MIT
[parse-json@5.2.0](https://github.com/sindresorhus/parse-json) - MIT
[path-exists@4.0.0](https://github.com/sindresorhus/path-exists) - MIT
[path-is-absolute@1.0.1](https://github.com/sindresorhus/path-is-absolute) - MIT
[path-key@3.1.1](https://github.com/sindresorhus/path-key) - MIT
[path-parse@1.0.7](https://github.com/jbgutierrez/path-parse) - MIT
[path-scurry@1.11.1](https://github.com/isaacs/path-scurry) - BlueOak-1.0.0
[path-type@4.0.0](https://github.com/sindresorhus/path-type) - MIT
[picocolors@1.0.0](https://github.com/alexeyraspopov/picocolors) - ISC
[picomatch@2.3.1](https://github.com/micromatch/picomatch) - MIT
[pirates@4.0.6](https://github.com/danez/pirates) - MIT
[pkg-dir@4.2.0](https://github.com/sindresorhus/pkg-dir) - MIT
[platform@1.3.6](https://github.com/bestiejs/platform.js) - MIT
[prelude-ls@1.2.1](https://github.com/gkz/prelude-ls) - MIT
[pretty-format@29.7.0](https://github.com/jestjs/jest) - MIT
[prompts@2.4.2](https://github.com/terkelg/prompts) - MIT
[protobufjs@7.4.0](https://github.com/protobufjs/protobuf.js) - BSD-3-Clause
[proxy-from-env@1.1.0](https://github.com/Rob--W/proxy-from-env) - MIT
[punycode.js@2.3.1](https://github.com/mathiasbynens/punycode.js) - MIT
[punycode@2.3.1](https://github.com/mathiasbynens/punycode.js) - MIT
[pure-rand@6.0.4](https://github.com/dubzzz/pure-rand) - MIT
[queue-microtask@1.2.3](https://github.com/feross/queue-microtask) - MIT
[react-is@18.2.0](https://github.com/facebook/react) - MIT
[rechoir@0.6.2](https://github.com/tkellen/node-rechoir) - MIT
[reflect-metadata@0.2.2](https://github.com/rbuckton/reflect-metadata) - Apache-2.0
[require-directory@2.1.1](https://github.com/troygoode/node-require-directory) - MIT
[resolve-cwd@3.0.0](https://github.com/sindresorhus/resolve-cwd) - MIT
[resolve-from@4.0.0](https://github.com/sindresorhus/resolve-from) - MIT
[resolve-from@5.0.0](https://github.com/sindresorhus/resolve-from) - MIT
[resolve.exports@2.0.2](https://github.com/lukeed/resolve.exports) - MIT
[resolve@1.22.8](https://github.com/browserify/resolve) - MIT
[reusify@1.0.4](https://github.com/mcollina/reusify) - MIT
[rimraf@3.0.2](https://github.com/isaacs/rimraf) - ISC
[rimraf@5.0.10](https://github.com/isaacs/rimraf) - ISC
[run-parallel@1.2.0](https://github.com/feross/run-parallel) - MIT
[semver@6.3.1](https://github.com/npm/node-semver) - ISC
[semver@7.6.3](https://github.com/npm/node-semver) - ISC
[sharp@0.33.5](https://github.com/lovell/sharp) - Apache-2.0
[shebang-command@2.0.0](https://github.com/kevva/shebang-command) - MIT
[shebang-regex@3.0.0](https://github.com/sindresorhus/shebang-regex) - MIT
[shelljs@0.8.5](https://github.com/shelljs/shelljs) - BSD-3-Clause
[shiki@1.10.3](https://github.com/shikijs/shiki) - MIT
[shx@0.3.4](https://github.com/shelljs/shx) - MIT
[signal-exit@3.0.7](https://github.com/tapjs/signal-exit) - ISC
[signal-exit@4.1.0](https://github.com/tapjs/signal-exit) - ISC
[simple-swizzle@0.2.2](https://github.com/qix-/node-simple-swizzle) - MIT
[sisteransi@1.0.5](https://github.com/terkelg/sisteransi) - MIT
[slash@3.0.0](https://github.com/sindresorhus/slash) - MIT
[source-map-support@0.5.13](https://github.com/evanw/node-source-map-support) - MIT
[source-map@0.6.1](https://github.com/mozilla/source-map) - BSD-3-Clause
[sprintf-js@1.0.3](https://github.com/alexei/sprintf.js) - BSD-3-Clause
[stack-utils@2.0.6](https://github.com/tapjs/stack-utils) - MIT
[stream-read-all@3.0.1](https://github.com/75lb/stream-read-all) - MIT
[string-length@4.0.2](https://github.com/sindresorhus/string-length) - MIT
[string-width@4.2.3](https://github.com/sindresorhus/string-width) - MIT
[string-width@5.1.2](https://github.com/sindresorhus/string-width) - MIT
[strip-ansi@6.0.1](https://github.com/chalk/strip-ansi) - MIT
[strip-ansi@7.1.0](https://github.com/chalk/strip-ansi) - MIT
[strip-bom@4.0.0](https://github.com/sindresorhus/strip-bom) - MIT
[strip-final-newline@2.0.0](https://github.com/sindresorhus/strip-final-newline) - MIT
[strip-json-comments@3.1.1](https://github.com/sindresorhus/strip-json-comments) - MIT
[strnum@1.0.5](https://github.com/NaturalIntelligence/strnum) - MIT
[supports-color@7.2.0](https://github.com/chalk/supports-color) - MIT
[supports-color@8.1.1](https://github.com/chalk/supports-color) - MIT
[supports-preserve-symlinks-flag@1.0.0](https://github.com/inspect-js/node-supports-preserve-symlinks-flag) - MIT
[table-layout@3.0.2](https://github.com/75lb/table-layout) - MIT
[tar@7.4.3](https://github.com/isaacs/node-tar) - ISC
[test-exclude@6.0.0](https://github.com/istanbuljs/test-exclude) - ISC
[text-table@0.2.0](https://github.com/substack/text-table) - MIT
[tmp@0.2.3](https://github.com/raszi/node-tmp) - MIT
[tmpl@1.0.5](https://github.com/daaku/nodejs-tmpl) - BSD-3-Clause
[to-regex-range@5.0.1](https://github.com/micromatch/to-regex-range) - MIT
[tr46@0.0.3](https://github.com/Sebmaster/tr46.js) - MIT
[ts-api-utils@1.0.3](https://github.com/JoshuaKGoldberg/ts-api-utils) - MIT
[ts-jest@29.1.2](https://github.com/kulshekhar/ts-jest) - MIT
[tslib@1.14.1](https://github.com/Microsoft/tslib) - 0BSD
[tslib@2.6.2](https://github.com/Microsoft/tslib) - 0BSD
[type-check@0.4.0](https://github.com/gkz/type-check) - MIT
[type-detect@4.0.8](https://github.com/chaijs/type-detect) - MIT
[type-fest@0.20.2](https://github.com/sindresorhus/type-fest) - (MIT OR CC0-1.0)
[type-fest@0.21.3](https://github.com/sindresorhus/type-fest) - (MIT OR CC0-1.0)
[typedoc-plugin-markdown@4.2.1](https://github.com/typedoc2md/typedoc-plugin-markdown) - MIT
[typedoc@0.26.4](https://github.com/TypeStrong/TypeDoc) - Apache-2.0
[typescript-eslint@7.1.0](https://github.com/typescript-eslint/typescript-eslint) - MIT
[typescript@5.5.4](https://github.com/Microsoft/TypeScript) - Apache-2.0
[typical@4.0.0](https://github.com/75lb/typical) - MIT
[typical@7.1.1](https://github.com/75lb/typical) - MIT
[uc.micro@2.1.0](https://github.com/markdown-it/uc.micro) - MIT
[undici-types@5.26.5](https://github.com/nodejs/undici) - MIT
[undici-types@6.19.8](https://github.com/nodejs/undici) - MIT
[update-browserslist-db@1.0.13](https://github.com/browserslist/update-db) - MIT
[uri-js@4.4.1](https://github.com/garycourt/uri-js) - BSD-2-Clause
[uuid@9.0.1](https://github.com/uuidjs/uuid) - MIT
[v8-to-istanbul@9.2.0](https://github.com/istanbuljs/v8-to-istanbul) - ISC
[walker@1.0.8](https://github.com/daaku/nodejs-walker) - Apache-2.0
[web-streams-polyfill@3.3.3](https://github.com/MattiasBuelens/web-streams-polyfill) - MIT
[web-streams-polyfill@4.0.0-beta.3](https://github.com/MattiasBuelens/web-streams-polyfill) - MIT
[webidl-conversions@3.0.1](https://github.com/jsdom/webidl-conversions) - BSD-2-Clause
[whatwg-url@5.0.0](https://github.com/jsdom/whatwg-url) - MIT
[which@2.0.2](https://github.com/isaacs/node-which) - ISC
[wordwrapjs@5.1.0](https://github.com/75lb/wordwrapjs) - MIT
[wrap-ansi@7.0.0](https://github.com/chalk/wrap-ansi) - MIT
[wrap-ansi@8.1.0](https://github.com/chalk/wrap-ansi) - MIT
[wrappy@1.0.2](https://github.com/npm/wrappy) - ISC
[write-file-atomic@4.0.2](https://github.com/npm/write-file-atomic) - ISC
[y18n@5.0.8](https://github.com/yargs/y18n) - ISC
[yallist@3.1.1](https://github.com/isaacs/yallist) - ISC
[yallist@5.0.0](https://github.com/isaacs/yallist) - BlueOak-1.0.0
[yaml@2.4.5](https://github.com/eemeli/yaml) - ISC
[yargs-parser@21.1.1](https://github.com/yargs/yargs-parser) - ISC
[yargs@17.7.2](https://github.com/yargs/yargs) - MIT
[yocto-queue@0.1.0](https://github.com/sindresorhus/yocto-queue) - MIT

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.26.1",
"version": "0.27.0-beta.0",
"cpu": [
"x64",
"arm64"

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.26.1",
"version": "0.27.0-beta.0",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",

View File

@@ -13,6 +13,7 @@ use crate::header::JsHeaderProvider;
use crate::table::Table;
use crate::ConnectionOptions;
use lancedb::connection::{ConnectBuilder, Connection as LanceDBConnection};
use lancedb::ipc::{ipc_file_to_batches, ipc_file_to_schema};
#[napi]

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.29.2"
current_version = "0.30.0-beta.0"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-python"
version = "0.29.2"
version = "0.30.0-beta.0"
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true

View File

@@ -0,0 +1,206 @@
| Name | Version | License | URL |
|--------------------------------|-----------------|--------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| InstructorEmbedding | 1.0.1 | Apache License 2.0 | https://github.com/HKUNLP/instructor-embedding |
| Jinja2 | 3.1.6 | BSD License | https://github.com/pallets/jinja/ |
| Markdown | 3.10.2 | BSD-3-Clause | https://Python-Markdown.github.io/ |
| MarkupSafe | 3.0.3 | BSD-3-Clause | https://github.com/pallets/markupsafe/ |
| PyJWT | 2.11.0 | MIT | https://github.com/jpadilla/pyjwt |
| PyYAML | 6.0.3 | MIT License | https://pyyaml.org/ |
| Pygments | 2.19.2 | BSD License | https://pygments.org |
| accelerate | 1.12.0 | Apache Software License | https://github.com/huggingface/accelerate |
| adlfs | 2026.2.0 | BSD License | UNKNOWN |
| aiohappyeyeballs | 2.6.1 | Python Software Foundation License | https://github.com/aio-libs/aiohappyeyeballs |
| aiohttp | 3.13.3 | Apache-2.0 AND MIT | https://github.com/aio-libs/aiohttp |
| aiosignal | 1.4.0 | Apache Software License | https://github.com/aio-libs/aiosignal |
| annotated-types | 0.7.0 | MIT License | https://github.com/annotated-types/annotated-types |
| anyio | 4.12.1 | MIT | https://anyio.readthedocs.io/en/stable/versionhistory.html |
| appnope | 0.1.4 | BSD License | http://github.com/minrk/appnope |
| asttokens | 3.0.1 | Apache 2.0 | https://github.com/gristlabs/asttokens |
| attrs | 25.4.0 | MIT | https://www.attrs.org/en/stable/changelog.html |
| awscli | 1.44.35 | Apache Software License | http://aws.amazon.com/cli/ |
| azure-core | 1.38.0 | MIT License | https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/core/azure-core |
| azure-datalake-store | 0.0.53 | MIT License | https://github.com/Azure/azure-data-lake-store-python |
| azure-identity | 1.25.1 | MIT | https://github.com/Azure/azure-sdk-for-python |
| azure-storage-blob | 12.28.0 | MIT License | https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/storage/azure-storage-blob |
| babel | 2.18.0 | BSD License | https://babel.pocoo.org/ |
| backrefs | 6.1 | MIT | https://github.com/facelessuser/backrefs |
| beautifulsoup4 | 4.14.3 | MIT License | https://www.crummy.com/software/BeautifulSoup/bs4/ |
| bleach | 6.3.0 | Apache Software License | https://github.com/mozilla/bleach |
| boto3 | 1.42.45 | Apache-2.0 | https://github.com/boto/boto3 |
| botocore | 1.42.45 | Apache-2.0 | https://github.com/boto/botocore |
| cachetools | 7.0.0 | MIT | https://github.com/tkem/cachetools/ |
| certifi | 2026.1.4 | Mozilla Public License 2.0 (MPL 2.0) | https://github.com/certifi/python-certifi |
| cffi | 2.0.0 | MIT | https://cffi.readthedocs.io/en/latest/whatsnew.html |
| cfgv | 3.5.0 | MIT | https://github.com/asottile/cfgv |
| charset-normalizer | 3.4.4 | MIT | https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md |
| click | 8.3.1 | BSD-3-Clause | https://github.com/pallets/click/ |
| cohere | 5.20.4 | MIT License | https://github.com/cohere-ai/cohere-python |
| colorama | 0.4.6 | BSD License | https://github.com/tartley/colorama |
| colpali_engine | 0.3.13 | MIT License | https://github.com/illuin-tech/colpali |
| comm | 0.2.3 | BSD License | https://github.com/ipython/comm |
| cryptography | 46.0.4 | Apache-2.0 OR BSD-3-Clause | https://github.com/pyca/cryptography |
| datafusion | 51.0.0 | Apache Software License | https://datafusion.apache.org/python |
| debugpy | 1.8.20 | MIT License | https://aka.ms/debugpy |
| decorator | 5.2.1 | BSD License | UNKNOWN |
| defusedxml | 0.7.1 | Python Software Foundation License | https://github.com/tiran/defusedxml |
| deprecation | 2.1.0 | Apache Software License | http://deprecation.readthedocs.io/ |
| distlib | 0.4.0 | Python Software Foundation License | https://github.com/pypa/distlib |
| distro | 1.9.0 | Apache Software License | https://github.com/python-distro/distro |
| docutils | 0.19 | BSD License; GNU General Public License (GPL); Public Domain; Python Software Foundation License | https://docutils.sourceforge.io/ |
| duckdb | 1.4.4 | MIT License | https://github.com/duckdb/duckdb-python |
| executing | 2.2.1 | MIT License | https://github.com/alexmojaki/executing |
| fastavro | 1.12.1 | MIT | https://github.com/fastavro/fastavro |
| fastjsonschema | 2.21.2 | BSD License | https://github.com/horejsek/python-fastjsonschema |
| filelock | 3.20.3 | Unlicense | https://github.com/tox-dev/py-filelock |
| frozenlist | 1.8.0 | Apache-2.0 | https://github.com/aio-libs/frozenlist |
| fsspec | 2026.2.0 | BSD-3-Clause | https://github.com/fsspec/filesystem_spec |
| ftfy | 6.3.1 | Apache-2.0 | https://ftfy.readthedocs.io/en/latest/ |
| ghp-import | 2.1.0 | Apache Software License | https://github.com/c-w/ghp-import |
| google-ai-generativelanguage | 0.6.15 | Apache Software License | https://github.com/googleapis/google-cloud-python/tree/main/packages/google-ai-generativelanguage |
| google-api-core | 2.25.2 | Apache Software License | https://github.com/googleapis/python-api-core |
| google-api-python-client | 2.189.0 | Apache Software License | https://github.com/googleapis/google-api-python-client/ |
| google-auth | 2.48.0 | Apache Software License | https://github.com/googleapis/google-auth-library-python |
| google-auth-httplib2 | 0.3.0 | Apache Software License | https://github.com/GoogleCloudPlatform/google-auth-library-python-httplib2 |
| google-generativeai | 0.8.6 | Apache Software License | https://github.com/google/generative-ai-python |
| googleapis-common-protos | 1.72.0 | Apache Software License | https://github.com/googleapis/google-cloud-python/tree/main/packages/googleapis-common-protos |
| griffe | 2.0.0 | ISC | https://mkdocstrings.github.io/griffe |
| griffecli | 2.0.0 | ISC | UNKNOWN |
| griffelib | 2.0.0 | ISC | UNKNOWN |
| grpcio | 1.78.0 | Apache-2.0 | https://grpc.io |
| grpcio-status | 1.71.2 | Apache Software License | https://grpc.io |
| h11 | 0.16.0 | MIT License | https://github.com/python-hyper/h11 |
| hf-xet | 1.2.0 | Apache-2.0 | https://github.com/huggingface/xet-core |
| httpcore | 1.0.9 | BSD-3-Clause | https://www.encode.io/httpcore/ |
| httplib2 | 0.31.2 | MIT License | https://github.com/httplib2/httplib2 |
| httpx | 0.28.1 | BSD License | https://github.com/encode/httpx |
| huggingface_hub | 0.36.2 | Apache Software License | https://github.com/huggingface/huggingface_hub |
| ibm-cos-sdk | 2.14.3 | Apache Software License | https://github.com/ibm/ibm-cos-sdk-python |
| ibm-cos-sdk-core | 2.14.3 | Apache Software License | https://github.com/ibm/ibm-cos-sdk-python-core |
| ibm-cos-sdk-s3transfer | 2.14.3 | Apache Software License | https://github.com/IBM/ibm-cos-sdk-python-s3transfer |
| ibm_watsonx_ai | 1.5.1 | BSD License | https://ibm.github.io/watsonx-ai-python-sdk/changelog.html |
| identify | 2.6.16 | MIT | https://github.com/pre-commit/identify |
| idna | 3.11 | BSD-3-Clause | https://github.com/kjd/idna |
| iniconfig | 2.3.0 | MIT | https://github.com/pytest-dev/iniconfig |
| ipykernel | 6.31.0 | BSD-3-Clause | https://ipython.org |
| ipython | 9.10.0 | BSD-3-Clause | https://ipython.org |
| ipython_pygments_lexers | 1.1.1 | BSD License | https://github.com/ipython/ipython-pygments-lexers |
| isodate | 0.7.2 | BSD License | https://github.com/gweis/isodate/ |
| jedi | 0.19.2 | MIT License | https://github.com/davidhalter/jedi |
| jiter | 0.13.0 | MIT License | https://github.com/pydantic/jiter/ |
| jmespath | 1.0.1 | MIT License | https://github.com/jmespath/jmespath.py |
| joblib | 1.5.3 | BSD-3-Clause | https://joblib.readthedocs.io |
| jsonschema | 4.26.0 | MIT | https://github.com/python-jsonschema/jsonschema |
| jsonschema-specifications | 2025.9.1 | MIT | https://github.com/python-jsonschema/jsonschema-specifications |
| jupyter_client | 8.8.0 | BSD License | https://jupyter.org |
| jupyter_core | 5.9.1 | BSD-3-Clause | https://jupyter.org |
| jupyterlab_pygments | 0.3.0 | BSD License | https://github.com/jupyterlab/jupyterlab_pygments |
| jupytext | 1.19.1 | MIT License | https://github.com/mwouts/jupytext |
| lance-namespace | 0.4.5 | Apache-2.0 | https://github.com/lance-format/lance-namespace |
| lance-namespace-urllib3-client | 0.4.5 | Apache-2.0 | https://github.com/lance-format/lance-namespace |
| lancedb | 0.29.2 | Apache Software License | https://github.com/lancedb/lancedb |
| lomond | 0.3.3 | BSD License | https://github.com/wildfoundry/dataplicity-lomond |
| markdown-it-py | 4.0.0 | MIT License | https://github.com/executablebooks/markdown-it-py |
| matplotlib-inline | 0.2.1 | UNKNOWN | https://github.com/ipython/matplotlib-inline |
| mdit-py-plugins | 0.5.0 | MIT License | https://github.com/executablebooks/mdit-py-plugins |
| mdurl | 0.1.2 | MIT License | https://github.com/executablebooks/mdurl |
| mergedeep | 1.3.4 | MIT License | https://github.com/clarketm/mergedeep |
| mistune | 3.2.0 | BSD License | https://github.com/lepture/mistune |
| mkdocs | 1.6.1 | BSD-2-Clause | https://github.com/mkdocs/mkdocs |
| mkdocs-autorefs | 1.4.3 | ISC | https://mkdocstrings.github.io/autorefs |
| mkdocs-get-deps | 0.2.0 | MIT | https://github.com/mkdocs/get-deps |
| mkdocs-jupyter | 0.25.1 | Apache-2.0 | https://github.com/danielfrg/mkdocs-jupyter |
| mkdocs-material | 9.7.1 | MIT | https://github.com/squidfunk/mkdocs-material |
| mkdocs-material-extensions | 1.3.1 | MIT | https://github.com/facelessuser/mkdocs-material-extensions |
| mkdocstrings | 1.0.3 | ISC | https://mkdocstrings.github.io |
| mkdocstrings-python | 2.0.2 | ISC | https://mkdocstrings.github.io/python |
| mpmath | 1.3.0 | BSD License | http://mpmath.org/ |
| msal | 1.34.0 | MIT License | https://github.com/AzureAD/microsoft-authentication-library-for-python |
| msal-extensions | 1.3.1 | MIT License | https://github.com/AzureAD/microsoft-authentication-extensions-for-python/releases |
| multidict | 6.7.1 | Apache License 2.0 | https://github.com/aio-libs/multidict |
| nbclient | 0.10.4 | BSD License | https://jupyter.org |
| nbconvert | 7.17.0 | BSD License | https://jupyter.org |
| nbformat | 5.10.4 | BSD License | https://jupyter.org |
| nest-asyncio | 1.6.0 | BSD License | https://github.com/erdewit/nest_asyncio |
| networkx | 3.6.1 | BSD-3-Clause | https://networkx.org/ |
| nodeenv | 1.10.0 | BSD License | https://github.com/ekalinin/nodeenv |
| numpy | 2.4.2 | BSD-3-Clause AND 0BSD AND MIT AND Zlib AND CC0-1.0 | https://numpy.org |
| ollama | 0.6.1 | MIT | https://ollama.com |
| open_clip_torch | 3.2.0 | MIT License | https://github.com/mlfoundations/open_clip |
| openai | 2.18.0 | Apache Software License | https://github.com/openai/openai-python |
| packaging | 26.0 | Apache-2.0 OR BSD-2-Clause | https://github.com/pypa/packaging |
| paginate | 0.5.7 | MIT License | https://github.com/Signum/paginate |
| pandas | 2.3.3 | BSD License | https://pandas.pydata.org |
| pandocfilters | 1.5.1 | BSD License | http://github.com/jgm/pandocfilters |
| parso | 0.8.6 | MIT License | https://github.com/davidhalter/parso |
| pathspec | 1.0.4 | Mozilla Public License 2.0 (MPL 2.0) | UNKNOWN |
| peft | 0.17.1 | Apache Software License | https://github.com/huggingface/peft |
| pexpect | 4.9.0 | ISC License (ISCL) | https://pexpect.readthedocs.io/ |
| pillow | 12.1.0 | MIT-CMU | https://python-pillow.github.io |
| platformdirs | 4.5.1 | MIT | https://github.com/tox-dev/platformdirs |
| pluggy | 1.6.0 | MIT License | UNKNOWN |
| polars | 1.3.0 | MIT License | https://www.pola.rs/ |
| pre_commit | 4.5.1 | MIT | https://github.com/pre-commit/pre-commit |
| prompt_toolkit | 3.0.52 | BSD License | https://github.com/prompt-toolkit/python-prompt-toolkit |
| propcache | 0.4.1 | Apache Software License | https://github.com/aio-libs/propcache |
| proto-plus | 1.27.1 | Apache Software License | https://github.com/googleapis/proto-plus-python |
| protobuf | 5.29.6 | 3-Clause BSD License | https://developers.google.com/protocol-buffers/ |
| psutil | 7.2.2 | BSD-3-Clause | https://github.com/giampaolo/psutil |
| ptyprocess | 0.7.0 | ISC License (ISCL) | https://github.com/pexpect/ptyprocess |
| pure_eval | 0.2.3 | MIT License | http://github.com/alexmojaki/pure_eval |
| pyarrow | 23.0.0 | Apache-2.0 | https://arrow.apache.org/ |
| pyarrow-stubs | 20.0.0.20251215 | BSD-2-Clause | https://github.com/zen-xu/pyarrow-stubs |
| pyasn1 | 0.6.2 | BSD-2-Clause | https://github.com/pyasn1/pyasn1 |
| pyasn1_modules | 0.4.2 | BSD License | https://github.com/pyasn1/pyasn1-modules |
| pycparser | 3.0 | BSD-3-Clause | https://github.com/eliben/pycparser |
| pydantic | 2.12.5 | MIT | https://github.com/pydantic/pydantic |
| pydantic_core | 2.41.5 | MIT | https://github.com/pydantic/pydantic-core |
| pylance | 2.0.0 | Apache Software License | UNKNOWN |
| pymdown-extensions | 10.20.1 | MIT | https://github.com/facelessuser/pymdown-extensions |
| pyparsing | 3.3.2 | MIT | https://github.com/pyparsing/pyparsing/ |
| pyright | 1.1.408 | MIT | https://github.com/RobertCraigie/pyright-python |
| pytest | 9.0.2 | MIT | https://docs.pytest.org/en/latest/ |
| pytest-asyncio | 1.3.0 | Apache-2.0 | https://github.com/pytest-dev/pytest-asyncio |
| pytest-mock | 3.15.1 | MIT License | https://github.com/pytest-dev/pytest-mock/ |
| python-dateutil | 2.9.0.post0 | Apache Software License; BSD License | https://github.com/dateutil/dateutil |
| pytz | 2025.2 | MIT License | http://pythonhosted.org/pytz |
| pyyaml_env_tag | 1.1 | MIT | https://github.com/waylan/pyyaml-env-tag |
| pyzmq | 27.1.0 | BSD License | https://pyzmq.readthedocs.org |
| referencing | 0.37.0 | MIT | https://github.com/python-jsonschema/referencing |
| regex | 2026.1.15 | Apache-2.0 AND CNRI-Python | https://github.com/mrabarnett/mrab-regex |
| requests | 2.32.5 | Apache Software License | https://requests.readthedocs.io |
| rpds-py | 0.30.0 | MIT | https://github.com/crate-py/rpds |
| rsa | 4.7.2 | Apache Software License | https://stuvel.eu/rsa |
| ruff | 0.15.0 | MIT License | https://docs.astral.sh/ruff |
| s3transfer | 0.16.0 | Apache Software License | https://github.com/boto/s3transfer |
| safetensors | 0.7.0 | Apache Software License | https://github.com/huggingface/safetensors |
| scikit-learn | 1.8.0 | BSD-3-Clause | https://scikit-learn.org |
| scipy | 1.17.0 | BSD License | https://scipy.org/ |
| sentence-transformers | 5.2.2 | Apache Software License | https://www.SBERT.net |
| sentencepiece | 0.2.1 | UNKNOWN | https://github.com/google/sentencepiece |
| six | 1.17.0 | MIT License | https://github.com/benjaminp/six |
| sniffio | 1.3.1 | Apache Software License; MIT License | https://github.com/python-trio/sniffio |
| soupsieve | 2.8.3 | MIT | https://github.com/facelessuser/soupsieve |
| stack-data | 0.6.3 | MIT License | http://github.com/alexmojaki/stack_data |
| sympy | 1.14.0 | BSD License | https://sympy.org |
| tabulate | 0.9.0 | MIT License | https://github.com/astanin/python-tabulate |
| tantivy | 0.25.1 | UNKNOWN | UNKNOWN |
| threadpoolctl | 3.6.0 | BSD License | https://github.com/joblib/threadpoolctl |
| timm | 1.0.24 | Apache Software License | https://github.com/huggingface/pytorch-image-models |
| tinycss2 | 1.4.0 | BSD License | https://www.courtbouillon.org/tinycss2 |
| tokenizers | 0.22.2 | Apache Software License | https://github.com/huggingface/tokenizers |
| torch | 2.8.0 | BSD License | https://pytorch.org/ |
| torchvision | 0.23.0 | BSD | https://github.com/pytorch/vision |
| tornado | 6.5.4 | Apache Software License | http://www.tornadoweb.org/ |
| tqdm | 4.67.3 | MPL-2.0 AND MIT | https://tqdm.github.io |
| traitlets | 5.14.3 | BSD License | https://github.com/ipython/traitlets |
| transformers | 4.57.6 | Apache Software License | https://github.com/huggingface/transformers |
| types-requests | 2.32.4.20260107 | Apache-2.0 | https://github.com/python/typeshed |
| typing-inspection | 0.4.2 | MIT | https://github.com/pydantic/typing-inspection |
| typing_extensions | 4.15.0 | PSF-2.0 | https://github.com/python/typing_extensions |
| tzdata | 2025.3 | Apache-2.0 | https://github.com/python/tzdata |
| uritemplate | 4.2.0 | BSD 3-Clause OR Apache-2.0 | https://uritemplate.readthedocs.org |
| urllib3 | 2.6.3 | MIT | https://github.com/urllib3/urllib3/blob/main/CHANGES.rst |
| virtualenv | 20.36.1 | MIT | https://github.com/pypa/virtualenv |
| watchdog | 6.0.0 | Apache Software License | https://github.com/gorakhargosh/watchdog |
| webencodings | 0.5.1 | BSD License | https://github.com/SimonSapin/python-webencodings |
| yarl | 1.22.0 | Apache Software License | https://github.com/aio-libs/yarl |

File diff suppressed because it is too large Load Diff

View File

@@ -2,6 +2,7 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import warnings
from typing import List, Union
import numpy as np
@@ -15,6 +16,8 @@ from .utils import weak_lru
@register("gte-text")
class GteEmbeddings(TextEmbeddingFunction):
"""
Deprecated: GTE embeddings should be used through sentence-transformers.
An embedding function that uses GTE-LARGE MLX format(for Apple silicon devices only)
as well as the standard cpu/gpu version from: https://huggingface.co/thenlper/gte-large.
@@ -61,6 +64,13 @@ class GteEmbeddings(TextEmbeddingFunction):
def __init__(self, **kwargs):
super().__init__(**kwargs)
warnings.warn(
"GTE embeddings as a standalone embedding function are deprecated. "
"Use the 'sentence-transformers' embedding function with a GTE model "
"instead.",
DeprecationWarning,
stacklevel=3,
)
self._ndims = None
if kwargs:
self.mlx = kwargs.get("mlx", False)

View File

@@ -110,6 +110,9 @@ class OpenAIEmbeddings(TextEmbeddingFunction):
valid_embeddings = {
idx: v.embedding for v, idx in zip(rs.data, valid_indices)
}
except openai.AuthenticationError:
logging.error("Authentication failed: Invalid API key provided")
raise
except openai.BadRequestError:
logging.exception("Bad request: %s", texts)
return [None] * len(texts)

View File

@@ -6,6 +6,7 @@ import io
import os
from typing import TYPE_CHECKING, List, Union
import urllib.parse as urlparse
import warnings
import numpy as np
import pyarrow as pa
@@ -24,6 +25,7 @@ if TYPE_CHECKING:
@register("siglip")
class SigLipEmbeddings(EmbeddingFunction):
# Deprecated: prefer CLIP embeddings via `open-clip`.
model_name: str = "google/siglip-base-patch16-224"
device: str = "cpu"
batch_size: int = 64
@@ -36,6 +38,12 @@ class SigLipEmbeddings(EmbeddingFunction):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
warnings.warn(
"SigLip embeddings are deprecated. Use CLIP embeddings via the "
"'open-clip' embedding function instead.",
DeprecationWarning,
stacklevel=3,
)
transformers = attempt_import_or_raise("transformers")
self._torch = attempt_import_or_raise("torch")

View File

@@ -269,6 +269,11 @@ def retry_with_exponential_backoff(
# and say that it is assumed that if this portion errors out, it's due
# to rate limit but the user should check the error message to be sure.
except Exception as e: # noqa: PERF203
# Don't retry on authentication errors (e.g., OpenAI 401)
# These are permanent failures that won't be fixed by retrying
if _is_non_retryable_error(e):
raise
num_retries += 1
if num_retries > max_retries:
@@ -289,6 +294,29 @@ def retry_with_exponential_backoff(
return wrapper
def _is_non_retryable_error(error: Exception) -> bool:
"""Check if an error should not be retried.
Args:
error: The exception to check
Returns:
True if the error should not be retried, False otherwise
"""
# Check for OpenAI authentication errors
error_type = type(error).__name__
if error_type == "AuthenticationError":
return True
# Check for other common non-retryable HTTP status codes
# 401 Unauthorized, 403 Forbidden
if hasattr(error, "status_code"):
if error.status_code in (401, 403):
return True
return False
def url_retrieve(url: str):
"""
Parameters

View File

@@ -44,7 +44,7 @@ from lance_namespace import (
ListNamespacesRequest,
CreateNamespaceRequest,
DropNamespaceRequest,
CreateEmptyTableRequest,
DeclareTableRequest,
)
from lancedb.table import AsyncTable, LanceTable, Table
from lancedb.util import validate_table_name
@@ -318,20 +318,20 @@ class LanceNamespaceDBConnection(DBConnection):
if location is None:
# Table doesn't exist or mode is "create", reserve a new location
create_empty_request = CreateEmptyTableRequest(
declare_request = DeclareTableRequest(
id=table_id,
location=None,
properties=self.storage_options if self.storage_options else None,
)
create_empty_response = self._ns.create_empty_table(create_empty_request)
declare_response = self._ns.declare_table(declare_request)
if not create_empty_response.location:
if not declare_response.location:
raise ValueError(
"Table location is missing from create_empty_table response"
"Table location is missing from declare_table response"
)
location = create_empty_response.location
namespace_storage_options = create_empty_response.storage_options
location = declare_response.location
namespace_storage_options = declare_response.storage_options
# Merge storage options: self.storage_options < user options < namespace options
merged_storage_options = dict(self.storage_options)
@@ -759,20 +759,20 @@ class AsyncLanceNamespaceDBConnection:
if location is None:
# Table doesn't exist or mode is "create", reserve a new location
create_empty_request = CreateEmptyTableRequest(
declare_request = DeclareTableRequest(
id=table_id,
location=None,
properties=self.storage_options if self.storage_options else None,
)
create_empty_response = self._ns.create_empty_table(create_empty_request)
declare_response = self._ns.declare_table(declare_request)
if not create_empty_response.location:
if not declare_response.location:
raise ValueError(
"Table location is missing from create_empty_table response"
"Table location is missing from declare_table response"
)
location = create_empty_response.location
namespace_storage_options = create_empty_response.storage_options
location = declare_response.location
namespace_storage_options = declare_response.storage_options
# Merge storage options: self.storage_options < user options < namespace options
merged_storage_options = dict(self.storage_options)

View File

@@ -9,7 +9,7 @@ import json
from ._lancedb import async_permutation_builder, PermutationReader
from .table import LanceTable
from .background_loop import LOOP
from .util import batch_to_tensor
from .util import batch_to_tensor, batch_to_tensor_rows
from typing import Any, Callable, Iterator, Literal, Optional, TYPE_CHECKING, Union
if TYPE_CHECKING:
@@ -333,7 +333,11 @@ class Transforms:
"""
@staticmethod
def arrow2python(batch: pa.RecordBatch) -> dict[str, list[Any]]:
def arrow2python(batch: pa.RecordBatch) -> list[dict[str, Any]]:
return batch.to_pylist()
@staticmethod
def arrow2pythoncol(batch: pa.RecordBatch) -> dict[str, list[Any]]:
return batch.to_pydict()
@staticmethod
@@ -687,7 +691,17 @@ class Permutation:
return
def with_format(
self, format: Literal["numpy", "python", "pandas", "arrow", "torch", "polars"]
self,
format: Literal[
"numpy",
"python",
"python_col",
"pandas",
"arrow",
"torch",
"torch_col",
"polars",
],
) -> "Permutation":
"""
Set the format for batches
@@ -696,16 +710,18 @@ class Permutation:
The format can be one of:
- "numpy" - the batch will be a dict of numpy arrays (one per column)
- "python" - the batch will be a dict of lists (one per column)
- "python" - the batch will be a list of dicts (one per row)
- "python_col" - the batch will be a dict of lists (one entry per column)
- "pandas" - the batch will be a pandas DataFrame
- "arrow" - the batch will be a pyarrow RecordBatch
- "torch" - the batch will be a two dimensional torch tensor
- "torch" - the batch will be a list of tensors, one per row
- "torch_col" - the batch will be a 2D torch tensor (first dim indexes columns)
- "polars" - the batch will be a polars DataFrame
Conversion may or may not involve a data copy. Lance uses Arrow internally
and so it is able to zero-copy to the arrow and polars.
and so it is able to zero-copy to the arrow and polars formats.
Conversion to torch will be zero-copy but will only support a subset of data
Conversion to torch_col will be zero-copy but will only support a subset of data
types (numeric types).
Conversion to numpy and/or pandas will typically be zero-copy for numeric
@@ -718,6 +734,8 @@ class Permutation:
assert format is not None, "format is required"
if format == "python":
return self.with_transform(Transforms.arrow2python)
if format == "python_col":
return self.with_transform(Transforms.arrow2pythoncol)
elif format == "numpy":
return self.with_transform(Transforms.arrow2numpy)
elif format == "pandas":
@@ -725,6 +743,8 @@ class Permutation:
elif format == "arrow":
return self.with_transform(Transforms.arrow2arrow)
elif format == "torch":
return self.with_transform(batch_to_tensor_rows)
elif format == "torch_col":
return self.with_transform(batch_to_tensor)
elif format == "polars":
return self.with_transform(Transforms.arrow2polars())
@@ -746,15 +766,20 @@ class Permutation:
def __getitem__(self, index: int) -> Any:
"""
Return a single row from the permutation
The output will always be a python dictionary regardless of the format.
This method is mostly useful for debugging and exploration. For actual
processing use [iter](#iter) or a torch data loader to perform batched
processing.
Returns a single row from the permutation by offset
"""
pass
return self.__getitems__([index])
def __getitems__(self, indices: list[int]) -> Any:
"""
Returns rows from the permutation by offset
"""
async def do_getitems():
return await self.reader.take_offsets(indices, selection=self.selection)
batch = LOOP.run(do_getitems())
return self.transform_fn(batch)
@deprecated(details="Use with_skip instead")
def skip(self, skip: int) -> "Permutation":

View File

@@ -1782,6 +1782,26 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
vector_results = LanceHybridQueryBuilder._rank(vector_results, "_distance")
fts_results = LanceHybridQueryBuilder._rank(fts_results, "_score")
# If both result sets are empty (e.g. after hard filtering),
# return early to avoid errors in reranking or score restoration.
if vector_results.num_rows == 0 and fts_results.num_rows == 0:
# Build a minimal empty table with the _relevance_score column
combined_schema = pa.unify_schemas(
[vector_results.schema, fts_results.schema],
)
empty = pa.table(
{
col: pa.array([], type=combined_schema.field(col).type)
for col in combined_schema.names
}
)
empty = empty.append_column(
"_relevance_score", pa.array([], type=pa.float32())
)
if not with_row_ids and "_rowid" in empty.column_names:
empty = empty.drop(["_rowid"])
return empty
original_distances = None
original_scores = None
original_distance_row_ids = None
@@ -2118,19 +2138,17 @@ class LanceHybridQueryBuilder(LanceQueryBuilder):
""" # noqa: E501
self._create_query_builders()
results = ["Vector Search Plan:"]
results.append(
self._table._explain_plan(
self._vector_query.to_query_object(), verbose=verbose
)
reranker_label = str(self._reranker) if self._reranker else "No reranker"
vector_plan = self._table._explain_plan(
self._vector_query.to_query_object(), verbose=verbose
)
results.append("FTS Search Plan:")
results.append(
self._table._explain_plan(
self._fts_query.to_query_object(), verbose=verbose
)
fts_plan = self._table._explain_plan(
self._fts_query.to_query_object(), verbose=verbose
)
return "\n".join(results)
# Indent sub-plans under the reranker
indented_vector = "\n".join(" " + line for line in vector_plan.splitlines())
indented_fts = "\n".join(" " + line for line in fts_plan.splitlines())
return f"{reranker_label}\n {indented_vector}\n {indented_fts}"
def analyze_plan(self):
"""Execute the query and display with runtime metrics.
@@ -3164,23 +3182,20 @@ class AsyncHybridQuery(AsyncStandardQuery, AsyncVectorQueryBase):
... plan = await table.query().nearest_to([1.0, 2.0]).nearest_to_text("hello").explain_plan(True)
... print(plan)
>>> asyncio.run(doctest_example()) # doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
Vector Search Plan:
ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]
Take: columns="vector, _rowid, _distance, (text)"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=0, fetch=10
FilterExec: _distance@2 IS NOT NULL
SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false]
KNNVectorDistance: metric=l2
LanceRead: uri=..., projection=[vector], ...
<BLANKLINE>
FTS Search Plan:
ProjectionExec: expr=[vector@2 as vector, text@3 as text, _score@1 as _score]
Take: columns="_rowid, _score, (vector), (text)"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=0, fetch=10
MatchQuery: column=text, query=hello
<BLANKLINE>
RRFReranker(K=60)
ProjectionExec: expr=[vector@0 as vector, text@3 as text, _distance@2 as _distance]
Take: columns="vector, _rowid, _distance, (text)"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=0, fetch=10
FilterExec: _distance@2 IS NOT NULL
SortExec: TopK(fetch=10), expr=[_distance@2 ASC NULLS LAST, _rowid@1 ASC NULLS LAST], preserve_partitioning=[false]
KNNVectorDistance: metric=l2
LanceRead: uri=..., projection=[vector], ...
ProjectionExec: expr=[vector@2 as vector, text@3 as text, _score@1 as _score]
Take: columns="_rowid, _score, (vector), (text)"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=0, fetch=10
MatchQuery: column=text, query=hello
Parameters
----------
@@ -3192,12 +3207,12 @@ class AsyncHybridQuery(AsyncStandardQuery, AsyncVectorQueryBase):
plan : str
""" # noqa: E501
results = ["Vector Search Plan:"]
results.append(await self._inner.to_vector_query().explain_plan(verbose))
results.append("FTS Search Plan:")
results.append(await self._inner.to_fts_query().explain_plan(verbose))
return "\n".join(results)
vector_plan = await self._inner.to_vector_query().explain_plan(verbose)
fts_plan = await self._inner.to_fts_query().explain_plan(verbose)
# Indent sub-plans under the reranker
indented_vector = "\n".join(" " + line for line in vector_plan.splitlines())
indented_fts = "\n".join(" " + line for line in fts_plan.splitlines())
return f"{self._reranker}\n {indented_vector}\n {indented_fts}"
async def analyze_plan(self):
"""

View File

@@ -42,10 +42,18 @@ class AnswerdotaiRerankers(Reranker):
rerankers = attempt_import_or_raise(
"rerankers"
) # import here for faster ops later
self.model_name = model_name
self.model_type = model_type
self.reranker = rerankers.Reranker(
model_name=model_name, model_type=model_type, **kwargs
)
def __str__(self):
return (
f"AnswerdotaiRerankers(model_type={self.model_type}, "
f"model_name={self.model_name})"
)
def _rerank(self, result_set: pa.Table, query: str):
result_set = self._handle_empty_results(result_set)
if len(result_set) == 0:

View File

@@ -40,6 +40,9 @@ class Reranker(ABC):
if ARROW_VERSION.major <= 13:
self._concat_tables_args = {"promote": True}
def __str__(self):
return self.__class__.__name__
def rerank_vector(
self,
query: str,

View File

@@ -44,6 +44,9 @@ class CohereReranker(Reranker):
self.top_n = top_n
self.api_key = api_key
def __str__(self):
return f"CohereReranker(model_name={self.model_name})"
@cached_property
def _client(self):
cohere = attempt_import_or_raise("cohere")

View File

@@ -50,6 +50,9 @@ class CrossEncoderReranker(Reranker):
if self.device is None:
self.device = "cuda" if torch.cuda.is_available() else "cpu"
def __str__(self):
return f"CrossEncoderReranker(model_name={self.model_name})"
@cached_property
def model(self):
sbert = attempt_import_or_raise("sentence_transformers")

View File

@@ -45,6 +45,9 @@ class JinaReranker(Reranker):
self.top_n = top_n
self.api_key = api_key
def __str__(self):
return f"JinaReranker(model_name={self.model_name})"
@cached_property
def _client(self):
import requests

View File

@@ -38,6 +38,9 @@ class LinearCombinationReranker(Reranker):
self.weight = weight
self.fill = fill
def __str__(self):
return f"LinearCombinationReranker(weight={self.weight}, fill={self.fill})"
def rerank_hybrid(
self,
query: str, # noqa: F821

View File

@@ -54,6 +54,12 @@ class MRRReranker(Reranker):
self.weight_vector = weight_vector
self.weight_fts = weight_fts
def __str__(self):
return (
f"MRRReranker(weight_vector={self.weight_vector}, "
f"weight_fts={self.weight_fts})"
)
def rerank_hybrid(
self,
query: str, # noqa: F821

View File

@@ -43,6 +43,9 @@ class OpenaiReranker(Reranker):
self.column = column
self.api_key = api_key
def __str__(self):
return f"OpenaiReranker(model_name={self.model_name})"
def _rerank(self, result_set: pa.Table, query: str):
result_set = self._handle_empty_results(result_set)
if len(result_set) == 0:

View File

@@ -36,6 +36,9 @@ class RRFReranker(Reranker):
super().__init__(return_score)
self.K = K
def __str__(self):
return f"RRFReranker(K={self.K})"
def rerank_hybrid(
self,
query: str, # noqa: F821

View File

@@ -52,6 +52,9 @@ class VoyageAIReranker(Reranker):
self.api_key = api_key
self.truncation = truncation
def __str__(self):
return f"VoyageAIReranker(model_name={self.model_name})"
@cached_property
def _client(self):
voyageai = attempt_import_or_raise("voyageai")

View File

@@ -904,7 +904,9 @@ class Table(ABC):
----------
field_names: str or list of str
The name(s) of the field to index.
can be only str if use_tantivy=True for now.
If ``use_tantivy`` is False (default), only a single field name
(str) is supported. To index multiple fields, create a separate
FTS index for each field.
replace: bool, default False
If True, replace the existing index if it exists. Note that this is
not yet an atomic operation; the index will be temporarily
@@ -2298,7 +2300,11 @@ class LanceTable(Table):
):
if not use_tantivy:
if not isinstance(field_names, str):
raise ValueError("field_names must be a string when use_tantivy=False")
raise ValueError(
"Native FTS indexes can only be created on a single field "
"at a time. To search over multiple text fields, create a "
"separate FTS index for each field."
)
if tokenizer_name is None:
tokenizer_configs = {

View File

@@ -419,3 +419,22 @@ def batch_to_tensor(batch: pa.RecordBatch):
"""
torch = attempt_import_or_raise("torch", "torch")
return torch.stack([torch.from_dlpack(col) for col in batch.columns])
def batch_to_tensor_rows(batch: pa.RecordBatch):
"""
Convert a PyArrow RecordBatch to a list of PyTorch Tensor, one per row
Each column is converted to a tensor (using zero-copy via DLPack)
and the columns are then stacked into a single tensor. The 2D tensor
is then converted to a list of tensors, one per row
Fails if torch or numpy is not installed.
Fails if a column's data type is not supported by PyTorch.
"""
torch = attempt_import_or_raise("torch", "torch")
numpy = attempt_import_or_raise("numpy", "numpy")
columns = [col.to_numpy(zero_copy_only=False) for col in batch.columns]
stacked = torch.tensor(numpy.column_stack(columns))
rows = list(stacked.unbind(dim=0))
return rows

View File

@@ -515,3 +515,34 @@ def test_openai_propagates_api_key(monkeypatch):
query = "greetings"
actual = table.search(query).limit(1).to_pydantic(Words)[0]
assert len(actual.text) > 0
@patch("time.sleep")
def test_openai_no_retry_on_401(mock_sleep):
"""
Test that OpenAI embedding function does not retry on 401 authentication
errors.
"""
from lancedb.embeddings.utils import retry_with_exponential_backoff
# Create a mock that raises an AuthenticationError
class MockAuthenticationError(Exception):
"""Mock OpenAI AuthenticationError"""
pass
MockAuthenticationError.__name__ = "AuthenticationError"
mock_func = MagicMock(side_effect=MockAuthenticationError("Invalid API key"))
# Wrap the function with retry logic
wrapped_func = retry_with_exponential_backoff(mock_func, max_retries=3)
# Should raise without retrying
with pytest.raises(MockAuthenticationError):
wrapped_func()
# Verify that the function was only called once (no retries)
assert mock_func.call_count == 1
# Verify that sleep was never called (no retries)
assert mock_sleep.call_count == 0

View File

@@ -163,9 +163,7 @@ async def test_explain_plan(table: AsyncTable):
table.query().nearest_to_text("dog").nearest_to([0.1, 0.1]).explain_plan(True)
)
assert "Vector Search Plan" in plan
assert "KNNVectorDistance" in plan
assert "FTS Search Plan" in plan
assert "LanceRead" in plan

View File

@@ -664,23 +664,20 @@ def test_iter_basic(some_permutation: Permutation):
expected_batches = (950 + batch_size - 1) // batch_size # ceiling division
assert len(batches) == expected_batches
# Check that all batches are dicts (default python format)
assert all(isinstance(batch, dict) for batch in batches)
# Check that all batches are lists of dicts (default python format)
assert all(isinstance(batch, list) for batch in batches)
# Check that batches have the correct structure
for batch in batches:
assert "id" in batch
assert "value" in batch
assert isinstance(batch["id"], list)
assert isinstance(batch["value"], list)
assert "id" in batch[0]
assert "value" in batch[0]
# Check that all batches except the last have the correct size
for batch in batches[:-1]:
assert len(batch["id"]) == batch_size
assert len(batch["value"]) == batch_size
assert len(batch) == batch_size
# Last batch might be smaller
assert len(batches[-1]["id"]) <= batch_size
assert len(batches[-1]) <= batch_size
def test_iter_skip_last_batch(some_permutation: Permutation):
@@ -699,11 +696,11 @@ def test_iter_skip_last_batch(some_permutation: Permutation):
if 950 % batch_size != 0:
assert len(batches_without_skip) == num_full_batches + 1
# Last batch should be smaller
assert len(batches_without_skip[-1]["id"]) == 950 % batch_size
assert len(batches_without_skip[-1]) == 950 % batch_size
# All batches with skip_last_batch should be full size
for batch in batches_with_skip:
assert len(batch["id"]) == batch_size
assert len(batch) == batch_size
def test_iter_different_batch_sizes(some_permutation: Permutation):
@@ -720,12 +717,12 @@ def test_iter_different_batch_sizes(some_permutation: Permutation):
# Test with batch size equal to total rows
single_batch = list(some_permutation.iter(950, skip_last_batch=False))
assert len(single_batch) == 1
assert len(single_batch[0]["id"]) == 950
assert len(single_batch[0]) == 950
# Test with batch size larger than total rows
oversized_batch = list(some_permutation.iter(10000, skip_last_batch=False))
assert len(oversized_batch) == 1
assert len(oversized_batch[0]["id"]) == 950
assert len(oversized_batch[0]) == 950
def test_dunder_iter(some_permutation: Permutation):
@@ -738,15 +735,13 @@ def test_dunder_iter(some_permutation: Permutation):
# All batches should be full size
for batch in batches:
assert len(batch["id"]) == 100
assert len(batch["value"]) == 100
assert len(batch) == 100
some_permutation = some_permutation.with_batch_size(400)
batches = list(some_permutation)
assert len(batches) == 2 # floor(950 / 400) since skip_last_batch=True
for batch in batches:
assert len(batch["id"]) == 400
assert len(batch["value"]) == 400
assert len(batch) == 400
def test_iter_with_different_formats(some_permutation: Permutation):
@@ -761,7 +756,7 @@ def test_iter_with_different_formats(some_permutation: Permutation):
# Test with python format (default)
python_perm = some_permutation.with_format("python")
python_batches = list(python_perm.iter(batch_size, skip_last_batch=False))
assert all(isinstance(batch, dict) for batch in python_batches)
assert all(isinstance(batch, list) for batch in python_batches)
# Test with pandas format
pandas_perm = some_permutation.with_format("pandas")
@@ -780,8 +775,8 @@ def test_iter_with_column_selection(some_permutation: Permutation):
# Check that batches only contain the id column
for batch in batches:
assert "id" in batch
assert "value" not in batch
assert "id" in batch[0]
assert "value" not in batch[0]
def test_iter_with_column_rename(some_permutation: Permutation):
@@ -791,9 +786,9 @@ def test_iter_with_column_rename(some_permutation: Permutation):
# Check that batches have the renamed column
for batch in batches:
assert "id" in batch
assert "data" in batch
assert "value" not in batch
assert "id" in batch[0]
assert "data" in batch[0]
assert "value" not in batch[0]
def test_iter_with_limit_offset(some_permutation: Permutation):
@@ -812,14 +807,14 @@ def test_iter_with_limit_offset(some_permutation: Permutation):
assert len(limit_batches) == 5
no_skip = some_permutation.iter(101, skip_last_batch=False)
row_100 = next(no_skip)["id"][100]
row_100 = next(no_skip)[100]["id"]
# Test with both limit and offset
limited_perm = some_permutation.with_skip(100).with_take(300)
limited_batches = list(limited_perm.iter(100, skip_last_batch=False))
# Should have 3 batches (300 / 100)
assert len(limited_batches) == 3
assert limited_batches[0]["id"][0] == row_100
assert limited_batches[0][0]["id"] == row_100
def test_iter_empty_permutation(mem_db):
@@ -842,7 +837,7 @@ def test_iter_single_row(mem_db):
# With skip_last_batch=False, should get one batch
batches = list(perm.iter(10, skip_last_batch=False))
assert len(batches) == 1
assert len(batches[0]["id"]) == 1
assert len(batches[0]) == 1
# With skip_last_batch=True, should skip the single row (since it's < batch_size)
batches_skip = list(perm.iter(10, skip_last_batch=True))
@@ -860,8 +855,7 @@ def test_identity_permutation(mem_db):
batches = list(permutation.iter(10, skip_last_batch=False))
assert len(batches) == 1
assert len(batches[0]["id"]) == 10
assert len(batches[0]["value"]) == 10
assert len(batches[0]) == 10
permutation = permutation.remove_columns(["value"])
assert permutation.num_columns == 1
@@ -904,10 +898,10 @@ def test_transform_fn(mem_db):
py_result = list(permutation.with_format("python").iter(10, skip_last_batch=False))[
0
]
assert len(py_result) == 2
assert len(py_result["id"]) == 10
assert len(py_result["value"]) == 10
assert isinstance(py_result, dict)
assert len(py_result) == 10
assert "id" in py_result[0]
assert "value" in py_result[0]
assert isinstance(py_result, list)
try:
import torch
@@ -915,9 +909,11 @@ def test_transform_fn(mem_db):
torch_result = list(
permutation.with_format("torch").iter(10, skip_last_batch=False)
)[0]
assert torch_result.shape == (2, 10)
assert torch_result.dtype == torch.int64
assert isinstance(torch_result, torch.Tensor)
assert isinstance(torch_result, list)
assert len(torch_result) == 10
assert isinstance(torch_result[0], torch.Tensor)
assert torch_result[0].shape == (2,)
assert torch_result[0].dtype == torch.int64
except ImportError:
# Skip check if torch is not installed
pass
@@ -945,3 +941,113 @@ def test_custom_transform(mem_db):
batch = batches[0]
assert batch == pa.record_batch([range(10)], ["id"])
def test_getitems_basic(some_permutation: Permutation):
"""Test __getitems__ returns correct rows by offset."""
result = some_permutation.__getitems__([0, 1, 2])
assert isinstance(result, list)
assert "id" in result[0]
assert "value" in result[0]
assert len(result) == 3
def test_getitems_single_index(some_permutation: Permutation):
"""Test __getitems__ with a single index."""
result = some_permutation.__getitems__([0])
assert len(result) == 1
def test_getitems_preserves_order(some_permutation: Permutation):
"""Test __getitems__ returns rows in the requested order."""
# Get rows in forward order
forward = some_permutation.__getitems__([0, 1, 2, 3, 4])
# Get the same rows in reverse order
reverse = some_permutation.__getitems__([4, 3, 2, 1, 0])
assert [r["id"] for r in forward] == list(reversed([r["id"] for r in reverse]))
assert [r["value"] for r in forward] == list(
reversed([r["value"] for r in reverse])
)
def test_getitems_non_contiguous(some_permutation: Permutation):
"""Test __getitems__ with non-contiguous indices."""
result = some_permutation.__getitems__([0, 10, 50, 100, 500])
assert len(result) == 5
# Each id/value pair should match what we'd get individually
for i, offset in enumerate([0, 10, 50, 100, 500]):
single = some_permutation.__getitems__([offset])
assert result[i]["id"] == single[0]["id"]
assert result[i]["value"] == single[0]["value"]
def test_getitems_with_column_selection(some_permutation: Permutation):
"""Test __getitems__ respects column selection."""
id_only = some_permutation.select_columns(["id"])
result = id_only.__getitems__([0, 1, 2])
assert "id" in result[0]
assert "value" not in result[0]
assert len(result) == 3
def test_getitems_with_column_rename(some_permutation: Permutation):
"""Test __getitems__ respects column renames."""
renamed = some_permutation.rename_column("value", "data")
result = renamed.__getitems__([0, 1])
assert "data" in result[0]
assert "value" not in result[0]
assert len(result) == 2
def test_getitems_with_format(some_permutation: Permutation):
"""Test __getitems__ applies the transform function."""
arrow_perm = some_permutation.with_format("arrow")
result = arrow_perm.__getitems__([0, 1, 2])
assert isinstance(result, pa.RecordBatch)
assert result.num_rows == 3
def test_getitems_with_custom_transform(some_permutation: Permutation):
"""Test __getitems__ with a custom transform."""
def transform(batch: pa.RecordBatch) -> list:
return batch.column("id").to_pylist()
custom = some_permutation.with_transform(transform)
result = custom.__getitems__([0, 1, 2])
assert isinstance(result, list)
assert len(result) == 3
def test_getitems_identity_permutation(mem_db):
"""Test __getitems__ on an identity permutation."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(10), "value": range(10)})
)
perm = Permutation.identity(tbl)
result = perm.__getitems__([0, 5, 9])
assert [r["id"] for r in result] == [0, 5, 9]
assert [r["value"] for r in result] == [0, 5, 9]
def test_getitems_with_limit_offset(some_permutation: Permutation):
"""Test __getitems__ on a permutation with skip/take applied."""
limited = some_permutation.with_skip(100).with_take(200)
# Should be able to access offsets within the limited range
result = limited.__getitems__([0, 1, 199])
assert len(result) == 3
# The first item of the limited permutation should match offset 100 of original
full_result = some_permutation.__getitems__([100])
limited_result = limited.__getitems__([0])
assert limited_result[0]["id"] == full_result[0]["id"]
def test_getitems_invalid_offset(some_permutation: Permutation):
"""Test __getitems__ with an out-of-range offset raises an error."""
with pytest.raises(Exception):
some_permutation.__getitems__([999999])

View File

@@ -531,6 +531,78 @@ def test_empty_result_reranker():
)
def test_empty_hybrid_result_reranker():
"""Test that hybrid search with empty results after filtering doesn't crash.
Regression test for https://github.com/lancedb/lancedb/issues/2425
"""
from lancedb.query import LanceHybridQueryBuilder
# Simulate empty vector and FTS results with the expected schema
vector_schema = pa.schema(
[
("text", pa.string()),
("vector", pa.list_(pa.float32(), 4)),
("_rowid", pa.uint64()),
("_distance", pa.float32()),
]
)
fts_schema = pa.schema(
[
("text", pa.string()),
("vector", pa.list_(pa.float32(), 4)),
("_rowid", pa.uint64()),
("_score", pa.float32()),
]
)
empty_vector = pa.table(
{
"text": pa.array([], type=pa.string()),
"vector": pa.array([], type=pa.list_(pa.float32(), 4)),
"_rowid": pa.array([], type=pa.uint64()),
"_distance": pa.array([], type=pa.float32()),
},
schema=vector_schema,
)
empty_fts = pa.table(
{
"text": pa.array([], type=pa.string()),
"vector": pa.array([], type=pa.list_(pa.float32(), 4)),
"_rowid": pa.array([], type=pa.uint64()),
"_score": pa.array([], type=pa.float32()),
},
schema=fts_schema,
)
for reranker in [LinearCombinationReranker(), RRFReranker()]:
result = LanceHybridQueryBuilder._combine_hybrid_results(
fts_results=empty_fts,
vector_results=empty_vector,
norm="score",
fts_query="nonexistent query",
reranker=reranker,
limit=10,
with_row_ids=False,
)
assert len(result) == 0
assert "_relevance_score" in result.column_names
assert "_rowid" not in result.column_names
# Also test with with_row_ids=True
result = LanceHybridQueryBuilder._combine_hybrid_results(
fts_results=empty_fts,
vector_results=empty_vector,
norm="score",
fts_query="nonexistent query",
reranker=LinearCombinationReranker(),
limit=10,
with_row_ids=True,
)
assert len(result) == 0
assert "_relevance_score" in result.column_names
assert "_rowid" in result.column_names
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_cross_encoder_reranker_return_all(tmp_path, use_tantivy):
pytest.importorskip("sentence_transformers")

View File

@@ -4,6 +4,7 @@
import pyarrow as pa
import pytest
from lancedb.util import tbl_to_tensor
from lancedb.permutation import Permutation
torch = pytest.importorskip("torch")
@@ -16,3 +17,26 @@ def test_table_dataloader(mem_db):
for batch in dataloader:
assert batch.size(0) == 1
assert batch.size(1) == 10
def test_permutation_dataloader(mem_db):
table = mem_db.create_table("test_table", pa.table({"a": range(1000)}))
permutation = Permutation.identity(table)
dataloader = torch.utils.data.DataLoader(permutation, batch_size=10, shuffle=True)
for batch in dataloader:
assert batch["a"].size(0) == 10
permutation = permutation.with_format("torch")
dataloader = torch.utils.data.DataLoader(permutation, batch_size=10, shuffle=True)
for batch in dataloader:
assert batch.size(0) == 10
assert batch.size(1) == 1
permutation = permutation.with_format("torch_col")
dataloader = torch.utils.data.DataLoader(
permutation, collate_fn=lambda x: x, batch_size=10, shuffle=True
)
for batch in dataloader:
assert batch.size(0) == 1
assert batch.size(1) == 10

View File

@@ -292,18 +292,14 @@ class TestModel(lancedb.pydantic.LanceModel):
lambda: pa.table({"a": [1], "b": [2]}),
lambda: pa.table({"a": [1], "b": [2]}).to_reader(),
lambda: iter(pa.table({"a": [1], "b": [2]}).to_batches()),
lambda: (
lance.write_dataset(
pa.table({"a": [1], "b": [2]}),
"memory://test",
)
),
lambda: (
lance.write_dataset(
pa.table({"a": [1], "b": [2]}),
"memory://test",
).scanner()
lambda: lance.write_dataset(
pa.table({"a": [1], "b": [2]}),
"memory://test",
),
lambda: lance.write_dataset(
pa.table({"a": [1], "b": [2]}),
"memory://test",
).scanner(),
lambda: pd.DataFrame({"a": [1], "b": [2]}),
lambda: pl.DataFrame({"a": [1], "b": [2]}),
lambda: pl.LazyFrame({"a": [1], "b": [2]}),

View File

@@ -121,7 +121,8 @@ impl Connection {
let mode = Self::parse_create_mode_str(mode)?;
let batches = ArrowArrayStreamReader::from_pyarrow_bound(&data)?;
let batches: Box<dyn arrow::array::RecordBatchReader + Send> =
Box::new(ArrowArrayStreamReader::from_pyarrow_bound(&data)?);
let mut builder = inner.create_table(name, batches).mode(mode);

View File

@@ -6,7 +6,7 @@ use std::sync::{Arc, Mutex};
use crate::{
arrow::RecordBatchStream, connection::Connection, error::PythonErrorExt, table::Table,
};
use arrow::pyarrow::ToPyArrow;
use arrow::pyarrow::{PyArrowType, ToPyArrow};
use lancedb::{
dataloader::permutation::{
builder::{PermutationBuilder as LancePermutationBuilder, ShuffleStrategy},
@@ -23,10 +23,25 @@ use pyo3::{
};
use pyo3_async_runtimes::tokio::future_into_py;
fn table_from_py<'a>(table: Bound<'a, PyAny>) -> PyResult<Bound<'a, Table>> {
if table.hasattr("_inner")? {
Ok(table.getattr("_inner")?.downcast_into::<Table>()?)
} else if table.hasattr("_table")? {
Ok(table
.getattr("_table")?
.getattr("_inner")?
.downcast_into::<Table>()?)
} else {
Err(PyRuntimeError::new_err(
"Provided table does not appear to be a Table or RemoteTable instance",
))
}
}
/// Create a permutation builder for the given table
#[pyo3::pyfunction]
pub fn async_permutation_builder(table: Bound<'_, PyAny>) -> PyResult<PyAsyncPermutationBuilder> {
let table = table.getattr("_inner")?.downcast_into::<Table>()?;
let table = table_from_py(table)?;
let inner_table = table.borrow().inner_ref()?.clone();
let inner_builder = LancePermutationBuilder::new(inner_table);
@@ -250,10 +265,8 @@ impl PyPermutationReader {
permutation_table: Option<Bound<'py, PyAny>>,
split: u64,
) -> PyResult<Bound<'py, PyAny>> {
let base_table = base_table.getattr("_inner")?.downcast_into::<Table>()?;
let permutation_table = permutation_table
.map(|p| PyResult::Ok(p.getattr("_inner")?.downcast_into::<Table>()?))
.transpose()?;
let base_table = table_from_py(base_table)?;
let permutation_table = permutation_table.map(table_from_py).transpose()?;
let base_table = base_table.borrow().inner_ref()?.base_table().clone();
let permutation_table = permutation_table
@@ -328,4 +341,21 @@ impl PyPermutationReader {
Ok(RecordBatchStream::new(stream))
})
}
#[pyo3(signature = (indices, *, selection=None))]
pub fn take_offsets<'py>(
slf: PyRef<'py, Self>,
indices: Vec<u64>,
selection: Option<Bound<'py, PyAny>>,
) -> PyResult<Bound<'py, PyAny>> {
let selection = Self::parse_selection(selection)?;
let reader = slf.reader.clone();
future_into_py(slf.py(), async move {
let batch = reader
.take_offsets(&indices, selection)
.await
.infer_error()?;
Ok(PyArrowType(batch))
})
}
}

View File

@@ -296,7 +296,8 @@ impl Table {
data: Bound<'_, PyAny>,
mode: String,
) -> PyResult<Bound<'a, PyAny>> {
let batches = ArrowArrayStreamReader::from_pyarrow_bound(&data)?;
let batches: Box<dyn arrow::array::RecordBatchReader + Send> =
Box::new(ArrowArrayStreamReader::from_pyarrow_bound(&data)?);
let mut op = self_.inner_ref()?.add(batches);
if mode == "append" {
op = op.mode(AddDataMode::Append);

5349
python/uv.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,2 +1,2 @@
[toolchain]
channel = "1.90.0"
channel = "1.91.0"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.26.1"
version = "0.27.0-beta.0"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true

View File

@@ -3,13 +3,12 @@
use std::{iter::once, sync::Arc};
use arrow_array::{Float64Array, Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_array::{Float64Array, Int32Array, RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema};
use aws_config::Region;
use aws_sdk_bedrockruntime::Client;
use futures::StreamExt;
use lancedb::{
arrow::IntoArrow,
connect,
embeddings::{bedrock::BedrockEmbeddingFunction, EmbeddingDefinition, EmbeddingFunction},
query::{ExecutableQuery, QueryBase},
@@ -67,7 +66,7 @@ async fn main() -> Result<()> {
Ok(())
}
fn make_data() -> impl IntoArrow {
fn make_data() -> RecordBatch {
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Field::new("text", DataType::Utf8, false),
@@ -83,10 +82,9 @@ fn make_data() -> impl IntoArrow {
]);
let price = Float64Array::from(vec![10.0, 50.0, 100.0, 30.0]);
let schema = Arc::new(schema);
let rb = RecordBatch::try_new(
RecordBatch::try_new(
schema.clone(),
vec![Arc::new(id), Arc::new(text), Arc::new(price)],
)
.unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(rb)], schema))
.unwrap()
}

View File

@@ -3,12 +3,13 @@
use std::sync::Arc;
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, RecordBatchReader, StringArray};
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use lance_index::scalar::FullTextSearchQuery;
use lancedb::connection::Connection;
use lancedb::index::scalar::FtsIndexBuilder;
use lancedb::index::Index;
use lancedb::query::{ExecutableQuery, QueryBase};
@@ -29,7 +30,7 @@ async fn main() -> Result<()> {
Ok(())
}
fn create_some_records() -> Result<Box<dyn RecordBatchReader + Send>> {
fn create_some_records() -> Result<Box<dyn arrow_array::RecordBatchReader + Send>> {
const TOTAL: usize = 1000;
let schema = Arc::new(Schema::new(vec![
@@ -66,7 +67,7 @@ fn create_some_records() -> Result<Box<dyn RecordBatchReader + Send>> {
}
async fn create_table(db: &Connection) -> Result<Table> {
let initial_data: Box<dyn RecordBatchReader + Send> = create_some_records()?;
let initial_data = create_some_records()?;
let tbl = db.create_table("my_table", initial_data).execute().await?;
Ok(tbl)
}

View File

@@ -1,14 +1,13 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use arrow_array::{RecordBatch, RecordBatchIterator, StringArray};
use arrow_array::{RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use lance_index::scalar::FullTextSearchQuery;
use lancedb::index::scalar::FtsIndexBuilder;
use lancedb::index::Index;
use lancedb::{
arrow::IntoArrow,
connect,
embeddings::{
sentence_transformers::SentenceTransformersEmbeddings, EmbeddingDefinition,
@@ -70,7 +69,7 @@ async fn main() -> Result<()> {
Ok(())
}
fn make_data() -> impl IntoArrow {
fn make_data() -> RecordBatch {
let schema = Schema::new(vec![Field::new("facts", DataType::Utf8, false)]);
let facts = StringArray::from_iter_values(vec![
@@ -101,8 +100,7 @@ fn make_data() -> impl IntoArrow {
"The first chatbot was ELIZA, created in the 1960s.",
]);
let schema = Arc::new(schema);
let rb = RecordBatch::try_new(schema.clone(), vec![Arc::new(facts)]).unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(rb)], schema))
RecordBatch::try_new(schema.clone(), vec![Arc::new(facts)]).unwrap()
}
async fn create_index(table: &Table) -> Result<()> {

View File

@@ -8,13 +8,12 @@
use std::sync::Arc;
use arrow_array::types::Float32Type;
use arrow_array::{
FixedSizeListArray, Int32Array, RecordBatch, RecordBatchIterator, RecordBatchReader,
};
use arrow_array::{FixedSizeListArray, Int32Array, RecordBatch, RecordBatchIterator};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use lancedb::connection::Connection;
use lancedb::index::vector::IvfPqIndexBuilder;
use lancedb::index::Index;
use lancedb::query::{ExecutableQuery, QueryBase};
@@ -34,7 +33,7 @@ async fn main() -> Result<()> {
Ok(())
}
fn create_some_records() -> Result<Box<dyn RecordBatchReader + Send>> {
fn create_some_records() -> Result<Box<dyn arrow_array::RecordBatchReader + Send>> {
const TOTAL: usize = 1000;
const DIM: usize = 128;
@@ -73,9 +72,9 @@ fn create_some_records() -> Result<Box<dyn RecordBatchReader + Send>> {
}
async fn create_table(db: &Connection) -> Result<Table> {
let initial_data: Box<dyn RecordBatchReader + Send> = create_some_records()?;
let initial_data = create_some_records()?;
let tbl = db
.create_table("my_table", Box::new(initial_data))
.create_table("my_table", initial_data)
.execute()
.await
.unwrap();

View File

@@ -5,11 +5,9 @@
use std::{iter::once, sync::Arc};
use arrow_array::{Float64Array, Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_schema::{DataType, Field, Schema};
use arrow_array::{RecordBatch, StringArray};
use futures::StreamExt;
use lancedb::{
arrow::IntoArrow,
connect,
embeddings::{openai::OpenAIEmbeddingFunction, EmbeddingDefinition, EmbeddingFunction},
query::{ExecutableQuery, QueryBase},
@@ -64,26 +62,20 @@ async fn main() -> Result<()> {
}
// --8<-- [end:openai_embeddings]
fn make_data() -> impl IntoArrow {
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Field::new("text", DataType::Utf8, false),
Field::new("price", DataType::Float64, false),
]);
let id = Int32Array::from(vec![1, 2, 3, 4]);
let text = StringArray::from_iter_values(vec![
"Black T-Shirt",
"Leather Jacket",
"Winter Parka",
"Hooded Sweatshirt",
]);
let price = Float64Array::from(vec![10.0, 50.0, 100.0, 30.0]);
let schema = Arc::new(schema);
let rb = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(id), Arc::new(text), Arc::new(price)],
fn make_data() -> RecordBatch {
arrow_array::record_batch!(
("id", Int32, [1, 2, 3, 4]),
(
"text",
Utf8,
[
"Black T-Shirt",
"Leather Jacket",
"Winter Parka",
"Hooded Sweatshirt"
]
),
("price", Float64, [10.0, 50.0, 100.0, 30.0])
)
.unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(rb)], schema))
.unwrap()
}

View File

@@ -3,11 +3,10 @@
use std::{iter::once, sync::Arc};
use arrow_array::{RecordBatch, RecordBatchIterator, StringArray};
use arrow_array::{RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema};
use futures::StreamExt;
use lancedb::{
arrow::IntoArrow,
connect,
embeddings::{
sentence_transformers::SentenceTransformersEmbeddings, EmbeddingDefinition,
@@ -59,7 +58,7 @@ async fn main() -> Result<()> {
Ok(())
}
fn make_data() -> impl IntoArrow {
fn make_data() -> RecordBatch {
let schema = Schema::new(vec![Field::new("facts", DataType::Utf8, false)]);
let facts = StringArray::from_iter_values(vec![
@@ -90,6 +89,5 @@ fn make_data() -> impl IntoArrow {
"The first chatbot was ELIZA, created in the 1960s.",
]);
let schema = Arc::new(schema);
let rb = RecordBatch::try_new(schema.clone(), vec![Arc::new(facts)]).unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(rb)], schema))
RecordBatch::try_new(schema.clone(), vec![Arc::new(facts)]).unwrap()
}

View File

@@ -8,11 +8,9 @@
use std::sync::Arc;
use arrow_array::types::Float32Type;
use arrow_array::{FixedSizeListArray, Int32Array, RecordBatch, RecordBatchIterator};
use arrow_array::{FixedSizeListArray, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use lancedb::arrow::IntoArrow;
use lancedb::connection::Connection;
use lancedb::index::Index;
use lancedb::query::{ExecutableQuery, QueryBase};
@@ -59,7 +57,7 @@ async fn open_with_existing_tbl() -> Result<()> {
Ok(())
}
fn create_some_records() -> Result<impl IntoArrow> {
fn create_some_records() -> Result<RecordBatch> {
const TOTAL: usize = 1000;
const DIM: usize = 128;
@@ -76,25 +74,18 @@ fn create_some_records() -> Result<impl IntoArrow> {
]));
// Create a RecordBatch stream.
let batches = RecordBatchIterator::new(
vec![RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..TOTAL as i32)),
Arc::new(
FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
(0..TOTAL).map(|_| Some(vec![Some(1.0); DIM])),
DIM as i32,
),
),
],
)
.unwrap()]
.into_iter()
.map(Ok),
Ok(RecordBatch::try_new(
schema.clone(),
);
Ok(Box::new(batches))
vec![
Arc::new(Int32Array::from_iter_values(0..TOTAL as i32)),
Arc::new(
FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
(0..TOTAL).map(|_| Some(vec![Some(1.0); DIM])),
DIM as i32,
),
),
],
)?)
}
async fn create_table(db: &Connection) -> Result<LanceDbTable> {

View File

@@ -6,8 +6,8 @@
use std::collections::HashMap;
use std::sync::Arc;
use arrow_array::RecordBatchReader;
use arrow_schema::{Field, SchemaRef};
use arrow_array::RecordBatch;
use arrow_schema::SchemaRef;
use lance::dataset::ReadParams;
use lance_namespace::models::{
CreateNamespaceRequest, CreateNamespaceResponse, DescribeNamespaceRequest,
@@ -17,24 +17,20 @@ use lance_namespace::models::{
#[cfg(feature = "aws")]
use object_store::aws::AwsCredential;
use crate::arrow::{IntoArrow, IntoArrowStream, SendableRecordBatchStream};
use crate::database::listing::{
ListingDatabase, OPT_NEW_TABLE_STORAGE_VERSION, OPT_NEW_TABLE_V2_MANIFEST_PATHS,
};
use crate::connection::create_table::CreateTableBuilder;
use crate::data::scannable::Scannable;
use crate::database::listing::ListingDatabase;
use crate::database::{
CloneTableRequest, CreateTableData, CreateTableMode, CreateTableRequest, Database,
DatabaseOptions, OpenTableRequest, ReadConsistency, TableNamesRequest,
};
use crate::embeddings::{
EmbeddingDefinition, EmbeddingFunction, EmbeddingRegistry, MemoryRegistry, WithEmbeddings,
CloneTableRequest, Database, DatabaseOptions, OpenTableRequest, ReadConsistency,
TableNamesRequest,
};
use crate::embeddings::{EmbeddingRegistry, MemoryRegistry};
use crate::error::{Error, Result};
#[cfg(feature = "remote")]
use crate::remote::{
client::ClientConfig,
db::{OPT_REMOTE_API_KEY, OPT_REMOTE_HOST_OVERRIDE, OPT_REMOTE_REGION},
};
use crate::table::{TableDefinition, WriteOptions};
use crate::Table;
use lance::io::ObjectStoreParams;
pub use lance_encoding::version::LanceFileVersion;
@@ -42,6 +38,8 @@ pub use lance_encoding::version::LanceFileVersion;
use lance_io::object_store::StorageOptions;
use lance_io::object_store::{StorageOptionsAccessor, StorageOptionsProvider};
mod create_table;
fn merge_storage_options(
store_params: &mut ObjectStoreParams,
pairs: impl IntoIterator<Item = (String, String)>,
@@ -116,337 +114,6 @@ impl TableNamesBuilder {
}
}
pub struct NoData {}
impl IntoArrow for NoData {
fn into_arrow(self) -> Result<Box<dyn arrow_array::RecordBatchReader + Send>> {
unreachable!("NoData should never be converted to Arrow")
}
}
// Stores the value given from the initial CreateTableBuilder::new call
// and defers errors until `execute` is called
enum CreateTableBuilderInitialData {
None,
Iterator(Result<Box<dyn RecordBatchReader + Send>>),
Stream(Result<SendableRecordBatchStream>),
}
/// A builder for configuring a [`Connection::create_table`] operation
pub struct CreateTableBuilder<const HAS_DATA: bool> {
parent: Arc<dyn Database>,
embeddings: Vec<(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)>,
embedding_registry: Arc<dyn EmbeddingRegistry>,
request: CreateTableRequest,
// This is a bit clumsy but we defer errors until `execute` is called
// to maintain backwards compatibility
data: CreateTableBuilderInitialData,
}
// Builder methods that only apply when we have initial data
impl CreateTableBuilder<true> {
fn new<T: IntoArrow>(
parent: Arc<dyn Database>,
name: String,
data: T,
embedding_registry: Arc<dyn EmbeddingRegistry>,
) -> Self {
let dummy_schema = Arc::new(arrow_schema::Schema::new(Vec::<Field>::default()));
Self {
parent,
request: CreateTableRequest::new(
name,
CreateTableData::Empty(TableDefinition::new_from_schema(dummy_schema)),
),
embeddings: Vec::new(),
embedding_registry,
data: CreateTableBuilderInitialData::Iterator(data.into_arrow()),
}
}
fn new_streaming<T: IntoArrowStream>(
parent: Arc<dyn Database>,
name: String,
data: T,
embedding_registry: Arc<dyn EmbeddingRegistry>,
) -> Self {
let dummy_schema = Arc::new(arrow_schema::Schema::new(Vec::<Field>::default()));
Self {
parent,
request: CreateTableRequest::new(
name,
CreateTableData::Empty(TableDefinition::new_from_schema(dummy_schema)),
),
embeddings: Vec::new(),
embedding_registry,
data: CreateTableBuilderInitialData::Stream(data.into_arrow()),
}
}
/// Execute the create table operation
pub async fn execute(self) -> Result<Table> {
let embedding_registry = self.embedding_registry.clone();
let parent = self.parent.clone();
let request = self.into_request()?;
Ok(Table::new_with_embedding_registry(
parent.create_table(request).await?,
parent,
embedding_registry,
))
}
fn into_request(self) -> Result<CreateTableRequest> {
if self.embeddings.is_empty() {
match self.data {
CreateTableBuilderInitialData::Iterator(maybe_iter) => {
let data = maybe_iter?;
Ok(CreateTableRequest {
data: CreateTableData::Data(data),
..self.request
})
}
CreateTableBuilderInitialData::None => {
unreachable!("No data provided for CreateTableBuilder<true>")
}
CreateTableBuilderInitialData::Stream(maybe_stream) => {
let data = maybe_stream?;
Ok(CreateTableRequest {
data: CreateTableData::StreamingData(data),
..self.request
})
}
}
} else {
let CreateTableBuilderInitialData::Iterator(maybe_iter) = self.data else {
return Err(Error::NotSupported { message: "Creating a table with embeddings is currently not support when the input is streaming".to_string() });
};
let data = maybe_iter?;
let data = Box::new(WithEmbeddings::new(data, self.embeddings));
Ok(CreateTableRequest {
data: CreateTableData::Data(data),
..self.request
})
}
}
}
// Builder methods that only apply when we do not have initial data
impl CreateTableBuilder<false> {
fn new(
parent: Arc<dyn Database>,
name: String,
schema: SchemaRef,
embedding_registry: Arc<dyn EmbeddingRegistry>,
) -> Self {
let table_definition = TableDefinition::new_from_schema(schema);
Self {
parent,
request: CreateTableRequest::new(name, CreateTableData::Empty(table_definition)),
data: CreateTableBuilderInitialData::None,
embeddings: Vec::default(),
embedding_registry,
}
}
/// Execute the create table operation
pub async fn execute(self) -> Result<Table> {
let parent = self.parent.clone();
let embedding_registry = self.embedding_registry.clone();
let request = self.into_request()?;
Ok(Table::new_with_embedding_registry(
parent.create_table(request).await?,
parent,
embedding_registry,
))
}
fn into_request(self) -> Result<CreateTableRequest> {
if self.embeddings.is_empty() {
return Ok(self.request);
}
let CreateTableData::Empty(table_def) = self.request.data else {
unreachable!("CreateTableBuilder<false> should always have Empty data")
};
let schema = table_def.schema.clone();
let empty_batch = arrow_array::RecordBatch::new_empty(schema.clone());
let reader = Box::new(std::iter::once(Ok(empty_batch)).collect::<Vec<_>>());
let reader = arrow_array::RecordBatchIterator::new(reader.into_iter(), schema);
let with_embeddings = WithEmbeddings::new(reader, self.embeddings);
let table_definition = with_embeddings.table_definition()?;
Ok(CreateTableRequest {
data: CreateTableData::Empty(table_definition),
..self.request
})
}
}
impl<const HAS_DATA: bool> CreateTableBuilder<HAS_DATA> {
/// Set the mode for creating the table
///
/// This controls what happens if a table with the given name already exists
pub fn mode(mut self, mode: CreateTableMode) -> Self {
self.request.mode = mode;
self
}
/// Apply the given write options when writing the initial data
pub fn write_options(mut self, write_options: WriteOptions) -> Self {
self.request.write_options = write_options;
self
}
/// Set an option for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.com/docs/storage/>
pub fn storage_option(mut self, key: impl Into<String>, value: impl Into<String>) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default());
merge_storage_options(store_params, [(key.into(), value.into())]);
self
}
/// Set multiple options for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.com/docs/storage/>
pub fn storage_options(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default());
let updates = pairs
.into_iter()
.map(|(key, value)| (key.into(), value.into()));
merge_storage_options(store_params, updates);
self
}
/// Add an embedding definition to the table.
///
/// The `embedding_name` must match the name of an embedding function that
/// was previously registered with the connection's [`EmbeddingRegistry`].
pub fn add_embedding(mut self, definition: EmbeddingDefinition) -> Result<Self> {
// Early verification of the embedding name
let embedding_func = self
.embedding_registry
.get(&definition.embedding_name)
.ok_or_else(|| Error::EmbeddingFunctionNotFound {
name: definition.embedding_name.clone(),
reason: "No embedding function found in the connection's embedding_registry"
.to_string(),
})?;
self.embeddings.push((definition, embedding_func));
Ok(self)
}
/// Set whether to use V2 manifest paths for the table. (default: false)
///
/// These paths provide more efficient opening of tables with many
/// versions on object stores.
///
/// <div class="warning">Turning this on will make the dataset unreadable
/// for older versions of LanceDB (prior to 0.10.0).</div>
///
/// To migrate an existing dataset, instead use the
/// [[NativeTable::migrate_manifest_paths_v2]].
///
/// This has no effect in LanceDB Cloud.
#[deprecated(since = "0.15.1", note = "Use `database_options` instead")]
pub fn enable_v2_manifest_paths(mut self, use_v2_manifest_paths: bool) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert_with(Default::default)
.store_params
.get_or_insert_with(Default::default);
let value = if use_v2_manifest_paths {
"true".to_string()
} else {
"false".to_string()
};
merge_storage_options(
store_params,
[(OPT_NEW_TABLE_V2_MANIFEST_PATHS.to_string(), value)],
);
self
}
/// Set the data storage version.
///
/// The default is `LanceFileVersion::Stable`.
#[deprecated(since = "0.15.1", note = "Use `database_options` instead")]
pub fn data_storage_version(mut self, data_storage_version: LanceFileVersion) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert_with(Default::default)
.store_params
.get_or_insert_with(Default::default);
merge_storage_options(
store_params,
[(
OPT_NEW_TABLE_STORAGE_VERSION.to_string(),
data_storage_version.to_string(),
)],
);
self
}
/// Set the namespace for the table
pub fn namespace(mut self, namespace: Vec<String>) -> Self {
self.request.namespace = namespace;
self
}
/// Set a custom location for the table.
///
/// If not set, the database will derive a location from its URI and the table name.
/// This is useful when integrating with namespace systems that manage table locations.
pub fn location(mut self, location: impl Into<String>) -> Self {
self.request.location = Some(location.into());
self
}
/// Set a storage options provider for automatic credential refresh.
///
/// This allows tables to automatically refresh cloud storage credentials
/// when they expire, enabling long-running operations on remote storage.
pub fn storage_options_provider(mut self, provider: Arc<dyn StorageOptionsProvider>) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default());
set_storage_options_provider(store_params, provider);
self
}
}
#[derive(Clone, Debug)]
pub struct OpenTableBuilder {
parent: Arc<dyn Database>,
@@ -684,35 +351,17 @@ impl Connection {
///
/// * `name` - The name of the table
/// * `initial_data` - The initial data to write to the table
pub fn create_table<T: IntoArrow>(
pub fn create_table<T: Scannable + 'static>(
&self,
name: impl Into<String>,
initial_data: T,
) -> CreateTableBuilder<true> {
CreateTableBuilder::<true>::new(
) -> CreateTableBuilder {
let initial_data = Box::new(initial_data);
CreateTableBuilder::new(
self.internal.clone(),
self.embedding_registry.clone(),
name.into(),
initial_data,
self.embedding_registry.clone(),
)
}
/// Create a new table from a stream of data
///
/// # Parameters
///
/// * `name` - The name of the table
/// * `initial_data` - The initial data to write to the table
pub fn create_table_streaming<T: IntoArrowStream>(
&self,
name: impl Into<String>,
initial_data: T,
) -> CreateTableBuilder<true> {
CreateTableBuilder::<true>::new_streaming(
self.internal.clone(),
name.into(),
initial_data,
self.embedding_registry.clone(),
)
}
@@ -726,13 +375,9 @@ impl Connection {
&self,
name: impl Into<String>,
schema: SchemaRef,
) -> CreateTableBuilder<false> {
CreateTableBuilder::<false>::new(
self.internal.clone(),
name.into(),
schema,
self.embedding_registry.clone(),
)
) -> CreateTableBuilder {
let empty_batch = RecordBatch::new_empty(schema);
self.create_table(name, empty_batch)
}
/// Open an existing table in the database
@@ -1349,20 +994,11 @@ mod test_utils {
#[cfg(test)]
mod tests {
use crate::database::listing::{ListingDatabaseOptions, NewTableConfig};
use crate::query::QueryBase;
use crate::query::{ExecutableQuery, QueryExecutionOptions};
use crate::test_utils::connection::new_test_connection;
use arrow::compute::concat_batches;
use arrow_array::RecordBatchReader;
use arrow_schema::{DataType, Field, Schema};
use datafusion_physical_plan::stream::RecordBatchStreamAdapter;
use futures::{stream, TryStreamExt};
use lance_core::error::{ArrowResult, DataFusionResult};
use lance_testing::datagen::{BatchGenerator, IncrementingInt32};
use tempfile::tempdir;
use crate::arrow::SimpleRecordBatchStream;
use crate::test_utils::connection::new_test_connection;
use super::*;
@@ -1478,139 +1114,6 @@ mod tests {
assert_eq!(tables, vec!["table1".to_owned()]);
}
fn make_data() -> Box<dyn RecordBatchReader + Send + 'static> {
let id = Box::new(IncrementingInt32::new().named("id".to_string()));
Box::new(BatchGenerator::new().col(id).batches(10, 2000))
}
#[tokio::test]
async fn test_create_table_v2() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri)
.database_options(&ListingDatabaseOptions {
new_table_config: NewTableConfig {
data_storage_version: Some(LanceFileVersion::Legacy),
..Default::default()
},
..Default::default()
})
.execute()
.await
.unwrap();
let tbl = db
.create_table("v1_test", make_data())
.execute()
.await
.unwrap();
// In v1 the row group size will trump max_batch_length
let batches = tbl
.query()
.limit(20000)
.execute_with_options(QueryExecutionOptions {
max_batch_length: 50000,
..Default::default()
})
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
assert_eq!(batches.len(), 20);
let db = connect(uri)
.database_options(&ListingDatabaseOptions {
new_table_config: NewTableConfig {
data_storage_version: Some(LanceFileVersion::Stable),
..Default::default()
},
..Default::default()
})
.execute()
.await
.unwrap();
let tbl = db
.create_table("v2_test", make_data())
.execute()
.await
.unwrap();
// In v2 the page size is much bigger than 50k so we should get a single batch
let batches = tbl
.query()
.execute_with_options(QueryExecutionOptions {
max_batch_length: 50000,
..Default::default()
})
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
assert_eq!(batches.len(), 1);
}
#[tokio::test]
async fn test_create_table_streaming() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
let batches = make_data().collect::<ArrowResult<Vec<_>>>().unwrap();
let schema = batches.first().unwrap().schema();
let one_batch = concat_batches(&schema, batches.iter()).unwrap();
let ldb_stream = stream::iter(batches.clone().into_iter().map(Result::Ok));
let ldb_stream: SendableRecordBatchStream =
Box::pin(SimpleRecordBatchStream::new(ldb_stream, schema.clone()));
let tbl1 = db
.create_table_streaming("one", ldb_stream)
.execute()
.await
.unwrap();
let df_stream = stream::iter(batches.into_iter().map(DataFusionResult::Ok));
let df_stream: datafusion_physical_plan::SendableRecordBatchStream =
Box::pin(RecordBatchStreamAdapter::new(schema.clone(), df_stream));
let tbl2 = db
.create_table_streaming("two", df_stream)
.execute()
.await
.unwrap();
let tbl1_data = tbl1
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let tbl1_data = concat_batches(&schema, tbl1_data.iter()).unwrap();
assert_eq!(tbl1_data, one_batch);
let tbl2_data = tbl2
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let tbl2_data = concat_batches(&schema, tbl2_data.iter()).unwrap();
assert_eq!(tbl2_data, one_batch);
}
#[tokio::test]
async fn drop_table() {
let tc = new_test_connection().await.unwrap();
@@ -1640,41 +1143,6 @@ mod tests {
assert_eq!(tables.len(), 0);
}
#[tokio::test]
async fn test_create_table_already_exists() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
db.create_empty_table("test", schema.clone())
.execute()
.await
.unwrap();
// TODO: None of the open table options are "inspectable" right now but once one is we
// should assert we are passing these options in correctly
db.create_empty_table("test", schema)
.mode(CreateTableMode::exist_ok(|mut req| {
req.index_cache_size = Some(16);
req
}))
.execute()
.await
.unwrap();
let other_schema = Arc::new(Schema::new(vec![Field::new("y", DataType::Int32, false)]));
assert!(db
.create_empty_table("test", other_schema.clone())
.execute()
.await
.is_err());
let overwritten = db
.create_empty_table("test", other_schema.clone())
.mode(CreateTableMode::Overwrite)
.execute()
.await
.unwrap();
assert_eq!(other_schema, overwritten.schema().await.unwrap());
}
#[tokio::test]
async fn test_clone_table() {
let tmp_dir = tempdir().unwrap();
@@ -1685,7 +1153,8 @@ mod tests {
let mut batch_gen = BatchGenerator::new()
.col(Box::new(IncrementingInt32::new().named("id")))
.col(Box::new(IncrementingInt32::new().named("value")));
let reader = batch_gen.batches(5, 100);
let reader: Box<dyn arrow_array::RecordBatchReader + Send> =
Box::new(batch_gen.batches(5, 100));
let source_table = db
.create_table("source_table", reader)
@@ -1720,128 +1189,4 @@ mod tests {
let cloned_count = cloned_table.count_rows(None).await.unwrap();
assert_eq!(source_count, cloned_count);
}
#[tokio::test]
async fn test_create_empty_table_with_embeddings() {
use crate::embeddings::{EmbeddingDefinition, EmbeddingFunction};
use arrow_array::{
Array, FixedSizeListArray, Float32Array, RecordBatch, RecordBatchIterator, StringArray,
};
use std::borrow::Cow;
#[derive(Debug, Clone)]
struct MockEmbedding {
dim: usize,
}
impl EmbeddingFunction for MockEmbedding {
fn name(&self) -> &str {
"test_embedding"
}
fn source_type(&self) -> Result<Cow<'_, DataType>> {
Ok(Cow::Owned(DataType::Utf8))
}
fn dest_type(&self) -> Result<Cow<'_, DataType>> {
Ok(Cow::Owned(DataType::new_fixed_size_list(
DataType::Float32,
self.dim as i32,
true,
)))
}
fn compute_source_embeddings(&self, source: Arc<dyn Array>) -> Result<Arc<dyn Array>> {
let len = source.len();
let values = vec![1.0f32; len * self.dim];
let values = Arc::new(Float32Array::from(values));
let field = Arc::new(Field::new("item", DataType::Float32, true));
Ok(Arc::new(FixedSizeListArray::new(
field,
self.dim as i32,
values,
None,
)))
}
fn compute_query_embeddings(&self, _input: Arc<dyn Array>) -> Result<Arc<dyn Array>> {
unimplemented!()
}
}
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
let embed_func = Arc::new(MockEmbedding { dim: 128 });
db.embedding_registry()
.register("test_embedding", embed_func.clone())
.unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("name", DataType::Utf8, true)]));
let ed = EmbeddingDefinition {
source_column: "name".to_owned(),
dest_column: Some("name_embedding".to_owned()),
embedding_name: "test_embedding".to_owned(),
};
let table = db
.create_empty_table("test", schema)
.mode(CreateTableMode::Overwrite)
.add_embedding(ed)
.unwrap()
.execute()
.await
.unwrap();
let table_schema = table.schema().await.unwrap();
assert!(table_schema.column_with_name("name").is_some());
assert!(table_schema.column_with_name("name_embedding").is_some());
let embedding_field = table_schema.field_with_name("name_embedding").unwrap();
assert_eq!(
embedding_field.data_type(),
&DataType::new_fixed_size_list(DataType::Float32, 128, true)
);
let input_schema = Arc::new(Schema::new(vec![Field::new("name", DataType::Utf8, true)]));
let input_batch = RecordBatch::try_new(
input_schema.clone(),
vec![Arc::new(StringArray::from(vec![
Some("Alice"),
Some("Bob"),
Some("Charlie"),
]))],
)
.unwrap();
let input_reader = Box::new(RecordBatchIterator::new(
vec![Ok(input_batch)].into_iter(),
input_schema,
));
table.add(input_reader).execute().await.unwrap();
let results = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
assert_eq!(results.len(), 1);
let batch = &results[0];
assert_eq!(batch.num_rows(), 3);
assert!(batch.column_by_name("name_embedding").is_some());
let embedding_col = batch
.column_by_name("name_embedding")
.unwrap()
.as_any()
.downcast_ref::<FixedSizeListArray>()
.unwrap();
assert_eq!(embedding_col.len(), 3);
}
}

View File

@@ -0,0 +1,612 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::Arc;
use lance_io::object_store::StorageOptionsProvider;
use crate::{
connection::{merge_storage_options, set_storage_options_provider},
data::scannable::{Scannable, WithEmbeddingsScannable},
database::{CreateTableMode, CreateTableRequest, Database},
embeddings::{EmbeddingDefinition, EmbeddingFunction, EmbeddingRegistry},
table::WriteOptions,
Error, Result, Table,
};
pub struct CreateTableBuilder {
parent: Arc<dyn Database>,
embeddings: Vec<(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)>,
embedding_registry: Arc<dyn EmbeddingRegistry>,
request: CreateTableRequest,
}
impl CreateTableBuilder {
pub(super) fn new(
parent: Arc<dyn Database>,
embedding_registry: Arc<dyn EmbeddingRegistry>,
name: String,
data: Box<dyn Scannable>,
) -> Self {
Self {
parent,
embeddings: Vec::new(),
embedding_registry,
request: CreateTableRequest::new(name, data),
}
}
/// Set the mode for creating the table
///
/// This controls what happens if a table with the given name already exists
pub fn mode(mut self, mode: CreateTableMode) -> Self {
self.request.mode = mode;
self
}
/// Apply the given write options when writing the initial data
pub fn write_options(mut self, write_options: WriteOptions) -> Self {
self.request.write_options = write_options;
self
}
/// Set an option for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.com/docs/storage/>
pub fn storage_option(mut self, key: impl Into<String>, value: impl Into<String>) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default());
merge_storage_options(store_params, [(key.into(), value.into())]);
self
}
/// Set multiple options for the storage layer.
///
/// Options already set on the connection will be inherited by the table,
/// but can be overridden here.
///
/// See available options at <https://lancedb.com/docs/storage/>
pub fn storage_options(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default());
let updates = pairs
.into_iter()
.map(|(key, value)| (key.into(), value.into()));
merge_storage_options(store_params, updates);
self
}
/// Add an embedding definition to the table.
///
/// The `embedding_name` must match the name of an embedding function that
/// was previously registered with the connection's [`EmbeddingRegistry`].
pub fn add_embedding(mut self, definition: EmbeddingDefinition) -> Result<Self> {
// Early verification of the embedding name
let embedding_func = self
.embedding_registry
.get(&definition.embedding_name)
.ok_or_else(|| Error::EmbeddingFunctionNotFound {
name: definition.embedding_name.clone(),
reason: "No embedding function found in the connection's embedding_registry"
.to_string(),
})?;
self.embeddings.push((definition, embedding_func));
Ok(self)
}
/// Set the namespace for the table
pub fn namespace(mut self, namespace: Vec<String>) -> Self {
self.request.namespace = namespace;
self
}
/// Set a custom location for the table.
///
/// If not set, the database will derive a location from its URI and the table name.
/// This is useful when integrating with namespace systems that manage table locations.
pub fn location(mut self, location: impl Into<String>) -> Self {
self.request.location = Some(location.into());
self
}
/// Set a storage options provider for automatic credential refresh.
///
/// This allows tables to automatically refresh cloud storage credentials
/// when they expire, enabling long-running operations on remote storage.
pub fn storage_options_provider(mut self, provider: Arc<dyn StorageOptionsProvider>) -> Self {
let store_params = self
.request
.write_options
.lance_write_params
.get_or_insert(Default::default())
.store_params
.get_or_insert(Default::default());
set_storage_options_provider(store_params, provider);
self
}
/// Execute the create table operation
pub async fn execute(mut self) -> Result<Table> {
let embedding_registry = self.embedding_registry.clone();
let parent = self.parent.clone();
// If embeddings were configured via add_embedding(), wrap the data
if !self.embeddings.is_empty() {
let wrapped_data: Box<dyn Scannable> = Box::new(WithEmbeddingsScannable::try_new(
self.request.data,
self.embeddings,
)?);
self.request.data = wrapped_data;
}
Ok(Table::new_with_embedding_registry(
parent.create_table(self.request).await?,
parent,
embedding_registry,
))
}
}
#[cfg(test)]
mod tests {
use arrow_array::{
record_batch, Array, FixedSizeListArray, Float32Array, RecordBatch, RecordBatchIterator,
};
use arrow_schema::{ArrowError, DataType, Field, Schema};
use futures::TryStreamExt;
use lance_file::version::LanceFileVersion;
use tempfile::tempdir;
use crate::{
arrow::{SendableRecordBatchStream, SimpleRecordBatchStream},
connect,
database::listing::{ListingDatabaseOptions, NewTableConfig},
embeddings::{EmbeddingDefinition, EmbeddingFunction, MemoryRegistry},
query::{ExecutableQuery, QueryBase, Select},
test_utils::embeddings::MockEmbed,
};
use std::borrow::Cow;
use super::*;
#[tokio::test]
async fn create_empty_table() {
let db = connect("memory://").execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int64, false),
Field::new("value", DataType::Float64, false),
]));
db.create_empty_table("name", schema.clone())
.execute()
.await
.unwrap();
let table = db.open_table("name").execute().await.unwrap();
assert_eq!(table.schema().await.unwrap(), schema);
assert_eq!(table.count_rows(None).await.unwrap(), 0);
}
async fn test_create_table_with_data<T>(data: T)
where
T: Scannable + 'static,
{
let db = connect("memory://").execute().await.unwrap();
let schema = data.schema();
db.create_table("data_table", data).execute().await.unwrap();
let table = db.open_table("data_table").execute().await.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 3);
assert_eq!(table.schema().await.unwrap(), schema);
}
#[tokio::test]
async fn create_table_with_batch() {
let batch = record_batch!(("id", Int64, [1, 2, 3])).unwrap();
test_create_table_with_data(batch).await;
}
#[tokio::test]
async fn test_create_table_with_vec_batch() {
let data = vec![
record_batch!(("id", Int64, [1, 2])).unwrap(),
record_batch!(("id", Int64, [3])).unwrap(),
];
test_create_table_with_data(data).await;
}
#[tokio::test]
async fn test_create_table_with_record_batch_reader() {
let data = vec![
record_batch!(("id", Int64, [1, 2])).unwrap(),
record_batch!(("id", Int64, [3])).unwrap(),
];
let schema = data[0].schema();
let reader: Box<dyn arrow_array::RecordBatchReader + Send> = Box::new(
RecordBatchIterator::new(data.into_iter().map(Ok), schema.clone()),
);
test_create_table_with_data(reader).await;
}
#[tokio::test]
async fn test_create_table_with_stream() {
let data = vec![
record_batch!(("id", Int64, [1, 2])).unwrap(),
record_batch!(("id", Int64, [3])).unwrap(),
];
let schema = data[0].schema();
let inner = futures::stream::iter(data.into_iter().map(Ok));
let stream: SendableRecordBatchStream = Box::pin(SimpleRecordBatchStream {
schema,
stream: inner,
});
test_create_table_with_data(stream).await;
}
#[derive(Debug)]
struct MyError;
impl std::fmt::Display for MyError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "MyError occurred")
}
}
impl std::error::Error for MyError {}
#[tokio::test]
async fn test_create_preserves_reader_error() {
let first_batch = record_batch!(("id", Int64, [1, 2])).unwrap();
let schema = first_batch.schema();
let iterator = vec![
Ok(first_batch),
Err(ArrowError::ExternalError(Box::new(MyError))),
];
let reader: Box<dyn arrow_array::RecordBatchReader + Send> = Box::new(
RecordBatchIterator::new(iterator.into_iter(), schema.clone()),
);
let db = connect("memory://").execute().await.unwrap();
let result = db.create_table("failing_table", reader).execute().await;
assert!(result.is_err());
// TODO: when we upgrade to Lance 2.0.0, this should pass
// assert!(matches!(result, Err(Error::External { source})
// if source.downcast_ref::<MyError>().is_some()
// ));
}
#[tokio::test]
async fn test_create_preserves_stream_error() {
let first_batch = record_batch!(("id", Int64, [1, 2])).unwrap();
let schema = first_batch.schema();
let iterator = vec![
Ok(first_batch),
Err(Error::External {
source: Box::new(MyError),
}),
];
let stream = futures::stream::iter(iterator);
let stream: SendableRecordBatchStream = Box::pin(SimpleRecordBatchStream {
schema: schema.clone(),
stream,
});
let db = connect("memory://").execute().await.unwrap();
let result = db
.create_table("failing_stream_table", stream)
.execute()
.await;
assert!(result.is_err());
// TODO: when we upgrade to Lance 2.0.0, this should pass
// assert!(matches!(result, Err(Error::External { source})
// if source.downcast_ref::<MyError>().is_some()
// ));
}
#[tokio::test]
#[allow(deprecated)]
async fn test_create_table_with_storage_options() {
let batch = record_batch!(("id", Int64, [1, 2, 3])).unwrap();
let db = connect("memory://").execute().await.unwrap();
let table = db
.create_table("options_table", batch)
.storage_option("timeout", "30s")
.storage_options([("retry_count", "3")])
.execute()
.await
.unwrap();
let final_options = table.storage_options().await.unwrap();
assert_eq!(final_options.get("timeout"), Some(&"30s".to_string()));
assert_eq!(final_options.get("retry_count"), Some(&"3".to_string()));
}
#[tokio::test]
async fn test_create_table_unregistered_embedding() {
let db = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("text", Utf8, ["hello", "world"])).unwrap();
// Try to add an embedding that doesn't exist in the registry
let result = db
.create_table("embed_table", batch)
.add_embedding(EmbeddingDefinition::new(
"text",
"nonexistent_embedding_function",
None::<&str>,
));
match result {
Err(Error::EmbeddingFunctionNotFound { name, .. }) => {
assert_eq!(name, "nonexistent_embedding_function");
}
Err(other) => panic!("Expected EmbeddingFunctionNotFound error, got: {:?}", other),
Ok(_) => panic!("Expected error, but got Ok"),
}
}
#[tokio::test]
async fn test_create_table_already_exists() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, false)]));
db.create_empty_table("test", schema.clone())
.execute()
.await
.unwrap();
db.create_empty_table("test", schema)
.mode(CreateTableMode::exist_ok(|mut req| {
req.index_cache_size = Some(16);
req
}))
.execute()
.await
.unwrap();
let other_schema = Arc::new(Schema::new(vec![Field::new("y", DataType::Int32, false)]));
assert!(db
.create_empty_table("test", other_schema.clone())
.execute()
.await
.is_err()); // TODO: assert what this error is
let overwritten = db
.create_empty_table("test", other_schema.clone())
.mode(CreateTableMode::Overwrite)
.execute()
.await
.unwrap();
assert_eq!(other_schema, overwritten.schema().await.unwrap());
}
#[tokio::test]
#[rstest::rstest]
#[case(LanceFileVersion::Legacy)]
#[case(LanceFileVersion::Stable)]
async fn test_create_table_with_storage_version(
#[case] data_storage_version: LanceFileVersion,
) {
let db = connect("memory://")
.database_options(&ListingDatabaseOptions {
new_table_config: NewTableConfig {
data_storage_version: Some(data_storage_version),
..Default::default()
},
..Default::default()
})
.execute()
.await
.unwrap();
let batch = record_batch!(("id", Int64, [1, 2, 3])).unwrap();
let table = db
.create_table("legacy_table", batch)
.execute()
.await
.unwrap();
let native_table = table.as_native().unwrap();
let storage_format = native_table
.manifest()
.await
.unwrap()
.data_storage_format
.lance_file_version()
.unwrap();
// Compare resolved versions since Stable/Next are aliases that resolve at storage time
assert_eq!(storage_format.resolve(), data_storage_version.resolve());
}
#[tokio::test]
async fn test_create_table_with_embedding() {
// Register the mock embedding function
let registry = Arc::new(MemoryRegistry::new());
let mock_embedding: Arc<dyn EmbeddingFunction> = Arc::new(MockEmbed::new("mock", 4));
registry.register("mock", mock_embedding).unwrap();
// Connect with the custom registry
let conn = connect("memory://")
.embedding_registry(registry)
.execute()
.await
.unwrap();
// Create data without the embedding column
let batch = record_batch!(("text", Utf8, ["hello", "world", "test"])).unwrap();
// Create table with add_embedding - embeddings should be computed automatically
let table = conn
.create_table("embed_test", batch)
.add_embedding(EmbeddingDefinition::new(
"text",
"mock",
Some("text_embedding"),
))
.unwrap()
.execute()
.await
.unwrap();
// Verify row count
assert_eq!(table.count_rows(None).await.unwrap(), 3);
// Verify the schema includes the embedding column
let result_schema = table.schema().await.unwrap();
assert_eq!(result_schema.fields().len(), 2);
assert_eq!(result_schema.field(0).name(), "text");
assert_eq!(result_schema.field(1).name(), "text_embedding");
// Verify the embedding column has the correct type
assert!(matches!(
result_schema.field(1).data_type(),
DataType::FixedSizeList(_, 4)
));
// Query to verify the embeddings were computed
let results: Vec<RecordBatch> = table
.query()
.select(Select::columns(&["text", "text_embedding"]))
.execute()
.await
.unwrap()
.try_collect()
.await
.unwrap();
let total_rows: usize = results.iter().map(|b| b.num_rows()).sum();
assert_eq!(total_rows, 3);
// Check that all rows have embedding values (not null)
for batch in &results {
let embedding_col = batch.column(1);
assert_eq!(embedding_col.null_count(), 0);
assert_eq!(embedding_col.len(), batch.num_rows());
}
// Verify the schema metadata contains the column definitions
assert!(
result_schema
.metadata
.contains_key("lancedb::column_definitions"),
"Schema metadata should contain column definitions"
);
}
#[tokio::test]
async fn test_create_empty_table_with_embeddings() {
#[derive(Debug, Clone)]
struct MockEmbedding {
dim: usize,
}
impl EmbeddingFunction for MockEmbedding {
fn name(&self) -> &str {
"test_embedding"
}
fn source_type(&self) -> Result<Cow<'_, DataType>> {
Ok(Cow::Owned(DataType::Utf8))
}
fn dest_type(&self) -> Result<Cow<'_, DataType>> {
Ok(Cow::Owned(DataType::new_fixed_size_list(
DataType::Float32,
self.dim as i32,
true,
)))
}
fn compute_source_embeddings(&self, source: Arc<dyn Array>) -> Result<Arc<dyn Array>> {
let len = source.len();
let values = vec![1.0f32; len * self.dim];
let values = Arc::new(Float32Array::from(values));
let field = Arc::new(Field::new("item", DataType::Float32, true));
Ok(Arc::new(FixedSizeListArray::new(
field,
self.dim as i32,
values,
None,
)))
}
fn compute_query_embeddings(&self, _input: Arc<dyn Array>) -> Result<Arc<dyn Array>> {
unimplemented!()
}
}
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
let embed_func = Arc::new(MockEmbedding { dim: 128 });
db.embedding_registry()
.register("test_embedding", embed_func.clone())
.unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("name", DataType::Utf8, true)]));
let ed = EmbeddingDefinition {
source_column: "name".to_owned(),
dest_column: Some("name_embedding".to_owned()),
embedding_name: "test_embedding".to_owned(),
};
let table = db
.create_empty_table("test", schema)
.mode(CreateTableMode::Overwrite)
.add_embedding(ed)
.unwrap()
.execute()
.await
.unwrap();
let table_schema = table.schema().await.unwrap();
assert!(table_schema.column_with_name("name").is_some());
assert!(table_schema.column_with_name("name_embedding").is_some());
let embedding_field = table_schema.field_with_name("name_embedding").unwrap();
assert_eq!(
embedding_field.data_type(),
&DataType::new_fixed_size_list(DataType::Float32, 128, true)
);
let input_batch = record_batch!(("name", Utf8, ["Alice", "Bob", "Charlie"])).unwrap();
table.add(input_batch).execute().await.unwrap();
let results = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
assert_eq!(results.len(), 1);
let batch = &results[0];
assert_eq!(batch.num_rows(), 3);
assert!(batch.column_by_name("name_embedding").is_some());
let embedding_col = batch
.column_by_name("name_embedding")
.unwrap()
.as_any()
.downcast_ref::<FixedSizeListArray>()
.unwrap();
assert_eq!(embedding_col.len(), 3);
}
}

View File

@@ -5,3 +5,4 @@
pub mod inspect;
pub mod sanitize;
pub mod scannable;

View File

@@ -0,0 +1,580 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! Data source abstraction for LanceDB.
//!
//! This module provides a [`Scannable`] trait that allows input data sources to express
//! capabilities (row count, rescannability) so the insert pipeline can make
//! better decisions about write parallelism and retry strategies.
use std::sync::Arc;
use arrow_array::{RecordBatch, RecordBatchIterator, RecordBatchReader};
use arrow_schema::{ArrowError, SchemaRef};
use async_trait::async_trait;
use futures::stream::once;
use futures::StreamExt;
use lance_datafusion::utils::StreamingWriteSource;
use crate::arrow::{
SendableRecordBatchStream, SendableRecordBatchStreamExt, SimpleRecordBatchStream,
};
use crate::embeddings::{
compute_embeddings_for_batch, compute_output_schema, EmbeddingDefinition, EmbeddingFunction,
EmbeddingRegistry,
};
use crate::table::{ColumnDefinition, ColumnKind, TableDefinition};
use crate::{Error, Result};
pub trait Scannable: Send {
/// Returns the schema of the data.
fn schema(&self) -> SchemaRef;
/// Read data as a stream of record batches.
///
/// For rescannable sources (in-memory data like RecordBatch, Vec<RecordBatch>),
/// this can be called multiple times and returns cloned data each time.
///
/// For non-rescannable sources (streams, readers), this can only be called once.
/// Calling it a second time returns a stream whose first item is an error.
fn scan_as_stream(&mut self) -> SendableRecordBatchStream;
/// Optional hint about the number of rows.
///
/// When available, this allows the pipeline to estimate total data size
/// and choose appropriate partitioning.
fn num_rows(&self) -> Option<usize> {
None
}
/// Whether the source can be re-read from the beginning.
///
/// `true` for in-memory data (Tables, DataFrames) and disk-based sources (Datasets).
/// `false` for streaming sources (DuckDB results, network streams).
///
/// When true, the pipeline can retry failed writes by rescanning.
fn rescannable(&self) -> bool {
false
}
}
impl std::fmt::Debug for dyn Scannable {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Scannable")
.field("schema", &self.schema())
.field("num_rows", &self.num_rows())
.field("rescannable", &self.rescannable())
.finish()
}
}
impl Scannable for RecordBatch {
fn schema(&self) -> SchemaRef {
Self::schema(self)
}
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
let batch = self.clone();
let schema = batch.schema();
Box::pin(SimpleRecordBatchStream {
schema,
stream: once(async move { Ok(batch) }),
})
}
fn num_rows(&self) -> Option<usize> {
Some(Self::num_rows(self))
}
fn rescannable(&self) -> bool {
true
}
}
impl Scannable for Vec<RecordBatch> {
fn schema(&self) -> SchemaRef {
if self.is_empty() {
Arc::new(arrow_schema::Schema::empty())
} else {
self[0].schema()
}
}
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
if self.is_empty() {
let schema = Scannable::schema(self);
return Box::pin(SimpleRecordBatchStream {
schema,
stream: once(async {
Err(Error::InvalidInput {
message: "Cannot scan an empty Vec<RecordBatch>".to_string(),
})
}),
});
}
let schema = Scannable::schema(self);
let batches = self.clone();
let stream = futures::stream::iter(batches.into_iter().map(Ok));
Box::pin(SimpleRecordBatchStream { schema, stream })
}
fn num_rows(&self) -> Option<usize> {
Some(self.iter().map(|b| b.num_rows()).sum())
}
fn rescannable(&self) -> bool {
true
}
}
impl Scannable for Box<dyn RecordBatchReader + Send> {
fn schema(&self) -> SchemaRef {
RecordBatchReader::schema(self.as_ref())
}
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
let schema = Scannable::schema(self);
// Swap self with a reader that errors on iteration, so a second call
// produces a clear error instead of silently returning empty data.
let err_reader: Box<dyn RecordBatchReader + Send> = Box::new(RecordBatchIterator::new(
vec![Err(ArrowError::InvalidArgumentError(
"Reader has already been consumed".into(),
))],
schema.clone(),
));
let reader = std::mem::replace(self, err_reader);
// Bridge the blocking RecordBatchReader to an async stream via a channel.
let (tx, rx) = tokio::sync::mpsc::channel::<crate::Result<RecordBatch>>(2);
tokio::task::spawn_blocking(move || {
for batch_result in reader {
let result = batch_result.map_err(Into::into);
if tx.blocking_send(result).is_err() {
break;
}
}
});
let stream = futures::stream::unfold(rx, |mut rx| async move {
rx.recv().await.map(|batch| (batch, rx))
})
.fuse();
Box::pin(SimpleRecordBatchStream { schema, stream })
}
}
impl Scannable for SendableRecordBatchStream {
fn schema(&self) -> SchemaRef {
self.as_ref().schema()
}
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
let schema = Scannable::schema(self);
// Swap self with an error stream so a second call produces a clear error.
let error_stream = Box::pin(SimpleRecordBatchStream {
schema: schema.clone(),
stream: once(async {
Err(Error::InvalidInput {
message: "Stream has already been consumed".to_string(),
})
}),
});
std::mem::replace(self, error_stream)
}
}
#[async_trait]
impl StreamingWriteSource for Box<dyn Scannable> {
fn arrow_schema(&self) -> SchemaRef {
self.schema()
}
fn into_stream(mut self) -> datafusion_physical_plan::SendableRecordBatchStream {
self.scan_as_stream().into_df_stream()
}
}
/// A scannable that applies embeddings to the stream.
pub struct WithEmbeddingsScannable {
inner: Box<dyn Scannable>,
embeddings: Vec<(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)>,
output_schema: SchemaRef,
}
impl WithEmbeddingsScannable {
/// Create a new WithEmbeddingsScannable.
///
/// The embeddings are applied to the inner scannable's data as new columns.
pub fn try_new(
inner: Box<dyn Scannable>,
embeddings: Vec<(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)>,
) -> Result<Self> {
let output_schema = compute_output_schema(&inner.schema(), &embeddings)?;
// Build column definitions: Physical for base columns, Embedding for new ones
let base_col_count = inner.schema().fields().len();
let column_definitions: Vec<ColumnDefinition> = (0..base_col_count)
.map(|_| ColumnDefinition {
kind: ColumnKind::Physical,
})
.chain(embeddings.iter().map(|(ed, _)| ColumnDefinition {
kind: ColumnKind::Embedding(ed.clone()),
}))
.collect();
let table_definition = TableDefinition::new(output_schema, column_definitions);
let output_schema = table_definition.into_rich_schema();
Ok(Self {
inner,
embeddings,
output_schema,
})
}
}
impl Scannable for WithEmbeddingsScannable {
fn schema(&self) -> SchemaRef {
self.output_schema.clone()
}
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
let inner_stream = self.inner.scan_as_stream();
let embeddings = self.embeddings.clone();
let output_schema = self.output_schema.clone();
let mapped_stream = inner_stream.then(move |batch_result| {
let embeddings = embeddings.clone();
async move {
let batch = batch_result?;
let result = tokio::task::spawn_blocking(move || {
compute_embeddings_for_batch(batch, &embeddings)
})
.await
.map_err(|e| Error::Runtime {
message: format!("Task panicked during embedding computation: {}", e),
})??;
Ok(result)
}
});
Box::pin(SimpleRecordBatchStream {
schema: output_schema,
stream: mapped_stream,
})
}
fn num_rows(&self) -> Option<usize> {
self.inner.num_rows()
}
fn rescannable(&self) -> bool {
self.inner.rescannable()
}
}
pub fn scannable_with_embeddings(
inner: Box<dyn Scannable>,
table_definition: &TableDefinition,
registry: Option<&Arc<dyn EmbeddingRegistry>>,
) -> Result<Box<dyn Scannable>> {
if let Some(registry) = registry {
let mut embeddings = Vec::with_capacity(table_definition.column_definitions.len());
for cd in table_definition.column_definitions.iter() {
if let ColumnKind::Embedding(embedding_def) = &cd.kind {
match registry.get(&embedding_def.embedding_name) {
Some(func) => {
embeddings.push((embedding_def.clone(), func));
}
None => {
return Err(Error::EmbeddingFunctionNotFound {
name: embedding_def.embedding_name.clone(),
reason: format!(
"Table was defined with an embedding column `{}` but no embedding function was found with that name within the registry.",
embedding_def.embedding_name
),
});
}
}
}
}
if !embeddings.is_empty() {
return Ok(Box::new(WithEmbeddingsScannable::try_new(
inner, embeddings,
)?));
}
}
Ok(inner)
}
#[cfg(test)]
mod tests {
use super::*;
use arrow_array::record_batch;
use futures::TryStreamExt;
#[tokio::test]
async fn test_record_batch_rescannable() {
let mut batch = record_batch!(("id", Int64, [0, 1, 2])).unwrap();
let stream1 = batch.scan_as_stream();
let batches1: Vec<RecordBatch> = stream1.try_collect().await.unwrap();
assert_eq!(batches1.len(), 1);
assert_eq!(batches1[0], batch);
assert!(batch.rescannable());
let stream2 = batch.scan_as_stream();
let batches2: Vec<RecordBatch> = stream2.try_collect().await.unwrap();
assert_eq!(batches2.len(), 1);
assert_eq!(batches2[0], batch);
}
#[tokio::test]
async fn test_vec_batch_rescannable() {
let mut batches = vec![
record_batch!(("id", Int64, [0, 1])).unwrap(),
record_batch!(("id", Int64, [2, 3, 4])).unwrap(),
];
let stream1 = batches.scan_as_stream();
let result1: Vec<RecordBatch> = stream1.try_collect().await.unwrap();
assert_eq!(result1.len(), 2);
assert_eq!(result1[0], batches[0]);
assert_eq!(result1[1], batches[1]);
assert!(batches.rescannable());
let stream2 = batches.scan_as_stream();
let result2: Vec<RecordBatch> = stream2.try_collect().await.unwrap();
assert_eq!(result2.len(), 2);
assert_eq!(result2[0], batches[0]);
assert_eq!(result2[1], batches[1]);
}
#[tokio::test]
async fn test_vec_batch_empty_errors() {
let mut empty: Vec<RecordBatch> = vec![];
let mut stream = empty.scan_as_stream();
let result = stream.next().await;
assert!(result.is_some());
assert!(result.unwrap().is_err());
}
#[tokio::test]
async fn test_reader_not_rescannable() {
let batch = record_batch!(("id", Int64, [0, 1, 2])).unwrap();
let schema = batch.schema();
let mut reader: Box<dyn arrow_array::RecordBatchReader + Send> = Box::new(
RecordBatchIterator::new(vec![Ok(batch.clone())], schema.clone()),
);
let stream1 = reader.scan_as_stream();
let result1: Vec<RecordBatch> = stream1.try_collect().await.unwrap();
assert_eq!(result1.len(), 1);
assert_eq!(result1[0], batch);
assert!(!reader.rescannable());
// Second call returns a stream whose first item is an error
let mut stream2 = reader.scan_as_stream();
let result2 = stream2.next().await;
assert!(result2.is_some());
assert!(result2.unwrap().is_err());
}
#[tokio::test]
async fn test_stream_not_rescannable() {
let batch = record_batch!(("id", Int64, [0, 1, 2])).unwrap();
let schema = batch.schema();
let inner_stream = futures::stream::iter(vec![Ok(batch.clone())]);
let mut stream: SendableRecordBatchStream = Box::pin(SimpleRecordBatchStream {
schema: schema.clone(),
stream: inner_stream,
});
let stream1 = stream.scan_as_stream();
let result1: Vec<RecordBatch> = stream1.try_collect().await.unwrap();
assert_eq!(result1.len(), 1);
assert_eq!(result1[0], batch);
assert!(!stream.rescannable());
// Second call returns a stream whose first item is an error
let mut stream2 = stream.scan_as_stream();
let result2 = stream2.next().await;
assert!(result2.is_some());
assert!(result2.unwrap().is_err());
}
mod embedding_tests {
use super::*;
use crate::embeddings::MemoryRegistry;
use crate::table::{ColumnDefinition, ColumnKind};
use crate::test_utils::embeddings::MockEmbed;
use arrow_array::Array as _;
use arrow_array::{ArrayRef, StringArray};
use arrow_schema::{DataType, Field, Schema};
#[tokio::test]
async fn test_with_embeddings_scannable() {
let schema = Arc::new(Schema::new(vec![Field::new("text", DataType::Utf8, false)]));
let text_array = StringArray::from(vec!["hello", "world", "test"]);
let batch =
RecordBatch::try_new(schema.clone(), vec![Arc::new(text_array) as ArrayRef])
.unwrap();
let mock_embedding: Arc<dyn EmbeddingFunction> = Arc::new(MockEmbed::new("mock", 4));
let embedding_def = EmbeddingDefinition::new("text", "mock", Some("text_embedding"));
let mut scannable = WithEmbeddingsScannable::try_new(
Box::new(batch.clone()),
vec![(embedding_def, mock_embedding)],
)
.unwrap();
// Check that schema has the embedding column
let output_schema = scannable.schema();
assert_eq!(output_schema.fields().len(), 2);
assert_eq!(output_schema.field(0).name(), "text");
assert_eq!(output_schema.field(1).name(), "text_embedding");
// Check num_rows and rescannable are preserved
assert_eq!(scannable.num_rows(), Some(3));
assert!(scannable.rescannable());
// Read the data
let stream = scannable.scan_as_stream();
let results: Vec<RecordBatch> = stream.try_collect().await.unwrap();
assert_eq!(results.len(), 1);
let result_batch = &results[0];
assert_eq!(result_batch.num_rows(), 3);
assert_eq!(result_batch.num_columns(), 2);
// Verify the embedding column is present and has the right shape
let embedding_col = result_batch.column(1);
assert_eq!(embedding_col.len(), 3);
}
#[tokio::test]
async fn test_maybe_embedded_scannable_no_embeddings() {
let batch = record_batch!(("id", Int64, [1, 2, 3])).unwrap();
// Create a table definition with no embedding columns
let table_def = TableDefinition::new_from_schema(batch.schema());
// Even with a registry, if there are no embedding columns, it's a passthrough
let registry: Arc<dyn EmbeddingRegistry> = Arc::new(MemoryRegistry::new());
let mut scannable =
scannable_with_embeddings(Box::new(batch.clone()), &table_def, Some(&registry))
.unwrap();
// Check that data passes through unchanged
let stream = scannable.scan_as_stream();
let results: Vec<RecordBatch> = stream.try_collect().await.unwrap();
assert_eq!(results.len(), 1);
assert_eq!(results[0], batch);
}
#[tokio::test]
async fn test_maybe_embedded_scannable_with_embeddings() {
let schema = Arc::new(Schema::new(vec![Field::new("text", DataType::Utf8, false)]));
let text_array = StringArray::from(vec!["hello", "world"]);
let batch =
RecordBatch::try_new(schema.clone(), vec![Arc::new(text_array) as ArrayRef])
.unwrap();
// Create a table definition with an embedding column
let embedding_def = EmbeddingDefinition::new("text", "mock", Some("text_embedding"));
let embedding_schema = Arc::new(Schema::new(vec![
Field::new("text", DataType::Utf8, false),
Field::new(
"text_embedding",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
4,
),
false,
),
]));
let table_def = TableDefinition::new(
embedding_schema,
vec![
ColumnDefinition {
kind: ColumnKind::Physical,
},
ColumnDefinition {
kind: ColumnKind::Embedding(embedding_def.clone()),
},
],
);
// Register the mock embedding function
let registry: Arc<dyn EmbeddingRegistry> = Arc::new(MemoryRegistry::new());
let mock_embedding: Arc<dyn EmbeddingFunction> = Arc::new(MockEmbed::new("mock", 4));
registry.register("mock", mock_embedding).unwrap();
let mut scannable =
scannable_with_embeddings(Box::new(batch), &table_def, Some(&registry)).unwrap();
// Read and verify the data has embeddings
let stream = scannable.scan_as_stream();
let results: Vec<RecordBatch> = stream.try_collect().await.unwrap();
assert_eq!(results.len(), 1);
let result_batch = &results[0];
assert_eq!(result_batch.num_columns(), 2);
assert_eq!(result_batch.schema().field(1).name(), "text_embedding");
}
#[tokio::test]
async fn test_maybe_embedded_scannable_missing_function() {
let schema = Arc::new(Schema::new(vec![Field::new("text", DataType::Utf8, false)]));
let text_array = StringArray::from(vec!["hello"]);
let batch =
RecordBatch::try_new(schema.clone(), vec![Arc::new(text_array) as ArrayRef])
.unwrap();
// Create a table definition with an embedding column
let embedding_def =
EmbeddingDefinition::new("text", "nonexistent", Some("text_embedding"));
let embedding_schema = Arc::new(Schema::new(vec![
Field::new("text", DataType::Utf8, false),
Field::new(
"text_embedding",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float32, true)),
4,
),
false,
),
]));
let table_def = TableDefinition::new(
embedding_schema,
vec![
ColumnDefinition {
kind: ColumnKind::Physical,
},
ColumnDefinition {
kind: ColumnKind::Embedding(embedding_def),
},
],
);
// Registry has no embedding functions registered
let registry: Arc<dyn EmbeddingRegistry> = Arc::new(MemoryRegistry::new());
let result = scannable_with_embeddings(Box::new(batch), &table_def, Some(&registry));
// Should fail because the embedding function is not found
assert!(result.is_err());
let err = result.err().unwrap();
assert!(
matches!(err, Error::EmbeddingFunctionNotFound { .. }),
"Expected EmbeddingFunctionNotFound"
);
}
}
}

View File

@@ -18,12 +18,7 @@ use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use arrow_array::RecordBatchReader;
use async_trait::async_trait;
use datafusion_physical_plan::stream::RecordBatchStreamAdapter;
use futures::stream;
use lance::dataset::ReadParams;
use lance_datafusion::utils::StreamingWriteSource;
use lance_namespace::models::{
CreateNamespaceRequest, CreateNamespaceResponse, DescribeNamespaceRequest,
DescribeNamespaceResponse, DropNamespaceRequest, DropNamespaceResponse, ListNamespacesRequest,
@@ -31,9 +26,9 @@ use lance_namespace::models::{
};
use lance_namespace::LanceNamespace;
use crate::arrow::{SendableRecordBatchStream, SendableRecordBatchStreamExt};
use crate::data::scannable::Scannable;
use crate::error::Result;
use crate::table::{BaseTable, TableDefinition, WriteOptions};
use crate::table::{BaseTable, WriteOptions};
pub mod listing;
pub mod namespace;
@@ -90,8 +85,10 @@ pub type TableBuilderCallback = Box<dyn FnOnce(OpenTableRequest) -> OpenTableReq
/// Describes what happens when creating a table and a table with
/// the same name already exists
#[derive(Default)]
pub enum CreateTableMode {
/// If the table already exists, an error is returned
#[default]
Create,
/// If the table already exists, it is opened. Any provided data is
/// ignored. The function will be passed an OpenTableBuilder to customize
@@ -109,57 +106,14 @@ impl CreateTableMode {
}
}
impl Default for CreateTableMode {
fn default() -> Self {
Self::Create
}
}
/// The data to start a table or a schema to create an empty table
pub enum CreateTableData {
/// Creates a table using an iterator of data, the schema will be obtained from the data
Data(Box<dyn RecordBatchReader + Send>),
/// Creates a table using a stream of data, the schema will be obtained from the data
StreamingData(SendableRecordBatchStream),
/// Creates an empty table, the definition / schema must be provided separately
Empty(TableDefinition),
}
impl CreateTableData {
pub fn schema(&self) -> Arc<arrow_schema::Schema> {
match self {
Self::Data(reader) => reader.schema(),
Self::StreamingData(stream) => stream.schema(),
Self::Empty(definition) => definition.schema.clone(),
}
}
}
#[async_trait]
impl StreamingWriteSource for CreateTableData {
fn arrow_schema(&self) -> Arc<arrow_schema::Schema> {
self.schema()
}
fn into_stream(self) -> datafusion_physical_plan::SendableRecordBatchStream {
match self {
Self::Data(reader) => reader.into_stream(),
Self::StreamingData(stream) => stream.into_df_stream(),
Self::Empty(table_definition) => {
let schema = table_definition.schema.clone();
Box::pin(RecordBatchStreamAdapter::new(schema, stream::empty()))
}
}
}
}
/// A request to create a table
pub struct CreateTableRequest {
/// The name of the new table
pub name: String,
/// The namespace to create the table in. Empty list represents root namespace.
pub namespace: Vec<String>,
/// Initial data to write to the table, can be None to create an empty table
pub data: CreateTableData,
/// Initial data to write to the table, can be empty.
pub data: Box<dyn Scannable>,
/// The mode to use when creating the table
pub mode: CreateTableMode,
/// Options to use when writing data (only used if `data` is not None)
@@ -173,7 +127,7 @@ pub struct CreateTableRequest {
}
impl CreateTableRequest {
pub fn new(name: String, data: CreateTableData) -> Self {
pub fn new(name: String, data: Box<dyn Scannable>) -> Self {
Self {
name,
namespace: vec![],

View File

@@ -922,7 +922,7 @@ impl Database for ListingDatabase {
.with_read_params(read_params.clone())
.load()
.await
.map_err(|e| Error::Lance { source: e })?;
.map_err(|e| -> Error { e.into() })?;
let version_ref = match (request.source_version, request.source_tag) {
(Some(v), None) => Ok(Ref::Version(None, Some(v))),
@@ -937,7 +937,7 @@ impl Database for ListingDatabase {
source_dataset
.shallow_clone(&target_uri, version_ref, Some(storage_params))
.await
.map_err(|e| Error::Lance { source: e })?;
.map_err(|e| -> Error { e.into() })?;
let cloned_table = NativeTable::open_with_params(
&target_uri,
@@ -1098,8 +1098,10 @@ impl Database for ListingDatabase {
mod tests {
use super::*;
use crate::connection::ConnectRequest;
use crate::database::{CreateTableData, CreateTableMode, CreateTableRequest, WriteOptions};
use crate::table::{Table, TableDefinition};
use crate::data::scannable::Scannable;
use crate::database::{CreateTableMode, CreateTableRequest};
use crate::table::WriteOptions;
use crate::Table;
use arrow_array::{Int32Array, RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema};
use std::path::PathBuf;
@@ -1139,7 +1141,7 @@ mod tests {
.create_table(CreateTableRequest {
name: "source_table".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema.clone())),
data: Box::new(RecordBatch::new_empty(schema.clone())) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1196,16 +1198,11 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch)],
schema.clone(),
));
let source_table = db
.create_table(CreateTableRequest {
name: "source_with_data".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1264,7 +1261,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "source".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema)),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1300,7 +1297,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "source".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema)),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1340,7 +1337,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "source".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema)),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1380,7 +1377,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "source".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema)),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1435,7 +1432,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "source".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema)),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1484,16 +1481,11 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch1)],
schema.clone(),
));
let source_table = db
.create_table(CreateTableRequest {
name: "versioned_source".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch1) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1517,14 +1509,7 @@ mod tests {
let db = Arc::new(db);
let source_table_obj = Table::new(source_table.clone(), db.clone());
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch2)],
schema.clone(),
)))
.execute()
.await
.unwrap();
source_table_obj.add(batch2).execute().await.unwrap();
// Verify source table now has 4 rows
assert_eq!(source_table.count_rows(None).await.unwrap(), 4);
@@ -1570,16 +1555,11 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch1)],
schema.clone(),
));
let source_table = db
.create_table(CreateTableRequest {
name: "tagged_source".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch1),
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1607,14 +1587,7 @@ mod tests {
.unwrap();
let source_table_obj = Table::new(source_table.clone(), db.clone());
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch2)],
schema.clone(),
)))
.execute()
.await
.unwrap();
source_table_obj.add(batch2).execute().await.unwrap();
// Source table should have 4 rows
assert_eq!(source_table.count_rows(None).await.unwrap(), 4);
@@ -1657,16 +1630,11 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch1)],
schema.clone(),
));
let source_table = db
.create_table(CreateTableRequest {
name: "independent_source".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch1),
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1706,14 +1674,7 @@ mod tests {
let db = Arc::new(db);
let cloned_table_obj = Table::new(cloned_table.clone(), db.clone());
cloned_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch_clone)],
schema.clone(),
)))
.execute()
.await
.unwrap();
cloned_table_obj.add(batch_clone).execute().await.unwrap();
// Add different data to the source table
let batch_source = RecordBatch::try_new(
@@ -1726,14 +1687,7 @@ mod tests {
.unwrap();
let source_table_obj = Table::new(source_table.clone(), db);
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch_source)],
schema.clone(),
)))
.execute()
.await
.unwrap();
source_table_obj.add(batch_source).execute().await.unwrap();
// Verify they have evolved independently
assert_eq!(source_table.count_rows(None).await.unwrap(), 4); // 2 + 2
@@ -1751,16 +1705,11 @@ mod tests {
RecordBatch::try_new(schema.clone(), vec![Arc::new(Int32Array::from(vec![1, 2]))])
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch1)],
schema.clone(),
));
let source_table = db
.create_table(CreateTableRequest {
name: "latest_version_source".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch1),
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1779,14 +1728,7 @@ mod tests {
.unwrap();
let source_table_obj = Table::new(source_table.clone(), db.clone());
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch)],
schema.clone(),
)))
.execute()
.await
.unwrap();
source_table_obj.add(batch).execute().await.unwrap();
}
// Source should have 8 rows total (2 + 2 + 2 + 2)
@@ -1849,16 +1791,11 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch)],
schema.clone(),
));
let table = db
.create_table(CreateTableRequest {
name: "test_stable".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch),
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -1887,11 +1824,6 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch)],
schema.clone(),
));
let mut storage_options = HashMap::new();
storage_options.insert(
OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS.to_string(),
@@ -1914,7 +1846,7 @@ mod tests {
.create_table(CreateTableRequest {
name: "test_stable_table_level".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch),
mode: CreateTableMode::Create,
write_options,
location: None,
@@ -1963,11 +1895,6 @@ mod tests {
)
.unwrap();
let reader = Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch)],
schema.clone(),
));
let mut storage_options = HashMap::new();
storage_options.insert(
OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS.to_string(),
@@ -1990,7 +1917,7 @@ mod tests {
.create_table(CreateTableRequest {
name: "test_override".to_string(),
namespace: vec![],
data: CreateTableData::Data(reader),
data: Box::new(batch),
mode: CreateTableMode::Create,
write_options,
location: None,
@@ -2108,7 +2035,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "table1".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema.clone())),
data: Box::new(RecordBatch::new_empty(schema.clone())) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,
@@ -2120,7 +2047,7 @@ mod tests {
db.create_table(CreateTableRequest {
name: "table2".to_string(),
namespace: vec![],
data: CreateTableData::Empty(TableDefinition::new_from_schema(schema)),
data: Box::new(RecordBatch::new_empty(schema)) as Box<dyn Scannable>,
mode: CreateTableMode::Create,
write_options: Default::default(),
location: None,

View File

@@ -354,15 +354,13 @@ mod tests {
use super::*;
use crate::connect_namespace;
use crate::query::ExecutableQuery;
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_array::{Int32Array, RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use tempfile::tempdir;
/// Helper function to create test data
fn create_test_data() -> RecordBatchIterator<
std::vec::IntoIter<std::result::Result<RecordBatch, arrow_schema::ArrowError>>,
> {
fn create_test_data() -> RecordBatch {
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
@@ -371,12 +369,7 @@ mod tests {
let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let name_array = StringArray::from(vec!["Alice", "Bob", "Charlie", "David", "Eve"]);
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(id_array), Arc::new(name_array)],
)
.unwrap();
RecordBatchIterator::new(vec![std::result::Result::Ok(batch)].into_iter(), schema)
RecordBatch::try_new(schema, vec![Arc::new(id_array), Arc::new(name_array)]).unwrap()
}
#[tokio::test]
@@ -618,13 +611,7 @@ mod tests {
// Test: Overwrite the table
let table2 = conn
.create_table(
"overwrite_test",
RecordBatchIterator::new(
vec![std::result::Result::Ok(test_data2)].into_iter(),
schema,
),
)
.create_table("overwrite_test", test_data2)
.namespace(vec!["test_ns".into()])
.mode(CreateTableMode::Overwrite)
.execute()

View File

@@ -13,7 +13,7 @@ use lance_datafusion::exec::SessionContextExt;
use crate::{
arrow::{SendableRecordBatchStream, SendableRecordBatchStreamExt, SimpleRecordBatchStream},
connect,
database::{CreateTableData, CreateTableRequest, Database},
database::{CreateTableRequest, Database},
dataloader::permutation::{
shuffle::{Shuffler, ShufflerConfig},
split::{SplitStrategy, Splitter, SPLIT_ID_COLUMN},
@@ -57,7 +57,7 @@ pub struct PermutationConfig {
}
/// Strategy for shuffling the data.
#[derive(Debug, Clone)]
#[derive(Debug, Clone, Default)]
pub enum ShuffleStrategy {
/// The data is randomly shuffled
///
@@ -78,15 +78,10 @@ pub enum ShuffleStrategy {
/// The data is not shuffled
///
/// This is useful for debugging and testing.
#[default]
None,
}
impl Default for ShuffleStrategy {
fn default() -> Self {
Self::None
}
}
/// Builder for creating a permutation table.
///
/// A permutation table is a table that stores split assignments and a shuffled order of rows. This
@@ -313,10 +308,8 @@ impl PermutationBuilder {
}
};
let create_table_request = CreateTableRequest::new(
name.to_string(),
CreateTableData::StreamingData(streaming_data),
);
let create_table_request =
CreateTableRequest::new(name.to_string(), Box::new(streaming_data));
let table = database.create_table(create_table_request).await?;
@@ -347,7 +340,7 @@ mod tests {
.col("col_b", lance_datagen::array::step::<Int32Type>())
.into_ldb_stream(RowCount::from(100), BatchCount::from(10));
let data_table = db
.create_table_streaming("base_tbl", initial_data)
.create_table("base_tbl", initial_data)
.execute()
.await
.unwrap();
@@ -387,7 +380,7 @@ mod tests {
.col("some_value", lance_datagen::array::step::<Int32Type>())
.into_ldb_stream(RowCount::from(100), BatchCount::from(10));
let data_table = db
.create_table_streaming("mytbl", initial_data)
.create_table("mytbl", initial_data)
.execute()
.await
.unwrap();

View File

@@ -39,6 +39,9 @@ pub struct PermutationReader {
limit: Option<u64>,
available_rows: u64,
split: u64,
// Cached map of offset to row id for the split
#[allow(clippy::type_complexity)]
offset_map: Arc<tokio::sync::Mutex<Option<Arc<HashMap<u64, u64>>>>>,
}
impl std::fmt::Debug for PermutationReader {
@@ -72,6 +75,7 @@ impl PermutationReader {
limit: None,
available_rows: 0,
split,
offset_map: Arc::new(tokio::sync::Mutex::new(None)),
};
slf.validate().await?;
// Calculate the number of available rows
@@ -157,6 +161,7 @@ impl PermutationReader {
let available_rows = self.verify_limit_offset(self.limit, Some(offset)).await?;
self.offset = Some(offset);
self.available_rows = available_rows;
self.offset_map = Arc::new(tokio::sync::Mutex::new(None));
Ok(self)
}
@@ -164,6 +169,7 @@ impl PermutationReader {
let available_rows = self.verify_limit_offset(Some(limit), self.offset).await?;
self.available_rows = available_rows;
self.limit = Some(limit);
self.offset_map = Arc::new(tokio::sync::Mutex::new(None));
Ok(self)
}
@@ -180,8 +186,9 @@ impl PermutationReader {
base_table: &Arc<dyn BaseTable>,
row_ids: RecordBatch,
selection: Select,
has_row_id: bool,
) -> Result<RecordBatch> {
let has_row_id = Self::has_row_id(&selection)?;
let num_rows = row_ids.num_rows();
let row_ids = row_ids
.column(0)
@@ -282,14 +289,13 @@ impl PermutationReader {
row_ids: DatasetRecordBatchStream,
selection: Select,
) -> Result<SendableRecordBatchStream> {
let has_row_id = Self::has_row_id(&selection)?;
let mut stream = row_ids
.map_err(Error::from)
.try_filter_map(move |batch| {
let selection = selection.clone();
let base_table = base_table.clone();
async move {
Self::load_batch(&base_table, batch, selection, has_row_id)
Self::load_batch(&base_table, batch, selection)
.await
.map(Some)
}
@@ -397,6 +403,82 @@ impl PermutationReader {
Self::row_ids_to_batches(self.base_table.clone(), row_ids, selection).await
}
/// If we are going to use `take` then we load the offset -> row id map once for the split and cache it
///
/// This method fetches the map with find-or-create semantics.
async fn get_offset_map(
&self,
permutation_table: &Arc<dyn BaseTable>,
) -> Result<Arc<HashMap<u64, u64>>> {
let mut offset_map_ref = self.offset_map.lock().await;
if let Some(offset_map) = &*offset_map_ref {
return Ok(offset_map.clone());
}
let mut offset_map = HashMap::new();
let mut row_ids_query = Table::from(permutation_table.clone())
.query()
.select(Select::Columns(vec![SRC_ROW_ID_COL.to_string()]))
.only_if(format!("{} = {}", SPLIT_ID_COLUMN, self.split));
if let Some(offset) = self.offset {
row_ids_query = row_ids_query.offset(offset as usize);
}
if let Some(limit) = self.limit {
row_ids_query = row_ids_query.limit(limit as usize);
}
let mut row_ids = row_ids_query.execute().await?;
while let Some(batch) = row_ids.try_next().await? {
let row_ids = batch
.column(0)
.as_primitive::<UInt64Type>()
.values()
.to_vec();
for (i, row_id) in row_ids.iter().enumerate() {
offset_map.insert(i as u64, *row_id);
}
}
let offset_map = Arc::new(offset_map);
*offset_map_ref = Some(offset_map.clone());
Ok(offset_map)
}
pub async fn take_offsets(&self, offsets: &[u64], selection: Select) -> Result<RecordBatch> {
if let Some(permutation_table) = &self.permutation_table {
let offset_map = self.get_offset_map(permutation_table).await?;
let row_ids = offsets
.iter()
.map(|o| offset_map.get(o).copied().expect_ok().map_err(Error::from))
.collect::<Result<Vec<_>>>()?;
let row_ids = RecordBatch::try_new(
Arc::new(arrow_schema::Schema::new(vec![arrow_schema::Field::new(
"row_id",
arrow_schema::DataType::UInt64,
false,
)])),
vec![Arc::new(UInt64Array::from(row_ids))],
)?;
Self::load_batch(&self.base_table, row_ids, selection).await
} else {
let table = Table::from(self.base_table.clone());
let batches = table
.take_offsets(offsets.to_vec())
.select(selection.clone())
.execute()
.await?
.try_collect::<Vec<_>>()
.await?;
if let Some(first_batch) = batches.first() {
let schema = first_batch.schema();
let batch = arrow::compute::concat_batches(&schema, &batches)?;
Ok(batch)
} else {
Ok(RecordBatch::try_new(
self.output_schema(selection).await?,
vec![],
)?)
}
}
}
pub async fn output_schema(&self, selection: Select) -> Result<SchemaRef> {
let table = Table::from(self.base_table.clone());
table.query().select(selection).output_schema().await
@@ -543,4 +625,224 @@ mod tests {
check_batch(&mut stream, &row_ids[7..9]).await;
assert!(stream.try_next().await.unwrap().is_none());
}
/// Helper to create a base table and permutation table for take_offsets tests.
/// Returns (base_table, row_ids_table, shuffled_row_ids).
async fn setup_permutation_tables(num_rows: usize) -> (Table, Table, Vec<u64>) {
let base_table = lance_datagen::gen_batch()
.col("idx", lance_datagen::array::step::<Int32Type>())
.col("other_col", lance_datagen::array::step::<UInt64Type>())
.into_mem_table("tbl", RowCount::from(num_rows as u64), BatchCount::from(1))
.await;
let mut row_ids = collect_column::<UInt64Type>(&base_table, "_rowid").await;
row_ids.shuffle(&mut rand::rng());
let split_ids = UInt64Array::from_iter_values(std::iter::repeat_n(0u64, row_ids.len()));
let permutation_batch = RecordBatch::try_new(
Arc::new(Schema::new(vec![
Field::new("row_id", DataType::UInt64, false),
Field::new(SPLIT_ID_COLUMN, DataType::UInt64, false),
])),
vec![
Arc::new(UInt64Array::from(row_ids.clone())),
Arc::new(split_ids),
],
)
.unwrap();
let row_ids_table = virtual_table("row_ids", &permutation_batch).await;
(base_table, row_ids_table, row_ids)
}
#[tokio::test]
async fn test_take_offsets_with_permutation_table() {
let (base_table, row_ids_table, row_ids) = setup_permutation_tables(10).await;
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
// Take specific offsets and verify the returned rows match the permutation order
let offsets = vec![0, 2, 4];
let batch = reader.take_offsets(&offsets, Select::All).await.unwrap();
assert_eq!(batch.num_rows(), 3);
let idx_values = batch
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
let expected: Vec<i32> = offsets
.iter()
.map(|&o| row_ids[o as usize] as i32)
.collect();
assert_eq!(idx_values, expected);
}
#[tokio::test]
async fn test_take_offsets_preserves_order() {
let (base_table, row_ids_table, row_ids) = setup_permutation_tables(10).await;
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
// Take offsets in reverse order and verify returned rows match that order
let offsets = vec![5, 3, 1, 0];
let batch = reader.take_offsets(&offsets, Select::All).await.unwrap();
assert_eq!(batch.num_rows(), 4);
let idx_values = batch
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
let expected: Vec<i32> = offsets
.iter()
.map(|&o| row_ids[o as usize] as i32)
.collect();
assert_eq!(idx_values, expected);
}
#[tokio::test]
async fn test_take_offsets_with_column_selection() {
let (base_table, row_ids_table, row_ids) = setup_permutation_tables(10).await;
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
let offsets = vec![1, 3];
let batch = reader
.take_offsets(&offsets, Select::Columns(vec!["idx".to_string()]))
.await
.unwrap();
assert_eq!(batch.num_rows(), 2);
assert_eq!(batch.num_columns(), 1);
assert_eq!(batch.schema().field(0).name(), "idx");
let idx_values = batch
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
let expected: Vec<i32> = offsets
.iter()
.map(|&o| row_ids[o as usize] as i32)
.collect();
assert_eq!(idx_values, expected);
}
#[tokio::test]
async fn test_take_offsets_invalid_offset() {
let (base_table, row_ids_table, _) = setup_permutation_tables(5).await;
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
// Offset 999 doesn't exist in the offset map
let result = reader.take_offsets(&[0, 999], Select::All).await;
assert!(result.is_err());
}
#[tokio::test]
async fn test_take_offsets_identity_reader() {
let base_table = lance_datagen::gen_batch()
.col("idx", lance_datagen::array::step::<Int32Type>())
.into_mem_table("tbl", RowCount::from(10), BatchCount::from(1))
.await;
let reader = PermutationReader::identity(base_table.base_table().clone()).await;
// With no permutation table, take_offsets uses the base table directly
let offsets = vec![0, 2, 4, 6];
let batch = reader.take_offsets(&offsets, Select::All).await.unwrap();
assert_eq!(batch.num_rows(), 4);
let idx_values = batch
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
assert_eq!(idx_values, vec![0, 2, 4, 6]);
}
#[tokio::test]
async fn test_take_offsets_caches_offset_map() {
let (base_table, row_ids_table, row_ids) = setup_permutation_tables(10).await;
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
// First call populates the cache
let batch1 = reader.take_offsets(&[0, 1], Select::All).await.unwrap();
// Second call should use the cached offset map and produce consistent results
let batch2 = reader.take_offsets(&[0, 1], Select::All).await.unwrap();
let values1 = batch1
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
let values2 = batch2
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
assert_eq!(values1, values2);
let expected: Vec<i32> = vec![row_ids[0] as i32, row_ids[1] as i32];
assert_eq!(values1, expected);
}
#[tokio::test]
async fn test_take_offsets_single_offset() {
let (base_table, row_ids_table, row_ids) = setup_permutation_tables(5).await;
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
let batch = reader.take_offsets(&[2], Select::All).await.unwrap();
assert_eq!(batch.num_rows(), 1);
let idx_values = batch
.column(0)
.as_primitive::<Int32Type>()
.values()
.to_vec();
assert_eq!(idx_values, vec![row_ids[2] as i32]);
}
}

View File

@@ -27,9 +27,10 @@ use crate::{
pub const SPLIT_ID_COLUMN: &str = "split_id";
/// Strategy for assigning rows to splits
#[derive(Debug, Clone)]
#[derive(Debug, Clone, Default)]
pub enum SplitStrategy {
/// All rows will have split id 0
#[default]
NoSplit,
/// Rows will be randomly assigned to splits
///
@@ -73,15 +74,6 @@ pub enum SplitStrategy {
Calculated { calculation: String },
}
// The default is not to split the data
//
// All data will be assigned to a single split.
impl Default for SplitStrategy {
fn default() -> Self {
Self::NoSplit
}
}
impl SplitStrategy {
pub fn validate(&self, num_rows: u64) -> Result<()> {
match self {

View File

@@ -18,7 +18,7 @@ use std::{
};
use arrow_array::{Array, RecordBatch, RecordBatchReader};
use arrow_schema::{DataType, Field, SchemaBuilder};
use arrow_schema::{DataType, Field, SchemaBuilder, SchemaRef};
// use async_trait::async_trait;
use serde::{Deserialize, Serialize};
@@ -190,6 +190,112 @@ impl<R: RecordBatchReader> WithEmbeddings<R> {
}
}
/// Compute embedding arrays for a batch.
///
/// When multiple embedding functions are defined, they are computed in parallel using
/// scoped threads. For a single embedding function, computation is done inline.
fn compute_embedding_arrays(
batch: &RecordBatch,
embeddings: &[(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)],
) -> Result<Vec<Arc<dyn Array>>> {
if embeddings.len() == 1 {
let (fld, func) = &embeddings[0];
let src_column =
batch
.column_by_name(&fld.source_column)
.ok_or_else(|| Error::InvalidInput {
message: format!("Source column '{}' not found", fld.source_column),
})?;
return Ok(vec![func.compute_source_embeddings(src_column.clone())?]);
}
// Parallel path: multiple embeddings
std::thread::scope(|s| {
let handles: Vec<_> = embeddings
.iter()
.map(|(fld, func)| {
let src_column = batch.column_by_name(&fld.source_column).ok_or_else(|| {
Error::InvalidInput {
message: format!("Source column '{}' not found", fld.source_column),
}
})?;
let handle = s.spawn(move || func.compute_source_embeddings(src_column.clone()));
Ok(handle)
})
.collect::<Result<_>>()?;
handles
.into_iter()
.map(|h| {
h.join().map_err(|e| Error::Runtime {
message: format!("Thread panicked during embedding computation: {:?}", e),
})?
})
.collect()
})
}
/// Compute the output schema when embeddings are applied to a base schema.
///
/// This returns the schema with embedding columns appended.
pub fn compute_output_schema(
base_schema: &SchemaRef,
embeddings: &[(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)],
) -> Result<SchemaRef> {
let mut sb: SchemaBuilder = base_schema.as_ref().into();
for (ed, func) in embeddings {
let src_field = base_schema
.field_with_name(&ed.source_column)
.map_err(|_| Error::InvalidInput {
message: format!("Source column '{}' not found in schema", ed.source_column),
})?;
let field_name = ed
.dest_column
.clone()
.unwrap_or_else(|| format!("{}_embedding", &ed.source_column));
sb.push(Field::new(
field_name,
func.dest_type()?.into_owned(),
src_field.is_nullable(),
));
}
Ok(Arc::new(sb.finish()))
}
/// Compute embeddings for a batch and append as new columns.
///
/// This function computes embeddings using the provided embedding functions and
/// appends them as new columns to the batch.
pub fn compute_embeddings_for_batch(
batch: RecordBatch,
embeddings: &[(EmbeddingDefinition, Arc<dyn EmbeddingFunction>)],
) -> Result<RecordBatch> {
let embedding_arrays = compute_embedding_arrays(&batch, embeddings)?;
let mut result = batch;
for ((fld, _), embedding) in embeddings.iter().zip(embedding_arrays.iter()) {
let dst_field_name = fld
.dest_column
.clone()
.unwrap_or_else(|| format!("{}_embedding", &fld.source_column));
let dst_field = Field::new(
dst_field_name,
embedding.data_type().clone(),
embedding.nulls().is_some(),
);
result = result.try_with_column(dst_field, embedding.clone())?;
}
Ok(result)
}
impl<R: RecordBatchReader> WithEmbeddings<R> {
fn dest_fields(&self) -> Result<Vec<Field>> {
let schema = self.inner.schema();
@@ -240,48 +346,6 @@ impl<R: RecordBatchReader> WithEmbeddings<R> {
column_definitions,
})
}
fn compute_embeddings_parallel(&self, batch: &RecordBatch) -> Result<Vec<Arc<dyn Array>>> {
if self.embeddings.len() == 1 {
let (fld, func) = &self.embeddings[0];
let src_column =
batch
.column_by_name(&fld.source_column)
.ok_or_else(|| Error::InvalidInput {
message: format!("Source column '{}' not found", fld.source_column),
})?;
return Ok(vec![func.compute_source_embeddings(src_column.clone())?]);
}
// Parallel path: multiple embeddings
std::thread::scope(|s| {
let handles: Vec<_> = self
.embeddings
.iter()
.map(|(fld, func)| {
let src_column = batch.column_by_name(&fld.source_column).ok_or_else(|| {
Error::InvalidInput {
message: format!("Source column '{}' not found", fld.source_column),
}
})?;
let handle =
s.spawn(move || func.compute_source_embeddings(src_column.clone()));
Ok(handle)
})
.collect::<Result<_>>()?;
handles
.into_iter()
.map(|h| {
h.join().map_err(|e| Error::Runtime {
message: format!("Thread panicked during embedding computation: {:?}", e),
})?
})
.collect()
})
}
}
impl<R: RecordBatchReader> Iterator for MaybeEmbedded<R> {
@@ -309,37 +373,13 @@ impl<R: RecordBatchReader> Iterator for WithEmbeddings<R> {
fn next(&mut self) -> Option<Self::Item> {
let batch = self.inner.next()?;
match batch {
Ok(batch) => {
let embeddings = match self.compute_embeddings_parallel(&batch) {
Ok(emb) => emb,
Err(e) => {
return Some(Err(arrow_schema::ArrowError::ComputeError(format!(
"Error computing embedding: {}",
e
))))
}
};
let mut batch = batch;
for ((fld, _), embedding) in self.embeddings.iter().zip(embeddings.iter()) {
let dst_field_name = fld
.dest_column
.clone()
.unwrap_or_else(|| format!("{}_embedding", &fld.source_column));
let dst_field = Field::new(
dst_field_name,
embedding.data_type().clone(),
embedding.nulls().is_some(),
);
match batch.try_with_column(dst_field.clone(), embedding.clone()) {
Ok(b) => batch = b,
Err(e) => return Some(Err(e)),
};
}
Some(Ok(batch))
}
Ok(batch) => match compute_embeddings_for_batch(batch, &self.embeddings) {
Ok(batch_with_embeddings) => Some(Ok(batch_with_embeddings)),
Err(e) => Some(Err(arrow_schema::ArrowError::ComputeError(format!(
"Error computing embedding: {}",
e
)))),
},
Err(e) => Some(Err(e)),
}
}

View File

@@ -6,7 +6,7 @@ use std::sync::PoisonError;
use arrow_schema::ArrowError;
use snafu::Snafu;
type BoxError = Box<dyn std::error::Error + Send + Sync>;
pub(crate) type BoxError = Box<dyn std::error::Error + Send + Sync>;
#[derive(Debug, Snafu)]
#[snafu(visibility(pub(crate)))]
@@ -80,6 +80,9 @@ pub enum Error {
Arrow { source: ArrowError },
#[snafu(display("LanceDBError: not supported: {message}"))]
NotSupported { message: String },
/// External error pass through from user code.
#[snafu(transparent)]
External { source: BoxError },
#[snafu(whatever, display("{message}"))]
Other {
message: String,
@@ -92,15 +95,26 @@ pub type Result<T> = std::result::Result<T, Error>;
impl From<ArrowError> for Error {
fn from(source: ArrowError) -> Self {
Self::Arrow { source }
match source {
ArrowError::ExternalError(source) => match source.downcast::<Self>() {
Ok(e) => *e,
Err(source) => Self::External { source },
},
_ => Self::Arrow { source },
}
}
}
impl From<lance::Error> for Error {
fn from(source: lance::Error) -> Self {
// TODO: Once Lance is changed to preserve ObjectStore, DataFusion, and Arrow errors, we can
// pass those variants through here as well.
Self::Lance { source }
// Try to unwrap external errors that were wrapped by lance
match source {
lance::Error::Wrapped { error, .. } => match error.downcast::<Self>() {
Ok(e) => *e,
Err(source) => Self::External { source },
},
_ => Self::Lance { source },
}
}
}

View File

@@ -218,8 +218,9 @@ mod test {
datagen = datagen.col(Box::<IncrementingInt32>::default());
datagen = datagen.col(Box::new(RandomVector::default().named("vector".into())));
let data: Box<dyn arrow_array::RecordBatchReader + Send> = Box::new(datagen.batch(100));
let res = db
.create_table("test", Box::new(datagen.batch(100)))
.create_table("test", data)
.write_options(WriteOptions {
lance_write_params: Some(param),
})

View File

@@ -12,10 +12,10 @@ use arrow_schema::Schema;
use crate::{Error, Result};
/// Convert a Arrow IPC file to a batch reader
pub fn ipc_file_to_batches(buf: Vec<u8>) -> Result<impl RecordBatchReader> {
pub fn ipc_file_to_batches(buf: Vec<u8>) -> Result<Box<dyn RecordBatchReader + Send>> {
let buf_reader = Cursor::new(buf);
let reader = FileReader::try_new(buf_reader, None)?;
Ok(reader)
Ok(Box::new(reader))
}
/// Convert record batches to Arrow IPC file

View File

@@ -39,7 +39,6 @@
//! #### Connect to a database.
//!
//! ```rust
//! # use arrow_schema::{Field, Schema};
//! # tokio::runtime::Runtime::new().unwrap().block_on(async {
//! let db = lancedb::connect("data/sample-lancedb").execute().await.unwrap();
//! # });
@@ -74,7 +73,10 @@
//!
//! #### Create a table
//!
//! To create a Table, you need to provide a [`arrow_schema::Schema`] and a [`arrow_array::RecordBatch`] stream.
//! To create a Table, you need to provide an [`arrow_array::RecordBatch`]. The
//! schema of the `RecordBatch` determines the schema of the table.
//!
//! Vector columns should be represented as `FixedSizeList<Float16/Float32>` data type.
//!
//! ```rust
//! # use std::sync::Arc;
@@ -85,34 +87,29 @@
//! # tokio::runtime::Runtime::new().unwrap().block_on(async {
//! # let tmpdir = tempfile::tempdir().unwrap();
//! # let db = lancedb::connect(tmpdir.path().to_str().unwrap()).execute().await.unwrap();
//! let ndims = 128;
//! let schema = Arc::new(Schema::new(vec![
//! Field::new("id", DataType::Int32, false),
//! Field::new(
//! "vector",
//! DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), 128),
//! DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), ndims),
//! true,
//! ),
//! ]));
//! // Create a RecordBatch stream.
//! let batches = RecordBatchIterator::new(
//! vec![RecordBatch::try_new(
//! let data = RecordBatch::try_new(
//! schema.clone(),
//! vec![
//! Arc::new(Int32Array::from_iter_values(0..256)),
//! Arc::new(
//! FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
//! (0..256).map(|_| Some(vec![Some(1.0); 128])),
//! 128,
//! (0..256).map(|_| Some(vec![Some(1.0); ndims as usize])),
//! ndims,
//! ),
//! ),
//! ],
//! )
//! .unwrap()]
//! .into_iter()
//! .map(Ok),
//! schema.clone(),
//! );
//! db.create_table("my_table", Box::new(batches))
//! .unwrap();
//! db.create_table("my_table", data)
//! .execute()
//! .await
//! .unwrap();
@@ -151,42 +148,18 @@
//! #### Open table and search
//!
//! ```rust
//! # use std::sync::Arc;
//! # use futures::TryStreamExt;
//! # use arrow_schema::{DataType, Schema, Field};
//! # use arrow_array::{RecordBatch, RecordBatchIterator};
//! # use arrow_array::{FixedSizeListArray, Float32Array, Int32Array, types::Float32Type};
//! # use lancedb::query::{ExecutableQuery, QueryBase};
//! # tokio::runtime::Runtime::new().unwrap().block_on(async {
//! # let tmpdir = tempfile::tempdir().unwrap();
//! # let db = lancedb::connect(tmpdir.path().to_str().unwrap()).execute().await.unwrap();
//! # let schema = Arc::new(Schema::new(vec![
//! # Field::new("id", DataType::Int32, false),
//! # Field::new("vector", DataType::FixedSizeList(
//! # Arc::new(Field::new("item", DataType::Float32, true)), 128), true),
//! # ]));
//! # let batches = RecordBatchIterator::new(vec![
//! # RecordBatch::try_new(schema.clone(),
//! # vec![
//! # Arc::new(Int32Array::from_iter_values(0..10)),
//! # Arc::new(FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>(
//! # (0..10).map(|_| Some(vec![Some(1.0); 128])), 128)),
//! # ]).unwrap()
//! # ].into_iter().map(Ok),
//! # schema.clone());
//! # db.create_table("my_table", Box::new(batches)).execute().await.unwrap();
//! # let table = db.open_table("my_table").execute().await.unwrap();
//! # async fn example(table: &lancedb::Table) -> lancedb::Result<()> {
//! let results = table
//! .query()
//! .nearest_to(&[1.0; 128])
//! .unwrap()
//! .nearest_to(&[1.0; 128])?
//! .execute()
//! .await
//! .unwrap()
//! .await?
//! .try_collect::<Vec<_>>()
//! .await
//! .unwrap();
//! # });
//! .await?;
//! # Ok(())
//! # }
//! ```
pub mod arrow;
@@ -219,13 +192,14 @@ pub use error::{Error, Result};
use lance_linalg::distance::DistanceType as LanceDistanceType;
pub use table::Table;
#[derive(Debug, Copy, Clone, PartialEq, Serialize, Deserialize)]
#[derive(Debug, Copy, Clone, PartialEq, Serialize, Deserialize, Default)]
#[non_exhaustive]
#[serde(rename_all = "lowercase")]
pub enum DistanceType {
/// Euclidean distance. This is a very common distance metric that
/// accounts for both magnitude and direction when determining the distance
/// between vectors. l2 distance has a range of [0, ∞).
#[default]
L2,
/// Cosine distance. Cosine distance is a distance metric
/// calculated from the cosine similarity between two vectors. Cosine
@@ -247,12 +221,6 @@ pub enum DistanceType {
Hamming,
}
impl Default for DistanceType {
fn default() -> Self {
Self::L2
}
}
impl From<DistanceType> for LanceDistanceType {
fn from(value: DistanceType) -> Self {
match value {

View File

@@ -1381,7 +1381,7 @@ mod tests {
use arrow::{array::downcast_array, compute::concat_batches, datatypes::Int32Type};
use arrow_array::{
cast::AsArray, types::Float32Type, FixedSizeListArray, Float32Array, Int32Array,
RecordBatch, RecordBatchIterator, RecordBatchReader, StringArray,
RecordBatch, StringArray,
};
use arrow_schema::{DataType, Field as ArrowField, Schema as ArrowSchema};
use futures::{StreamExt, TryStreamExt};
@@ -1402,7 +1402,7 @@ mod tests {
let batches = make_test_batches();
let conn = connect(uri).execute().await.unwrap();
let table = conn
.create_table("my_table", Box::new(batches))
.create_table("my_table", batches)
.execute()
.await
.unwrap();
@@ -1463,7 +1463,7 @@ mod tests {
let batches = make_non_empty_batches();
let conn = connect(uri).execute().await.unwrap();
let table = conn
.create_table("my_table", Box::new(batches))
.create_table("my_table", batches)
.execute()
.await
.unwrap();
@@ -1525,7 +1525,7 @@ mod tests {
let batches = make_non_empty_batches();
let conn = connect(uri).execute().await.unwrap();
let table = conn
.create_table("my_table", Box::new(batches))
.create_table("my_table", batches)
.execute()
.await
.unwrap();
@@ -1578,7 +1578,7 @@ mod tests {
let batches = make_non_empty_batches();
let conn = connect(uri).execute().await.unwrap();
let table = conn
.create_table("my_table", Box::new(batches))
.create_table("my_table", batches)
.execute()
.await
.unwrap();
@@ -1599,13 +1599,13 @@ mod tests {
assert!(result.is_err());
}
fn make_non_empty_batches() -> impl RecordBatchReader + Send + 'static {
fn make_non_empty_batches() -> Box<dyn arrow_array::RecordBatchReader + Send> {
let vec = Box::new(RandomVector::new().named("vector".to_string()));
let id = Box::new(IncrementingInt32::new().named("id".to_string()));
BatchGenerator::new().col(vec).col(id).batch(512)
Box::new(BatchGenerator::new().col(vec).col(id).batch(512))
}
fn make_test_batches() -> impl RecordBatchReader + Send + 'static {
fn make_test_batches() -> RecordBatch {
let dim: usize = 128;
let schema = Arc::new(ArrowSchema::new(vec![
ArrowField::new("key", DataType::Int32, false),
@@ -1619,12 +1619,7 @@ mod tests {
),
ArrowField::new("uri", DataType::Utf8, true),
]));
RecordBatchIterator::new(
vec![RecordBatch::new_empty(schema.clone())]
.into_iter()
.map(Ok),
schema,
)
RecordBatch::new_empty(schema)
}
async fn make_test_table(tmp_dir: &tempfile::TempDir) -> Table {
@@ -1633,7 +1628,7 @@ mod tests {
let batches = make_non_empty_batches();
let conn = connect(uri).execute().await.unwrap();
conn.create_table("my_table", Box::new(batches))
conn.create_table("my_table", batches)
.execute()
.await
.unwrap()
@@ -1862,10 +1857,8 @@ mod tests {
let record_batch =
RecordBatch::try_new(schema.clone(), vec![Arc::new(text), Arc::new(vector)]).unwrap();
let record_batch_iter =
RecordBatchIterator::new(vec![record_batch].into_iter().map(Ok), schema.clone());
let table = conn
.create_table("my_table", record_batch_iter)
.create_table("my_table", record_batch)
.execute()
.await
.unwrap();
@@ -1949,10 +1942,8 @@ mod tests {
],
)
.unwrap();
let record_batch_iter =
RecordBatchIterator::new(vec![record_batch].into_iter().map(Ok), schema.clone());
let table = conn
.create_table("my_table", record_batch_iter)
.create_table("my_table", record_batch)
.mode(CreateTableMode::Overwrite)
.execute()
.await
@@ -2062,8 +2053,6 @@ mod tests {
async fn test_pagination_with_fts() {
let db = connect("memory://test").execute().await.unwrap();
let data = fts_test_data(400);
let schema = data.schema();
let data = RecordBatchIterator::new(vec![Ok(data)], schema);
let table = db.create_table("test_table", data).execute().await.unwrap();
table

View File

@@ -491,7 +491,7 @@ impl<S: HttpSend> RestfulLanceDbClient<S> {
}
/// Apply dynamic headers from the header provider if configured
async fn apply_dynamic_headers(&self, mut request: Request) -> Result<Request> {
pub(crate) async fn apply_dynamic_headers(&self, mut request: Request) -> Result<Request> {
if let Some(ref provider) = self.header_provider {
let headers = provider.get_headers().await?;
let request_headers = request.headers_mut();
@@ -555,7 +555,9 @@ impl<S: HttpSend> RestfulLanceDbClient<S> {
message: "Attempted to retry a request that cannot be cloned".to_string(),
})?;
let (_, r) = tmp_req.build_split();
let mut r = r.unwrap();
let mut r = r.map_err(|e| Error::Runtime {
message: format!("Failed to build request: {}", e),
})?;
let request_id = self.extract_request_id(&mut r);
let mut retry_counter = RetryCounter::new(retry_config, request_id.clone());
@@ -571,7 +573,9 @@ impl<S: HttpSend> RestfulLanceDbClient<S> {
}
let (c, request) = req_builder.build_split();
let mut request = request.unwrap();
let mut request = request.map_err(|e| Error::Runtime {
message: format!("Failed to build request: {}", e),
})?;
self.set_request_id(&mut request, &request_id.clone());
// Apply dynamic headers before each retry attempt
@@ -621,7 +625,7 @@ impl<S: HttpSend> RestfulLanceDbClient<S> {
}
}
fn log_request(&self, request: &Request, request_id: &String) {
pub(crate) fn log_request(&self, request: &Request, request_id: &String) {
if log::log_enabled!(log::Level::Debug) {
let content_type = request
.headers()

View File

@@ -4,13 +4,11 @@
use std::collections::HashMap;
use std::sync::Arc;
use arrow_array::RecordBatchIterator;
use async_trait::async_trait;
use http::StatusCode;
use lance_io::object_store::StorageOptions;
use moka::future::Cache;
use reqwest::header::CONTENT_TYPE;
use tokio::task::spawn_blocking;
use lance_namespace::models::{
CreateNamespaceRequest, CreateNamespaceResponse, DescribeNamespaceRequest,
@@ -19,16 +17,17 @@ use lance_namespace::models::{
};
use crate::database::{
CloneTableRequest, CreateTableData, CreateTableMode, CreateTableRequest, Database,
DatabaseOptions, OpenTableRequest, ReadConsistency, TableNamesRequest,
CloneTableRequest, CreateTableMode, CreateTableRequest, Database, DatabaseOptions,
OpenTableRequest, ReadConsistency, TableNamesRequest,
};
use crate::error::Result;
use crate::remote::util::stream_as_body;
use crate::table::BaseTable;
use crate::Error;
use super::client::{ClientConfig, HttpSend, RequestResultExt, RestfulLanceDbClient, Sender};
use super::table::RemoteTable;
use super::util::{batches_to_ipc_bytes, parse_server_version};
use super::util::parse_server_version;
use super::ARROW_STREAM_CONTENT_TYPE;
// Request structure for the remote clone table API
@@ -436,26 +435,8 @@ impl<S: HttpSend> Database for RemoteDatabase<S> {
Ok(response)
}
async fn create_table(&self, request: CreateTableRequest) -> Result<Arc<dyn BaseTable>> {
let data = match request.data {
CreateTableData::Data(data) => data,
CreateTableData::StreamingData(_) => {
return Err(Error::NotSupported {
message: "Creating a remote table from a streaming source".to_string(),
})
}
CreateTableData::Empty(table_definition) => {
let schema = table_definition.schema.clone();
Box::new(RecordBatchIterator::new(vec![], schema))
}
};
// TODO: https://github.com/lancedb/lancedb/issues/1026
// We should accept data from an async source. In the meantime, spawn this as blocking
// to make sure we don't block the tokio runtime if the source is slow.
let data_buffer = spawn_blocking(move || batches_to_ipc_bytes(data))
.await
.unwrap()?;
async fn create_table(&self, mut request: CreateTableRequest) -> Result<Arc<dyn BaseTable>> {
let body = stream_as_body(request.data.scan_as_stream())?;
let identifier =
build_table_identifier(&request.name, &request.namespace, &self.client.id_delimiter);
@@ -463,7 +444,7 @@ impl<S: HttpSend> Database for RemoteDatabase<S> {
.client
.post(&format!("/v1/table/{}/create/", identifier))
.query(&[("mode", Into::<&str>::into(&request.mode))])
.body(data_buffer)
.body(body)
.header(CONTENT_TYPE, ARROW_STREAM_CONTENT_TYPE);
let (request_id, rsp) = self.client.send(req).await?;
@@ -813,7 +794,7 @@ mod tests {
use std::collections::HashMap;
use std::sync::{Arc, OnceLock};
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator};
use arrow_array::{Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};
use crate::connection::ConnectBuilder;
@@ -993,8 +974,7 @@ mod tests {
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap();
let reader = RecordBatchIterator::new([Ok(data.clone())], data.schema());
let table = conn.create_table("table1", reader).execute().await.unwrap();
let table = conn.create_table("table1", data).execute().await.unwrap();
assert_eq!(table.name(), "table1");
}
@@ -1011,8 +991,7 @@ mod tests {
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap();
let reader = RecordBatchIterator::new([Ok(data.clone())], data.schema());
let result = conn.create_table("table1", reader).execute().await;
let result = conn.create_table("table1", data).execute().await;
assert!(result.is_err());
assert!(
matches!(result, Err(crate::Error::TableAlreadyExists { name }) if name == "table1")
@@ -1045,8 +1024,7 @@ mod tests {
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap();
let reader = RecordBatchIterator::new([Ok(data.clone())], data.schema());
let mut builder = conn.create_table("table1", reader);
let mut builder = conn.create_table("table1", data.clone());
if let Some(mode) = mode {
builder = builder.mode(mode);
}
@@ -1071,9 +1049,8 @@ mod tests {
.unwrap();
let called: Arc<OnceLock<bool>> = Arc::new(OnceLock::new());
let reader = RecordBatchIterator::new([Ok(data.clone())], data.schema());
let called_in_cb = called.clone();
conn.create_table("table1", reader)
conn.create_table("table1", data)
.mode(CreateTableMode::ExistOk(Box::new(move |b| {
called_in_cb.clone().set(true).unwrap();
b
@@ -1262,9 +1239,8 @@ mod tests {
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap();
let reader = RecordBatchIterator::new([Ok(data.clone())], data.schema());
let table = conn
.create_table("table1", reader)
.create_table("table1", data)
.namespace(vec!["ns1".to_string()])
.execute()
.await
@@ -1730,10 +1706,8 @@ mod tests {
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap();
let reader = RecordBatchIterator::new([Ok(data.clone())], schema.clone());
let table = conn
.create_table("test_table", reader)
.create_table("test_table", data)
.namespace(namespace.clone())
.execute()
.await;
@@ -1806,9 +1780,7 @@ mod tests {
let data =
RecordBatch::try_new(schema.clone(), vec![Arc::new(Int32Array::from(vec![i]))])
.unwrap();
let reader = RecordBatchIterator::new([Ok(data.clone())], schema.clone());
conn.create_table(format!("table{}", i), reader)
conn.create_table(format!("table{}", i), data)
.namespace(namespace.clone())
.execute()
.await

File diff suppressed because it is too large Load Diff

View File

@@ -1,29 +1,50 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::io::Cursor;
use arrow_array::RecordBatchReader;
use arrow_ipc::CompressionType;
use futures::{Stream, StreamExt};
use reqwest::Response;
use crate::Result;
use crate::{arrow::SendableRecordBatchStream, Result};
use super::db::ServerVersion;
pub fn batches_to_ipc_bytes(batches: impl RecordBatchReader) -> Result<Vec<u8>> {
pub fn stream_as_ipc(
data: SendableRecordBatchStream,
) -> Result<impl Stream<Item = Result<bytes::Bytes>>> {
let options = arrow_ipc::writer::IpcWriteOptions::default()
.try_with_compression(Some(CompressionType::LZ4_FRAME))?;
const WRITE_BUF_SIZE: usize = 4096;
let buf = Vec::with_capacity(WRITE_BUF_SIZE);
let mut buf = Cursor::new(buf);
{
let mut writer = arrow_ipc::writer::StreamWriter::try_new(&mut buf, &batches.schema())?;
let writer =
arrow_ipc::writer::StreamWriter::try_new_with_options(buf, &data.schema(), options)?;
let stream = futures::stream::try_unfold(
(data, writer, false),
move |(mut data, mut writer, finished)| async move {
if finished {
return Ok(None);
}
match data.next().await {
Some(Ok(batch)) => {
writer.write(&batch)?;
let buffer = std::mem::take(writer.get_mut());
Ok(Some((bytes::Bytes::from(buffer), (data, writer, false))))
}
Some(Err(e)) => Err(e),
None => {
writer.finish()?;
let buffer = std::mem::take(writer.get_mut());
Ok(Some((bytes::Bytes::from(buffer), (data, writer, true))))
}
}
},
);
Ok(stream)
}
for batch in batches {
let batch = batch?;
writer.write(&batch)?;
}
writer.finish()?;
}
Ok(buf.into_inner())
pub fn stream_as_body(data: SendableRecordBatchStream) -> Result<reqwest::Body> {
let stream = stream_as_ipc(data)?;
Ok(reqwest::Body::wrap_stream(stream))
}
pub fn parse_server_version(req_id: &str, rsp: &Response) -> Result<ServerVersion> {

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,343 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::Arc;
use serde::{Deserialize, Serialize};
use crate::data::scannable::Scannable;
use crate::embeddings::EmbeddingRegistry;
use crate::Result;
use super::{BaseTable, WriteOptions};
#[derive(Debug, Clone, Default)]
pub enum AddDataMode {
/// Rows will be appended to the table (the default)
#[default]
Append,
/// The existing table will be overwritten with the new data
Overwrite,
}
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize, Default)]
pub struct AddResult {
// The commit version associated with the operation.
// A version of `0` indicates compatibility with legacy servers that do not return
/// a commit version.
#[serde(default)]
pub version: u64,
}
/// A builder for configuring a [`crate::table::Table::add`] operation
pub struct AddDataBuilder {
pub(crate) parent: Arc<dyn BaseTable>,
pub(crate) data: Box<dyn Scannable>,
pub(crate) mode: AddDataMode,
pub(crate) write_options: WriteOptions,
pub(crate) embedding_registry: Option<Arc<dyn EmbeddingRegistry>>,
}
impl std::fmt::Debug for AddDataBuilder {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("AddDataBuilder")
.field("parent", &self.parent)
.field("mode", &self.mode)
.field("write_options", &self.write_options)
.finish()
}
}
impl AddDataBuilder {
pub(crate) fn new(
parent: Arc<dyn BaseTable>,
data: Box<dyn Scannable>,
embedding_registry: Option<Arc<dyn EmbeddingRegistry>>,
) -> Self {
Self {
parent,
data,
mode: AddDataMode::Append,
write_options: WriteOptions::default(),
embedding_registry,
}
}
pub fn mode(mut self, mode: AddDataMode) -> Self {
self.mode = mode;
self
}
pub fn write_options(mut self, options: WriteOptions) -> Self {
self.write_options = options;
self
}
pub async fn execute(self) -> Result<AddResult> {
self.parent.clone().add(self).await
}
}
#[cfg(test)]
mod tests {
use std::sync::Arc;
use arrow_array::{record_batch, RecordBatch, RecordBatchIterator};
use arrow_schema::{ArrowError, DataType, Field, Schema};
use futures::TryStreamExt;
use lance::dataset::{WriteMode, WriteParams};
use crate::arrow::{SendableRecordBatchStream, SimpleRecordBatchStream};
use crate::connect;
use crate::data::scannable::Scannable;
use crate::embeddings::{
EmbeddingDefinition, EmbeddingFunction, EmbeddingRegistry, MemoryRegistry,
};
use crate::query::{ExecutableQuery, QueryBase, Select};
use crate::table::{ColumnDefinition, ColumnKind, Table, TableDefinition, WriteOptions};
use crate::test_utils::embeddings::MockEmbed;
use crate::Error;
use super::AddDataMode;
async fn create_test_table() -> Table {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("id", Int64, [1, 2, 3])).unwrap();
conn.create_table("test", batch).execute().await.unwrap()
}
async fn test_add_with_data<T>(data: T)
where
T: Scannable + 'static,
{
let table = create_test_table().await;
let schema = data.schema();
table.add(data).execute().await.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 5); // 3 initial + 2 added
assert_eq!(table.schema().await.unwrap(), schema);
}
#[tokio::test]
async fn test_add_with_batch() {
let batch = record_batch!(("id", Int64, [4, 5])).unwrap();
test_add_with_data(batch).await;
}
#[tokio::test]
async fn test_add_with_vec_batch() {
let data = vec![
record_batch!(("id", Int64, [4])).unwrap(),
record_batch!(("id", Int64, [5])).unwrap(),
];
test_add_with_data(data).await;
}
#[tokio::test]
async fn test_add_with_record_batch_reader() {
let data = vec![
record_batch!(("id", Int64, [4])).unwrap(),
record_batch!(("id", Int64, [5])).unwrap(),
];
let schema = data[0].schema();
let reader: Box<dyn arrow_array::RecordBatchReader + Send> = Box::new(
RecordBatchIterator::new(data.into_iter().map(Ok), schema.clone()),
);
test_add_with_data(reader).await;
}
#[tokio::test]
async fn test_add_with_stream() {
let data = vec![
record_batch!(("id", Int64, [4])).unwrap(),
record_batch!(("id", Int64, [5])).unwrap(),
];
let schema = data[0].schema();
let inner = futures::stream::iter(data.into_iter().map(Ok));
let stream: SendableRecordBatchStream = Box::pin(SimpleRecordBatchStream {
schema,
stream: inner,
});
test_add_with_data(stream).await;
}
#[derive(Debug)]
struct MyError;
impl std::fmt::Display for MyError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "MyError occurred")
}
}
impl std::error::Error for MyError {}
#[tokio::test]
async fn test_add_preserves_reader_error() {
let table = create_test_table().await;
let first_batch = record_batch!(("id", Int64, [4])).unwrap();
let schema = first_batch.schema();
let iterator = vec![
Ok(first_batch),
Err(ArrowError::ExternalError(Box::new(MyError))),
];
let reader: Box<dyn arrow_array::RecordBatchReader + Send> = Box::new(
RecordBatchIterator::new(iterator.into_iter(), schema.clone()),
);
let result = table.add(reader).execute().await;
assert!(result.is_err());
}
#[tokio::test]
async fn test_add_preserves_stream_error() {
let table = create_test_table().await;
let first_batch = record_batch!(("id", Int64, [4])).unwrap();
let schema = first_batch.schema();
let iterator = vec![
Ok(first_batch),
Err(Error::External {
source: Box::new(MyError),
}),
];
let stream = futures::stream::iter(iterator);
let stream: SendableRecordBatchStream = Box::pin(SimpleRecordBatchStream {
schema: schema.clone(),
stream,
});
let result = table.add(stream).execute().await;
assert!(result.is_err());
}
#[tokio::test]
async fn test_add() {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("i", Int32, [0, 1, 2])).unwrap();
let table = conn
.create_table("test", batch.clone())
.execute()
.await
.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 3);
let new_batch = record_batch!(("i", Int32, [3])).unwrap();
table.add(new_batch).execute().await.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 4);
assert_eq!(table.schema().await.unwrap(), batch.schema());
}
#[tokio::test]
async fn test_add_overwrite() {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("i", Int32, [0, 1, 2])).unwrap();
let table = conn
.create_table("test", batch.clone())
.execute()
.await
.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), batch.num_rows());
let new_batch = record_batch!(("x", Float32, [0.0, 1.0])).unwrap();
let res = table
.add(new_batch.clone())
.mode(AddDataMode::Overwrite)
.execute()
.await
.unwrap();
assert_eq!(res.version, table.version().await.unwrap());
assert_eq!(table.count_rows(None).await.unwrap(), new_batch.num_rows());
assert_eq!(table.schema().await.unwrap(), new_batch.schema());
// Can overwrite using underlying WriteParams (which
// take precedence over AddDataMode)
let param: WriteParams = WriteParams {
mode: WriteMode::Overwrite,
..Default::default()
};
table
.add(new_batch.clone())
.write_options(WriteOptions {
lance_write_params: Some(param),
})
.mode(AddDataMode::Append)
.execute()
.await
.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), new_batch.num_rows());
}
#[tokio::test]
async fn test_add_with_embeddings() {
let registry = Arc::new(MemoryRegistry::new());
let mock_embedding: Arc<dyn EmbeddingFunction> = Arc::new(MockEmbed::new("mock", 4));
registry.register("mock", mock_embedding).unwrap();
let conn = connect("memory://")
.embedding_registry(registry)
.execute()
.await
.unwrap();
let schema = Arc::new(Schema::new(vec![
Field::new("text", DataType::Utf8, false),
Field::new(
"text_embedding",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), 4),
false,
),
]));
// Add embedding metadata to the schema
let embedding_def = EmbeddingDefinition::new("text", "mock", Some("text_embedding"));
let table_def = TableDefinition::new(
schema.clone(),
vec![
ColumnDefinition {
kind: ColumnKind::Physical,
},
ColumnDefinition {
kind: ColumnKind::Embedding(embedding_def),
},
],
);
let rich_schema = table_def.into_rich_schema();
let table = conn
.create_empty_table("embed_test", rich_schema)
.execute()
.await
.unwrap();
// Now add new data WITHOUT the embedding column - it should be computed automatically
let new_batch = record_batch!(("text", Utf8, ["hello", "world"])).unwrap();
table.add(new_batch).execute().await.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 2);
// Query to verify the embeddings were computed for the new rows
let results: Vec<RecordBatch> = table
.query()
.select(Select::columns(&["text", "text_embedding"]))
.execute()
.await
.unwrap()
.try_collect()
.await
.unwrap();
let total_rows: usize = results.iter().map(|b| b.num_rows()).sum();
assert_eq!(total_rows, 2);
// Check that all rows have embedding values (not null)
for batch in &results {
let embedding_col = batch.column(1);
assert_eq!(embedding_col.null_count(), 0);
}
}
}

View File

@@ -287,8 +287,7 @@ pub mod tests {
use arrow::array::AsArray;
use arrow_array::{
BinaryArray, Float64Array, Int32Array, Int64Array, RecordBatch, RecordBatchIterator,
RecordBatchReader, StringArray, UInt32Array,
BinaryArray, Float64Array, Int32Array, Int64Array, RecordBatch, StringArray, UInt32Array,
};
use arrow_schema::{DataType, Field, Schema};
use datafusion::{
@@ -308,7 +307,7 @@ pub mod tests {
table::datafusion::BaseTableAdapter,
};
fn make_test_batches() -> impl RecordBatchReader + Send + Sync + 'static {
fn make_test_batches() -> RecordBatch {
let metadata = HashMap::from_iter(vec![("foo".to_string(), "bar".to_string())]);
let schema = Arc::new(
Schema::new(vec![
@@ -317,19 +316,17 @@ pub mod tests {
])
.with_metadata(metadata),
);
RecordBatchIterator::new(
vec![RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..10)),
Arc::new(UInt32Array::from_iter_values(0..10)),
],
)],
RecordBatch::try_new(
schema,
vec![
Arc::new(Int32Array::from_iter_values(0..10)),
Arc::new(UInt32Array::from_iter_values(0..10)),
],
)
.unwrap()
}
fn make_tbl_two_test_batches() -> impl RecordBatchReader + Send + Sync + 'static {
fn make_tbl_two_test_batches() -> RecordBatch {
let metadata = HashMap::from_iter(vec![("foo".to_string(), "bar".to_string())]);
let schema = Arc::new(
Schema::new(vec![
@@ -342,28 +339,26 @@ pub mod tests {
])
.with_metadata(metadata),
);
RecordBatchIterator::new(
vec![RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int64Array::from_iter_values(0..1000)),
Arc::new(StringArray::from_iter_values(
(0..1000).map(|i| i.to_string()),
)),
Arc::new(Float64Array::from_iter_values((0..1000).map(|i| i as f64))),
Arc::new(StringArray::from_iter_values(
(0..1000).map(|i| format!("{{\"i\":{}}}", i)),
)),
Arc::new(BinaryArray::from_iter_values(
(0..1000).map(|i| (i as u32).to_be_bytes().to_vec()),
)),
Arc::new(StringArray::from_iter_values(
(0..1000).map(|i| i.to_string()),
)),
],
)],
RecordBatch::try_new(
schema,
vec![
Arc::new(Int64Array::from_iter_values(0..1000)),
Arc::new(StringArray::from_iter_values(
(0..1000).map(|i| i.to_string()),
)),
Arc::new(Float64Array::from_iter_values((0..1000).map(|i| i as f64))),
Arc::new(StringArray::from_iter_values(
(0..1000).map(|i| format!("{{\"i\":{}}}", i)),
)),
Arc::new(BinaryArray::from_iter_values(
(0..1000).map(|i| (i as u32).to_be_bytes().to_vec()),
)),
Arc::new(StringArray::from_iter_values(
(0..1000).map(|i| i.to_string()),
)),
],
)
.unwrap()
}
struct TestFixture {

View File

@@ -200,7 +200,7 @@ impl ExecutionPlan for InsertExec {
let new_dataset = CommitBuilder::new(dataset.clone())
.execute(merged_txn)
.await?;
ds_wrapper.set_latest(new_dataset).await;
ds_wrapper.update(new_dataset);
}
}
@@ -222,7 +222,7 @@ mod tests {
use std::vec;
use super::*;
use arrow_array::{record_batch, Int32Array, RecordBatchIterator};
use arrow_array::{record_batch, RecordBatchIterator};
use datafusion::prelude::SessionContext;
use datafusion_catalog::MemTable;
use tempfile::tempdir;
@@ -238,11 +238,8 @@ mod tests {
// Create initial table
let batch = record_batch!(("id", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let reader = RecordBatchIterator::new(vec![Ok(batch)], schema);
let table = db
.create_table("test_insert", Box::new(reader))
.create_table("test_insert", batch)
.execute()
.await
.unwrap();
@@ -279,11 +276,8 @@ mod tests {
// Create initial table with 3 rows
let batch = record_batch!(("id", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let reader = RecordBatchIterator::new(vec![Ok(batch)], schema);
let table = db
.create_table("test_overwrite", Box::new(reader))
.create_table("test_overwrite", batch)
.execute()
.await
.unwrap();
@@ -318,20 +312,9 @@ mod tests {
let db = connect(uri).execute().await.unwrap();
// Create initial table
let schema = Arc::new(ArrowSchema::new(vec![Field::new(
"id",
DataType::Int32,
false,
)]));
let batches = vec![RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap()];
let reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
let batch = record_batch!(("id", Int32, [1, 2, 3])).unwrap();
let table = db
.create_table("test_empty", Box::new(reader))
.create_table("test_empty", batch)
.execute()
.await
.unwrap();
@@ -352,12 +335,13 @@ mod tests {
false,
)]));
// Empty batches
let source_reader = RecordBatchIterator::new(
std::iter::empty::<Result<RecordBatch, arrow_schema::ArrowError>>(),
source_schema,
);
let source_reader: Box<dyn arrow_array::RecordBatchReader + Send> =
Box::new(RecordBatchIterator::new(
std::iter::empty::<Result<RecordBatch, arrow_schema::ArrowError>>(),
source_schema,
));
let source_table = db
.create_table("empty_source", Box::new(source_reader))
.create_table("empty_source", source_reader)
.execute()
.await
.unwrap();
@@ -389,20 +373,10 @@ mod tests {
let db = connect(uri).execute().await.unwrap();
// Create initial table
let schema = Arc::new(ArrowSchema::new(vec![Field::new(
"id",
DataType::Int32,
true,
)]));
let batches =
vec![
RecordBatch::try_new(schema.clone(), vec![Arc::new(Int32Array::from(vec![1]))])
.unwrap(),
];
let reader = RecordBatchIterator::new(batches.into_iter().map(Ok), schema.clone());
let batch = record_batch!(("id", Int32, [1])).unwrap();
let schema = batch.schema();
let table = db
.create_table("test_multi_batch", Box::new(reader))
.create_table("test_multi_batch", batch)
.execute()
.await
.unwrap();

View File

@@ -97,7 +97,7 @@ mod tests {
table::datafusion::BaseTableAdapter,
Connection, Table,
};
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_array::{Int32Array, RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema as ArrowSchema};
use datafusion::prelude::SessionContext;
@@ -173,14 +173,7 @@ mod tests {
// Create LanceDB database and table
let db = crate::connect("memory://test").execute().await.unwrap();
let table = db
.create_table(
"foo",
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table("foo", batch).execute().await.unwrap();
// Create FTS index
table
@@ -323,13 +316,7 @@ mod tests {
RecordBatch::try_new(metadata_schema.clone(), vec![metadata_col, extra_col]).unwrap();
let _metadata_table = db
.create_table(
"metadata",
RecordBatchIterator::new(
vec![Ok(metadata_batch.clone())].into_iter(),
metadata_schema.clone(),
),
)
.create_table("metadata", metadata_batch.clone())
.execute()
.await
.unwrap();
@@ -393,14 +380,7 @@ mod tests {
let batch =
RecordBatch::try_new(schema.clone(), vec![id_col, text_col, category_col]).unwrap();
let table = db
.create_table(
table_name,
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table(table_name, batch).execute().await.unwrap();
// Create FTS index
table
@@ -546,14 +526,7 @@ mod tests {
]));
let batch = RecordBatch::try_new(schema.clone(), vec![id_col, text_col]).unwrap();
let table = db
.create_table(
"docs",
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table("docs", batch).execute().await.unwrap();
// Create FTS index with position information for phrase queries
table
@@ -691,14 +664,7 @@ mod tests {
let batch =
RecordBatch::try_new(schema.clone(), vec![id_col, title_col, content_col]).unwrap();
let table = db
.create_table(
"multi_col",
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table("multi_col", batch).execute().await.unwrap();
// Create FTS indices on both columns
table
@@ -963,13 +929,7 @@ mod tests {
let metadata_batch =
RecordBatch::try_new(metadata_schema.clone(), vec![metadata_id, extra_info]).unwrap();
let _metadata_table = db
.create_table(
"metadata",
RecordBatchIterator::new(
vec![Ok(metadata_batch.clone())].into_iter(),
metadata_schema,
),
)
.create_table("metadata", metadata_batch.clone())
.execute()
.await
.unwrap();
@@ -1358,14 +1318,7 @@ mod tests {
]));
let batch = RecordBatch::try_new(schema.clone(), vec![id_col, text_col]).unwrap();
let table = db
.create_table(
"docs",
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table("docs", batch).execute().await.unwrap();
// Create FTS index with position information
table
@@ -1510,14 +1463,7 @@ mod tests {
let batch =
RecordBatch::try_new(schema.clone(), vec![id_col, title_col, content_col]).unwrap();
let table = db
.create_table(
"docs",
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table("docs", batch).execute().await.unwrap();
// Create FTS indices on both columns
table
@@ -1591,14 +1537,7 @@ mod tests {
let batch =
RecordBatch::try_new(schema.clone(), vec![id_col, title_col, content_col]).unwrap();
let table = db
.create_table(
"docs",
RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema),
)
.execute()
.await
.unwrap();
let table = db.create_table("docs", batch).execute().await.unwrap();
// Create FTS indices
table
@@ -1724,36 +1663,23 @@ mod tests {
.unwrap();
// Create table with simple text for n-gram testing
let data = RecordBatchIterator::new(
vec![RecordBatch::try_new(
Arc::new(ArrowSchema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("text", DataType::Utf8, false),
])),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec![
"hello world",
"lance database",
"lance is cool",
])),
],
)
.unwrap()]
.into_iter()
.map(Ok),
let data = RecordBatch::try_new(
Arc::new(ArrowSchema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("text", DataType::Utf8, false),
])),
);
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec![
"hello world",
"lance database",
"lance is cool",
])),
],
)
.unwrap();
let table = Arc::new(
db.create_table("docs", Box::new(data))
.execute()
.await
.unwrap(),
);
let table = Arc::new(db.create_table("docs", data).execute().await.unwrap());
// Create FTS index with n-gram tokenizer (default min_ngram_length=3)
table
@@ -1876,43 +1802,29 @@ mod tests {
.unwrap();
// Create table with two text columns
let data = RecordBatchIterator::new(
vec![RecordBatch::try_new(
Arc::new(ArrowSchema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("title", DataType::Utf8, false),
Field::new("content", DataType::Utf8, false),
])),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec![
"Important Document",
"Another Document",
"Random Text",
])),
Arc::new(StringArray::from(vec![
"This is important information",
"This has details",
"Nothing special here",
])),
],
)
.unwrap()]
.into_iter()
.map(Ok),
let data = RecordBatch::try_new(
Arc::new(ArrowSchema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("title", DataType::Utf8, false),
Field::new("content", DataType::Utf8, false),
])),
);
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec![
"Important Document",
"Another Document",
"Random Text",
])),
Arc::new(StringArray::from(vec![
"This is important information",
"This has details",
"Nothing special here",
])),
],
)
.unwrap();
let table = Arc::new(
db.create_table("docs", Box::new(data))
.execute()
.await
.unwrap(),
);
let table = Arc::new(db.create_table("docs", data).execute().await.unwrap());
// Create FTS indices on both columns
table

View File

@@ -2,321 +2,499 @@
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{
ops::{Deref, DerefMut},
sync::Arc,
time::{self, Duration, Instant},
sync::{Arc, Mutex},
time::Duration,
};
use lance::{dataset::refs, Dataset};
use tokio::sync::{RwLock, RwLockReadGuard, RwLockWriteGuard};
use crate::error::Result;
/// A wrapper around a [Dataset] that provides lazy-loading and consistency checks.
///
/// This can be cloned cheaply. It supports concurrent reads or exclusive writes.
#[derive(Debug, Clone)]
pub struct DatasetConsistencyWrapper(Arc<RwLock<DatasetRef>>);
use crate::{error::Result, utils::background_cache::BackgroundCache, Error};
/// A wrapper around a [Dataset] that provides consistency checks.
///
/// The dataset is lazily loaded, and starts off as None. On the first access,
/// the dataset is loaded.
/// This can be cloned cheaply. Callers get an [`Arc<Dataset>`] from [`get()`](Self::get)
/// and call [`update()`](Self::update) after writes to store the new version.
#[derive(Debug, Clone)]
enum DatasetRef {
/// In this mode, the dataset is always the latest version.
Latest {
dataset: Dataset,
read_consistency_interval: Option<Duration>,
last_consistency_check: Option<time::Instant>,
},
/// In this mode, the dataset is a specific version. It cannot be mutated.
TimeTravel { dataset: Dataset, version: u64 },
pub struct DatasetConsistencyWrapper {
state: Arc<Mutex<DatasetState>>,
consistency: ConsistencyMode,
}
impl DatasetRef {
/// Reload the dataset to the appropriate version.
async fn reload(&mut self) -> Result<()> {
match self {
Self::Latest {
dataset,
last_consistency_check,
..
} => {
dataset.checkout_latest().await?;
last_consistency_check.replace(Instant::now());
}
Self::TimeTravel { dataset, version } => {
dataset.checkout_version(*version).await?;
}
}
Ok(())
}
/// The current dataset and whether it is pinned to a specific version.
#[derive(Debug, Clone)]
struct DatasetState {
dataset: Arc<Dataset>,
/// `Some(version)` = pinned to a specific version (time travel),
/// `None` = tracking latest.
pinned_version: Option<u64>,
}
fn is_latest(&self) -> bool {
matches!(self, Self::Latest { .. })
}
async fn need_reload(&self) -> Result<bool> {
Ok(match self {
Self::Latest { dataset, .. } => {
dataset.latest_version_id().await? != dataset.version().version
}
Self::TimeTravel { dataset, version } => dataset.version().version != *version,
})
}
async fn as_latest(&mut self, read_consistency_interval: Option<Duration>) -> Result<()> {
match self {
Self::Latest { .. } => Ok(()),
Self::TimeTravel { dataset, .. } => {
dataset
.checkout_version(dataset.latest_version_id().await?)
.await?;
*self = Self::Latest {
dataset: dataset.clone(),
read_consistency_interval,
last_consistency_check: Some(Instant::now()),
};
Ok(())
}
}
}
async fn as_time_travel(&mut self, target_version: impl Into<refs::Ref>) -> Result<()> {
let target_ref = target_version.into();
match self {
Self::Latest { dataset, .. } => {
let new_dataset = dataset.checkout_version(target_ref.clone()).await?;
let version_value = new_dataset.version().version;
*self = Self::TimeTravel {
dataset: new_dataset,
version: version_value,
};
}
Self::TimeTravel { dataset, version } => {
let should_checkout = match &target_ref {
refs::Ref::Version(_, Some(target_ver)) => version != target_ver,
refs::Ref::Version(_, None) => true, // No specific version, always checkout
refs::Ref::VersionNumber(target_ver) => version != target_ver,
refs::Ref::Tag(_) => true, // Always checkout for tags
};
if should_checkout {
let new_dataset = dataset.checkout_version(target_ref).await?;
let version_value = new_dataset.version().version;
*self = Self::TimeTravel {
dataset: new_dataset,
version: version_value,
};
}
}
}
Ok(())
}
fn time_travel_version(&self) -> Option<u64> {
match self {
Self::Latest { .. } => None,
Self::TimeTravel { version, .. } => Some(*version),
}
}
fn set_latest(&mut self, dataset: Dataset) {
match self {
Self::Latest {
dataset: ref mut ds,
..
} => {
if dataset.manifest().version > ds.manifest().version {
*ds = dataset;
}
}
_ => unreachable!("Dataset should be in latest mode at this point"),
}
}
#[derive(Debug, Clone)]
enum ConsistencyMode {
/// Only update table state when explicitly asked.
Lazy,
/// Always check for a new version on every read.
Strong,
/// Periodically check for new version in the background. If the table is being
/// regularly accessed, refresh will happen in the background. If the table is idle for a while,
/// the next access will trigger a refresh before returning the dataset.
///
/// read_consistency_interval = TTL
/// refresh_window = min(3s, TTL/4)
///
/// | t < TTL - refresh_window | t < TTL | t >= TTL |
/// | Return value | Background refresh & return value | syncronous refresh |
Eventual(BackgroundCache<Arc<Dataset>, Error>),
}
impl DatasetConsistencyWrapper {
/// Create a new wrapper in the latest version mode.
pub fn new_latest(dataset: Dataset, read_consistency_interval: Option<Duration>) -> Self {
Self(Arc::new(RwLock::new(DatasetRef::Latest {
dataset,
read_consistency_interval,
last_consistency_check: Some(Instant::now()),
})))
let dataset = Arc::new(dataset);
let consistency = match read_consistency_interval {
Some(d) if d == Duration::ZERO => ConsistencyMode::Strong,
Some(d) => {
let refresh_window = std::cmp::min(std::time::Duration::from_secs(3), d / 4);
let cache = BackgroundCache::new(d, refresh_window);
cache.seed(dataset.clone());
ConsistencyMode::Eventual(cache)
}
None => ConsistencyMode::Lazy,
};
Self {
state: Arc::new(Mutex::new(DatasetState {
dataset,
pinned_version: None,
})),
consistency,
}
}
/// Get an immutable reference to the dataset.
pub async fn get(&self) -> Result<DatasetReadGuard<'_>> {
self.ensure_up_to_date().await?;
Ok(DatasetReadGuard {
guard: self.0.read().await,
})
}
/// Get a mutable reference to the dataset.
/// Get the current dataset.
///
/// If the dataset is in time travel mode this will fail
pub async fn get_mut(&self) -> Result<DatasetWriteGuard<'_>> {
self.ensure_mutable().await?;
self.ensure_up_to_date().await?;
Ok(DatasetWriteGuard {
guard: self.0.write().await,
})
}
/// Get a mutable reference to the dataset without requiring the
/// dataset to be in a Latest mode.
pub async fn get_mut_unchecked(&self) -> Result<DatasetWriteGuard<'_>> {
self.ensure_up_to_date().await?;
Ok(DatasetWriteGuard {
guard: self.0.write().await,
})
}
/// Convert into a wrapper in latest version mode
pub async fn as_latest(&self, read_consistency_interval: Option<Duration>) -> Result<()> {
if self.0.read().await.is_latest() {
return Ok(());
}
let mut write_guard = self.0.write().await;
if write_guard.is_latest() {
return Ok(());
}
write_guard.as_latest(read_consistency_interval).await
}
pub async fn as_time_travel(&self, target_version: impl Into<refs::Ref>) -> Result<()> {
self.0.write().await.as_time_travel(target_version).await
}
/// Provide a known latest version of the dataset.
/// Behavior depends on the consistency mode:
/// - **Lazy** (`None`): returns the cached dataset immediately.
/// - **Strong** (`Some(ZERO)`): checks for a new version before returning.
/// - **Eventual** (`Some(d)` where `d > 0`): returns a cached value immediately
/// while refreshing in the background when the TTL expires.
///
/// This is usually done after some write operation, which inherently will
/// have the latest version.
pub async fn set_latest(&self, dataset: Dataset) {
self.0.write().await.set_latest(dataset);
}
pub async fn reload(&self) -> Result<()> {
if !self.0.read().await.need_reload().await? {
return Ok(());
/// If pinned to a specific version (time travel), always returns the
/// pinned dataset regardless of consistency mode.
pub async fn get(&self) -> Result<Arc<Dataset>> {
{
let state = self.state.lock().unwrap();
if state.pinned_version.is_some() {
return Ok(state.dataset.clone());
}
}
let mut write_guard = self.0.write().await;
// on lock escalation -- check if someone else has already reloaded
if !write_guard.need_reload().await? {
return Ok(());
}
// actually need reloading
write_guard.reload().await
}
/// Returns the version, if in time travel mode, or None otherwise
pub async fn time_travel_version(&self) -> Option<u64> {
self.0.read().await.time_travel_version()
}
pub async fn ensure_mutable(&self) -> Result<()> {
let dataset_ref = self.0.read().await;
match &*dataset_ref {
DatasetRef::Latest { .. } => Ok(()),
DatasetRef::TimeTravel { .. } => Err(crate::Error::InvalidInput {
message: "table cannot be modified when a specific version is checked out"
.to_string(),
}),
}
}
async fn is_up_to_date(&self) -> Result<bool> {
let dataset_ref = self.0.read().await;
match &*dataset_ref {
DatasetRef::Latest {
read_consistency_interval,
last_consistency_check,
..
} => match (read_consistency_interval, last_consistency_check) {
(None, _) => Ok(true),
(Some(_), None) => Ok(false),
(Some(read_consistency_interval), Some(last_consistency_check)) => {
if &last_consistency_check.elapsed() < read_consistency_interval {
Ok(true)
} else {
Ok(false)
}
match &self.consistency {
ConsistencyMode::Eventual(bg_cache) => {
if let Some(dataset) = bg_cache.try_get() {
return Ok(dataset);
}
},
DatasetRef::TimeTravel { dataset, version } => {
Ok(dataset.version().version == *version)
let state = self.state.clone();
bg_cache
.get(move || refresh_latest(state))
.await
.map_err(unwrap_shared_error)
}
ConsistencyMode::Strong => refresh_latest(self.state.clone()).await,
ConsistencyMode::Lazy => {
let state = self.state.lock().unwrap();
Ok(state.dataset.clone())
}
}
}
/// Ensures that the dataset is loaded and up-to-date with consistency and
/// version parameters.
async fn ensure_up_to_date(&self) -> Result<()> {
if !self.is_up_to_date().await? {
self.reload().await?;
/// Store a new dataset version after a write operation.
///
/// Only stores the dataset if its version is newer than the current one.
/// If the wrapper has since transitioned to time-travel mode (e.g. via a
/// concurrent [`as_time_travel`](Self::as_time_travel) call), the update
/// is silently ignored — the write already committed to storage.
pub fn update(&self, dataset: Dataset) {
let mut state = self.state.lock().unwrap();
if state.pinned_version.is_some() {
// A concurrent as_time_travel() beat us here. The write succeeded
// in storage, but since we're now pinned we don't advance the
// cached pointer.
return;
}
if dataset.manifest().version > state.dataset.manifest().version {
state.dataset = Arc::new(dataset);
}
drop(state);
if let ConsistencyMode::Eventual(bg_cache) = &self.consistency {
bg_cache.invalidate();
}
}
/// Checkout a branch and track its HEAD for new versions.
pub async fn as_branch(&self, _branch: impl Into<String>) -> Result<()> {
todo!("Branch support not yet implemented")
}
/// Check that the dataset is in a mutable mode (Latest).
pub fn ensure_mutable(&self) -> Result<()> {
let state = self.state.lock().unwrap();
if state.pinned_version.is_some() {
Err(crate::Error::InvalidInput {
message: "table cannot be modified when a specific version is checked out"
.to_string(),
})
} else {
Ok(())
}
}
/// Returns the version, if in time travel mode, or None otherwise.
pub fn time_travel_version(&self) -> Option<u64> {
self.state.lock().unwrap().pinned_version
}
/// Convert into a wrapper in latest version mode.
pub async fn as_latest(&self) -> Result<()> {
let dataset = {
let state = self.state.lock().unwrap();
if state.pinned_version.is_none() {
return Ok(());
}
state.dataset.clone()
};
let latest_version = dataset.latest_version_id().await?;
let new_dataset = dataset.checkout_version(latest_version).await?;
let mut state = self.state.lock().unwrap();
if state.pinned_version.is_some() {
state.dataset = Arc::new(new_dataset);
state.pinned_version = None;
}
drop(state);
if let ConsistencyMode::Eventual(bg_cache) = &self.consistency {
bg_cache.invalidate();
}
Ok(())
}
pub async fn as_time_travel(&self, target_version: impl Into<refs::Ref>) -> Result<()> {
let target_ref = target_version.into();
let (should_checkout, dataset) = {
let state = self.state.lock().unwrap();
let should = match state.pinned_version {
None => true,
Some(version) => match &target_ref {
refs::Ref::Version(_, Some(target_ver)) => version != *target_ver,
refs::Ref::Version(_, None) => true,
refs::Ref::VersionNumber(target_ver) => version != *target_ver,
refs::Ref::Tag(_) => true,
},
};
(should, state.dataset.clone())
};
if !should_checkout {
return Ok(());
}
let new_dataset = dataset.checkout_version(target_ref).await?;
let version_value = new_dataset.version().version;
let mut state = self.state.lock().unwrap();
state.dataset = Arc::new(new_dataset);
state.pinned_version = Some(version_value);
Ok(())
}
pub async fn reload(&self) -> Result<()> {
let (dataset, pinned_version) = {
let state = self.state.lock().unwrap();
(state.dataset.clone(), state.pinned_version)
};
match pinned_version {
None => {
refresh_latest(self.state.clone()).await?;
if let ConsistencyMode::Eventual(bg_cache) = &self.consistency {
bg_cache.invalidate();
}
}
Some(version) => {
if dataset.version().version == version {
return Ok(());
}
let new_dataset = dataset.checkout_version(version).await?;
let mut state = self.state.lock().unwrap();
if state.pinned_version == Some(version) {
state.dataset = Arc::new(new_dataset);
}
}
}
Ok(())
}
}
pub struct DatasetReadGuard<'a> {
guard: RwLockReadGuard<'a, DatasetRef>,
}
async fn refresh_latest(state: Arc<Mutex<DatasetState>>) -> Result<Arc<Dataset>> {
let dataset = { state.lock().unwrap().dataset.clone() };
impl Deref for DatasetReadGuard<'_> {
type Target = Dataset;
let mut ds = (*dataset).clone();
ds.checkout_latest().await?;
let new_arc = Arc::new(ds);
fn deref(&self) -> &Self::Target {
match &*self.guard {
DatasetRef::Latest { dataset, .. } => dataset,
DatasetRef::TimeTravel { dataset, .. } => dataset,
{
let mut state = state.lock().unwrap();
if state.pinned_version.is_none()
&& new_arc.manifest().version >= state.dataset.manifest().version
{
state.dataset = new_arc.clone();
}
}
Ok(new_arc)
}
pub struct DatasetWriteGuard<'a> {
guard: RwLockWriteGuard<'a, DatasetRef>,
}
impl Deref for DatasetWriteGuard<'_> {
type Target = Dataset;
fn deref(&self) -> &Self::Target {
match &*self.guard {
DatasetRef::Latest { dataset, .. } => dataset,
DatasetRef::TimeTravel { dataset, .. } => dataset,
}
}
}
impl DerefMut for DatasetWriteGuard<'_> {
fn deref_mut(&mut self) -> &mut Self::Target {
match &mut *self.guard {
DatasetRef::Latest { dataset, .. } => dataset,
DatasetRef::TimeTravel { dataset, .. } => dataset,
}
fn unwrap_shared_error(arc: Arc<Error>) -> Error {
match Arc::try_unwrap(arc) {
Ok(err) => err,
Err(arc) => Error::Runtime {
message: arc.to_string(),
},
}
}
#[cfg(test)]
mod tests {
use std::time::Instant;
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator};
use arrow_schema::{DataType, Field, Schema};
use lance::{dataset::WriteParams, io::ObjectStoreParams};
use lance::{
dataset::{WriteMode, WriteParams},
io::ObjectStoreParams,
};
use super::*;
use crate::{connect, io::object_store::io_tracking::IoStatsHolder, table::WriteOptions};
async fn create_test_dataset(uri: &str) -> Dataset {
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from(vec![1, 2, 3]))],
)
.unwrap();
Dataset::write(
RecordBatchIterator::new(vec![Ok(batch)], schema),
uri,
Some(WriteParams::default()),
)
.await
.unwrap()
}
async fn append_to_dataset(uri: &str) -> Dataset {
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from(vec![4, 5, 6]))],
)
.unwrap();
Dataset::write(
RecordBatchIterator::new(vec![Ok(batch)], schema),
uri,
Some(WriteParams {
mode: WriteMode::Append,
..Default::default()
}),
)
.await
.unwrap()
}
#[tokio::test]
async fn test_get_returns_dataset() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let version = ds.version().version;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, None);
let ds1 = wrapper.get().await.unwrap();
let ds2 = wrapper.get().await.unwrap();
assert_eq!(ds1.version().version, version);
assert_eq!(ds2.version().version, version);
// Arc<Dataset> is independent — not borrowing from wrapper
drop(wrapper);
assert_eq!(ds1.version().version, version);
}
#[tokio::test]
async fn test_update_stores_newer_version() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds_v1 = create_test_dataset(uri).await;
assert_eq!(ds_v1.version().version, 1);
let wrapper = DatasetConsistencyWrapper::new_latest(ds_v1, None);
let ds_v2 = append_to_dataset(uri).await;
assert_eq!(ds_v2.version().version, 2);
wrapper.update(ds_v2);
let ds = wrapper.get().await.unwrap();
assert_eq!(ds.version().version, 2);
}
#[tokio::test]
async fn test_update_ignores_older_version() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds_v1 = create_test_dataset(uri).await;
let ds_v2 = append_to_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds_v2, None);
wrapper.update(ds_v1);
let ds = wrapper.get().await.unwrap();
assert_eq!(ds.version().version, 2);
}
#[tokio::test]
async fn test_ensure_mutable_allows_latest() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, None);
assert!(wrapper.ensure_mutable().is_ok());
}
#[tokio::test]
async fn test_ensure_mutable_rejects_time_travel() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, None);
wrapper.as_time_travel(1u64).await.unwrap();
assert!(wrapper.ensure_mutable().is_err());
}
#[tokio::test]
async fn test_time_travel_version() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, None);
assert_eq!(wrapper.time_travel_version(), None);
wrapper.as_time_travel(1u64).await.unwrap();
assert_eq!(wrapper.time_travel_version(), Some(1));
}
#[tokio::test]
async fn test_as_latest_from_time_travel() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, None);
wrapper.as_time_travel(1u64).await.unwrap();
assert!(wrapper.ensure_mutable().is_err());
wrapper.as_latest().await.unwrap();
assert!(wrapper.ensure_mutable().is_ok());
assert_eq!(wrapper.time_travel_version(), None);
}
#[tokio::test]
async fn test_lazy_consistency_never_refreshes() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, None);
let v1 = wrapper.get().await.unwrap().version().version;
// External write
append_to_dataset(uri).await;
// Lazy consistency should not pick up external write
let v_after = wrapper.get().await.unwrap().version().version;
assert_eq!(v1, v_after);
}
#[tokio::test]
async fn test_strong_consistency_always_refreshes() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, Some(Duration::ZERO));
let v1 = wrapper.get().await.unwrap().version().version;
// External write
append_to_dataset(uri).await;
// Strong consistency should pick up external write
let v_after = wrapper.get().await.unwrap().version().version;
assert_eq!(v_after, v1 + 1);
}
#[tokio::test]
async fn test_eventual_consistency_background_refresh() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds, Some(Duration::from_millis(200)));
// Populate the cache
let v1 = wrapper.get().await.unwrap().version().version;
assert_eq!(v1, 1);
// External write
append_to_dataset(uri).await;
// Should return cached value immediately (within TTL)
let v_cached = wrapper.get().await.unwrap().version().version;
assert_eq!(v_cached, 1);
// Wait for TTL to expire, then get() should trigger a refresh
tokio::time::sleep(Duration::from_millis(300)).await;
let v_after = wrapper.get().await.unwrap().version().version;
assert_eq!(v_after, 2);
}
#[tokio::test]
async fn test_eventual_consistency_update_invalidates_cache() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds_v1 = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds_v1, Some(Duration::from_secs(60)));
// Simulate a write that produces v2
let ds_v2 = append_to_dataset(uri).await;
wrapper.update(ds_v2);
// get() should return v2 immediately (update invalidated the bg_cache,
// and the mutex state was updated)
let v = wrapper.get().await.unwrap().version().version;
assert_eq!(v, 2);
}
#[tokio::test]
async fn test_iops_open_strong_consistency() {
let db = connect("memory://")
@@ -332,7 +510,7 @@ mod tests {
.create_empty_table("test", schema)
.write_options(WriteOptions {
lance_write_params: Some(WriteParams {
store_params: Some(ObjectStoreParams {
store_params: Some(lance::io::ObjectStoreParams {
object_store_wrapper: Some(Arc::new(io_stats.clone())),
..Default::default()
}),
@@ -351,4 +529,85 @@ mod tests {
let stats = io_stats.incremental_stats();
assert_eq!(stats.read_iops, 1);
}
/// Regression test: a write that races with as_time_travel() must not panic.
///
/// Sequence: ensure_mutable() passes → as_time_travel() completes → write
/// calls update(). Previously the assert!() in update() would fire.
#[tokio::test]
async fn test_update_after_concurrent_time_travel_does_not_panic() {
let dir = tempfile::tempdir().unwrap();
let uri = dir.path().to_str().unwrap();
let ds_v1 = create_test_dataset(uri).await;
let wrapper = DatasetConsistencyWrapper::new_latest(ds_v1, None);
// Simulate: as_time_travel() completes just before the write's update().
wrapper.as_time_travel(1u64).await.unwrap();
assert_eq!(wrapper.time_travel_version(), Some(1));
// The write already committed to storage; now it calls update().
// This must not panic, and the wrapper must stay pinned.
let ds_v2 = append_to_dataset(uri).await;
wrapper.update(ds_v2);
let ds = wrapper.get().await.unwrap();
assert_eq!(ds.version().version, 1);
}
/// Regression test: before the fix, the reload fast-path (no version change)
/// did not reset `last_consistency_check`, causing a list call on every
/// subsequent query once the interval expired.
#[tokio::test]
async fn test_reload_resets_consistency_timer() {
let db = connect("memory://")
.read_consistency_interval(Duration::from_secs(1))
.execute()
.await
.unwrap();
let io_stats = IoStatsHolder::default();
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int32, false)]));
let table = db
.create_empty_table("test", schema)
.write_options(WriteOptions {
lance_write_params: Some(WriteParams {
store_params: Some(ObjectStoreParams {
object_store_wrapper: Some(Arc::new(io_stats.clone())),
..Default::default()
}),
..Default::default()
}),
})
.execute()
.await
.unwrap();
let start = Instant::now();
io_stats.incremental_stats(); // reset
// Step 1: within interval — no list
table.schema().await.unwrap();
let s = io_stats.incremental_stats();
assert_eq!(s.read_iops, 0, "step 1, elapsed={:?}", start.elapsed());
// Step 2: still within interval — no list
table.schema().await.unwrap();
let s = io_stats.incremental_stats();
assert_eq!(s.read_iops, 0, "step 2, elapsed={:?}", start.elapsed());
// Step 3: sleep past the 1s boundary
tokio::time::sleep(Duration::from_secs(1)).await;
// Step 4: interval expired — exactly 1 list, timer resets
table.schema().await.unwrap();
let s = io_stats.incremental_stats();
assert_eq!(s.read_iops, 1, "step 4, elapsed={:?}", start.elapsed());
// Step 5: 10 more calls — timer just reset, no lists (THIS is the regression test).
for _ in 0..10 {
table.schema().await.unwrap();
}
let s = io_stats.incremental_stats();
assert_eq!(s.read_iops, 0, "step 5, elapsed={:?}", start.elapsed());
}
}

View File

@@ -18,23 +18,18 @@ pub struct DeleteResult {
///
/// This logic was moved from NativeTable::delete to keep table.rs clean.
pub(crate) async fn execute_delete(table: &NativeTable, predicate: &str) -> Result<DeleteResult> {
// We access the dataset from the table. Since this is in the same module hierarchy (super),
// and 'dataset' is pub(crate), we can access it.
let mut dataset = table.dataset.get_mut().await?;
// Perform the actual delete on the Lance dataset
table.dataset.ensure_mutable()?;
let mut dataset = (*table.dataset.get().await?).clone();
dataset.delete(predicate).await?;
// Return the result with the new version
Ok(DeleteResult {
version: dataset.version().version,
})
let version = dataset.version().version;
table.dataset.update(dataset);
Ok(DeleteResult { version })
}
#[cfg(test)]
mod tests {
use crate::connect;
use arrow_array::{record_batch, Int32Array, RecordBatch, RecordBatchIterator};
use arrow_array::{record_batch, Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;
@@ -53,10 +48,7 @@ mod tests {
.unwrap();
let table = conn
.create_table(
"test_delete",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_delete", batch)
.execute()
.await
.unwrap();
@@ -102,10 +94,7 @@ mod tests {
let original_schema = batch.schema();
let table = conn
.create_table(
"test_delete_all",
RecordBatchIterator::new(vec![Ok(batch)], original_schema.clone()),
)
.create_table("test_delete_all", batch)
.execute()
.await
.unwrap();
@@ -126,13 +115,8 @@ mod tests {
// Create a table with 5 rows
let batch = record_batch!(("id", Int32, [1, 2, 3, 4, 5])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_delete_noop",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_delete_noop", batch)
.execute()
.await
.unwrap();

View File

@@ -1,13 +1,45 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{sync::Arc, time::Duration};
use std::sync::Arc;
use std::time::Duration;
use arrow_array::RecordBatchReader;
use futures::future::Either;
use futures::{FutureExt, TryFutureExt};
use lance::dataset::{
MergeInsertBuilder as LanceMergeInsertBuilder, WhenMatched, WhenNotMatchedBySource,
};
use serde::{Deserialize, Serialize};
use crate::Result;
use crate::error::{Error, Result};
use super::{BaseTable, MergeResult};
use super::{BaseTable, NativeTable};
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize, Default)]
pub struct MergeResult {
// The commit version associated with the operation.
// A version of `0` indicates compatibility with legacy servers that do not return
/// a commit version.
#[serde(default)]
pub version: u64,
/// Number of inserted rows (for user statistics)
#[serde(default)]
pub num_inserted_rows: u64,
/// Number of updated rows (for user statistics)
#[serde(default)]
pub num_updated_rows: u64,
/// Number of deleted rows (for user statistics)
/// Note: This is different from internal references to 'deleted_rows', since we technically "delete" updated rows during processing.
/// However those rows are not shared with the user.
#[serde(default)]
pub num_deleted_rows: u64,
/// Number of attempts performed during the merge operation.
/// This includes the initial attempt plus any retries due to transaction conflicts.
/// A value of 1 means the operation succeeded on the first try.
#[serde(default)]
pub num_attempts: u32,
}
/// A builder used to create and run a merge insert operation
///
@@ -124,3 +156,172 @@ impl MergeInsertBuilder {
self.table.clone().merge_insert(self, new_data).await
}
}
/// Internal implementation of the merge insert logic
///
/// This logic was moved from NativeTable::merge_insert to keep table.rs clean.
pub(crate) async fn execute_merge_insert(
table: &NativeTable,
params: MergeInsertBuilder,
new_data: Box<dyn RecordBatchReader + Send>,
) -> Result<MergeResult> {
let dataset = table.dataset.get().await?;
let mut builder = LanceMergeInsertBuilder::try_new(dataset.clone(), params.on)?;
match (
params.when_matched_update_all,
params.when_matched_update_all_filt,
) {
(false, _) => builder.when_matched(WhenMatched::DoNothing),
(true, None) => builder.when_matched(WhenMatched::UpdateAll),
(true, Some(filt)) => builder.when_matched(WhenMatched::update_if(&dataset, &filt)?),
};
if params.when_not_matched_insert_all {
builder.when_not_matched(lance::dataset::WhenNotMatched::InsertAll);
} else {
builder.when_not_matched(lance::dataset::WhenNotMatched::DoNothing);
}
if params.when_not_matched_by_source_delete {
let behavior = if let Some(filter) = params.when_not_matched_by_source_delete_filt {
WhenNotMatchedBySource::delete_if(dataset.as_ref(), &filter)?
} else {
WhenNotMatchedBySource::Delete
};
builder.when_not_matched_by_source(behavior);
} else {
builder.when_not_matched_by_source(WhenNotMatchedBySource::Keep);
}
builder.use_index(params.use_index);
let future = if let Some(timeout) = params.timeout {
let future = builder
.retry_timeout(timeout)
.try_build()?
.execute_reader(new_data);
Either::Left(tokio::time::timeout(timeout, future).map(|res| match res {
Ok(Ok((new_dataset, stats))) => Ok((new_dataset, stats)),
Ok(Err(e)) => Err(e.into()),
Err(_) => Err(Error::Runtime {
message: "merge insert timed out".to_string(),
}),
}))
} else {
let job = builder.try_build()?;
Either::Right(job.execute_reader(new_data).map_err(|e| e.into()))
};
let (new_dataset, stats) = future.await?;
let version = new_dataset.manifest().version;
table.dataset.update(new_dataset.as_ref().clone());
Ok(MergeResult {
version,
num_updated_rows: stats.num_updated_rows,
num_inserted_rows: stats.num_inserted_rows,
num_deleted_rows: stats.num_deleted_rows,
num_attempts: stats.num_attempts,
})
}
#[cfg(test)]
mod tests {
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, RecordBatchReader};
use arrow_schema::{DataType, Field, Schema};
use std::sync::Arc;
use crate::connect;
fn merge_insert_test_batches(offset: i32, age: i32) -> Box<dyn RecordBatchReader + Send> {
let schema = Arc::new(Schema::new(vec![
Field::new("i", DataType::Int32, false),
Field::new("age", DataType::Int32, false),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(offset..(offset + 10))),
Arc::new(Int32Array::from_iter_values(std::iter::repeat_n(age, 10))),
],
)
.unwrap();
Box::new(RecordBatchIterator::new(vec![Ok(batch)], schema))
}
#[tokio::test]
async fn test_merge_insert() {
let conn = connect("memory://").execute().await.unwrap();
// Create a dataset with i=0..10
let batches = merge_insert_test_batches(0, 0);
let table = conn
.create_table("my_table", batches)
.execute()
.await
.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 10);
// Create new data with i=5..15
let new_batches = merge_insert_test_batches(5, 1);
// Perform a "insert if not exists"
let mut merge_insert_builder = table.merge_insert(&["i"]);
merge_insert_builder.when_not_matched_insert_all();
let result = merge_insert_builder.execute(new_batches).await.unwrap();
// Only 5 rows should actually be inserted
assert_eq!(table.count_rows(None).await.unwrap(), 15);
assert_eq!(result.num_inserted_rows, 5);
assert_eq!(result.num_updated_rows, 0);
assert_eq!(result.num_deleted_rows, 0);
assert_eq!(result.num_attempts, 1);
// Create new data with i=15..25 (no id matches)
let new_batches = merge_insert_test_batches(15, 2);
// Perform a "bulk update" (should not affect anything)
let mut merge_insert_builder = table.merge_insert(&["i"]);
merge_insert_builder.when_matched_update_all(None);
merge_insert_builder.execute(new_batches).await.unwrap();
// No new rows should have been inserted
assert_eq!(table.count_rows(None).await.unwrap(), 15);
assert_eq!(
table.count_rows(Some("age = 2".to_string())).await.unwrap(),
0
);
// Conditional update that only replaces the age=0 data
let new_batches = merge_insert_test_batches(5, 3);
let mut merge_insert_builder = table.merge_insert(&["i"]);
merge_insert_builder.when_matched_update_all(Some("target.age = 0".to_string()));
merge_insert_builder.execute(new_batches).await.unwrap();
assert_eq!(
table.count_rows(Some("age = 3".to_string())).await.unwrap(),
5
);
}
#[tokio::test]
async fn test_merge_insert_use_index() {
let conn = connect("memory://").execute().await.unwrap();
// Create a dataset with i=0..10
let batches = merge_insert_test_batches(0, 0);
let table = conn
.create_table("my_table", batches)
.execute()
.await
.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 10);
// Test use_index=true (default behavior)
let new_batches = merge_insert_test_batches(5, 1);
let mut merge_insert_builder = table.merge_insert(&["i"]);
merge_insert_builder.when_not_matched_insert_all();
merge_insert_builder.use_index(true);
merge_insert_builder.execute(new_batches).await.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 15);
// Test use_index=false (force table scan)
let new_batches = merge_insert_test_batches(15, 2);
let mut merge_insert_builder = table.merge_insert(&["i"]);
merge_insert_builder.when_not_matched_insert_all();
merge_insert_builder.use_index(false);
merge_insert_builder.execute(new_batches).await.unwrap();
assert_eq!(table.count_rows(None).await.unwrap(), 25);
}
}

View File

@@ -0,0 +1,724 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! Table optimization operations for compaction, pruning, and index optimization.
//!
//! This module contains the implementation of optimization operations that help
//! maintain good performance for LanceDB tables.
use std::sync::Arc;
use lance::dataset::cleanup::RemovalStats;
use lance::dataset::optimize::{compact_files, CompactionMetrics, IndexRemapperOptions};
use lance_index::optimize::OptimizeOptions;
use lance_index::DatasetIndexExt;
use log::info;
pub use chrono::Duration;
pub use lance::dataset::optimize::CompactionOptions;
use super::NativeTable;
use crate::error::Result;
/// Optimize the dataset.
///
/// Similar to `VACUUM` in PostgreSQL, it offers different options to
/// optimize different parts of the table on disk.
///
/// By default, it optimizes everything, as [`OptimizeAction::All`].
#[derive(Default)]
pub enum OptimizeAction {
/// Run all optimizations with default values
#[default]
All,
/// Compacts files in the dataset
///
/// LanceDb uses a readonly filesystem for performance and safe concurrency. Every time
/// new data is added it will be added into new files. Small files
/// can hurt both read and write performance. Compaction will merge small files
/// into larger ones.
///
/// All operations that modify data (add, delete, update, merge insert, etc.) will create
/// new files. If these operations are run frequently then compaction should run frequently.
///
/// If these operations are never run (search only) then compaction is not necessary.
Compact {
options: CompactionOptions,
remap_options: Option<Arc<dyn IndexRemapperOptions>>,
},
/// Prune old version of datasets
///
/// Every change in LanceDb is additive. When data is removed from a dataset a new version is
/// created that doesn't contain the removed data. However, the old version, which does contain
/// the removed data, is left in place. This is necessary for consistency and concurrency and
/// also enables time travel functionality like the ability to checkout an older version of the
/// dataset to undo changes.
///
/// Over time, these old versions can consume a lot of disk space. The prune operation will
/// remove versions of the dataset that are older than a certain age. This will free up the
/// space used by that old data.
///
/// Once a version is pruned it can no longer be checked out.
Prune {
/// The duration of time to keep versions of the dataset.
older_than: Option<Duration>,
/// Because they may be part of an in-progress transaction, files newer than 7 days old are not deleted by default.
/// If you are sure that there are no in-progress transactions, then you can set this to True to delete all files older than `older_than`.
delete_unverified: Option<bool>,
/// If true, an error will be returned if there are any old versions that are still tagged.
error_if_tagged_old_versions: Option<bool>,
},
/// Optimize the indices
///
/// This operation optimizes all indices in the table. When new data is added to LanceDb
/// it is not added to the indices. However, it can still turn up in searches because the search
/// function will scan both the indexed data and the unindexed data in parallel. Over time, the
/// unindexed data can become large enough that the search performance is slow. This operation
/// will add the unindexed data to the indices without rerunning the full index creation process.
///
/// Optimizing an index is faster than re-training the index but it does not typically adjust the
/// underlying model relied upon by the index. This can eventually lead to poor search accuracy
/// and so users may still want to occasionally retrain the index after adding a large amount of
/// data.
///
/// For example, when using IVF, an index will create clusters. Optimizing an index assigns unindexed
/// data to the existing clusters, but it does not move the clusters or create new clusters.
Index(OptimizeOptions),
}
/// Statistics about the optimization.
#[derive(Debug, Default)]
pub struct OptimizeStats {
/// Stats of the file compaction.
pub compaction: Option<CompactionMetrics>,
/// Stats of the version pruning
pub prune: Option<RemovalStats>,
}
/// Internal implementation of optimize_indices
///
/// This logic was moved from NativeTable to keep table.rs clean.
pub(crate) async fn optimize_indices(table: &NativeTable, options: &OptimizeOptions) -> Result<()> {
info!("LanceDB: optimizing indices: {:?}", options);
table.dataset.ensure_mutable()?;
let mut dataset = (*table.dataset.get().await?).clone();
dataset.optimize_indices(options).await?;
table.dataset.update(dataset);
Ok(())
}
/// Remove old versions of the dataset from disk.
///
/// # Arguments
/// * `older_than` - The duration of time to keep versions of the dataset.
/// * `delete_unverified` - Because they may be part of an in-progress
/// transaction, files newer than 7 days old are not deleted by default.
/// If you are sure that there are no in-progress transactions, then you
/// can set this to True to delete all files older than `older_than`.
///
/// This calls into [lance::dataset::Dataset::cleanup_old_versions] and
/// returns the result.
pub(crate) async fn cleanup_old_versions(
table: &NativeTable,
older_than: Duration,
delete_unverified: Option<bool>,
error_if_tagged_old_versions: Option<bool>,
) -> Result<RemovalStats> {
table.dataset.ensure_mutable()?;
let dataset = table.dataset.get().await?;
Ok(dataset
.cleanup_old_versions(older_than, delete_unverified, error_if_tagged_old_versions)
.await?)
}
/// Compact files in the dataset.
///
/// This can be run after making several small appends to optimize the table
/// for faster reads.
///
/// This calls into [lance::dataset::optimize::compact_files].
pub(crate) async fn compact_files_impl(
table: &NativeTable,
options: CompactionOptions,
remap_options: Option<Arc<dyn IndexRemapperOptions>>,
) -> Result<CompactionMetrics> {
table.dataset.ensure_mutable()?;
let mut dataset = (*table.dataset.get().await?).clone();
let metrics = compact_files(&mut dataset, options, remap_options).await?;
table.dataset.update(dataset);
Ok(metrics)
}
/// Execute the optimize operation on the table.
///
/// This is the main entry point for all optimization operations.
pub(crate) async fn execute_optimize(
table: &NativeTable,
action: OptimizeAction,
) -> Result<OptimizeStats> {
let mut stats = OptimizeStats {
compaction: None,
prune: None,
};
match action {
OptimizeAction::All => {
// Call helper functions directly to avoid async recursion issues
stats.compaction =
Some(compact_files_impl(table, CompactionOptions::default(), None).await?);
stats.prune = Some(
cleanup_old_versions(
table,
Duration::try_days(7).expect("valid delta"),
None,
None,
)
.await?,
);
optimize_indices(table, &OptimizeOptions::default()).await?;
}
OptimizeAction::Compact {
options,
remap_options,
} => {
stats.compaction = Some(compact_files_impl(table, options, remap_options).await?);
}
OptimizeAction::Prune {
older_than,
delete_unverified,
error_if_tagged_old_versions,
} => {
stats.prune = Some(
cleanup_old_versions(
table,
older_than.unwrap_or(Duration::try_days(7).expect("valid delta")),
delete_unverified,
error_if_tagged_old_versions,
)
.await?,
);
}
OptimizeAction::Index(options) => {
optimize_indices(table, &options).await?;
}
}
Ok(stats)
}
#[cfg(test)]
mod tests {
use arrow_array::{Int32Array, RecordBatch, StringArray};
use arrow_schema::{DataType, Field, Schema};
use rstest::rstest;
use std::sync::Arc;
use crate::connect;
use crate::index::{scalar::BTreeIndexBuilder, Index};
use crate::query::ExecutableQuery;
use crate::table::{CompactionOptions, OptimizeAction, OptimizeStats};
use futures::TryStreamExt;
#[tokio::test]
async fn test_optimize_compact_simple() {
let conn = connect("memory://").execute().await.unwrap();
// Create a table with initial data
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..100))],
)
.unwrap();
let table = conn
.create_table("test_compact", batch)
.execute()
.await
.unwrap();
// Add more data to create multiple fragments
for i in 0..5 {
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(
(i * 100 + 100)..((i + 1) * 100 + 100),
))],
)
.unwrap();
table.add(batch).execute().await.unwrap();
}
// Verify we have multiple fragments before compaction
let initial_row_count = table.count_rows(None).await.unwrap();
assert_eq!(initial_row_count, 600);
// Run compaction
let stats = table
.optimize(OptimizeAction::Compact {
options: CompactionOptions {
target_rows_per_fragment: 1000,
..Default::default()
},
remap_options: None,
})
.await
.unwrap();
// Verify compaction occurred
assert!(stats.compaction.is_some());
let compaction_metrics = stats.compaction.unwrap();
assert!(compaction_metrics.fragments_removed > 0);
// Verify data integrity after compaction
let final_row_count = table.count_rows(None).await.unwrap();
assert_eq!(final_row_count, 600);
// Verify data content is correct
let batches = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let total_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
assert_eq!(total_rows, 600);
// Verify the values are as expected
let mut all_values: Vec<i32> = Vec::new();
for batch in &batches {
let array = batch["i"].as_any().downcast_ref::<Int32Array>().unwrap();
all_values.extend(array.values().iter().copied());
}
all_values.sort();
let expected: Vec<i32> = (0..600).collect();
assert_eq!(all_values, expected);
}
#[tokio::test]
async fn test_optimize_prune_versions() {
let conn = connect("memory://").execute().await.unwrap();
// Create a table
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..10))],
)
.unwrap();
let table = conn
.create_table("test_prune", batch)
.execute()
.await
.unwrap();
// Make several modifications to create versions
for i in 0..5 {
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(
(i * 10 + 10)..((i + 1) * 10 + 10),
))],
)
.unwrap();
table.add(batch).execute().await.unwrap();
}
// Verify multiple versions exist
let versions = table.list_versions().await.unwrap();
assert!(versions.len() > 1);
// Run prune with a very old cutoff (won't delete recent versions)
let stats = table
.optimize(OptimizeAction::Prune {
older_than: Some(chrono::Duration::try_days(0).unwrap()),
delete_unverified: Some(true),
error_if_tagged_old_versions: None,
})
.await
.unwrap();
// Prune-only operation should not have compaction stats
assert!(stats.compaction.is_none());
// Verify prune stats
let prune_stats = stats.prune.unwrap();
assert!(prune_stats.bytes_removed > 0);
assert_eq!(prune_stats.old_versions, 5);
// Verify data is still intact
let final_row_count = table.count_rows(None).await.unwrap();
assert_eq!(final_row_count, 60);
// Verify data content is correct
let batches = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let mut all_values: Vec<i32> = Vec::new();
for batch in &batches {
let array = batch["i"].as_any().downcast_ref::<Int32Array>().unwrap();
all_values.extend(array.values().iter().copied());
}
all_values.sort();
let expected: Vec<i32> = (0..60).collect();
assert_eq!(all_values, expected);
}
#[tokio::test]
async fn test_optimize_index() {
let conn = connect("memory://").execute().await.unwrap();
// Create a table with data
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..100))],
)
.unwrap();
let table = conn
.create_table("test_index_optimize", batch)
.execute()
.await
.unwrap();
// Create an index
table
.create_index(&["i"], Index::BTree(BTreeIndexBuilder::default()))
.execute()
.await
.unwrap();
// Add more data (unindexed)
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(100..200))],
)
.unwrap();
table.add(batch).execute().await.unwrap();
// Verify index stats before optimization
let indices = table.list_indices().await.unwrap();
assert_eq!(indices.len(), 1);
let index_name = indices[0].name.clone();
let stats_before = table.index_stats(&index_name).await.unwrap().unwrap();
assert_eq!(stats_before.num_indexed_rows, 100);
assert_eq!(stats_before.num_unindexed_rows, 100);
// Run index optimization
let stats = table
.optimize(OptimizeAction::Index(Default::default()))
.await
.unwrap();
// For index optimization, compaction and prune stats should be None
assert!(stats.compaction.is_none());
assert!(stats.prune.is_none());
// Verify index stats after optimization
let stats_after = table.index_stats(&index_name).await.unwrap().unwrap();
assert_eq!(stats_after.num_indexed_rows, 200);
assert_eq!(stats_after.num_unindexed_rows, 0);
assert!(stats_after.num_indices.is_some());
// Verify data integrity
let final_row_count = table.count_rows(None).await.unwrap();
assert_eq!(final_row_count, 200);
}
#[tokio::test]
async fn test_optimize_all() {
let conn = connect("memory://").execute().await.unwrap();
// Create a table with data
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..100))],
)
.unwrap();
let table = conn
.create_table("test_optimize_all", batch)
.execute()
.await
.unwrap();
// Add more data
for i in 0..3 {
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(
(i * 100 + 100)..((i + 1) * 100 + 100),
))],
)
.unwrap();
table.add(batch).execute().await.unwrap();
}
// Run all optimizations
let stats = table.optimize(OptimizeAction::All).await.unwrap();
// Verify stats from both compaction and prune
assert!(stats.compaction.is_some());
assert!(stats.prune.is_some());
// Verify data integrity
let final_row_count = table.count_rows(None).await.unwrap();
assert_eq!(final_row_count, 400);
// Verify data content
let batches = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let mut all_values: Vec<i32> = Vec::new();
for batch in &batches {
let array = batch["i"].as_any().downcast_ref::<Int32Array>().unwrap();
all_values.extend(array.values().iter().copied());
}
all_values.sort();
let expected: Vec<i32> = (0..400).collect();
assert_eq!(all_values, expected);
}
#[tokio::test]
async fn test_optimize_default_action() {
// Verify that default action is All
let action: OptimizeAction = Default::default();
assert!(matches!(action, OptimizeAction::All));
}
#[tokio::test]
async fn test_optimize_stats_default() {
// Verify OptimizeStats default values
let stats: OptimizeStats = Default::default();
assert!(stats.compaction.is_none());
assert!(stats.prune.is_none());
}
#[tokio::test]
async fn test_compact_with_deferred_index_remap() {
// Smoke test: verifies compaction with deferred index remap doesn't error.
// We don't currently assert that remap is actually deferred.
let conn = connect("memory://").execute().await.unwrap();
// Create a table with data
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..100))],
)
.unwrap();
let table = conn
.create_table("test_deferred_remap", batch.clone())
.execute()
.await
.unwrap();
// Add more data
table.add(batch).execute().await.unwrap();
// Create an index
table
.create_index(&["id"], Index::BTree(BTreeIndexBuilder::default()))
.execute()
.await
.unwrap();
// Run compaction with deferred index remap
let stats = table
.optimize(OptimizeAction::Compact {
options: CompactionOptions {
target_rows_per_fragment: 2000,
defer_index_remap: true,
..Default::default()
},
remap_options: None,
})
.await
.unwrap();
assert!(stats.compaction.is_some());
// Verify data integrity after compaction
let final_row_count = table.count_rows(None).await.unwrap();
assert_eq!(final_row_count, 200);
// Verify data content is correct
let batches = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let mut all_values: Vec<i32> = Vec::new();
for batch in &batches {
let array = batch["id"].as_any().downcast_ref::<Int32Array>().unwrap();
all_values.extend(array.values().iter().copied());
}
all_values.sort();
// Since we added the same data twice (0..100 twice), we expect 200 values
// with values 0-99 appearing twice
let mut expected: Vec<i32> = (0..100).chain(0..100).collect();
expected.sort();
assert_eq!(all_values, expected);
}
#[tokio::test]
async fn test_compaction_preserves_schema() {
let conn = connect("memory://").execute().await.unwrap();
// Create a table with multiple columns
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, true),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..10)),
Arc::new(StringArray::from(
(0..10).map(|i| format!("name_{}", i)).collect::<Vec<_>>(),
)),
],
)
.unwrap();
let original_schema = batch.schema();
let table = conn
.create_table("test_schema_preserved", batch.clone())
.execute()
.await
.unwrap();
// Add more data
table.add(batch).execute().await.unwrap();
// Run compaction
table
.optimize(OptimizeAction::Compact {
options: CompactionOptions::default(),
remap_options: None,
})
.await
.unwrap();
// Verify schema is preserved
let current_schema = table.schema().await.unwrap();
assert_eq!(current_schema, original_schema);
// Verify data is intact and correct
let batches = table
.query()
.execute()
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let total_rows: usize = batches.iter().map(|b| b.num_rows()).sum();
assert_eq!(total_rows, 20);
}
#[tokio::test]
async fn test_optimize_empty_table() {
let conn = connect("memory://").execute().await.unwrap();
// Create a table and delete all data
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..10))],
)
.unwrap();
let table = conn
.create_table("test_empty_optimize", batch)
.execute()
.await
.unwrap();
// Delete all rows
table.delete("true").await.unwrap();
// Verify table is empty
assert_eq!(table.count_rows(None).await.unwrap(), 0);
// Optimize should work on empty table
let stats = table.optimize(OptimizeAction::All).await.unwrap();
assert!(stats.compaction.is_some());
assert!(stats.prune.is_some());
// Verify table is still empty but schema is preserved
assert_eq!(table.count_rows(None).await.unwrap(), 0);
let current_schema = table.schema().await.unwrap();
assert_eq!(current_schema, schema);
}
#[rstest]
#[case::all(OptimizeAction::All)]
#[case::compact(OptimizeAction::Compact {
options: CompactionOptions::default(),
remap_options: None,
})]
#[case::prune(OptimizeAction::Prune {
older_than: Some(chrono::Duration::try_days(0).unwrap()),
delete_unverified: Some(true),
error_if_tagged_old_versions: None,
})]
#[case::index(OptimizeAction::Index(Default::default()))]
#[tokio::test]
async fn test_optimize_fails_on_checked_out_table(#[case] action: OptimizeAction) {
let conn = connect("memory://").execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..10))],
)
.unwrap();
let table = conn
.create_table("test_checkout_optimize", batch.clone())
.execute()
.await
.unwrap();
table.add(batch).execute().await.unwrap();
table.checkout(1).await.unwrap();
let result = table.optimize(action).await;
assert!(result.is_err());
let err_msg = result.unwrap_err().to_string();
assert!(
err_msg.contains("cannot be modified when a specific version is checked out"),
"Expected error message about checked out table, got: {}",
err_msg
);
}
}

View File

@@ -0,0 +1,739 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::Arc;
use super::NativeTable;
use crate::error::{Error, Result};
use crate::query::{
QueryExecutionOptions, QueryFilter, QueryRequest, Select, VectorQueryRequest, DEFAULT_TOP_K,
};
use crate::utils::{default_vector_column, TimeoutStream};
use arrow::array::{AsArray, FixedSizeListBuilder, Float32Builder};
use arrow::datatypes::{Float32Type, UInt8Type};
use arrow_array::Array;
use arrow_schema::{DataType, Schema};
use datafusion_physical_plan::projection::ProjectionExec;
use datafusion_physical_plan::repartition::RepartitionExec;
use datafusion_physical_plan::union::UnionExec;
use datafusion_physical_plan::ExecutionPlan;
use futures::future::try_join_all;
use lance::dataset::scanner::DatasetRecordBatchStream;
use lance::dataset::scanner::Scanner;
use lance_datafusion::exec::{analyze_plan as lance_analyze_plan, execute_plan};
use lance_namespace::models::{
QueryTableRequest as NsQueryTableRequest, QueryTableRequestColumns,
QueryTableRequestFullTextQuery, QueryTableRequestVector, StringFtsQuery,
};
use lance_namespace::LanceNamespace;
#[derive(Debug, Clone)]
pub enum AnyQuery {
Query(QueryRequest),
VectorQuery(VectorQueryRequest),
}
//Decide between namespace or local
pub async fn execute_query(
table: &NativeTable,
query: &AnyQuery,
options: QueryExecutionOptions,
) -> Result<DatasetRecordBatchStream> {
// If namespace client is configured, use server-side query execution
if let Some(ref namespace_client) = table.namespace_client {
return execute_namespace_query(table, namespace_client.clone(), query, options).await;
}
execute_generic_query(table, query, options).await
}
pub async fn analyze_query_plan(
table: &NativeTable,
query: &AnyQuery,
options: QueryExecutionOptions,
) -> Result<String> {
let plan = create_plan(table, query, options).await?;
Ok(lance_analyze_plan(plan, Default::default()).await?)
}
/// Local Execution Path (DataFusion)
async fn execute_generic_query(
table: &NativeTable,
query: &AnyQuery,
options: QueryExecutionOptions,
) -> Result<DatasetRecordBatchStream> {
let plan = create_plan(table, query, options.clone()).await?;
let inner = execute_plan(plan, Default::default())?;
let inner = if let Some(timeout) = options.timeout {
TimeoutStream::new_boxed(inner, timeout)
} else {
inner
};
Ok(DatasetRecordBatchStream::new(inner))
}
pub async fn create_plan(
table: &NativeTable,
query: &AnyQuery,
options: QueryExecutionOptions,
) -> Result<Arc<dyn ExecutionPlan>> {
let query = match query {
AnyQuery::VectorQuery(query) => query.clone(),
AnyQuery::Query(query) => VectorQueryRequest::from_plain_query(query.clone()),
};
let ds_ref = table.dataset.get().await?;
let schema = ds_ref.schema();
let mut column = query.column.clone();
let mut query_vector = query.query_vector.first().cloned();
if query.query_vector.len() > 1 {
if column.is_none() {
// Infer a vector column with the same dimension of the query vector.
let arrow_schema = Schema::from(ds_ref.schema());
column = Some(default_vector_column(
&arrow_schema,
Some(query.query_vector[0].len() as i32),
)?);
}
let vector_field = schema.field(column.as_ref().unwrap()).unwrap();
if let DataType::List(_) = vector_field.data_type() {
// Multivector handling: concatenate into FixedSizeList<FixedSizeList<_>>
let vectors = query
.query_vector
.iter()
.map(|arr| arr.as_ref())
.collect::<Vec<_>>();
let dim = vectors[0].len();
let mut fsl_builder = FixedSizeListBuilder::with_capacity(
Float32Builder::with_capacity(dim),
dim as i32,
vectors.len(),
);
for vec in vectors {
fsl_builder
.values()
.append_slice(vec.as_primitive::<Float32Type>().values());
fsl_builder.append(true);
}
query_vector = Some(Arc::new(fsl_builder.finish()));
} else {
// Multiple query vectors: create a plan for each and union them
let query_vecs = query.query_vector.clone();
let plan_futures = query_vecs
.into_iter()
.map(|query_vector| {
let mut sub_query = query.clone();
sub_query.query_vector = vec![query_vector];
let options_ref = options.clone();
async move {
create_plan(table, &AnyQuery::VectorQuery(sub_query), options_ref).await
}
})
.collect::<Vec<_>>();
let plans = try_join_all(plan_futures).await?;
return create_multi_vector_plan(plans);
}
}
let mut scanner: Scanner = ds_ref.scan();
if let Some(query_vector) = query_vector {
let column = if let Some(col) = column {
col
} else {
let arrow_schema = Schema::from(ds_ref.schema());
default_vector_column(&arrow_schema, Some(query_vector.len() as i32))?
};
let (_, element_type) = lance::index::vector::utils::get_vector_type(schema, &column)?;
let is_binary = matches!(element_type, DataType::UInt8);
let top_k = query.base.limit.unwrap_or(DEFAULT_TOP_K) + query.base.offset.unwrap_or(0);
if is_binary {
let query_vector = arrow::compute::cast(&query_vector, &DataType::UInt8)?;
let query_vector = query_vector.as_primitive::<UInt8Type>();
scanner.nearest(&column, query_vector, top_k)?;
} else {
scanner.nearest(&column, query_vector.as_ref(), top_k)?;
}
scanner.minimum_nprobes(query.minimum_nprobes);
if let Some(maximum_nprobes) = query.maximum_nprobes {
scanner.maximum_nprobes(maximum_nprobes);
}
}
scanner.limit(
query.base.limit.map(|limit| limit as i64),
query.base.offset.map(|offset| offset as i64),
)?;
if let Some(ef) = query.ef {
scanner.ef(ef);
}
scanner.distance_range(query.lower_bound, query.upper_bound);
scanner.use_index(query.use_index);
scanner.prefilter(query.base.prefilter);
match query.base.select {
Select::Columns(ref columns) => {
scanner.project(columns.as_slice())?;
}
Select::Dynamic(ref select_with_transform) => {
scanner.project_with_transform(select_with_transform.as_slice())?;
}
Select::All => {}
}
if query.base.with_row_id {
scanner.with_row_id();
}
scanner.batch_size(options.max_batch_length as usize);
if query.base.fast_search {
scanner.fast_search();
}
if let Some(filter) = &query.base.filter {
match filter {
QueryFilter::Sql(sql) => {
scanner.filter(sql)?;
}
QueryFilter::Substrait(substrait) => {
scanner.filter_substrait(substrait)?;
}
QueryFilter::Datafusion(expr) => {
scanner.filter_expr(expr.clone());
}
}
}
if let Some(fts) = &query.base.full_text_search {
scanner.full_text_search(fts.clone())?;
}
if let Some(refine_factor) = query.refine_factor {
scanner.refine(refine_factor);
}
if let Some(distance_type) = query.distance_type {
scanner.distance_metric(distance_type.into());
}
if query.base.disable_scoring_autoprojection {
scanner.disable_scoring_autoprojection();
}
Ok(scanner.create_plan().await?)
}
//Helper functions below
// Take many execution plans and map them into a single plan that adds
// a query_index column and unions them.
pub(crate) fn create_multi_vector_plan(
plans: Vec<Arc<dyn ExecutionPlan>>,
) -> Result<Arc<dyn ExecutionPlan>> {
if plans.is_empty() {
return Err(Error::InvalidInput {
message: "No plans provided".to_string(),
});
}
// Projection to keeping all existing columns
let first_plan = plans[0].clone();
let project_all_columns = first_plan
.schema()
.fields()
.iter()
.enumerate()
.map(|(i, field)| {
let expr = datafusion_physical_plan::expressions::Column::new(field.name().as_str(), i);
let expr = Arc::new(expr) as Arc<dyn datafusion_physical_plan::PhysicalExpr>;
(expr, field.name().clone())
})
.collect::<Vec<_>>();
let projected_plans = plans
.into_iter()
.enumerate()
.map(|(plan_i, plan)| {
let query_index = datafusion_common::ScalarValue::Int32(Some(plan_i as i32));
let query_index_expr = datafusion_physical_plan::expressions::Literal::new(query_index);
let query_index_expr =
Arc::new(query_index_expr) as Arc<dyn datafusion_physical_plan::PhysicalExpr>;
let mut projections = vec![(query_index_expr, "query_index".to_string())];
projections.extend_from_slice(&project_all_columns);
let projection = ProjectionExec::try_new(projections, plan).unwrap();
Arc::new(projection) as Arc<dyn datafusion_physical_plan::ExecutionPlan>
})
.collect::<Vec<_>>();
let unioned = UnionExec::try_new(projected_plans).map_err(|err| Error::Runtime {
message: err.to_string(),
})?;
// We require 1 partition in the final output
let repartitioned = RepartitionExec::try_new(
unioned,
datafusion_physical_plan::Partitioning::RoundRobinBatch(1),
)
.unwrap();
Ok(Arc::new(repartitioned))
}
/// Execute a query on the namespace server instead of locally.
async fn execute_namespace_query(
table: &NativeTable,
namespace_client: Arc<dyn LanceNamespace>,
query: &AnyQuery,
_options: QueryExecutionOptions,
) -> Result<DatasetRecordBatchStream> {
// Build table_id from namespace + table name
let mut table_id = table.namespace.clone();
table_id.push(table.name.clone());
// Convert AnyQuery to namespace QueryTableRequest
let mut ns_request = convert_to_namespace_query(query)?;
// Set the table ID on the request
ns_request.id = Some(table_id);
// Call the namespace query_table API
let response_bytes = namespace_client
.query_table(ns_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to execute server-side query: {}", e),
})?;
// Parse the Arrow IPC response into a RecordBatchStream
parse_arrow_ipc_response(response_bytes).await
}
/// Convert an AnyQuery to the namespace QueryTableRequest format.
fn convert_to_namespace_query(query: &AnyQuery) -> Result<NsQueryTableRequest> {
match query {
AnyQuery::VectorQuery(vq) => {
// Extract the query vector(s)
let vector = extract_query_vector(&vq.query_vector)?;
// Convert filter to SQL string
let filter = match &vq.base.filter {
Some(f) => Some(filter_to_sql(f)?),
None => None,
};
// Convert select to columns list
let columns = match &vq.base.select {
Select::All => None,
Select::Columns(cols) => Some(Box::new(QueryTableRequestColumns {
column_names: Some(cols.clone()),
column_aliases: None,
})),
Select::Dynamic(_) => {
return Err(Error::NotSupported {
message:
"Dynamic column selection is not supported for server-side queries"
.to_string(),
});
}
};
// Check for unsupported features
if vq.base.reranker.is_some() {
return Err(Error::NotSupported {
message: "Reranker is not supported for server-side queries".to_string(),
});
}
// Convert FTS query if present
let full_text_query = vq.base.full_text_search.as_ref().map(|fts| {
let columns = fts.columns();
let columns_vec = if columns.is_empty() {
None
} else {
Some(columns.into_iter().collect())
};
Box::new(QueryTableRequestFullTextQuery {
string_query: Some(Box::new(StringFtsQuery {
query: fts.query.to_string(),
columns: columns_vec,
})),
structured_query: None,
})
});
Ok(NsQueryTableRequest {
id: None, // Will be set in namespace_query
k: vq.base.limit.unwrap_or(10) as i32,
vector: Box::new(vector),
vector_column: vq.column.clone(),
filter,
columns,
offset: vq.base.offset.map(|o| o as i32),
distance_type: vq.distance_type.map(|dt| dt.to_string()),
nprobes: Some(vq.minimum_nprobes as i32),
ef: vq.ef.map(|e| e as i32),
refine_factor: vq.refine_factor.map(|r| r as i32),
lower_bound: vq.lower_bound,
upper_bound: vq.upper_bound,
prefilter: Some(vq.base.prefilter),
fast_search: Some(vq.base.fast_search),
with_row_id: Some(vq.base.with_row_id),
bypass_vector_index: Some(!vq.use_index),
full_text_query,
..Default::default()
})
}
AnyQuery::Query(q) => {
// For non-vector queries, pass an empty vector (similar to remote table implementation)
if q.reranker.is_some() {
return Err(Error::NotSupported {
message: "Reranker is not supported for server-side query execution"
.to_string(),
});
}
let filter = q.filter.as_ref().map(filter_to_sql).transpose()?;
let columns = match &q.select {
Select::All => None,
Select::Columns(cols) => Some(Box::new(QueryTableRequestColumns {
column_names: Some(cols.clone()),
column_aliases: None,
})),
Select::Dynamic(_) => {
return Err(Error::NotSupported {
message: "Dynamic columns are not supported for server-side query"
.to_string(),
});
}
};
// Handle full text search if present
let full_text_query = q.full_text_search.as_ref().map(|fts| {
let columns_vec = if fts.columns().is_empty() {
None
} else {
Some(fts.columns().iter().cloned().collect())
};
Box::new(QueryTableRequestFullTextQuery {
string_query: Some(Box::new(StringFtsQuery {
query: fts.query.to_string(),
columns: columns_vec,
})),
structured_query: None,
})
});
// Empty vector for non-vector queries
let vector = Box::new(QueryTableRequestVector {
single_vector: Some(vec![]),
multi_vector: None,
});
Ok(NsQueryTableRequest {
id: None, // Will be set by caller
vector,
k: q.limit.unwrap_or(10) as i32,
filter,
columns,
prefilter: Some(q.prefilter),
offset: q.offset.map(|o| o as i32),
vector_column: None, // No vector column for plain queries
with_row_id: Some(q.with_row_id),
bypass_vector_index: Some(true), // No vector index for plain queries
full_text_query,
..Default::default()
})
}
}
}
fn filter_to_sql(filter: &QueryFilter) -> Result<String> {
match filter {
QueryFilter::Sql(sql) => Ok(sql.clone()),
QueryFilter::Substrait(_) => Err(Error::NotSupported {
message: "Substrait filters are not supported for server-side queries".to_string(),
}),
QueryFilter::Datafusion(_) => Err(Error::NotSupported {
message: "Datafusion expression filters are not supported for server-side queries. Use SQL filter instead.".to_string(),
}),
}
}
/// Extract query vector(s) from Arrow arrays into the namespace format.
fn extract_query_vector(
query_vectors: &[Arc<dyn arrow_array::Array>],
) -> Result<QueryTableRequestVector> {
if query_vectors.is_empty() {
return Err(Error::InvalidInput {
message: "Query vector is required for vector search".to_string(),
});
}
// Handle single vector case
if query_vectors.len() == 1 {
let arr = &query_vectors[0];
let single_vector = array_to_f32_vec(arr)?;
Ok(QueryTableRequestVector {
single_vector: Some(single_vector),
multi_vector: None,
})
} else {
// Handle multi-vector case
let multi_vector: Result<Vec<Vec<f32>>> =
query_vectors.iter().map(array_to_f32_vec).collect();
Ok(QueryTableRequestVector {
single_vector: None,
multi_vector: Some(multi_vector?),
})
}
}
/// Convert an Arrow array to a Vec<f32>.
fn array_to_f32_vec(arr: &Arc<dyn arrow_array::Array>) -> Result<Vec<f32>> {
// Handle FixedSizeList (common for vectors)
if let Some(fsl) = arr
.as_any()
.downcast_ref::<arrow_array::FixedSizeListArray>()
{
let values = fsl.values();
if let Some(f32_arr) = values.as_any().downcast_ref::<arrow_array::Float32Array>() {
return Ok(f32_arr.values().to_vec());
}
}
// Handle direct Float32Array
if let Some(f32_arr) = arr.as_any().downcast_ref::<arrow_array::Float32Array>() {
return Ok(f32_arr.values().to_vec());
}
Err(Error::InvalidInput {
message: "Query vector must be Float32 type".to_string(),
})
}
/// Parse Arrow IPC response from the namespace server.
async fn parse_arrow_ipc_response(bytes: bytes::Bytes) -> Result<DatasetRecordBatchStream> {
use arrow_ipc::reader::StreamReader;
use std::io::Cursor;
let cursor = Cursor::new(bytes);
let reader = StreamReader::try_new(cursor, None).map_err(|e| Error::Runtime {
message: format!("Failed to parse Arrow IPC response: {}", e),
})?;
// Collect all record batches
let schema = reader.schema();
let batches: Vec<_> = reader
.into_iter()
.collect::<std::result::Result<Vec<_>, _>>()
.map_err(|e| Error::Runtime {
message: format!("Failed to read Arrow IPC batches: {}", e),
})?;
// Create a stream from the batches
let stream = futures::stream::iter(batches.into_iter().map(Ok));
let record_batch_stream =
Box::pin(datafusion_physical_plan::stream::RecordBatchStreamAdapter::new(schema, stream));
Ok(DatasetRecordBatchStream::new(record_batch_stream))
}
#[cfg(test)]
#[allow(deprecated)]
mod tests {
use arrow_array::Float32Array;
use futures::TryStreamExt;
use std::sync::Arc;
use super::*;
use crate::query::QueryExecutionOptions;
#[test]
fn test_convert_to_namespace_query_vector() {
let query_vector = Arc::new(Float32Array::from(vec![1.0, 2.0, 3.0, 4.0]));
let vq = VectorQueryRequest {
base: QueryRequest {
limit: Some(10),
offset: Some(5),
filter: Some(QueryFilter::Sql("id > 0".to_string())),
select: Select::Columns(vec!["id".to_string()]),
..Default::default()
},
column: Some("vector".to_string()),
// We cast here to satisfy the struct definition
query_vector: vec![query_vector as Arc<dyn Array>],
minimum_nprobes: 20,
distance_type: Some(crate::DistanceType::L2),
..Default::default()
};
let any_query = AnyQuery::VectorQuery(vq);
let ns_request = convert_to_namespace_query(&any_query).unwrap();
assert_eq!(ns_request.k, 10);
assert_eq!(ns_request.offset, Some(5));
assert_eq!(ns_request.filter, Some("id > 0".to_string()));
assert_eq!(
ns_request
.columns
.as_ref()
.and_then(|c| c.column_names.as_ref()),
Some(&vec!["id".to_string()])
);
assert_eq!(ns_request.vector_column, Some("vector".to_string()));
assert_eq!(ns_request.distance_type, Some("l2".to_string()));
// Verify the vector data was extracted correctly
assert!(ns_request.vector.single_vector.is_some());
assert_eq!(
ns_request.vector.single_vector.as_ref().unwrap(),
&vec![1.0, 2.0, 3.0, 4.0]
);
}
#[test]
fn test_convert_to_namespace_query_plain_query() {
let q = QueryRequest {
limit: Some(20),
offset: Some(5),
filter: Some(QueryFilter::Sql("id > 5".to_string())),
select: Select::Columns(vec!["id".to_string()]),
with_row_id: true,
..Default::default()
};
let any_query = AnyQuery::Query(q);
let ns_request = convert_to_namespace_query(&any_query).unwrap();
assert_eq!(ns_request.k, 20);
assert_eq!(ns_request.offset, Some(5));
assert_eq!(ns_request.filter, Some("id > 5".to_string()));
assert_eq!(
ns_request
.columns
.as_ref()
.and_then(|c| c.column_names.as_ref()),
Some(&vec!["id".to_string()])
);
assert_eq!(ns_request.with_row_id, Some(true));
assert_eq!(ns_request.bypass_vector_index, Some(true));
assert!(ns_request.vector_column.is_none());
assert!(ns_request.vector.single_vector.as_ref().unwrap().is_empty());
}
#[tokio::test]
async fn test_execute_query_local_routing() {
use crate::connect;
use crate::table::query::execute_query;
use arrow_array::{Int32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};
let conn = connect("memory://").execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int32, false)]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from(vec![1, 2, 3, 4, 5]))],
)
.unwrap();
let table = conn
.create_table("test_routing", vec![batch])
.execute()
.await
.unwrap();
let native_table = table.as_native().unwrap();
// Setup a request
let req = QueryRequest {
filter: Some(QueryFilter::Sql("id > 3".to_string())),
..Default::default()
};
let query = AnyQuery::Query(req);
// Action: Call execute_query directly
// This validates that execute_query correctly routes to the local DataFusion engine
// when table.namespace_client is None.
let stream = execute_query(native_table, &query, QueryExecutionOptions::default())
.await
.unwrap();
// Verify results
let batches = stream.try_collect::<Vec<_>>().await.unwrap();
let count: usize = batches.iter().map(|b| b.num_rows()).sum();
assert_eq!(count, 2); // 4 and 5
}
#[tokio::test]
async fn test_create_plan_multivector_structure() {
use arrow_array::{Float32Array, RecordBatch};
use arrow_schema::{DataType, Field, Schema};
use datafusion_physical_plan::display::DisplayableExecutionPlan;
use crate::table::query::create_plan;
use crate::connect;
let conn = connect("memory://").execute().await.unwrap();
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new(
"vector",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), 2),
false,
),
]));
let batch = RecordBatch::new_empty(schema.clone());
let table = conn
.create_table("test_plan", vec![batch])
.execute()
.await
.unwrap();
let native_table = table.as_native().unwrap();
// This triggers the "create_multi_vector_plan" logic branch
let q1 = Arc::new(Float32Array::from(vec![1.0, 2.0]));
let q2 = Arc::new(Float32Array::from(vec![3.0, 4.0]));
let req = VectorQueryRequest {
column: Some("vector".to_string()),
query_vector: vec![q1, q2],
..Default::default()
};
let query = AnyQuery::VectorQuery(req);
// Create the Plan
let plan = create_plan(native_table, &query, QueryExecutionOptions::default())
.await
.unwrap();
// formatting it allows us to see the hierarchy
let display = DisplayableExecutionPlan::new(plan.as_ref())
.indent(true)
.to_string();
// We expect a RepartitionExec wrapping a UnionExec
assert!(
display.contains("RepartitionExec"),
"Plan should include Repartitioning"
);
assert!(
display.contains("UnionExec"),
"Plan should include a Union of multiple searches"
);
// We expect the projection to add the 'query_index' column (logic inside multi_vector_plan)
assert!(
display.contains("query_index"),
"Plan should add query_index column"
);
}
}

View File

@@ -52,11 +52,12 @@ pub(crate) async fn execute_add_columns(
transforms: NewColumnTransform,
read_columns: Option<Vec<String>>,
) -> Result<AddColumnsResult> {
let mut dataset = table.dataset.get_mut().await?;
table.dataset.ensure_mutable()?;
let mut dataset = (*table.dataset.get().await?).clone();
dataset.add_columns(transforms, read_columns, None).await?;
Ok(AddColumnsResult {
version: dataset.version().version,
})
let version = dataset.version().version;
table.dataset.update(dataset);
Ok(AddColumnsResult { version })
}
/// Internal implementation of the alter columns logic.
@@ -66,11 +67,12 @@ pub(crate) async fn execute_alter_columns(
table: &NativeTable,
alterations: &[ColumnAlteration],
) -> Result<AlterColumnsResult> {
let mut dataset = table.dataset.get_mut().await?;
table.dataset.ensure_mutable()?;
let mut dataset = (*table.dataset.get().await?).clone();
dataset.alter_columns(alterations).await?;
Ok(AlterColumnsResult {
version: dataset.version().version,
})
let version = dataset.version().version;
table.dataset.update(dataset);
Ok(AlterColumnsResult { version })
}
/// Internal implementation of the drop columns logic.
@@ -80,16 +82,17 @@ pub(crate) async fn execute_drop_columns(
table: &NativeTable,
columns: &[&str],
) -> Result<DropColumnsResult> {
let mut dataset = table.dataset.get_mut().await?;
table.dataset.ensure_mutable()?;
let mut dataset = (*table.dataset.get().await?).clone();
dataset.drop_columns(columns).await?;
Ok(DropColumnsResult {
version: dataset.version().version,
})
let version = dataset.version().version;
table.dataset.update(dataset);
Ok(DropColumnsResult { version })
}
#[cfg(test)]
mod tests {
use arrow_array::{record_batch, Int32Array, RecordBatchIterator, StringArray};
use arrow_array::{record_batch, Int32Array, StringArray};
use arrow_schema::DataType;
use futures::TryStreamExt;
use lance::dataset::ColumnAlteration;
@@ -105,13 +108,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("id", Int32, [1, 2, 3, 4, 5])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_add_columns",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_add_columns", batch)
.execute()
.await
.unwrap();
@@ -169,13 +168,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("x", Int32, [10, 20, 30])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_add_multi_columns",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_add_multi_columns", batch)
.execute()
.await
.unwrap();
@@ -205,13 +200,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("id", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_add_const_column",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_add_const_column", batch)
.execute()
.await
.unwrap();
@@ -255,13 +246,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("old_name", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_alter_rename",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_alter_rename", batch)
.execute()
.await
.unwrap();
@@ -304,10 +291,7 @@ mod tests {
.unwrap();
let table = conn
.create_table(
"test_alter_nullable",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_alter_nullable", batch)
.execute()
.await
.unwrap();
@@ -332,13 +316,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("num", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_cast_type",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_cast_type", batch)
.execute()
.await
.unwrap();
@@ -379,13 +359,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("num", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_invalid_cast",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_invalid_cast", batch)
.execute()
.await
.unwrap();
@@ -407,13 +383,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("a", Int32, [1, 2, 3]), ("b", Int32, [4, 5, 6])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_alter_multi",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_alter_multi", batch)
.execute()
.await
.unwrap();
@@ -441,13 +413,9 @@ mod tests {
let batch =
record_batch!(("keep", Int32, [1, 2, 3]), ("remove", Int32, [4, 5, 6])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_drop_single",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_drop_single", batch)
.execute()
.await
.unwrap();
@@ -478,13 +446,9 @@ mod tests {
("d", Int32, [7, 8])
)
.unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_drop_multi",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_drop_multi", batch)
.execute()
.await
.unwrap();
@@ -511,13 +475,9 @@ mod tests {
("extra", Int32, [10, 20, 30])
)
.unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_drop_preserves",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_drop_preserves", batch)
.execute()
.await
.unwrap();
@@ -567,13 +527,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("existing", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_drop_nonexistent",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_drop_nonexistent", batch)
.execute()
.await
.unwrap();
@@ -593,13 +549,9 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("existing", Int32, [1, 2, 3])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_alter_nonexistent",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_alter_nonexistent", batch)
.execute()
.await
.unwrap();
@@ -623,13 +575,8 @@ mod tests {
let conn = connect("memory://").execute().await.unwrap();
let batch = record_batch!(("a", Int32, [1, 2, 3]), ("b", Int32, [4, 5, 6])).unwrap();
let schema = batch.schema();
let table = conn
.create_table(
"test_version_increment",
RecordBatchIterator::new(vec![Ok(batch)], schema),
)
.create_table("test_version_increment", batch)
.execute()
.await
.unwrap();

View File

@@ -78,11 +78,13 @@ pub(crate) async fn execute_update(
table: &NativeTable,
update: UpdateBuilder,
) -> Result<UpdateResult> {
table.dataset.ensure_mutable()?;
// 1. Snapshot the current dataset
let dataset = table.dataset.get().await?.clone();
let dataset = table.dataset.get().await?;
// 2. Initialize the Lance Core builder
let mut builder = LanceUpdateBuilder::new(Arc::new(dataset));
let mut builder = LanceUpdateBuilder::new(dataset);
// 3. Apply the filter (WHERE clause)
if let Some(predicate) = update.filter {
@@ -99,10 +101,7 @@ pub(crate) async fn execute_update(
let res = operation.execute().await?;
// 6. Update the table's view of the latest version
table
.dataset
.set_latest(res.new_dataset.as_ref().clone())
.await;
table.dataset.update(res.new_dataset.as_ref().clone());
Ok(UpdateResult {
rows_updated: res.rows_updated,
@@ -117,9 +116,8 @@ mod tests {
use crate::query::{ExecutableQuery, Select};
use arrow_array::{
record_batch, Array, BooleanArray, Date32Array, FixedSizeListArray, Float32Array,
Float64Array, Int32Array, Int64Array, LargeStringArray, RecordBatch, RecordBatchIterator,
RecordBatchReader, StringArray, TimestampMillisecondArray, TimestampNanosecondArray,
UInt32Array,
Float64Array, Int32Array, Int64Array, LargeStringArray, RecordBatch, StringArray,
TimestampMillisecondArray, TimestampNanosecondArray, UInt32Array,
};
use arrow_data::ArrayDataBuilder;
use arrow_schema::{ArrowError, DataType, Field, Schema, TimeUnit};
@@ -167,51 +165,46 @@ mod tests {
),
]));
let record_batch_iter = RecordBatchIterator::new(
vec![RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from_iter_values(0..10)),
Arc::new(Int64Array::from_iter_values(0..10)),
Arc::new(UInt32Array::from_iter_values(0..10)),
Arc::new(StringArray::from_iter_values(vec![
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
])),
Arc::new(LargeStringArray::from_iter_values(vec![
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
])),
Arc::new(Float32Array::from_iter_values((0..10).map(|i| i as f32))),
Arc::new(Float64Array::from_iter_values((0..10).map(|i| i as f64))),
Arc::new(Into::<BooleanArray>::into(vec![
true, false, true, false, true, false, true, false, true, false,
])),
Arc::new(Date32Array::from_iter_values(0..10)),
Arc::new(TimestampNanosecondArray::from_iter_values(0..10)),
Arc::new(TimestampMillisecondArray::from_iter_values(0..10)),
Arc::new(
create_fixed_size_list(
Float32Array::from_iter_values((0..20).map(|i| i as f32)),
2,
)
.unwrap(),
),
Arc::new(
create_fixed_size_list(
Float64Array::from_iter_values((0..20).map(|i| i as f64)),
2,
)
.unwrap(),
),
],
)
.unwrap()]
.into_iter()
.map(Ok),
let batch = RecordBatch::try_new(
schema.clone(),
);
vec![
Arc::new(Int32Array::from_iter_values(0..10)),
Arc::new(Int64Array::from_iter_values(0..10)),
Arc::new(UInt32Array::from_iter_values(0..10)),
Arc::new(StringArray::from_iter_values(vec![
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
])),
Arc::new(LargeStringArray::from_iter_values(vec![
"a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
])),
Arc::new(Float32Array::from_iter_values((0..10).map(|i| i as f32))),
Arc::new(Float64Array::from_iter_values((0..10).map(|i| i as f64))),
Arc::new(Into::<BooleanArray>::into(vec![
true, false, true, false, true, false, true, false, true, false,
])),
Arc::new(Date32Array::from_iter_values(0..10)),
Arc::new(TimestampNanosecondArray::from_iter_values(0..10)),
Arc::new(TimestampMillisecondArray::from_iter_values(0..10)),
Arc::new(
create_fixed_size_list(
Float32Array::from_iter_values((0..20).map(|i| i as f32)),
2,
)
.unwrap(),
),
Arc::new(
create_fixed_size_list(
Float64Array::from_iter_values((0..20).map(|i| i as f64)),
2,
)
.unwrap(),
),
],
)
.unwrap();
let table = conn
.create_table("my_table", record_batch_iter)
.create_table("my_table", batch)
.execute()
.await
.unwrap();
@@ -338,15 +331,13 @@ mod tests {
Ok(FixedSizeListArray::from(data))
}
fn make_test_batches() -> impl RecordBatchReader + Send + Sync + 'static {
fn make_test_batch() -> RecordBatch {
let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, false)]));
RecordBatchIterator::new(
vec![RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..10))],
)],
schema,
RecordBatch::try_new(
schema.clone(),
vec![Arc::new(Int32Array::from_iter_values(0..10))],
)
.unwrap()
}
#[tokio::test]
@@ -367,12 +358,8 @@ mod tests {
)
.unwrap();
let schema = batch.schema();
// need the iterator for create table
let record_batch_iter = RecordBatchIterator::new(vec![Ok(batch)], schema);
let table = conn
.create_table("my_table", record_batch_iter)
.create_table("my_table", batch)
.execute()
.await
.unwrap();
@@ -430,7 +417,7 @@ mod tests {
.await
.unwrap();
let tbl = conn
.create_table("my_table", make_test_batches())
.create_table("my_table", make_test_batch())
.execute()
.await
.unwrap();

Some files were not shown because too many files have changed in this diff Show More