Compare commits

..

29 Commits

Author SHA1 Message Date
Lance Release
4f7b24d1a9 Bump version: 0.25.3-beta.6 → 0.25.3 2025-11-07 04:57:55 +00:00
Lance Release
f9540724b7 Bump version: 0.25.3-beta.5 → 0.25.3-beta.6 2025-11-07 04:57:54 +00:00
Weston Pace
aeac9c7644 feat: add python Permutation class to mimic hugging face dataset and provide pytorch dataloader (#2725) 2025-11-06 16:15:33 -08:00
Mark
6ddd271627 fix: relax bytemuck and crunchy version pins (#2768)
Closes #2767
2025-11-05 14:07:35 -08:00
LanceDB Robot
f0d7520bdf chore: update lance dependency to v0.39.0 (#2766)
## Summary
- bump Lance crates to v0.39.0 with ci/set_lance_version.py and refresh
Cargo.lock
- keep namespace feature set intact while moving off git dependencies
- verified cargo clippy --workspace --tests --all-features -- -D
warnings
- ran cargo fmt --all

## References
- https://github.com/lancedb/lance/releases/tag/v0.39.0
2025-11-05 21:25:05 +08:00
Will Jones
7ef8bafd51 feat: add source to TableNotFound errors (#2765)
This will make it easier to see if there are underlying problems. We
should see the actual object store HTTP request error within the error
chain after this.
2025-11-04 15:31:45 -08:00
Lance Release
aed4a7c98e Bump version: 0.22.3-beta.4 → 0.22.3-beta.5 2025-10-31 17:08:56 +00:00
Lance Release
273ba18426 Bump version: 0.25.3-beta.4 → 0.25.3-beta.5 2025-10-31 17:07:31 +00:00
LuQQiu
8b94308cf2 feat: add fts udtf in sql (#2755)
Support FTS feature parity in SQL to match current Python API
capability.
Add `.to_json()` method to FTS query classes to enable usage with SQL
`fts()` UDTF.
Related: https://github.com/lancedb/blog-lancedb/pull/147

query = MatchQuery("puppy", "text", fuzziness=2)
result = client.execute(f"SELECT * FROM fts('table',
'{query.to_json()}')")

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-31 10:06:19 -07:00
Lance Release
0b7b27481e Bump version: 0.22.3-beta.3 → 0.22.3-beta.4 2025-10-31 01:14:39 +00:00
Lance Release
e1f9b011f8 Bump version: 0.25.3-beta.3 → 0.25.3-beta.4 2025-10-31 01:13:18 +00:00
Wyatt Alt
d664b8739f chore: update lance to 0.38.3 stable (#2757) 2025-10-30 16:44:10 -07:00
S.A.N
20bec61ecb refactor(node): async generator for RecordBatchIterator (#2744)
JS native Async Generator, more efficient asynchronous iteration, fewer
synthetic promises, and the ability to handle `catch` or `break` of
parent loop in `finally` block
2025-10-30 14:36:24 -07:00
Will Jones
45255be42c ci: add agents and add reviewing instructions (#2754) 2025-10-29 17:28:26 -07:00
fzowl
93c2cf2f59 feat(voyageai): update voyage integration (#2713)
Adding multimodal usage guide
VoyageAI integration changes:
 - Adding voyage-3.5 and voyage-3.5-lite models
 - Adding voyage-context-3 model
 - Adding rerank-2.5 and rerank-2.5-lite models
2025-10-29 16:49:07 +05:30
Oz Katz
9d29c83f81 docs: remove DynamoDB commit store section (#2715)
This PR removes the section about needing the DynamoDB Commit Store.
Reasoning:

* S3 now supports [conditional
writes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/conditional-writes.html)
* Upstream lance was updated to use this capability in
https://github.com/lancedb/lance/issues/2793
* lanceDB itself was updated to include this (see @wjones127's comment
[here](https://github.com/lancedb/lancedb/issues/1614#issuecomment-2725687260))
2025-10-29 02:12:50 +08:00
Lance Release
2a6143b5bd Bump version: 0.22.3-beta.2 → 0.22.3-beta.3 2025-10-28 02:12:20 +00:00
Lance Release
b2242886e0 Bump version: 0.25.3-beta.2 → 0.25.3-beta.3 2025-10-28 02:11:17 +00:00
LuQQiu
199904ab35 chore: update lance dependency to v0.38.3-beta.11 (#2749)
## Summary

- Updated all Lance dependencies from v0.38.3-beta.9 to v0.38.3-beta.11
- Migrated `lance-namespace-impls` to use new granular cloud provider
features (`dir-aws`, `dir-gcp`, `dir-azure`, `dir-oss`) instead of
deprecated `dir` feature
- Updated namespace connection API to use `ConnectBuilder` instead of
deprecated `connect()` function

## API Changes

The Lance team refactored the `lance-namespace-impls` package in
v0.38.3-beta.11:

1. **Feature flags**: The single `dir` feature was split into cloud
provider-specific features:
   - `dir-aws` for AWS S3 support
   - `dir-gcp` for Google Cloud Storage support
   - `dir-azure` for Azure Blob Storage support
   - `dir-oss` for Alibaba Cloud OSS support

2. **Connection API**: The `connect()` function was replaced with a
`ConnectBuilder` pattern for more flexibility

## Testing

-  Ran `cargo clippy --workspace --tests --all-features -- -D warnings`
- no warnings
-  Ran `cargo fmt --all` - code formatted
-  All changes verified and committed

## Related

This update was triggered by the Lance release:
https://github.com/lancedb/lance/releases/tag/v0.38.3-beta.11

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-27 19:10:26 -07:00
Lance Release
1fa888615f Bump version: 0.22.3-beta.1 → 0.22.3-beta.2 2025-10-21 20:14:20 +00:00
Lance Release
40967f3baa Bump version: 0.25.3-beta.1 → 0.25.3-beta.2 2025-10-21 20:13:10 +00:00
Jack Ye
0bfc7de32c feat: expose storage options in table (#2736)
Pending https://github.com/lancedb/lance/pull/5016
2025-10-21 16:10:40 -04:00
LanceDB Robot
d43880a585 ci: polish codex prompt for better behavior (#2739) 2025-10-22 03:49:25 +08:00
LanceDB Robot
59a886958b ci: make sure GH_TOKEN included in codex env (#2738) 2025-10-21 17:51:41 +08:00
github-actions[bot]
c36f6746d1 chore: update lance dependency to v0.38.3-beta.8 (#2737)
## Summary
- bump Lance dependencies to v0.38.3-beta.8
- ran `cargo clippy --workspace --tests --all-features -- -D warnings`
- ran `cargo fmt --all`

## Links
- https://github.com/lancedb/lance/releases/tag/v0.38.3-beta.8

Co-authored-by: lancedb automation <robot@lancedb.com>
2025-10-21 17:29:08 +08:00
LanceDB Robot
25ce6d311f ci: add instruct for codex to use gh with token (#2734) 2025-10-21 17:12:15 +08:00
github-actions[bot]
92a4e46f9f chore: update lance dependency to v0.38.3-beta.7 (#2735)
## Summary
- bump Lance dependencies to v0.38.3-beta.7
- ran cargo clippy --workspace --tests --all-features -- -D warnings
- ran cargo fmt --all

Triggered by tag
[v0.38.3-beta.7](https://github.com/lancedb/lance/releases/tag/v0.38.3-beta.7).

---------

Co-authored-by: LanceDB Robot <robot@lancedb.com>
2025-10-21 17:04:57 +08:00
LanceDB Robot
845641c480 ci: use robot token instead of github's own token (#2732) 2025-10-21 02:38:14 +08:00
Lance Release
d96404c635 Bump version: 0.22.3-beta.0 → 0.22.3-beta.1 2025-10-19 23:41:46 +00:00
82 changed files with 5036 additions and 603 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.22.3-beta.0"
current_version = "0.22.3-beta.5"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -61,7 +61,7 @@ jobs:
- name: Configure git user
run: |
git config user.name "lancedb automation"
git config user.email "automation@lancedb.com"
git config user.email "robot@lancedb.com"
- name: Configure Codex authentication
env:
@@ -77,8 +77,8 @@ jobs:
- name: Run Codex to update Lance dependency
env:
TAG: ${{ inputs.tag }}
GITHUB_TOKEN: ${{ github.token }}
GH_TOKEN: ${{ github.token }}
GITHUB_TOKEN: ${{ secrets.ROBOT_TOKEN }}
GH_TOKEN: ${{ secrets.ROBOT_TOKEN }}
run: |
set -euo pipefail
VERSION="${TAG#refs/tags/}"
@@ -88,19 +88,20 @@ jobs:
You are running inside the lancedb repository on a GitHub Actions runner. Update the Lance dependency to version ${VERSION} and prepare a pull request for maintainers to review.
Follow these steps exactly:
1. Use script `ci/set_lance_version.py` to update Lance dependencies. The script already refreshes Cargo metadata, so allow it to finish even if it takes time.
1. Use script "ci/set_lance_version.py" to update Lance dependencies. The script already refreshes Cargo metadata, so allow it to finish even if it takes time.
2. Run "cargo clippy --workspace --tests --all-features -- -D warnings". If diagnostics appear, fix them yourself and rerun clippy until it exits cleanly. Do not skip any warnings.
3. After clippy succeeds, run "cargo fmt --all" to format the workspace.
4. Ensure the repository is clean except for intentional changes. Inspect "git status --short" and "git diff" to confirm the dependency update and any required fixes.
5. Create and switch to a new branch named "${BRANCH_NAME}" (replace any duplicated hyphens if necessary).
6. Stage all relevant files with "git add -A". Commit using the message "chore: update lance dependency to v${VERSION}".
7. Push the branch to origin. If the branch already exists, force-push your changes.
8. Create a pull request targeting "main" with title "chore: update lance dependency to v${VERSION}". In the body, summarize the dependency bump, clippy/fmt verification, and link the triggering tag (${TAG}). Use the GitHub CLI if helpful.
9. After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
8. env "GH_TOKEN" is available, use "gh" tools for github related operations like creating pull request.
9. Create a pull request targeting "main" with title "chore: update lance dependency to v${VERSION}". In the body, summarize the dependency bump, clippy/fmt verification, and link the triggering tag (${TAG}).
10. After creating the PR, display the PR URL, "git status --short", and a concise summary of the commands run and their results.
Constraints:
- Use bash commands; avoid modifying GitHub workflow files other than through the scripted task above.
- Do not merge the PR.
- If any command fails, diagnose and fix the issue instead of aborting.
EOF
codex exec --dangerously-bypass-approvals-and-sandbox "$(cat /tmp/codex-prompt.txt)"
codex --config shell_environment_policy.ignore_default_excludes=true exec --dangerously-bypass-approvals-and-sandbox "$(cat /tmp/codex-prompt.txt)"

101
AGENTS.md Normal file
View File

@@ -0,0 +1,101 @@
LanceDB is a database designed for retrieval, including vector, full-text, and hybrid search.
It is a wrapper around Lance. There are two backends: local (in-process like SQLite) and
remote (against LanceDB Cloud).
The core of LanceDB is written in Rust. There are bindings in Python, Typescript, and Java.
Project layout:
* `rust/lancedb`: The LanceDB core Rust implementation.
* `python`: The Python bindings, using PyO3.
* `nodejs`: The Typescript bindings, using napi-rs
* `java`: The Java bindings
Common commands:
* Check for compiler errors: `cargo check --quiet --features remote --tests --examples`
* Run tests: `cargo test --quiet --features remote --tests`
* Run specific test: `cargo test --quiet --features remote -p <package_name> --test <test_name>`
* Lint: `cargo clippy --quiet --features remote --tests --examples`
* Format: `cargo fmt --all`
Before committing changes, run formatting.
## Coding tips
* When writing Rust doctests for things that require a connection or table reference,
write them as a function instead of a fully executable test. This allows type checking
to run but avoids needing a full test environment. For example:
```rust
/// ```
/// use lance_index::scalar::FullTextSearchQuery;
/// use lancedb::query::{QueryBase, ExecutableQuery};
///
/// # use lancedb::Table;
/// # async fn query(table: &Table) -> Result<(), Box<dyn std::error::Error>> {
/// let results = table.query()
/// .full_text_search(FullTextSearchQuery::new("hello world".into()))
/// .execute()
/// .await?;
/// # Ok(())
/// # }
/// ```
```
## Example plan: adding a new method on Table
Adding a new method involves first adding it to the Rust core, then exposing it
in the Python and TypeScript bindings. There are both local and remote tables.
Remote tables are implemented via a HTTP API and require the `remote` cargo
feature flag to be enabled. Python has both sync and async methods.
Rust core changes:
1. Add method on `Table` struct in `rust/lancedb/src/table.rs` (calls `BaseTable` trait).
2. Add method to `BaseTable` trait in `rust/lancedb/src/table.rs`.
3. Implement new trait method on `NativeTable` in `rust/lancedb/src/table.rs`.
* Test with unit test in `rust/lancedb/src/table.rs`.
4. Implement new trait method on `RemoteTable` in `rust/lancedb/src/remote/table.rs`.
* Test with unit test in `rust/lancedb/src/remote/table.rs` against mocked endpoint.
Python bindings changes:
1. Add PyO3 method binding in `python/src/table.rs`. Run `make develop` to compile bindings.
2. Add types for PyO3 method in `python/python/lancedb/_lancedb.pyi`.
3. Add method to `AsyncTable` class in `python/python/lancedb/table.py`.
4. Add abstract method to `Table` abstract base class in `python/python/lancedb/table.py`.
5. Add concrete sync method to `LanceTable` class in `python/python/lancedb/table.py`.
* Should use `LOOP.run()` to call the corresponding `AsyncTable` method.
6. Add concrete sync method to `RemoteTable` class in `python/python/lancedb/remote/table.py`.
7. Add unit test in `python/tests/test_table.py`.
TypeScript bindings changes:
1. Add napi-rs method binding on `Table` in `nodejs/src/table.rs`.
2. Run `npm run build` to generate TypeScript definitions.
3. Add typescript method on abstract class `Table` in `nodejs/src/table.ts`.
4. Add concrete method on `LocalTable` class in `nodejs/src/native_table.ts`.
* Note: despite the name, this class is also used for remote tables.
5. Add test in `nodejs/__test__/table.test.ts`.
6. Run `npm run docs` to generate TypeScript documentation.
## Review Guidelines
Please consider the following when reviewing code contributions.
### Rust API design
* Design public APIs so they can be evolved easily in the future without breaking
changes. Often this means using builder patterns or options structs instead of
long argument lists.
* For public APIs, prefer inputs that use `Into<T>` or `AsRef<T>` traits to allow
more flexible inputs. For example, use `name: Into<String>` instead of `name: String`,
so we don't have to write `func("my_string".to_string())`.
### Testing
* Ensure all new public APIs have documentation and examples.
* Ensure that all bugfixes and features have corresponding tests. **We do not merge
code without tests.**
### Documentation
* New features must include updates to the rust documentation comments. Link to
relevant structs and methods to increase the value of documentation.

View File

@@ -1,80 +0,0 @@
LanceDB is a database designed for retrieval, including vector, full-text, and hybrid search.
It is a wrapper around Lance. There are two backends: local (in-process like SQLite) and
remote (against LanceDB Cloud).
The core of LanceDB is written in Rust. There are bindings in Python, Typescript, and Java.
Project layout:
* `rust/lancedb`: The LanceDB core Rust implementation.
* `python`: The Python bindings, using PyO3.
* `nodejs`: The Typescript bindings, using napi-rs
* `java`: The Java bindings
Common commands:
* Check for compiler errors: `cargo check --quiet --features remote --tests --examples`
* Run tests: `cargo test --quiet --features remote --tests`
* Run specific test: `cargo test --quiet --features remote -p <package_name> --test <test_name>`
* Lint: `cargo clippy --quiet --features remote --tests --examples`
* Format: `cargo fmt --all`
Before committing changes, run formatting.
## Coding tips
* When writing Rust doctests for things that require a connection or table reference,
write them as a function instead of a fully executable test. This allows type checking
to run but avoids needing a full test environment. For example:
```rust
/// ```
/// use lance_index::scalar::FullTextSearchQuery;
/// use lancedb::query::{QueryBase, ExecutableQuery};
///
/// # use lancedb::Table;
/// # async fn query(table: &Table) -> Result<(), Box<dyn std::error::Error>> {
/// let results = table.query()
/// .full_text_search(FullTextSearchQuery::new("hello world".into()))
/// .execute()
/// .await?;
/// # Ok(())
/// # }
/// ```
```
## Example plan: adding a new method on Table
Adding a new method involves first adding it to the Rust core, then exposing it
in the Python and TypeScript bindings. There are both local and remote tables.
Remote tables are implemented via a HTTP API and require the `remote` cargo
feature flag to be enabled. Python has both sync and async methods.
Rust core changes:
1. Add method on `Table` struct in `rust/lancedb/src/table.rs` (calls `BaseTable` trait).
2. Add method to `BaseTable` trait in `rust/lancedb/src/table.rs`.
3. Implement new trait method on `NativeTable` in `rust/lancedb/src/table.rs`.
* Test with unit test in `rust/lancedb/src/table.rs`.
4. Implement new trait method on `RemoteTable` in `rust/lancedb/src/remote/table.rs`.
* Test with unit test in `rust/lancedb/src/remote/table.rs` against mocked endpoint.
Python bindings changes:
1. Add PyO3 method binding in `python/src/table.rs`. Run `make develop` to compile bindings.
2. Add types for PyO3 method in `python/python/lancedb/_lancedb.pyi`.
3. Add method to `AsyncTable` class in `python/python/lancedb/table.py`.
4. Add abstract method to `Table` abstract base class in `python/python/lancedb/table.py`.
5. Add concrete sync method to `LanceTable` class in `python/python/lancedb/table.py`.
* Should use `LOOP.run()` to call the corresponding `AsyncTable` method.
6. Add concrete sync method to `RemoteTable` class in `python/python/lancedb/remote/table.py`.
7. Add unit test in `python/tests/test_table.py`.
TypeScript bindings changes:
1. Add napi-rs method binding on `Table` in `nodejs/src/table.rs`.
2. Run `npm run build` to generate TypeScript definitions.
3. Add typescript method on abstract class `Table` in `nodejs/src/table.ts`.
4. Add concrete method on `LocalTable` class in `nodejs/src/native_table.ts`.
* Note: despite the name, this class is also used for remote tables.
5. Add test in `nodejs/__test__/table.test.ts`.
6. Run `npm run docs` to generate TypeScript documentation.

1
CLAUDE.md Symbolic link
View File

@@ -0,0 +1 @@
AGENTS.md

132
Cargo.lock generated
View File

@@ -1139,7 +1139,7 @@ dependencies = [
"bitflags 2.9.4",
"cexpr",
"clang-sys",
"itertools 0.11.0",
"itertools 0.12.1",
"lazy_static",
"lazycell",
"log",
@@ -2933,18 +2933,6 @@ version = "0.2.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f8eb564c5c7423d25c886fb561d1e4ee69f72354d16918afa32c08811f6b6a55"
[[package]]
name = "fastbloom"
version = "0.14.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "18c1ddb9231d8554c2d6bdf4cfaabf0c59251658c68b6c95cd52dd0c513a912a"
dependencies = [
"getrandom 0.3.3",
"libm",
"rand 0.9.2",
"siphasher",
]
[[package]]
name = "fastdivide"
version = "0.4.2"
@@ -3044,8 +3032,9 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
[[package]]
name = "fsst"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1d2475ce218217196b161b025598f77e2b405d5e729f7c37bfff145f5df00a41"
dependencies = [
"arrow-array",
"rand 0.9.2",
@@ -4229,8 +4218,9 @@ dependencies = [
[[package]]
name = "lance"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a2f0ca022d0424d991933a62d2898864cf5621873962bd84e65e7d1f023f9c36"
dependencies = [
"arrow",
"arrow-arith",
@@ -4280,6 +4270,7 @@ dependencies = [
"prost-types",
"rand 0.9.2",
"roaring",
"semver",
"serde",
"serde_json",
"snafu",
@@ -4293,8 +4284,9 @@ dependencies = [
[[package]]
name = "lance-arrow"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7552f8d528775bf0ab21e1f75dcb70bdb2a828eeae58024a803b5a4655fd9a11"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4312,8 +4304,9 @@ dependencies = [
[[package]]
name = "lance-bitpacking"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a2ea14583cc6fa0bb190bcc2d3bc364b0aa545b345702976025f810e4740e8ce"
dependencies = [
"arrayref",
"paste",
@@ -4322,8 +4315,9 @@ dependencies = [
[[package]]
name = "lance-core"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "69c752dedd207384892006c40930f898d6634e05e3d489e89763abfe4b9307e7"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4359,8 +4353,9 @@ dependencies = [
[[package]]
name = "lance-datafusion"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "21e1e98ca6e5cd337bdda2d9fb66063f295c0c2852d2bc6831366fea833ee608"
dependencies = [
"arrow",
"arrow-array",
@@ -4389,8 +4384,9 @@ dependencies = [
[[package]]
name = "lance-datagen"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "483c643fc2806ed1a2766edf4d180511bbd1d549bcc60373e33f4785c6185891"
dependencies = [
"arrow",
"arrow-array",
@@ -4407,8 +4403,9 @@ dependencies = [
[[package]]
name = "lance-encoding"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a199d1fa3487529c5ffc433fbd1721231330b9350c2ff9b0c7b7dbdb98f0806a"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4445,8 +4442,9 @@ dependencies = [
[[package]]
name = "lance-file"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b57def2279465232cf5a8cd996300c632442e368745768bbed661c7f0a35334b"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4471,7 +4469,6 @@ dependencies = [
"prost",
"prost-build",
"prost-types",
"roaring",
"snafu",
"tokio",
"tracing",
@@ -4479,8 +4476,9 @@ dependencies = [
[[package]]
name = "lance-index"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "a75938c61e986aef8c615dc44c92e4c19e393160a59e2b57402ccfe08c5e63af"
dependencies = [
"arrow",
"arrow-arith",
@@ -4502,7 +4500,6 @@ dependencies = [
"datafusion-sql",
"deepsize",
"dirs",
"fastbloom",
"fst",
"futures",
"half",
@@ -4542,8 +4539,9 @@ dependencies = [
[[package]]
name = "lance-io"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fa6c3b5b28570d6c951206c5b043f1b35c936928af14fca6f2ac25b0097e4c32"
dependencies = [
"arrow",
"arrow-arith",
@@ -4583,32 +4581,27 @@ dependencies = [
[[package]]
name = "lance-linalg"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b3cbc7e85a89ff9cb3a4627559dea3fd1c1fb16c0d8bc46ede75eefef51eec06"
dependencies = [
"arrow-array",
"arrow-buffer",
"arrow-ord",
"arrow-schema",
"bitvec",
"cc",
"deepsize",
"futures",
"half",
"lance-arrow",
"lance-core",
"log",
"num-traits",
"rand 0.9.2",
"rayon",
"tokio",
"tracing",
]
[[package]]
name = "lance-namespace"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "897dd6726816515bb70a698ce7cda44670dca5761637696d7905b45f405a8cd9"
dependencies = [
"arrow",
"async-trait",
@@ -4620,8 +4613,9 @@ dependencies = [
[[package]]
name = "lance-namespace-impls"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "5e3cfcd3ba369de2719abf6fb6233f69cda639eb5cbcb328487a790e745ab988"
dependencies = [
"arrow",
"arrow-ipc",
@@ -4630,8 +4624,9 @@ dependencies = [
"bytes",
"lance",
"lance-core",
"lance-io",
"lance-namespace",
"opendal",
"object_store",
"reqwest",
"serde_json",
"snafu",
@@ -4653,8 +4648,9 @@ dependencies = [
[[package]]
name = "lance-table"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c8facc13760ba034b6c38767b16adba85e44cbcbea8124dc0c63c43865c60630"
dependencies = [
"arrow",
"arrow-array",
@@ -4681,6 +4677,7 @@ dependencies = [
"rand 0.9.2",
"rangemap",
"roaring",
"semver",
"serde",
"serde_json",
"snafu",
@@ -4692,8 +4689,9 @@ dependencies = [
[[package]]
name = "lance-testing"
version = "0.38.3-beta.6"
source = "git+https://github.com/lancedb/lance.git?tag=v0.38.3-beta.6#affff28b7a9ae6d60b1dbb40103e91f3574c3555"
version = "0.39.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b05052ef86188d6ae6339bdd9f2c5d77190e8ad1158f3dc8a42fa91bde9e5246"
dependencies = [
"arrow-array",
"arrow-schema",
@@ -4704,7 +4702,7 @@ dependencies = [
[[package]]
name = "lancedb"
version = "0.22.3-beta.0"
version = "0.22.3-beta.5"
dependencies = [
"ahash",
"anyhow",
@@ -4724,13 +4722,11 @@ dependencies = [
"aws-sdk-kms",
"aws-sdk-s3",
"aws-smithy-runtime",
"bytemuck_derive",
"bytes",
"candle-core",
"candle-nn",
"candle-transformers",
"chrono",
"crunchy",
"datafusion",
"datafusion-catalog",
"datafusion-common",
@@ -4743,6 +4739,7 @@ dependencies = [
"http 1.3.1",
"http-body 1.0.1",
"lance",
"lance-arrow",
"lance-core",
"lance-datafusion",
"lance-datagen",
@@ -4800,7 +4797,7 @@ dependencies = [
[[package]]
name = "lancedb-nodejs"
version = "0.22.3-beta.0"
version = "0.22.3-beta.5"
dependencies = [
"arrow-array",
"arrow-ipc",
@@ -4820,7 +4817,7 @@ dependencies = [
[[package]]
name = "lancedb-python"
version = "0.25.3-beta.0"
version = "0.25.3-beta.5"
dependencies = [
"arrow",
"async-trait",
@@ -5180,12 +5177,9 @@ dependencies = [
[[package]]
name = "mock_instant"
version = "0.3.2"
version = "0.6.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9366861eb2a2c436c20b12c8dbec5f798cea6b47ad99216be0282942e2c81ea0"
dependencies = [
"once_cell",
]
checksum = "dce6dd36094cac388f119d2e9dc82dc730ef91c32a6222170d630e5414b956e6"
[[package]]
name = "moka"
@@ -6415,8 +6409,8 @@ version = "0.13.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "be769465445e8c1474e9c5dac2018218498557af32d9ed057325ec9a41ae81bf"
dependencies = [
"heck 0.4.1",
"itertools 0.11.0",
"heck 0.5.0",
"itertools 0.14.0",
"log",
"multimap",
"once_cell",
@@ -6436,7 +6430,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8a56d757972c98b346a9b766e3f02746cde6dd1cd1d1d563472929fdd74bec4d"
dependencies = [
"anyhow",
"itertools 0.11.0",
"itertools 0.14.0",
"proc-macro2",
"quote",
"syn 2.0.106",
@@ -7744,7 +7738,7 @@ version = "0.8.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c1c97747dbf44bb1ca44a561ece23508e99cb592e862f22222dcf42f51d1e451"
dependencies = [
"heck 0.4.1",
"heck 0.5.0",
"proc-macro2",
"quote",
"syn 2.0.106",

View File

@@ -15,19 +15,20 @@ categories = ["database-implementations"]
rust-version = "1.78.0"
[workspace.dependencies]
lance = { "version" = "=0.38.3-beta.6", default-features = false, "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-core = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-datagen = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-file = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-io = { "version" = "=0.38.3-beta.6", default-features = false, "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-index = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-linalg = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-namespace = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-namespace-impls = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-table = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-testing = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-datafusion = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance-encoding = { "version" = "=0.38.3-beta.6", "tag" = "v0.38.3-beta.6", "git" = "https://github.com/lancedb/lance.git" }
lance = { "version" = "=0.39.0", default-features = false }
lance-core = "=0.39.0"
lance-datagen = "=0.39.0"
lance-file = "=0.39.0"
lance-io = { "version" = "=0.39.0", default-features = false }
lance-index = "=0.39.0"
lance-linalg = "=0.39.0"
lance-namespace = "=0.39.0"
lance-namespace-impls = { "version" = "=0.39.0", "features" = ["dir-aws", "dir-gcp", "dir-azure", "dir-oss", "rest"] }
lance-table = "=0.39.0"
lance-testing = "=0.39.0"
lance-datafusion = "=0.39.0"
lance-encoding = "=0.39.0"
lance-arrow = "=0.39.0"
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "56.2", optional = false }
@@ -61,7 +62,4 @@ num-traits = "0.2"
regex = "1.10"
lazy_static = "1"
semver = "1.0.25"
crunchy = "0.2.4"
chrono = "0.4"
# Workaround for: https://github.com/Lokathor/bytemuck/issues/306
bytemuck_derive = ">=1.8.1, <1.9.0"

View File

@@ -55,7 +55,7 @@ def extract_features(line: str) -> list:
match = re.search(r'"features"\s*=\s*\[\s*(.*?)\s*\]', line, re.DOTALL)
if match:
features_str = match.group(1)
return [f.strip('"') for f in features_str.split(",") if len(f) > 0]
return [f.strip().strip('"') for f in features_str.split(",") if f.strip()]
return []

View File

@@ -0,0 +1,97 @@
# VoyageAI Embeddings : Multimodal
VoyageAI embeddings can also be used to embed both text and image data, only some of the models support image data and you can check the list
under [https://docs.voyageai.com/docs/multimodal-embeddings](https://docs.voyageai.com/docs/multimodal-embeddings)
Supported parameters (to be passed in `create` method) are:
| Parameter | Type | Default Value | Description |
|---|---|-------------------------|-------------------------------------------|
| `name` | `str` | `"voyage-multimodal-3"` | The model ID of the VoyageAI model to use |
Usage Example:
```python
import base64
import os
from io import BytesIO
import requests
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
import pandas as pd
os.environ['VOYAGE_API_KEY'] = 'YOUR_VOYAGE_API_KEY'
db = lancedb.connect(".lancedb")
func = get_registry().get("voyageai").create(name="voyage-multimodal-3")
def image_to_base64(image_bytes: bytes):
buffered = BytesIO(image_bytes)
img_str = base64.b64encode(buffered.getvalue())
return img_str.decode("utf-8")
class Images(LanceModel):
label: str
image_uri: str = func.SourceField() # image uri as the source
image_bytes: str = func.SourceField() # image bytes base64 encoded as the source
vector: Vector(func.ndims()) = func.VectorField() # vector column
vec_from_bytes: Vector(func.ndims()) = func.VectorField() # Another vector column
if "images" in db.table_names():
db.drop_table("images")
table = db.create_table("images", schema=Images)
labels = ["cat", "cat", "dog", "dog", "horse", "horse"]
uris = [
"http://farm1.staticflickr.com/53/167798175_7c7845bbbd_z.jpg",
"http://farm1.staticflickr.com/134/332220238_da527d8140_z.jpg",
"http://farm9.staticflickr.com/8387/8602747737_2e5c2a45d4_z.jpg",
"http://farm5.staticflickr.com/4092/5017326486_1f46057f5f_z.jpg",
"http://farm9.staticflickr.com/8216/8434969557_d37882c42d_z.jpg",
"http://farm6.staticflickr.com/5142/5835678453_4f3a4edb45_z.jpg",
]
# get each uri as bytes
images_bytes = [image_to_base64(requests.get(uri).content) for uri in uris]
table.add(
pd.DataFrame({"label": labels, "image_uri": uris, "image_bytes": images_bytes})
)
```
Now we can search using text from both the default vector column and the custom vector column
```python
# text search
actual = table.search("man's best friend", "vec_from_bytes").limit(1).to_pydantic(Images)[0]
print(actual.label) # prints "dog"
frombytes = (
table.search("man's best friend", vector_column_name="vec_from_bytes")
.limit(1)
.to_pydantic(Images)[0]
)
print(frombytes.label)
```
Because we're using a multi-modal embedding function, we can also search using images
```python
# image search
query_image_uri = "http://farm1.staticflickr.com/200/467715466_ed4a31801f_z.jpg"
image_bytes = requests.get(query_image_uri).content
query_image = Image.open(BytesIO(image_bytes))
actual = table.search(query_image, "vec_from_bytes").limit(1).to_pydantic(Images)[0]
print(actual.label == "dog")
# image search using a custom vector column
other = (
table.search(query_image, vector_column_name="vec_from_bytes")
.limit(1)
.to_pydantic(Images)[0]
)
print(actual.label)
```

View File

@@ -397,117 +397,6 @@ For **read-only access**, LanceDB will need a policy such as:
}
```
#### DynamoDB Commit Store for concurrent writes
By default, S3 does not support concurrent writes. Having two or more processes
writing to the same table at the same time can lead to data corruption. This is
because S3, unlike other object stores, does not have any atomic put or copy
operation.
To enable concurrent writes, you can configure LanceDB to use a DynamoDB table
as a commit store. This table will be used to coordinate writes between
different processes. To enable this feature, you must modify your connection
URI to use the `s3+ddb` scheme and add a query parameter `ddbTableName` with the
name of the table to use.
=== "Python"
=== "Sync API"
```python
import lancedb
db = lancedb.connect(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
)
```
=== "Async API"
```python
import lancedb
async_db = await lancedb.connect_async(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
)
```
=== "JavaScript"
```javascript
const lancedb = require("lancedb");
const db = await lancedb.connect(
"s3+ddb://bucket/path?ddbTableName=my-dynamodb-table",
);
```
The DynamoDB table must be created with the following schema:
- Hash key: `base_uri` (string)
- Range key: `version` (number)
You can create this programmatically with:
=== "Python"
<!-- skip-test -->
```python
import boto3
dynamodb = boto3.client("dynamodb")
table = dynamodb.create_table(
TableName=table_name,
KeySchema=[
{"AttributeName": "base_uri", "KeyType": "HASH"},
{"AttributeName": "version", "KeyType": "RANGE"},
],
AttributeDefinitions=[
{"AttributeName": "base_uri", "AttributeType": "S"},
{"AttributeName": "version", "AttributeType": "N"},
],
ProvisionedThroughput={"ReadCapacityUnits": 1, "WriteCapacityUnits": 1},
)
```
=== "JavaScript"
<!-- skip-test -->
```javascript
import {
CreateTableCommand,
DynamoDBClient,
} from "@aws-sdk/client-dynamodb";
const dynamodb = new DynamoDBClient({
region: CONFIG.awsRegion,
credentials: {
accessKeyId: CONFIG.awsAccessKeyId,
secretAccessKey: CONFIG.awsSecretAccessKey,
},
endpoint: CONFIG.awsEndpoint,
});
const command = new CreateTableCommand({
TableName: table_name,
AttributeDefinitions: [
{
AttributeName: "base_uri",
AttributeType: "S",
},
{
AttributeName: "version",
AttributeType: "N",
},
],
KeySchema: [
{ AttributeName: "base_uri", KeyType: "HASH" },
{ AttributeName: "version", KeyType: "RANGE" },
],
ProvisionedThroughput: {
ReadCapacityUnits: 1,
WriteCapacityUnits: 1,
},
});
await client.send(command);
```
#### S3-compatible stores

View File

@@ -64,6 +64,36 @@ builder.filter("age > 18 AND status = 'active'");
***
### persist()
```ts
persist(connection, tableName): PermutationBuilder
```
Configure the permutation to be persisted.
#### Parameters
* **connection**: [`Connection`](Connection.md)
The connection to persist the permutation to
* **tableName**: `string`
The name of the table to create
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
builder.persist(connection, "permutation_table");
```
***
### shuffle()
```ts
@@ -98,15 +128,15 @@ builder.shuffle({ seed: 42, clumpSize: 10 });
### splitCalculated()
```ts
splitCalculated(calculation): PermutationBuilder
splitCalculated(options): PermutationBuilder
```
Configure calculated splits for the permutation.
#### Parameters
* **calculation**: `string`
SQL expression for calculating splits
* **options**: [`SplitCalculatedOptions`](../interfaces/SplitCalculatedOptions.md)
Configuration for calculated splitting
#### Returns

View File

@@ -80,7 +80,7 @@ AnalyzeExec verbose=true, metrics=[]
### execute()
```ts
protected execute(options?): RecordBatchIterator
protected execute(options?): AsyncGenerator<RecordBatch<any>, void, unknown>
```
Execute the query and return the results as an
@@ -91,7 +91,7 @@ Execute the query and return the results as an
#### Returns
[`RecordBatchIterator`](RecordBatchIterator.md)
`AsyncGenerator`&lt;`RecordBatch`&lt;`any`&gt;, `void`, `unknown`&gt;
#### See

View File

@@ -81,7 +81,7 @@ AnalyzeExec verbose=true, metrics=[]
### execute()
```ts
protected execute(options?): RecordBatchIterator
protected execute(options?): AsyncGenerator<RecordBatch<any>, void, unknown>
```
Execute the query and return the results as an
@@ -92,7 +92,7 @@ Execute the query and return the results as an
#### Returns
[`RecordBatchIterator`](RecordBatchIterator.md)
`AsyncGenerator`&lt;`RecordBatch`&lt;`any`&gt;, `void`, `unknown`&gt;
#### See

View File

@@ -1,43 +0,0 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / RecordBatchIterator
# Class: RecordBatchIterator
## Implements
- `AsyncIterator`&lt;`RecordBatch`&gt;
## Constructors
### new RecordBatchIterator()
```ts
new RecordBatchIterator(promise?): RecordBatchIterator
```
#### Parameters
* **promise?**: `Promise`&lt;`RecordBatchIterator`&gt;
#### Returns
[`RecordBatchIterator`](RecordBatchIterator.md)
## Methods
### next()
```ts
next(): Promise<IteratorResult<RecordBatch<any>, any>>
```
#### Returns
`Promise`&lt;`IteratorResult`&lt;`RecordBatch`&lt;`any`&gt;, `any`&gt;&gt;
#### Implementation of
`AsyncIterator.next`

View File

@@ -76,7 +76,7 @@ AnalyzeExec verbose=true, metrics=[]
### execute()
```ts
protected execute(options?): RecordBatchIterator
protected execute(options?): AsyncGenerator<RecordBatch<any>, void, unknown>
```
Execute the query and return the results as an
@@ -87,7 +87,7 @@ Execute the query and return the results as an
#### Returns
[`RecordBatchIterator`](RecordBatchIterator.md)
`AsyncGenerator`&lt;`RecordBatch`&lt;`any`&gt;, `void`, `unknown`&gt;
#### See

View File

@@ -221,7 +221,7 @@ also increase the latency of your query. The default value is 1.5*limit.
### execute()
```ts
protected execute(options?): RecordBatchIterator
protected execute(options?): AsyncGenerator<RecordBatch<any>, void, unknown>
```
Execute the query and return the results as an
@@ -232,7 +232,7 @@ Execute the query and return the results as an
#### Returns
[`RecordBatchIterator`](RecordBatchIterator.md)
`AsyncGenerator`&lt;`RecordBatch`&lt;`any`&gt;, `void`, `unknown`&gt;
#### See

View File

@@ -0,0 +1,19 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / RecordBatchIterator
# Function: RecordBatchIterator()
```ts
function RecordBatchIterator(promisedInner): AsyncGenerator<RecordBatch<any>, void, unknown>
```
## Parameters
* **promisedInner**: `Promise`&lt;`RecordBatchIterator`&gt;
## Returns
`AsyncGenerator`&lt;`RecordBatch`&lt;`any`&gt;, `void`, `unknown`&gt;

View File

@@ -32,7 +32,6 @@
- [PhraseQuery](classes/PhraseQuery.md)
- [Query](classes/Query.md)
- [QueryBase](classes/QueryBase.md)
- [RecordBatchIterator](classes/RecordBatchIterator.md)
- [Session](classes/Session.md)
- [StaticHeaderProvider](classes/StaticHeaderProvider.md)
- [Table](classes/Table.md)
@@ -78,6 +77,7 @@
- [RemovalStats](interfaces/RemovalStats.md)
- [RetryConfig](interfaces/RetryConfig.md)
- [ShuffleOptions](interfaces/ShuffleOptions.md)
- [SplitCalculatedOptions](interfaces/SplitCalculatedOptions.md)
- [SplitHashOptions](interfaces/SplitHashOptions.md)
- [SplitRandomOptions](interfaces/SplitRandomOptions.md)
- [SplitSequentialOptions](interfaces/SplitSequentialOptions.md)
@@ -105,6 +105,7 @@
## Functions
- [RecordBatchIterator](functions/RecordBatchIterator.md)
- [connect](functions/connect.md)
- [makeArrowTable](functions/makeArrowTable.md)
- [packBits](functions/packBits.md)

View File

@@ -0,0 +1,23 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / SplitCalculatedOptions
# Interface: SplitCalculatedOptions
## Properties
### calculation
```ts
calculation: string;
```
***
### splitNames?
```ts
optional splitNames: string[];
```

View File

@@ -24,6 +24,14 @@ optional discardWeight: number;
***
### splitNames?
```ts
optional splitNames: string[];
```
***
### splitWeights
```ts

View File

@@ -37,3 +37,11 @@ optional ratios: number[];
```ts
optional seed: number;
```
***
### splitNames?
```ts
optional splitNames: string[];
```

View File

@@ -29,3 +29,11 @@ optional fixed: number;
```ts
optional ratios: number[];
```
***
### splitNames?
```ts
optional splitNames: string[];
```

View File

@@ -51,8 +51,11 @@ pub enum Error {
DatasetAlreadyExists { uri: String, location: Location },
#[snafu(display("Table '{name}' already exists"))]
TableAlreadyExists { name: String },
#[snafu(display("Table '{name}' was not found"))]
TableNotFound { name: String },
#[snafu(display("Table '{name}' was not found: {source}"))]
TableNotFound {
name: String,
source: Box<dyn std::error::Error + Send + Sync>,
},
#[snafu(display("Invalid table name '{name}': {reason}"))]
InvalidTableName { name: String, reason: String },
#[snafu(display("Embedding function '{name}' was not found: {reason}, {location}"))]
@@ -191,7 +194,7 @@ impl From<lancedb::Error> for Error {
message,
location: std::panic::Location::caller().to_snafu_location(),
},
lancedb::Error::TableNotFound { name } => Self::TableNotFound { name },
lancedb::Error::TableNotFound { name, source } => Self::TableNotFound { name, source },
lancedb::Error::TableAlreadyExists { name } => Self::TableAlreadyExists { name },
lancedb::Error::EmbeddingFunctionNotFound { name, reason } => {
Self::EmbeddingFunctionNotFound {

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.22.3-beta.0</version>
<version>0.22.3-beta.5</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.22.3-beta.0</version>
<version>0.22.3-beta.5</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.22.3-beta.0</version>
<version>0.22.3-beta.5</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>

13
nodejs/AGENTS.md Normal file
View File

@@ -0,0 +1,13 @@
These are the typescript bindings of LanceDB.
The core Rust library is in the `../rust/lancedb` directory, the rust binding
code is in the `src/` directory and the typescript bindings are in
the `lancedb/` directory.
Whenever you change the Rust code, you will need to recompile: `npm run build`.
Common commands:
* Build: `npm run build`
* Lint: `npm run lint`
* Fix lints: `npm run lint-fix`
* Test: `npm test`
* Run single test file: `npm test __test__/arrow.test.ts`

View File

@@ -1,13 +0,0 @@
These are the typescript bindings of LanceDB.
The core Rust library is in the `../rust/lancedb` directory, the rust binding
code is in the `src/` directory and the typescript bindings are in
the `lancedb/` directory.
Whenever you change the Rust code, you will need to recompile: `npm run build`.
Common commands:
* Build: `npm run build`
* Lint: `npm run lint`
* Fix lints: `npm run lint-fix`
* Test: `npm test`
* Run single test file: `npm test __test__/arrow.test.ts`

1
nodejs/CLAUDE.md Symbolic link
View File

@@ -0,0 +1 @@
AGENTS.md

View File

@@ -1,7 +1,7 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.22.3-beta.0"
version = "0.22.3-beta.5"
license.workspace = true
description.workspace = true
repository.workspace = true

View File

@@ -138,7 +138,9 @@ describe("PermutationBuilder", () => {
});
test("should create permutation with calculated splits", async () => {
const builder = permutationBuilder(table).splitCalculated("id % 2");
const builder = permutationBuilder(table).splitCalculated({
calculation: "id % 2",
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
@@ -224,4 +226,146 @@ describe("PermutationBuilder", () => {
// Should throw error on second execution
await expect(builder.execute()).rejects.toThrow("Builder already consumed");
});
test("should accept custom split names with random splits", async () => {
const builder = permutationBuilder(table).splitRandom({
ratios: [0.3, 0.7],
seed: 42,
splitNames: ["train", "test"],
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Split names are provided but split_id is still numeric (0, 1, etc.)
// The names are metadata that can be used by higher-level APIs
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
});
test("should accept custom split names with hash splits", async () => {
const builder = permutationBuilder(table).splitHash({
columns: ["id"],
splitWeights: [50, 50],
discardWeight: 0,
splitNames: ["set_a", "set_b"],
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Split names are provided but split_id is still numeric
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
});
test("should accept custom split names with sequential splits", async () => {
const builder = permutationBuilder(table).splitSequential({
ratios: [0.5, 0.5],
splitNames: ["first", "second"],
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Split names are provided but split_id is still numeric
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBe(5);
expect(split1Count).toBe(5);
});
test("should accept custom split names with calculated splits", async () => {
const builder = permutationBuilder(table).splitCalculated({
calculation: "id % 2",
splitNames: ["even", "odd"],
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Split names are provided but split_id is still numeric
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
});
test("should persist permutation to a new table", async () => {
const db = await connect(tmpDir.name);
const builder = permutationBuilder(table)
.splitRandom({
ratios: [0.7, 0.3],
seed: 42,
splitNames: ["train", "validation"],
})
.persist(db, "my_permutation");
// Execute the builder which will persist the table
const permutationTable = await builder.execute();
// Verify the persisted table exists and can be opened
const persistedTable = await db.openTable("my_permutation");
expect(persistedTable).toBeDefined();
// Verify the persisted table has the correct number of rows
const rowCount = await persistedTable.countRows();
expect(rowCount).toBe(10);
// Verify splits exist (numeric split_id values)
const split0Count = await persistedTable.countRows("split_id = 0");
const split1Count = await persistedTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
// Verify the table returned by execute is the same as the persisted one
const executedRowCount = await permutationTable.countRows();
expect(executedRowCount).toBe(10);
});
test("should persist permutation with multiple operations", async () => {
const db = await connect(tmpDir.name);
const builder = permutationBuilder(table)
.filter("value > 30")
.splitRandom({ ratios: [0.5, 0.5], seed: 123, splitNames: ["a", "b"] })
.shuffle({ seed: 456 })
.persist(db, "filtered_permutation");
// Execute the builder
const permutationTable = await builder.execute();
// Verify the persisted table
const persistedTable = await db.openTable("filtered_permutation");
const rowCount = await persistedTable.countRows();
expect(rowCount).toBe(7); // Values 40, 50, 60, 70, 80, 90, 100
// Verify splits exist (numeric split_id values)
const split0Count = await persistedTable.countRows("split_id = 0");
const split1Count = await persistedTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(7);
// Verify the executed table matches
const executedRowCount = await permutationTable.countRows();
expect(executedRowCount).toBe(7);
});
});

View File

@@ -43,6 +43,7 @@ export {
DeleteResult,
DropColumnsResult,
UpdateResult,
SplitCalculatedOptions,
SplitRandomOptions,
SplitHashOptions,
SplitSequentialOptions,

View File

@@ -1,10 +1,12 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
import { Connection, LocalConnection } from "./connection.js";
import {
PermutationBuilder as NativePermutationBuilder,
Table as NativeTable,
ShuffleOptions,
SplitCalculatedOptions,
SplitHashOptions,
SplitRandomOptions,
SplitSequentialOptions,
@@ -29,6 +31,23 @@ export class PermutationBuilder {
this.inner = inner;
}
/**
* Configure the permutation to be persisted.
*
* @param connection - The connection to persist the permutation to
* @param tableName - The name of the table to create
* @returns A new PermutationBuilder instance
* @example
* ```ts
* builder.persist(connection, "permutation_table");
* ```
*/
persist(connection: Connection, tableName: string): PermutationBuilder {
const localConnection = connection as LocalConnection;
const newInner = this.inner.persist(localConnection.inner, tableName);
return new PermutationBuilder(newInner);
}
/**
* Configure random splits for the permutation.
*
@@ -95,15 +114,15 @@ export class PermutationBuilder {
/**
* Configure calculated splits for the permutation.
*
* @param calculation - SQL expression for calculating splits
* @param options - Configuration for calculated splitting
* @returns A new PermutationBuilder instance
* @example
* ```ts
* builder.splitCalculated("user_id % 3");
* ```
*/
splitCalculated(calculation: string): PermutationBuilder {
const newInner = this.inner.splitCalculated(calculation);
splitCalculated(options: SplitCalculatedOptions): PermutationBuilder {
const newInner = this.inner.splitCalculated(options);
return new PermutationBuilder(newInner);
}

View File

@@ -20,35 +20,25 @@ import {
} from "./native";
import { Reranker } from "./rerankers";
export class RecordBatchIterator implements AsyncIterator<RecordBatch> {
private promisedInner?: Promise<NativeBatchIterator>;
private inner?: NativeBatchIterator;
export async function* RecordBatchIterator(
promisedInner: Promise<NativeBatchIterator>,
) {
const inner = await promisedInner;
constructor(promise?: Promise<NativeBatchIterator>) {
// TODO: check promise reliably so we dont need to pass two arguments.
this.promisedInner = promise;
if (inner === undefined) {
throw new Error("Invalid iterator state");
}
// biome-ignore lint/suspicious/noExplicitAny: skip
async next(): Promise<IteratorResult<RecordBatch<any>>> {
if (this.inner === undefined) {
this.inner = await this.promisedInner;
}
if (this.inner === undefined) {
throw new Error("Invalid iterator state state");
}
const n = await this.inner.next();
if (n == null) {
return Promise.resolve({ done: true, value: null });
}
const tbl = tableFromIPC(n);
if (tbl.batches.length != 1) {
for (let buffer = await inner.next(); buffer; buffer = await inner.next()) {
const { batches } = tableFromIPC(buffer);
if (batches.length !== 1) {
throw new Error("Expected only one batch");
}
return Promise.resolve({ done: false, value: tbl.batches[0] });
yield batches[0];
}
}
/* eslint-enable */
class RecordBatchIterable<
NativeQueryType extends NativeQuery | NativeVectorQuery | NativeTakeQuery,
@@ -64,7 +54,7 @@ class RecordBatchIterable<
// biome-ignore lint/suspicious/noExplicitAny: skip
[Symbol.asyncIterator](): AsyncIterator<RecordBatch<any>, any, undefined> {
return new RecordBatchIterator(
return RecordBatchIterator(
this.inner.execute(this.options?.maxBatchLength, this.options?.timeoutMs),
);
}
@@ -231,10 +221,8 @@ export class QueryBase<
* single query)
*
*/
protected execute(
options?: Partial<QueryExecutionOptions>,
): RecordBatchIterator {
return new RecordBatchIterator(this.nativeExecute(options));
protected execute(options?: Partial<QueryExecutionOptions>) {
return RecordBatchIterator(this.nativeExecute(options));
}
/**
@@ -242,8 +230,7 @@ export class QueryBase<
*/
// biome-ignore lint/suspicious/noExplicitAny: skip
[Symbol.asyncIterator](): AsyncIterator<RecordBatch<any>> {
const promise = this.nativeExecute();
return new RecordBatchIterator(promise);
return RecordBatchIterator(this.nativeExecute());
}
/** Collect the results as an Arrow @see {@link ArrowTable}. */

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-x64",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["darwin"],
"cpu": ["x64"],
"main": "lancedb.darwin-x64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"cpu": [
"x64",
"arm64"

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.22.3-beta.0",
"version": "0.22.3-beta.5",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",

View File

@@ -4,7 +4,7 @@
use std::collections::HashMap;
use std::sync::Arc;
use lancedb::database::CreateTableMode;
use lancedb::database::{CreateTableMode, Database};
use napi::bindgen_prelude::*;
use napi_derive::*;
@@ -41,6 +41,10 @@ impl Connection {
_ => Err(napi::Error::from_reason(format!("Invalid mode {}", mode))),
}
}
pub fn database(&self) -> napi::Result<Arc<dyn Database>> {
Ok(self.get_inner()?.database().clone())
}
}
#[napi]

View File

@@ -16,6 +16,7 @@ pub struct SplitRandomOptions {
pub counts: Option<Vec<i64>>,
pub fixed: Option<i64>,
pub seed: Option<i64>,
pub split_names: Option<Vec<String>>,
}
#[napi(object)]
@@ -23,6 +24,7 @@ pub struct SplitHashOptions {
pub columns: Vec<String>,
pub split_weights: Vec<i64>,
pub discard_weight: Option<i64>,
pub split_names: Option<Vec<String>>,
}
#[napi(object)]
@@ -30,6 +32,13 @@ pub struct SplitSequentialOptions {
pub ratios: Option<Vec<f64>>,
pub counts: Option<Vec<i64>>,
pub fixed: Option<i64>,
pub split_names: Option<Vec<String>>,
}
#[napi(object)]
pub struct SplitCalculatedOptions {
pub calculation: String,
pub split_names: Option<Vec<String>>,
}
#[napi(object)]
@@ -76,6 +85,16 @@ impl PermutationBuilder {
#[napi]
impl PermutationBuilder {
#[napi]
pub fn persist(
&self,
connection: &crate::connection::Connection,
table_name: String,
) -> napi::Result<Self> {
let database = connection.database()?;
self.modify(|builder| builder.persist(database, table_name))
}
/// Configure random splits
#[napi]
pub fn split_random(&self, options: SplitRandomOptions) -> napi::Result<Self> {
@@ -107,7 +126,12 @@ impl PermutationBuilder {
let seed = options.seed.map(|s| s as u64);
self.modify(|builder| builder.with_split_strategy(SplitStrategy::Random { seed, sizes }))
self.modify(|builder| {
builder.with_split_strategy(
SplitStrategy::Random { seed, sizes },
options.split_names.clone(),
)
})
}
/// Configure hash-based splits
@@ -120,12 +144,15 @@ impl PermutationBuilder {
.collect();
let discard_weight = options.discard_weight.unwrap_or(0) as u64;
self.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Hash {
columns: options.columns,
split_weights,
discard_weight,
})
self.modify(move |builder| {
builder.with_split_strategy(
SplitStrategy::Hash {
columns: options.columns,
split_weights,
discard_weight,
},
options.split_names,
)
})
}
@@ -158,14 +185,21 @@ impl PermutationBuilder {
unreachable!("One of the split arguments must be provided");
};
self.modify(|builder| builder.with_split_strategy(SplitStrategy::Sequential { sizes }))
self.modify(move |builder| {
builder.with_split_strategy(SplitStrategy::Sequential { sizes }, options.split_names)
})
}
/// Configure calculated splits
#[napi]
pub fn split_calculated(&self, calculation: String) -> napi::Result<Self> {
self.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Calculated { calculation })
pub fn split_calculated(&self, options: SplitCalculatedOptions) -> napi::Result<Self> {
self.modify(move |builder| {
builder.with_split_strategy(
SplitStrategy::Calculated {
calculation: options.calculation,
},
options.split_names,
)
})
}

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.25.3-beta.1"
current_version = "0.25.3"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

19
python/AGENTS.md Normal file
View File

@@ -0,0 +1,19 @@
These are the Python bindings of LanceDB.
The core Rust library is in the `../rust/lancedb` directory, the rust binding
code is in the `src/` directory and the Python bindings are in the `lancedb/` directory.
Common commands:
* Build: `make develop`
* Format: `make format`
* Lint: `make check`
* Fix lints: `make fix`
* Test: `make test`
* Doc test: `make doctest`
Before committing changes, run lints and then formatting.
When you change the Rust code, you will need to recompile the Python bindings: `make develop`.
When you export new types from Rust to Python, you must manually update `python/lancedb/_lancedb.pyi`
with the corresponding type hints. You can run `pyright` to check for type errors in the Python code.

View File

@@ -1,19 +0,0 @@
These are the Python bindings of LanceDB.
The core Rust library is in the `../rust/lancedb` directory, the rust binding
code is in the `src/` directory and the Python bindings are in the `lancedb/` directory.
Common commands:
* Build: `make develop`
* Format: `make format`
* Lint: `make check`
* Fix lints: `make fix`
* Test: `make test`
* Doc test: `make doctest`
Before committing changes, run lints and then formatting.
When you change the Rust code, you will need to recompile the Python bindings: `make develop`.
When you export new types from Rust to Python, you must manually update `python/lancedb/_lancedb.pyi`
with the corresponding type hints. You can run `pyright` to check for type errors in the Python code.

1
python/CLAUDE.md Symbolic link
View File

@@ -0,0 +1 @@
AGENTS.md

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-python"
version = "0.25.3-beta.1"
version = "0.25.3"
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true

View File

@@ -17,7 +17,7 @@ from .db import AsyncConnection, DBConnection, LanceDBConnection
from .remote import ClientConfig
from .remote.db import RemoteDBConnection
from .schema import vector
from .table import AsyncTable
from .table import AsyncTable, Table
from ._lancedb import Session
from .namespace import connect_namespace, LanceNamespaceDBConnection
@@ -233,6 +233,7 @@ __all__ = [
"LanceNamespaceDBConnection",
"RemoteDBConnection",
"Session",
"Table",
"__version__",
]

View File

@@ -339,3 +339,7 @@ class AsyncPermutationBuilder:
def async_permutation_builder(
table: Table, dest_table_name: str
) -> AsyncPermutationBuilder: ...
def fts_query_to_json(query: Any) -> str: ...
class PermutationReader:
def __init__(self, base_table: Table, permutation_table: Table): ...

View File

@@ -2,7 +2,7 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import base64
import os
from typing import ClassVar, TYPE_CHECKING, List, Union, Any
from typing import ClassVar, TYPE_CHECKING, List, Union, Any, Generator
from pathlib import Path
from urllib.parse import urlparse
@@ -19,6 +19,23 @@ from .utils import api_key_not_found_help, IMAGES, TEXT
if TYPE_CHECKING:
import PIL
# Token limits for different VoyageAI models
VOYAGE_TOTAL_TOKEN_LIMITS = {
"voyage-context-3": 32_000,
"voyage-3.5-lite": 1_000_000,
"voyage-3.5": 320_000,
"voyage-3-lite": 120_000,
"voyage-3": 120_000,
"voyage-multimodal-3": 120_000,
"voyage-finance-2": 120_000,
"voyage-multilingual-2": 120_000,
"voyage-law-2": 120_000,
"voyage-code-2": 120_000,
}
# Batch size for embedding requests (max number of items per batch)
BATCH_SIZE = 1000
def is_valid_url(text):
try:
@@ -120,6 +137,9 @@ class VoyageAIEmbeddingFunction(EmbeddingFunction):
name: str
The name of the model to use. List of acceptable models:
* voyage-context-3
* voyage-3.5
* voyage-3.5-lite
* voyage-3
* voyage-3-lite
* voyage-multimodal-3
@@ -157,25 +177,35 @@ class VoyageAIEmbeddingFunction(EmbeddingFunction):
name: str
client: ClassVar = None
text_embedding_models: list = [
"voyage-3.5",
"voyage-3.5-lite",
"voyage-3",
"voyage-3-lite",
"voyage-finance-2",
"voyage-multilingual-2",
"voyage-law-2",
"voyage-code-2",
]
multimodal_embedding_models: list = ["voyage-multimodal-3"]
contextual_embedding_models: list = ["voyage-context-3"]
def _is_multimodal_model(self, model_name: str):
return (
model_name in self.multimodal_embedding_models or "multimodal" in model_name
)
def _is_contextual_model(self, model_name: str):
return model_name in self.contextual_embedding_models or "context" in model_name
def ndims(self):
if self.name == "voyage-3-lite":
return 512
elif self.name == "voyage-code-2":
return 1536
elif self.name in [
"voyage-context-3",
"voyage-3.5",
"voyage-3.5-lite",
"voyage-3",
"voyage-multimodal-3",
"voyage-finance-2",
@@ -207,6 +237,11 @@ class VoyageAIEmbeddingFunction(EmbeddingFunction):
result = client.multimodal_embed(
inputs=[[query]], model=self.name, input_type="query", **kwargs
)
elif self._is_contextual_model(self.name):
result = client.contextualized_embed(
inputs=[[query]], model=self.name, input_type="query", **kwargs
)
result = result.results[0]
else:
result = client.embed(
texts=[query], model=self.name, input_type="query", **kwargs
@@ -231,18 +266,164 @@ class VoyageAIEmbeddingFunction(EmbeddingFunction):
List[np.array]: the list of embeddings
"""
client = VoyageAIEmbeddingFunction._get_client()
# For multimodal models, check if inputs contain images
if self._is_multimodal_model(self.name):
inputs = sanitize_multimodal_input(inputs)
result = client.multimodal_embed(
inputs=inputs, model=self.name, input_type="document", **kwargs
sanitized = sanitize_multimodal_input(inputs)
has_images = any(
inp["content"][0].get("type") != "text" for inp in sanitized
)
if has_images:
# Use non-batched API for images
result = client.multimodal_embed(
inputs=sanitized, model=self.name, input_type="document", **kwargs
)
return result.embeddings
# Extract texts for batching
inputs = [inp["content"][0]["text"] for inp in sanitized]
else:
inputs = sanitize_text_input(inputs)
result = client.embed(
texts=inputs, model=self.name, input_type="document", **kwargs
)
return result.embeddings
# Use batching for all text inputs
return self._embed_with_batching(
client, inputs, input_type="document", **kwargs
)
def _build_batches(
self, client, texts: List[str]
) -> Generator[List[str], None, None]:
"""
Generate batches of texts based on token limits using a generator.
Parameters
----------
client : voyageai.Client
The VoyageAI client instance.
texts : List[str]
List of texts to batch.
Yields
------
List[str]: Batches of texts.
"""
if not texts:
return
max_tokens_per_batch = VOYAGE_TOTAL_TOKEN_LIMITS.get(self.name, 120_000)
current_batch: List[str] = []
current_batch_tokens = 0
# Tokenize all texts in one API call
token_lists = client.tokenize(texts, model=self.name)
token_counts = [len(token_list) for token_list in token_lists]
for i, text in enumerate(texts):
n_tokens = token_counts[i]
# Check if adding this text would exceed limits
if current_batch and (
len(current_batch) >= BATCH_SIZE
or (current_batch_tokens + n_tokens > max_tokens_per_batch)
):
# Yield the current batch and start a new one
yield current_batch
current_batch = []
current_batch_tokens = 0
current_batch.append(text)
current_batch_tokens += n_tokens
# Yield the last batch (always has at least one text)
if current_batch:
yield current_batch
def _get_embed_function(
self, client, input_type: str = "document", **kwargs
) -> callable:
"""
Get the appropriate embedding function based on model type.
Parameters
----------
client : voyageai.Client
The VoyageAI client instance.
input_type : str
Either "query" or "document"
**kwargs
Additional arguments to pass to the embedding API
Returns
-------
callable: A function that takes a batch of texts and returns embeddings.
"""
if self._is_multimodal_model(self.name):
def embed_batch(batch: List[str]) -> List[np.array]:
batch_inputs = sanitize_multimodal_input(batch)
result = client.multimodal_embed(
inputs=batch_inputs,
model=self.name,
input_type=input_type,
**kwargs,
)
return result.embeddings
return embed_batch
elif self._is_contextual_model(self.name):
def embed_batch(batch: List[str]) -> List[np.array]:
result = client.contextualized_embed(
inputs=[batch], model=self.name, input_type=input_type, **kwargs
)
return result.results[0].embeddings
return embed_batch
else:
def embed_batch(batch: List[str]) -> List[np.array]:
result = client.embed(
texts=batch, model=self.name, input_type=input_type, **kwargs
)
return result.embeddings
return embed_batch
def _embed_with_batching(
self, client, texts: List[str], input_type: str = "document", **kwargs
) -> List[np.array]:
"""
Embed texts with automatic batching based on token limits.
Parameters
----------
client : voyageai.Client
The VoyageAI client instance.
texts : List[str]
List of texts to embed.
input_type : str
Either "query" or "document"
**kwargs
Additional arguments to pass to the embedding API
Returns
-------
List[np.array]: List of embeddings.
"""
if not texts:
return []
# Get the appropriate embedding function for this model type
embed_fn = self._get_embed_function(client, input_type=input_type, **kwargs)
# Process each batch
all_embeddings = []
for batch in self._build_batches(client, texts):
batch_embeddings = embed_fn(batch)
all_embeddings.extend(batch_embeddings)
return all_embeddings
@staticmethod
def _get_client():

View File

@@ -1,18 +1,63 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
from ._lancedb import async_permutation_builder
from deprecation import deprecated
from lancedb import AsyncConnection, DBConnection
import pyarrow as pa
import json
from ._lancedb import async_permutation_builder, PermutationReader
from .table import LanceTable
from .background_loop import LOOP
from typing import Optional
from .util import batch_to_tensor
from typing import Any, Callable, Iterator, Literal, Optional, TYPE_CHECKING, Union
if TYPE_CHECKING:
from lancedb.dependencies import pandas as pd, numpy as np, polars as pl
class PermutationBuilder:
"""
A utility for creating a "permutation table" which is a table that defines an
ordering on a base table.
The permutation table does not store the actual data. It only stores row
ids and split ids to define the ordering. The [Permutation] class can be used to
read the data from the base table in the order defined by the permutation table.
Permutations can split, shuffle, and filter the data in the base table.
A filter limits the rows that are included in the permutation.
Splits divide the data into subsets (for example, a test/train split, or K
different splits for cross-validation).
Shuffling randomizes the order of the rows in the permutation.
Splits can optionally be named. If names are provided it will enable them to
be referenced by name in the future. If names are not provided then they can only
be referenced by their ordinal index. There is no requirement to name every split.
By default, the permutation will be stored in memory and will be lost when the
program exits. To persist the permutation (for very large datasets or to share
the permutation across multiple workers) use the [persist](#persist) method to
create a permanent table.
"""
def __init__(self, table: LanceTable):
"""
Creates a new permutation builder for the given table.
By default, the permutation builder will create a single split that contains all
rows in the same order as the base table.
"""
self._async = async_permutation_builder(table)
def select(self, projections: dict[str, str]) -> "PermutationBuilder":
self._async.select(projections)
def persist(
self, database: Union[DBConnection, AsyncConnection], table_name: str
) -> "PermutationBuilder":
"""
Persist the permutation to the given database.
"""
self._async.persist(database, table_name)
return self
def split_random(
@@ -22,8 +67,38 @@ class PermutationBuilder:
counts: Optional[list[int]] = None,
fixed: Optional[int] = None,
seed: Optional[int] = None,
split_names: Optional[list[str]] = None,
) -> "PermutationBuilder":
self._async.split_random(ratios=ratios, counts=counts, fixed=fixed, seed=seed)
"""
Configure random splits for the permutation.
One of ratios, counts, or fixed must be provided.
If ratios are provided, they will be used to determine the relative size of each
split. For example, if ratios are [0.3, 0.7] then the first split will contain
30% of the rows and the second split will contain 70% of the rows.
If counts are provided, they will be used to determine the absolute number of
rows in each split. For example, if counts are [100, 200] then the first split
will contain 100 rows and the second split will contain 200 rows.
If fixed is provided, it will be used to determine the number of splits.
For example, if fixed is 3 then the permutation will be split evenly into 3
splits.
Rows will be randomly assigned to splits. The optional seed can be provided to
make the assignment deterministic.
The optional split_names can be provided to name the splits. If not provided,
the splits can only be referenced by their index.
"""
self._async.split_random(
ratios=ratios,
counts=counts,
fixed=fixed,
seed=seed,
split_names=split_names,
)
return self
def split_hash(
@@ -32,8 +107,33 @@ class PermutationBuilder:
split_weights: list[int],
*,
discard_weight: Optional[int] = None,
split_names: Optional[list[str]] = None,
) -> "PermutationBuilder":
self._async.split_hash(columns, split_weights, discard_weight=discard_weight)
"""
Configure hash-based splits for the permutation.
First, a hash will be calculated over the specified columns. The splits weights
are then used to determine how many rows to assign to each split. For example,
if split weights are [1, 2] then the first split will contain 1/3 of the rows
and the second split will contain 2/3 of the rows.
The optional discard weight can be provided to determine what percentage of rows
should be discarded. For example, if split weights are [1, 2] and discard
weight is 1 then 25% of the rows will be discarded.
Hash-based splits are useful if you want the split to be more or less random but
you don't want the split assignments to change if rows are added or removed
from the table.
The optional split_names can be provided to name the splits. If not provided,
the splits can only be referenced by their index.
"""
self._async.split_hash(
columns,
split_weights,
discard_weight=discard_weight,
split_names=split_names,
)
return self
def split_sequential(
@@ -42,25 +142,85 @@ class PermutationBuilder:
ratios: Optional[list[float]] = None,
counts: Optional[list[int]] = None,
fixed: Optional[int] = None,
split_names: Optional[list[str]] = None,
) -> "PermutationBuilder":
self._async.split_sequential(ratios=ratios, counts=counts, fixed=fixed)
"""
Configure sequential splits for the permutation.
One of ratios, counts, or fixed must be provided.
If ratios are provided, they will be used to determine the relative size of each
split. For example, if ratios are [0.3, 0.7] then the first split will contain
30% of the rows and the second split will contain 70% of the rows.
If counts are provided, they will be used to determine the absolute number of
rows in each split. For example, if counts are [100, 200] then the first split
will contain 100 rows and the second split will contain 200 rows.
If fixed is provided, it will be used to determine the number of splits.
For example, if fixed is 3 then the permutation will be split evenly into 3
splits.
Rows will be assigned to splits sequentially. The first N1 rows are assigned to
split 1, the next N2 rows are assigned to split 2, etc.
The optional split_names can be provided to name the splits. If not provided,
the splits can only be referenced by their index.
"""
self._async.split_sequential(
ratios=ratios, counts=counts, fixed=fixed, split_names=split_names
)
return self
def split_calculated(self, calculation: str) -> "PermutationBuilder":
self._async.split_calculated(calculation)
def split_calculated(
self, calculation: str, split_names: Optional[list[str]] = None
) -> "PermutationBuilder":
"""
Use pre-calculated splits for the permutation.
The calculation should be an SQL statement that returns an integer value between
0 and the number of splits - 1. For example, if you have 3 splits then the
calculation should return 0 for the first split, 1 for the second split, and 2
for the third split.
This can be used to implement any kind of user-defined split strategy.
The optional split_names can be provided to name the splits. If not provided,
the splits can only be referenced by their index.
"""
self._async.split_calculated(calculation, split_names=split_names)
return self
def shuffle(
self, *, seed: Optional[int] = None, clump_size: Optional[int] = None
) -> "PermutationBuilder":
"""
Randomly shuffle the rows in the permutation.
An optional seed can be provided to make the shuffle deterministic.
If a clump size is provided, then data will be shuffled as small "clumps"
of contiguous rows. This allows for a balance between randomization and
I/O performance. It can be useful when reading from cloud storage.
"""
self._async.shuffle(seed=seed, clump_size=clump_size)
return self
def filter(self, filter: str) -> "PermutationBuilder":
"""
Configure a filter for the permutation.
The filter should be an SQL statement that returns a boolean value for each row.
Only rows where the filter is true will be included in the permutation.
"""
self._async.filter(filter)
return self
def execute(self) -> LanceTable:
"""
Execute the configuration and create the permutation table.
"""
async def do_execute():
inner_tbl = await self._async.execute()
return LanceTable.from_inner(inner_tbl)
@@ -70,3 +230,592 @@ class PermutationBuilder:
def permutation_builder(table: LanceTable) -> PermutationBuilder:
return PermutationBuilder(table)
class Permutations:
"""
A collection of permutations indexed by name or ordinal index.
Splits are defined when the permutation is created. Splits can always be referenced
by their ordinal index. If names were provided when the permutation was created
then they can also be referenced by name.
Each permutation or "split" is a view of a portion of the base table. For more
details see [Permutation].
Attributes
----------
base_table: LanceTable
The base table that the permutations are based on.
permutation_table: LanceTable
The permutation table that defines the splits.
split_names: list[str]
The names of the splits.
split_dict: dict[str, int]
A dictionary mapping split names to their ordinal index.
Examples
--------
>>> # Initial data
>>> import lancedb
>>> db = lancedb.connect("memory:///")
>>> tbl = db.create_table("tbl", data=[{"x": x} for x in range(1000)])
>>> # Create a permutation
>>> perm_tbl = (
... permutation_builder(tbl)
... .split_random(ratios=[0.95, 0.05], split_names=["train", "test"])
... .shuffle()
... .execute()
... )
>>> # Read the permutations
>>> permutations = Permutations(tbl, perm_tbl)
>>> permutations["train"]
<lancedb.permutation.Permutation ...>
>>> permutations[0]
<lancedb.permutation.Permutation ...>
>>> permutations.split_names
['train', 'test']
>>> permutations.split_dict
{'train': 0, 'test': 1}
"""
def __init__(self, base_table: LanceTable, permutation_table: LanceTable):
self.base_table = base_table
self.permutation_table = permutation_table
if permutation_table.schema.metadata is not None:
split_names = permutation_table.schema.metadata.get(
b"split_names", None
).decode("utf-8")
if split_names is not None:
self.split_names = json.loads(split_names)
self.split_dict = {
name: idx for idx, name in enumerate(self.split_names)
}
else:
# No split names are defined in the permutation table
self.split_names = []
self.split_dict = {}
else:
# No metadata is defined in the permutation table
self.split_names = []
self.split_dict = {}
def get_by_name(self, name: str) -> "Permutation":
"""
Get a permutation by name.
If no split named `name` is found then an error will be raised.
"""
idx = self.split_dict.get(name, None)
if idx is None:
raise ValueError(f"No split named `{name}` found")
return self.get_by_index(idx)
def get_by_index(self, index: int) -> "Permutation":
"""
Get a permutation by index.
"""
return Permutation.from_tables(self.base_table, self.permutation_table, index)
def __getitem__(self, name: Union[str, int]) -> "Permutation":
if isinstance(name, str):
return self.get_by_name(name)
elif isinstance(name, int):
return self.get_by_index(name)
else:
raise TypeError(f"Invalid split name or index: {name}")
class Transforms:
"""
Namespace for common transformation functions
"""
@staticmethod
def arrow2python(batch: pa.RecordBatch) -> dict[str, list[Any]]:
return batch.to_pydict()
@staticmethod
def arrow2arrow(batch: pa.RecordBatch) -> pa.RecordBatch:
return batch
@staticmethod
def arrow2numpy(batch: pa.RecordBatch) -> "np.ndarray":
return batch.to_pandas().to_numpy()
@staticmethod
def arrow2pandas(batch: pa.RecordBatch) -> "pd.DataFrame":
return batch.to_pandas()
@staticmethod
def arrow2polars() -> "pl.DataFrame":
import polars as pl
def impl(batch: pa.RecordBatch) -> pl.DataFrame:
return pl.from_arrow(batch)
return impl
# HuggingFace uses 10 which is pretty small
DEFAULT_BATCH_SIZE = 100
class Permutation:
"""
A Permutation is a view of a dataset that can be used as input to model training
and evaluation.
A Permutation fulfills the pytorch Dataset contract and is loosely modeled after the
huggingface Dataset so it should be easy to use with existing code.
A permutation is not a "materialized view" or copy of the underlying data. It is
calculated on the fly from the base table. As a result, it is truly "lazy" and does
not require materializing the entire dataset in memory.
"""
def __init__(
self,
reader: PermutationReader,
selection: dict[str, str],
batch_size: int,
transform_fn: Callable[pa.RecordBatch, Any],
):
"""
Internal constructor. Use [from_tables](#from_tables) instead.
"""
assert reader is not None, "reader is required"
assert selection is not None, "selection is required"
self.reader = reader
self.selection = selection
self.transform_fn = transform_fn
self.batch_size = batch_size
def _with_selection(self, selection: dict[str, str]) -> "Permutation":
"""
Creates a new permutation with the given selection
Does not validation of the selection and it replaces it entirely. This is not
intended for public use.
"""
return Permutation(self.reader, selection, self.batch_size, self.transform_fn)
def _with_reader(self, reader: PermutationReader) -> "Permutation":
"""
Creates a new permutation with the given reader
This is an internal method and should not be used directly.
"""
return Permutation(reader, self.selection, self.batch_size, self.transform_fn)
def with_batch_size(self, batch_size: int) -> "Permutation":
"""
Creates a new permutation with the given batch size
"""
return Permutation(self.reader, self.selection, batch_size, self.transform_fn)
@classmethod
def identity(cls, table: LanceTable) -> "Permutation":
"""
Creates an identity permutation for the given table.
"""
return Permutation.from_tables(table, None, None)
@classmethod
def from_tables(
cls,
base_table: LanceTable,
permutation_table: Optional[LanceTable] = None,
split: Optional[Union[str, int]] = None,
) -> "Permutation":
"""
Creates a permutation from the given base table and permutation table.
A permutation table identifies which rows, and in what order, the data should
be read from the base table. For more details see the [PermutationBuilder]
class.
If no permutation table is provided, then the identity permutation will be
created. An identity permutation is a permutation that reads all rows in the
base table in the order they are stored.
The split parameter identifies which split to use. If no split is provided
then the first split will be used.
"""
assert base_table is not None, "base_table is required"
if split is not None:
if permutation_table is None:
raise ValueError(
"Cannot create a permutation on split `{split}`"
" because no permutation table is provided"
)
if isinstance(split, str):
if permutation_table.schema.metadata is None:
raise ValueError(
f"Cannot create a permutation on split `{split}`"
" because no split names are defined in the permutation table"
)
split_names = permutation_table.schema.metadata.get(
b"split_names", None
).decode("utf-8")
if split_names is None:
raise ValueError(
f"Cannot create a permutation on split `{split}`"
" because no split names are defined in the permutation table"
)
split_names = json.loads(split_names)
try:
split = split_names.index(split)
except ValueError:
raise ValueError(
f"Cannot create a permutation on split `{split}`"
f" because split `{split}` is not defined in the "
"permutation table"
)
elif isinstance(split, int):
split = split
else:
raise TypeError(f"Invalid split: {split}")
else:
split = 0
async def do_from_tables():
reader = await PermutationReader.from_tables(
base_table, permutation_table, split
)
schema = await reader.output_schema(None)
initial_selection = {name: name for name in schema.names}
return cls(
reader, initial_selection, DEFAULT_BATCH_SIZE, Transforms.arrow2python
)
return LOOP.run(do_from_tables())
@property
def schema(self) -> pa.Schema:
async def do_output_schema():
return await self.reader.output_schema(self.selection)
return LOOP.run(do_output_schema())
@property
def num_columns(self) -> int:
"""
The number of columns in the permutation
"""
return len(self.schema)
@property
def num_rows(self) -> int:
"""
The number of rows in the permutation
"""
return self.reader.count_rows()
@property
def column_names(self) -> list[str]:
"""
The names of the columns in the permutation
"""
return self.schema.names
@property
def shape(self) -> tuple[int, int]:
"""
The shape of the permutation
This will return self.num_rows, self.num_columns
"""
return self.num_rows, self.num_columns
def __len__(self) -> int:
"""
The number of rows in the permutation
This is an alias for [num_rows][lancedb.permutation.Permutation.num_rows]
"""
return self.num_rows
def unique(self, _column: str) -> list[Any]:
"""
Get the unique values in the given column
"""
raise Exception("unique is not yet implemented")
def flatten(self) -> "Permutation":
"""
Flatten the permutation
Each column with a struct type will be flattened into multiple columns.
This flattening operation happens at read time as a post-processing step
so this call is cheap and no data is copied or modified in the underlying
dataset.
"""
raise Exception("flatten is not yet implemented")
def remove_columns(self, columns: list[str]) -> "Permutation":
"""
Remove the given columns from the permutation
Note: this does not actually modify the underlying dataset. It only changes
which columns are visible from this permutation. Also, this does not introduce
a post-processing step. Instead, we simply do not read those columns in the
first place.
If any of the provided columns does not exist in the current permutation then it
will be ignored (no error is raised for missing columns)
Returns a new permutation with the given columns removed. This does not modify
self.
"""
assert columns is not None, "columns is required"
new_selection = {
name: value for name, value in self.selection.items() if name not in columns
}
if len(new_selection) == 0:
raise ValueError("Cannot remove all columns")
return self._with_selection(new_selection)
def rename_column(self, old_name: str, new_name: str) -> "Permutation":
"""
Rename a column in the permutation
If there is no column named old_name then an error will be raised
If there is already a column named new_name then an error will be raised
Note: this does not actually modify the underlying dataset. It only changes
the name of the column that is visible from this permutation. This is a
post-processing step but done at the batch level and so it is very cheap.
No data will be copied.
"""
assert old_name is not None, "old_name is required"
assert new_name is not None, "new_name is required"
if old_name not in self.selection:
raise ValueError(
f"Cannot rename column `{old_name}` because it does not exist"
)
if new_name in self.selection:
raise ValueError(
f"Cannot rename column `{old_name}` to `{new_name}` because a column "
"with that name already exists"
)
new_selection = self.selection.copy()
new_selection[new_name] = new_selection[old_name]
del new_selection[old_name]
return self._with_selection(new_selection)
def rename_columns(self, column_map: dict[str, str]) -> "Permutation":
"""
Rename the given columns in the permutation
If any of the columns do not exist then an error will be raised
If any of the new names already exist then an error will be raised
Note: this does not actually modify the underlying dataset. It only changes
the name of the column that is visible from this permutation. This is a
post-processing step but done at the batch level and so it is very cheap.
No data will be copied.
"""
assert column_map is not None, "column_map is required"
new_permutation = self
for old_name, new_name in column_map.items():
new_permutation = new_permutation.rename_column(old_name, new_name)
return new_permutation
def select_columns(self, columns: list[str]) -> "Permutation":
"""
Select the given columns from the permutation
This method refines the current selection, potentially removing columns. It
will not add back columns that were previously removed.
If any of the columns do not exist then an error will be raised
This does not introduce a post-processing step. It simply reduces the amount
of data we read.
"""
assert columns is not None, "columns is required"
if len(columns) == 0:
raise ValueError("Must select at least one column")
new_selection = {}
for name in columns:
value = self.selection.get(name, None)
if value is None:
raise ValueError(
f"Cannot select column `{name}` because it does not exist"
)
new_selection[name] = value
return self._with_selection(new_selection)
def __iter__(self) -> Iterator[dict[str, Any]]:
"""
Iterate over the permutation
"""
return self.iter(self.batch_size, skip_last_batch=True)
def iter(
self, batch_size: int, skip_last_batch: bool = False
) -> Iterator[dict[str, Any]]:
"""
Iterate over the permutation in batches
If skip_last_batch is True, the last batch will be skipped if it is not a
multiple of batch_size.
"""
async def get_iter():
return await self.reader.read(self.selection, batch_size=batch_size)
async_iter = LOOP.run(get_iter())
async def get_next():
return await async_iter.__anext__()
try:
while True:
batch = LOOP.run(get_next())
if batch.num_rows == batch_size or not skip_last_batch:
yield self.transform_fn(batch)
except StopAsyncIteration:
return
def with_format(
self, format: Literal["numpy", "python", "pandas", "arrow", "torch", "polars"]
) -> "Permutation":
"""
Set the format for batches
If this method is not called, the "python" format will be used.
The format can be one of:
- "numpy" - the batch will be a dict of numpy arrays (one per column)
- "python" - the batch will be a dict of lists (one per column)
- "pandas" - the batch will be a pandas DataFrame
- "arrow" - the batch will be a pyarrow RecordBatch
- "torch" - the batch will be a two dimensional torch tensor
- "polars" - the batch will be a polars DataFrame
Conversion may or may not involve a data copy. Lance uses Arrow internally
and so it is able to zero-copy to the arrow and polars.
Conversion to torch will be zero-copy but will only support a subset of data
types (numeric types).
Conversion to numpy and/or pandas will typically be zero-copy for numeric
types. Conversion of strings, lists, and structs will require creating python
objects and this is not zero-copy.
For custom formatting, use [with_transform](#with_transform) which overrides
this method.
"""
assert format is not None, "format is required"
if format == "python":
return self.with_transform(Transforms.arrow2python)
elif format == "numpy":
return self.with_transform(Transforms.arrow2numpy)
elif format == "pandas":
return self.with_transform(Transforms.arrow2pandas)
elif format == "arrow":
return self.with_transform(Transforms.arrow2arrow)
elif format == "torch":
return self.with_transform(batch_to_tensor)
elif format == "polars":
return self.with_transform(Transforms.arrow2polars())
else:
raise ValueError(f"Invalid format: {format}")
def with_transform(self, transform: Callable[pa.RecordBatch, Any]) -> "Permutation":
"""
Set a custom transform for the permutation
The transform is a callable that will be invoked with each record batch. The
return value will be used as the batch for iteration.
Note: transforms are not invoked in parallel. This method is not a good place
for expensive operations such as image decoding.
"""
assert transform is not None, "transform is required"
return Permutation(self.reader, self.selection, self.batch_size, transform)
def __getitem__(self, index: int) -> Any:
"""
Return a single row from the permutation
The output will always be a python dictionary regardless of the format.
This method is mostly useful for debugging and exploration. For actual
processing use [iter](#iter) or a torch data loader to perform batched
processing.
"""
pass
@deprecated(details="Use with_skip instead")
def skip(self, skip: int) -> "Permutation":
"""
Skip the first `skip` rows of the permutation
Note: this method returns a new permutation and does not modify `self`
It is provided for compatibility with the huggingface Dataset API.
Use [with_skip](#with_skip) instead to avoid confusion.
"""
return self.with_skip(skip)
def with_skip(self, skip: int) -> "Permutation":
"""
Skip the first `skip` rows of the permutation
"""
async def do_with_skip():
reader = await self.reader.with_offset(skip)
return self._with_reader(reader)
return LOOP.run(do_with_skip())
@deprecated(details="Use with_take instead")
def take(self, limit: int) -> "Permutation":
"""
Limit the permutation to `limit` rows (following any `skip`)
Note: this method returns a new permutation and does not modify `self`
It is provided for compatibility with the huggingface Dataset API.
Use [with_take](#with_take) instead to avoid confusion.
"""
return self.with_take(limit)
def with_take(self, limit: int) -> "Permutation":
"""
Limit the permutation to `limit` rows (following any `skip`)
"""
async def do_with_take():
reader = await self.reader.with_limit(limit)
return self._with_reader(reader)
return LOOP.run(do_with_take())
@deprecated(details="Use with_repeat instead")
def repeat(self, times: int) -> "Permutation":
"""
Repeat the permutation `times` times
Note: this method returns a new permutation and does not modify `self`
It is provided for compatibility with the huggingface Dataset API.
Use [with_repeat](#with_repeat) instead to avoid confusion.
"""
return self.with_repeat(times)
def with_repeat(self, times: int) -> "Permutation":
"""
Repeat the permutation `times` times
"""
raise Exception("with_repeat is not yet implemented")

View File

@@ -37,7 +37,7 @@ from .rerankers.base import Reranker
from .rerankers.rrf import RRFReranker
from .rerankers.util import check_reranker_result
from .util import flatten_columns
from lancedb._lancedb import fts_query_to_json
from typing_extensions import Annotated
if TYPE_CHECKING:
@@ -124,6 +124,24 @@ class FullTextQuery(ABC):
"""
pass
def to_json(self) -> str:
"""
Convert the query to a JSON string.
Returns
-------
str
A JSON string representation of the query.
Examples
--------
>>> from lancedb.query import MatchQuery
>>> query = MatchQuery("puppy", "text", fuzziness=2)
>>> query.to_json()
'{"match":{"column":"text","terms":"puppy","boost":1.0,"fuzziness":2,"max_expansions":50,"operator":"Or","prefix_length":0}}'
"""
return fts_query_to_json(self)
def __and__(self, other: "FullTextQuery") -> "FullTextQuery":
"""
Combine two queries with a logical AND operation.
@@ -288,6 +306,8 @@ class BooleanQuery(FullTextQuery):
----------
queries : list[tuple(Occur, FullTextQuery)]
The list of queries with their occurrence requirements.
Each tuple contains an Occur value (MUST, SHOULD, or MUST_NOT)
and a FullTextQuery to apply.
"""
queries: list[tuple[Occur, FullTextQuery]]

View File

@@ -21,6 +21,8 @@ class VoyageAIReranker(Reranker):
----------
model_name : str, default "rerank-english-v2.0"
The name of the cross encoder model to use. Available voyageai models are:
- rerank-2.5
- rerank-2.5-lite
- rerank-2
- rerank-2-lite
column : str, default "text"

View File

@@ -366,3 +366,56 @@ def add_note(base_exception: BaseException, note: str):
)
else:
raise ValueError("Cannot add note to exception")
def tbl_to_tensor(tbl: pa.Table):
"""
Convert a PyArrow Table to a PyTorch Tensor.
Each column is converted to a tensor (using zero-copy via DLPack)
and the columns are then stacked into a single tensor.
Fails if torch is not installed.
Fails if any column is more than one chunk.
Fails if a column's data type is not supported by PyTorch.
Parameters
----------
tbl : pa.Table or pa.RecordBatch
The table or record batch to convert to a tensor.
Returns
-------
torch.Tensor: The tensor containing the columns of the table.
"""
torch = attempt_import_or_raise("torch", "torch")
def to_tensor(col: pa.ChunkedArray):
if col.num_chunks > 1:
raise Exception("Single batch was too large to fit into a one-chunk table")
return torch.from_dlpack(col.chunk(0))
return torch.stack([to_tensor(tbl.column(i)) for i in range(tbl.num_columns)])
def batch_to_tensor(batch: pa.RecordBatch):
"""
Convert a PyArrow RecordBatch to a PyTorch Tensor.
Each column is converted to a tensor (using zero-copy via DLPack)
and the columns are then stacked into a single tensor.
Fails if torch is not installed.
Fails if a column's data type is not supported by PyTorch.
Parameters
----------
batch : pa.RecordBatch
The record batch to convert to a tensor.
Returns
-------
torch.Tensor: The tensor containing the columns of the record batch.
"""
torch = attempt_import_or_raise("torch", "torch")
return torch.stack([torch.from_dlpack(col) for col in batch.columns])

View File

@@ -532,6 +532,27 @@ def test_voyageai_embedding_function():
assert len(tbl.to_pandas()["vector"][0]) == voyageai.ndims()
@pytest.mark.slow
@pytest.mark.skipif(
os.environ.get("VOYAGE_API_KEY") is None, reason="VOYAGE_API_KEY not set"
)
def test_voyageai_embedding_function_contextual_model():
voyageai = (
get_registry().get("voyageai").create(name="voyage-context-3", max_retries=0)
)
class TextModel(LanceModel):
text: str = voyageai.SourceField()
vector: Vector(voyageai.ndims()) = voyageai.VectorField()
df = pd.DataFrame({"text": ["hello world", "goodbye world"]})
db = lancedb.connect("~/lancedb")
tbl = db.create_table("test", schema=TextModel, mode="overwrite")
tbl.add(df)
assert len(tbl.to_pandas()["vector"][0]) == voyageai.ndims()
@pytest.mark.slow
@pytest.mark.skipif(
os.environ.get("VOYAGE_API_KEY") is None, reason="VOYAGE_API_KEY not set"

View File

@@ -20,7 +20,14 @@ from unittest import mock
import lancedb as ldb
from lancedb.db import DBConnection
from lancedb.index import FTS
from lancedb.query import BoostQuery, MatchQuery, MultiMatchQuery, PhraseQuery
from lancedb.query import (
BoostQuery,
MatchQuery,
MultiMatchQuery,
PhraseQuery,
BooleanQuery,
Occur,
)
import numpy as np
import pyarrow as pa
import pandas as pd
@@ -727,3 +734,146 @@ def test_fts_ngram(mem_db: DBConnection):
results = table.search("la", query_type="fts").limit(10).to_list()
assert len(results) == 2
assert set(r["text"] for r in results) == {"lance database", "lance is cool"}
def test_fts_query_to_json():
"""Test that FTS query to_json() produces valid JSON strings with exact format."""
# Test MatchQuery - basic
match_query = MatchQuery("hello world", "text")
json_str = match_query.to_json()
expected = (
'{"match":{"column":"text","terms":"hello world","boost":1.0,'
'"fuzziness":0,"max_expansions":50,"operator":"Or","prefix_length":0}}'
)
assert json_str == expected
# Test MatchQuery with options
match_query = MatchQuery("puppy", "text", fuzziness=2, boost=1.5, prefix_length=3)
json_str = match_query.to_json()
expected = (
'{"match":{"column":"text","terms":"puppy","boost":1.5,"fuzziness":2,'
'"max_expansions":50,"operator":"Or","prefix_length":3}}'
)
assert json_str == expected
# Test PhraseQuery
phrase_query = PhraseQuery("quick brown fox", "title")
json_str = phrase_query.to_json()
expected = '{"phrase":{"column":"title","terms":"quick brown fox","slop":0}}'
assert json_str == expected
# Test PhraseQuery with slop
phrase_query = PhraseQuery("quick brown", "title", slop=2)
json_str = phrase_query.to_json()
expected = '{"phrase":{"column":"title","terms":"quick brown","slop":2}}'
assert json_str == expected
# Test BooleanQuery with MUST
must_query = BooleanQuery(
[
(Occur.MUST, MatchQuery("puppy", "text")),
(Occur.MUST, MatchQuery("runs", "text")),
]
)
json_str = must_query.to_json()
expected = (
'{"boolean":{"should":[],"must":[{"match":{"column":"text","terms":"puppy",'
'"boost":1.0,"fuzziness":0,"max_expansions":50,"operator":"Or",'
'"prefix_length":0}},{"match":{"column":"text","terms":"runs","boost":1.0,'
'"fuzziness":0,"max_expansions":50,"operator":"Or","prefix_length":0}}],'
'"must_not":[]}}'
)
assert json_str == expected
# Test BooleanQuery with SHOULD
should_query = BooleanQuery(
[
(Occur.SHOULD, MatchQuery("cat", "text")),
(Occur.SHOULD, MatchQuery("dog", "text")),
]
)
json_str = should_query.to_json()
expected = (
'{"boolean":{"should":[{"match":{"column":"text","terms":"cat","boost":1.0,'
'"fuzziness":0,"max_expansions":50,"operator":"Or","prefix_length":0}},'
'{"match":{"column":"text","terms":"dog","boost":1.0,"fuzziness":0,'
'"max_expansions":50,"operator":"Or","prefix_length":0}}],"must":[],'
'"must_not":[]}}'
)
assert json_str == expected
# Test BooleanQuery with MUST_NOT
must_not_query = BooleanQuery(
[
(Occur.MUST, MatchQuery("puppy", "text")),
(Occur.MUST_NOT, MatchQuery("training", "text")),
]
)
json_str = must_not_query.to_json()
expected = (
'{"boolean":{"should":[],"must":[{"match":{"column":"text","terms":"puppy",'
'"boost":1.0,"fuzziness":0,"max_expansions":50,"operator":"Or",'
'"prefix_length":0}}],"must_not":[{"match":{"column":"text",'
'"terms":"training","boost":1.0,"fuzziness":0,"max_expansions":50,'
'"operator":"Or","prefix_length":0}}]}}'
)
assert json_str == expected
# Test BoostQuery
positive = MatchQuery("puppy", "text")
negative = MatchQuery("training", "text")
boost_query = BoostQuery(positive, negative, negative_boost=0.3)
json_str = boost_query.to_json()
expected = (
'{"boost":{"positive":{"match":{"column":"text","terms":"puppy",'
'"boost":1.0,"fuzziness":0,"max_expansions":50,"operator":"Or",'
'"prefix_length":0}},"negative":{"match":{"column":"text",'
'"terms":"training","boost":1.0,"fuzziness":0,"max_expansions":50,'
'"operator":"Or","prefix_length":0}},"negative_boost":0.3}}'
)
assert json_str == expected
# Test MultiMatchQuery
multi_match = MultiMatchQuery("python", ["tags", "title"])
json_str = multi_match.to_json()
expected = (
'{"multi_match":{"query":"python","columns":["tags","title"],'
'"boost":[1.0,1.0]}}'
)
assert json_str == expected
# Test complex nested BooleanQuery
inner1 = BooleanQuery(
[
(Occur.MUST, MatchQuery("python", "tags")),
(Occur.MUST, MatchQuery("tutorial", "title")),
]
)
inner2 = BooleanQuery(
[
(Occur.MUST, MatchQuery("rust", "tags")),
(Occur.MUST, MatchQuery("guide", "title")),
]
)
complex_query = BooleanQuery(
[
(Occur.SHOULD, inner1),
(Occur.SHOULD, inner2),
]
)
json_str = complex_query.to_json()
expected = (
'{"boolean":{"should":[{"boolean":{"should":[],"must":[{"match":'
'{"column":"tags","terms":"python","boost":1.0,"fuzziness":0,'
'"max_expansions":50,"operator":"Or","prefix_length":0}},{"match":'
'{"column":"title","terms":"tutorial","boost":1.0,"fuzziness":0,'
'"max_expansions":50,"operator":"Or","prefix_length":0}}],"must_not":[]}}'
',{"boolean":{"should":[],"must":[{"match":{"column":"tags",'
'"terms":"rust","boost":1.0,"fuzziness":0,"max_expansions":50,'
'"operator":"Or","prefix_length":0}},{"match":{"column":"title",'
'"terms":"guide","boost":1.0,"fuzziness":0,"max_expansions":50,'
'"operator":"Or","prefix_length":0}}],"must_not":[]}}],"must":[],'
'"must_not":[]}}'
)
assert json_str == expected

View File

@@ -59,6 +59,14 @@ class TempNamespace(LanceNamespace):
root
] # Reference to shared namespaces
def namespace_id(self) -> str:
"""Return a human-readable unique identifier for this namespace instance.
Returns:
A unique identifier string based on the root directory
"""
return f"TempNamespace {{ root: '{self.config.root}' }}"
def list_tables(self, request: ListTablesRequest) -> ListTablesResponse:
"""List all tables in the namespace."""
if not request.id:

View File

@@ -2,9 +2,26 @@
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import pyarrow as pa
import math
import pytest
from lancedb.permutation import permutation_builder
from lancedb import DBConnection, Table, connect
from lancedb.permutation import Permutation, Permutations, permutation_builder
def test_permutation_persistence(tmp_path):
db = connect(tmp_path)
tbl = db.create_table("test_table", pa.table({"x": range(100), "y": range(100)}))
permutation_tbl = (
permutation_builder(tbl).shuffle().persist(db, "test_permutation").execute()
)
assert permutation_tbl.count_rows() == 100
re_open = db.open_table("test_permutation")
assert re_open.count_rows() == 100
assert permutation_tbl.to_arrow() == re_open.to_arrow()
def test_split_random_ratios(mem_db):
@@ -195,21 +212,33 @@ def test_split_error_cases(mem_db):
tbl = mem_db.create_table("test_table", pa.table({"x": range(10), "y": range(10)}))
# Test split_random with no parameters
with pytest.raises(Exception):
with pytest.raises(
ValueError,
match="Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
):
permutation_builder(tbl).split_random().execute()
# Test split_random with multiple parameters
with pytest.raises(Exception):
with pytest.raises(
ValueError,
match="Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
):
permutation_builder(tbl).split_random(
ratios=[0.5, 0.5], counts=[5, 5]
).execute()
# Test split_sequential with no parameters
with pytest.raises(Exception):
with pytest.raises(
ValueError,
match="Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
):
permutation_builder(tbl).split_sequential().execute()
# Test split_sequential with multiple parameters
with pytest.raises(Exception):
with pytest.raises(
ValueError,
match="Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
):
permutation_builder(tbl).split_sequential(ratios=[0.5, 0.5], fixed=2).execute()
@@ -460,3 +489,455 @@ def test_filter_empty_result(mem_db):
)
assert permutation_tbl.count_rows() == 0
@pytest.fixture
def mem_db() -> DBConnection:
return connect("memory:///")
@pytest.fixture
def some_table(mem_db: DBConnection) -> Table:
data = pa.table(
{
"id": range(1000),
"value": range(1000),
}
)
return mem_db.create_table("some_table", data)
def test_no_split_names(some_table: Table):
perm_tbl = (
permutation_builder(some_table).split_sequential(counts=[500, 500]).execute()
)
permutations = Permutations(some_table, perm_tbl)
assert permutations.split_names == []
assert permutations.split_dict == {}
assert permutations[0].num_rows == 500
assert permutations[1].num_rows == 500
@pytest.fixture
def some_perm_table(some_table: Table) -> Table:
return (
permutation_builder(some_table)
.split_random(ratios=[0.95, 0.05], seed=42, split_names=["train", "test"])
.shuffle(seed=42)
.execute()
)
def test_nonexistent_split(some_table: Table, some_perm_table: Table):
# Reference by name and name does not exist
with pytest.raises(ValueError, match="split `nonexistent` is not defined"):
Permutation.from_tables(some_table, some_perm_table, "nonexistent")
# Reference by ordinal and there are no rows
with pytest.raises(ValueError, match="No rows found"):
Permutation.from_tables(some_table, some_perm_table, 5)
def test_permutations(some_table: Table, some_perm_table: Table):
permutations = Permutations(some_table, some_perm_table)
assert permutations.split_names == ["train", "test"]
assert permutations.split_dict == {"train": 0, "test": 1}
assert permutations["train"].num_rows == 950
assert permutations[0].num_rows == 950
assert permutations["test"].num_rows == 50
assert permutations[1].num_rows == 50
with pytest.raises(ValueError, match="No split named `nonexistent` found"):
permutations["nonexistent"]
with pytest.raises(ValueError, match="No rows found"):
permutations[5]
@pytest.fixture
def some_permutation(some_table: Table, some_perm_table: Table) -> Permutation:
return Permutation.from_tables(some_table, some_perm_table)
def test_num_rows(some_permutation: Permutation):
assert some_permutation.num_rows == 950
def test_num_columns(some_permutation: Permutation):
assert some_permutation.num_columns == 2
def test_column_names(some_permutation: Permutation):
assert some_permutation.column_names == ["id", "value"]
def test_shape(some_permutation: Permutation):
assert some_permutation.shape == (950, 2)
def test_schema(some_permutation: Permutation):
assert some_permutation.schema == pa.schema(
[("id", pa.int64()), ("value", pa.int64())]
)
def test_limit_offset(some_permutation: Permutation):
assert some_permutation.with_take(100).num_rows == 100
assert some_permutation.with_skip(100).num_rows == 850
assert some_permutation.with_take(100).with_skip(100).num_rows == 100
with pytest.raises(Exception):
some_permutation.with_take(1000000).num_rows
with pytest.raises(Exception):
some_permutation.with_skip(1000000).num_rows
with pytest.raises(Exception):
some_permutation.with_take(500).with_skip(500).num_rows
with pytest.raises(Exception):
some_permutation.with_skip(500).with_take(500).num_rows
def test_remove_columns(some_permutation: Permutation):
assert some_permutation.remove_columns(["value"]).schema == pa.schema(
[("id", pa.int64())]
)
# Should not modify the original permutation
assert some_permutation.schema.names == ["id", "value"]
# Cannot remove all columns
with pytest.raises(ValueError, match="Cannot remove all columns"):
some_permutation.remove_columns(["id", "value"])
def test_rename_column(some_permutation: Permutation):
assert some_permutation.rename_column("value", "new_value").schema == pa.schema(
[("id", pa.int64()), ("new_value", pa.int64())]
)
# Should not modify the original permutation
assert some_permutation.schema.names == ["id", "value"]
# Cannot rename to an existing column
with pytest.raises(
ValueError,
match="a column with that name already exists",
):
some_permutation.rename_column("value", "id")
# Cannot rename a non-existent column
with pytest.raises(
ValueError,
match="does not exist",
):
some_permutation.rename_column("non_existent", "new_value")
def test_rename_columns(some_permutation: Permutation):
assert some_permutation.rename_columns({"value": "new_value"}).schema == pa.schema(
[("id", pa.int64()), ("new_value", pa.int64())]
)
# Should not modify the original permutation
assert some_permutation.schema.names == ["id", "value"]
# Cannot rename to an existing column
with pytest.raises(ValueError, match="a column with that name already exists"):
some_permutation.rename_columns({"value": "id"})
def test_select_columns(some_permutation: Permutation):
assert some_permutation.select_columns(["id"]).schema == pa.schema(
[("id", pa.int64())]
)
# Should not modify the original permutation
assert some_permutation.schema.names == ["id", "value"]
# Cannot select a non-existent column
with pytest.raises(ValueError, match="does not exist"):
some_permutation.select_columns(["non_existent"])
# Empty selection is not allowed
with pytest.raises(ValueError, match="select at least one column"):
some_permutation.select_columns([])
def test_iter_basic(some_permutation: Permutation):
"""Test basic iteration with custom batch size."""
batch_size = 100
batches = list(some_permutation.iter(batch_size, skip_last_batch=False))
# Check that we got the expected number of batches
expected_batches = (950 + batch_size - 1) // batch_size # ceiling division
assert len(batches) == expected_batches
# Check that all batches are dicts (default python format)
assert all(isinstance(batch, dict) for batch in batches)
# Check that batches have the correct structure
for batch in batches:
assert "id" in batch
assert "value" in batch
assert isinstance(batch["id"], list)
assert isinstance(batch["value"], list)
# Check that all batches except the last have the correct size
for batch in batches[:-1]:
assert len(batch["id"]) == batch_size
assert len(batch["value"]) == batch_size
# Last batch might be smaller
assert len(batches[-1]["id"]) <= batch_size
def test_iter_skip_last_batch(some_permutation: Permutation):
"""Test iteration with skip_last_batch=True."""
batch_size = 300
batches_with_skip = list(some_permutation.iter(batch_size, skip_last_batch=True))
batches_without_skip = list(
some_permutation.iter(batch_size, skip_last_batch=False)
)
# With skip_last_batch=True, we should have fewer batches if the last one is partial
num_full_batches = 950 // batch_size
assert len(batches_with_skip) == num_full_batches
# Without skip_last_batch, we should have one more batch if there's a remainder
if 950 % batch_size != 0:
assert len(batches_without_skip) == num_full_batches + 1
# Last batch should be smaller
assert len(batches_without_skip[-1]["id"]) == 950 % batch_size
# All batches with skip_last_batch should be full size
for batch in batches_with_skip:
assert len(batch["id"]) == batch_size
def test_iter_different_batch_sizes(some_permutation: Permutation):
"""Test iteration with different batch sizes."""
# Test with small batch size
small_batches = list(some_permutation.iter(100, skip_last_batch=False))
assert len(small_batches) == 10 # ceiling(950 / 100)
# Test with large batch size
large_batches = list(some_permutation.iter(400, skip_last_batch=False))
assert len(large_batches) == 3 # ceiling(950 / 400)
# Test with batch size equal to total rows
single_batch = list(some_permutation.iter(950, skip_last_batch=False))
assert len(single_batch) == 1
assert len(single_batch[0]["id"]) == 950
# Test with batch size larger than total rows
oversized_batch = list(some_permutation.iter(10000, skip_last_batch=False))
assert len(oversized_batch) == 1
assert len(oversized_batch[0]["id"]) == 950
def test_dunder_iter(some_permutation: Permutation):
"""Test the __iter__ method."""
# __iter__ should use DEFAULT_BATCH_SIZE (100) and skip_last_batch=True
batches = list(some_permutation)
# With DEFAULT_BATCH_SIZE=100 and skip_last_batch=True, we should get 9 batches
assert len(batches) == 9 # ceiling(950 / 100)
# All batches should be full size
for batch in batches:
assert len(batch["id"]) == 100
assert len(batch["value"]) == 100
some_permutation = some_permutation.with_batch_size(400)
batches = list(some_permutation)
assert len(batches) == 2 # floor(950 / 400) since skip_last_batch=True
for batch in batches:
assert len(batch["id"]) == 400
assert len(batch["value"]) == 400
def test_iter_with_different_formats(some_permutation: Permutation):
"""Test iteration with different output formats."""
batch_size = 100
# Test with arrow format
arrow_perm = some_permutation.with_format("arrow")
arrow_batches = list(arrow_perm.iter(batch_size, skip_last_batch=False))
assert all(isinstance(batch, pa.RecordBatch) for batch in arrow_batches)
# Test with python format (default)
python_perm = some_permutation.with_format("python")
python_batches = list(python_perm.iter(batch_size, skip_last_batch=False))
assert all(isinstance(batch, dict) for batch in python_batches)
# Test with pandas format
pandas_perm = some_permutation.with_format("pandas")
pandas_batches = list(pandas_perm.iter(batch_size, skip_last_batch=False))
# Import pandas to check the type
import pandas as pd
assert all(isinstance(batch, pd.DataFrame) for batch in pandas_batches)
def test_iter_with_column_selection(some_permutation: Permutation):
"""Test iteration after column selection."""
# Select only the id column
id_only = some_permutation.select_columns(["id"])
batches = list(id_only.iter(100, skip_last_batch=False))
# Check that batches only contain the id column
for batch in batches:
assert "id" in batch
assert "value" not in batch
def test_iter_with_column_rename(some_permutation: Permutation):
"""Test iteration after renaming columns."""
renamed = some_permutation.rename_column("value", "data")
batches = list(renamed.iter(100, skip_last_batch=False))
# Check that batches have the renamed column
for batch in batches:
assert "id" in batch
assert "data" in batch
assert "value" not in batch
def test_iter_with_limit_offset(some_permutation: Permutation):
"""Test iteration with limit and offset."""
# Test with offset
offset_perm = some_permutation.with_skip(100)
offset_batches = list(offset_perm.iter(100, skip_last_batch=False))
# Should have 850 rows (950 - 100)
expected_batches = math.ceil(850 / 100)
assert len(offset_batches) == expected_batches
# Test with limit
limit_perm = some_permutation.with_take(500)
limit_batches = list(limit_perm.iter(100, skip_last_batch=False))
# Should have 5 batches (500 / 100)
assert len(limit_batches) == 5
no_skip = some_permutation.iter(101, skip_last_batch=False)
row_100 = next(no_skip)["id"][100]
# Test with both limit and offset
limited_perm = some_permutation.with_skip(100).with_take(300)
limited_batches = list(limited_perm.iter(100, skip_last_batch=False))
# Should have 3 batches (300 / 100)
assert len(limited_batches) == 3
assert limited_batches[0]["id"][0] == row_100
def test_iter_empty_permutation(mem_db):
"""Test iteration over an empty permutation."""
# Create a table and filter it to be empty
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(10), "value": range(10)})
)
permutation_tbl = permutation_builder(tbl).filter("value > 100").execute()
with pytest.raises(ValueError, match="No rows found"):
Permutation.from_tables(tbl, permutation_tbl)
def test_iter_single_row(mem_db):
"""Test iteration over a permutation with a single row."""
tbl = mem_db.create_table("test_table", pa.table({"id": [42], "value": [100]}))
permutation_tbl = permutation_builder(tbl).execute()
perm = Permutation.from_tables(tbl, permutation_tbl)
# With skip_last_batch=False, should get one batch
batches = list(perm.iter(10, skip_last_batch=False))
assert len(batches) == 1
assert len(batches[0]["id"]) == 1
# With skip_last_batch=True, should skip the single row (since it's < batch_size)
batches_skip = list(perm.iter(10, skip_last_batch=True))
assert len(batches_skip) == 0
def test_identity_permutation(mem_db):
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(10), "value": range(10)})
)
permutation = Permutation.identity(tbl)
assert permutation.num_rows == 10
assert permutation.num_columns == 2
batches = list(permutation.iter(10, skip_last_batch=False))
assert len(batches) == 1
assert len(batches[0]["id"]) == 10
assert len(batches[0]["value"]) == 10
permutation = permutation.remove_columns(["value"])
assert permutation.num_columns == 1
assert permutation.schema == pa.schema([("id", pa.int64())])
assert permutation.column_names == ["id"]
assert permutation.shape == (10, 1)
def test_transform_fn(mem_db):
import numpy as np
import pandas as pd
import polars as pl
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(10), "value": range(10)})
)
permutation = Permutation.identity(tbl)
np_result = list(permutation.with_format("numpy").iter(10, skip_last_batch=False))[
0
]
assert np_result.shape == (10, 2)
assert np_result.dtype == np.int64
assert isinstance(np_result, np.ndarray)
pd_result = list(permutation.with_format("pandas").iter(10, skip_last_batch=False))[
0
]
assert pd_result.shape == (10, 2)
assert pd_result.dtypes.tolist() == [np.int64, np.int64]
assert isinstance(pd_result, pd.DataFrame)
pl_result = list(permutation.with_format("polars").iter(10, skip_last_batch=False))[
0
]
assert pl_result.shape == (10, 2)
assert pl_result.dtypes == [pl.Int64, pl.Int64]
assert isinstance(pl_result, pl.DataFrame)
py_result = list(permutation.with_format("python").iter(10, skip_last_batch=False))[
0
]
assert len(py_result) == 2
assert len(py_result["id"]) == 10
assert len(py_result["value"]) == 10
assert isinstance(py_result, dict)
try:
import torch
torch_result = list(
permutation.with_format("torch").iter(10, skip_last_batch=False)
)[0]
assert torch_result.shape == (2, 10)
assert torch_result.dtype == torch.int64
assert isinstance(torch_result, torch.Tensor)
except ImportError:
# Skip check if torch is not installed
pass
arrow_result = list(
permutation.with_format("arrow").iter(10, skip_last_batch=False)
)[0]
assert arrow_result.shape == (10, 2)
assert arrow_result.schema == pa.schema([("id", pa.int64()), ("value", pa.int64())])
assert isinstance(arrow_result, pa.RecordBatch)
def test_custom_transform(mem_db):
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(10), "value": range(10)})
)
permutation = Permutation.identity(tbl)
def transform(batch: pa.RecordBatch) -> pa.RecordBatch:
return batch.select(["id"])
transformed = permutation.with_transform(transform)
batches = list(transformed.iter(10, skip_last_batch=False))
assert len(batches) == 1
batch = batches[0]
assert batch == pa.record_batch([range(10)], ["id"])

View File

@@ -484,7 +484,7 @@ def test_jina_reranker(tmp_path, use_tantivy):
@pytest.mark.parametrize("use_tantivy", [True, False])
def test_voyageai_reranker(tmp_path, use_tantivy):
pytest.importorskip("voyageai")
reranker = VoyageAIReranker(model_name="rerank-2")
reranker = VoyageAIReranker(model_name="rerank-2.5")
table, schema = get_test_table(tmp_path, use_tantivy)
_run_test_reranker(reranker, table, "single player experience", None, schema)

View File

@@ -3,19 +3,11 @@
import pyarrow as pa
import pytest
from lancedb.util import tbl_to_tensor
torch = pytest.importorskip("torch")
def tbl_to_tensor(tbl):
def to_tensor(col: pa.ChunkedArray):
if col.num_chunks > 1:
raise Exception("Single batch was too large to fit into a one-chunk table")
return torch.from_dlpack(col.chunk(0))
return torch.stack([to_tensor(tbl.column(i)) for i in range(tbl.num_columns)])
def test_table_dataloader(mem_db):
table = mem_db.create_table("test_table", pa.table({"a": range(1000)}))
dataloader = torch.utils.data.DataLoader(

View File

@@ -6,7 +6,7 @@ use std::{collections::HashMap, sync::Arc, time::Duration};
use arrow::{datatypes::Schema, ffi_stream::ArrowArrayStreamReader, pyarrow::FromPyArrow};
use lancedb::{
connection::Connection as LanceConnection,
database::{CreateTableMode, ReadConsistency},
database::{CreateTableMode, Database, ReadConsistency},
};
use pyo3::{
exceptions::{PyRuntimeError, PyValueError},
@@ -42,6 +42,10 @@ impl Connection {
_ => Err(PyValueError::new_err(format!("Invalid mode {}", mode))),
}
}
pub fn database(&self) -> PyResult<Arc<dyn Database>> {
Ok(self.get_inner()?.database().clone())
}
}
#[pymethods]

View File

@@ -5,7 +5,7 @@ use arrow::RecordBatchStream;
use connection::{connect, Connection};
use env_logger::Env;
use index::IndexConfig;
use permutation::PyAsyncPermutationBuilder;
use permutation::{PyAsyncPermutationBuilder, PyPermutationReader};
use pyo3::{
pymodule,
types::{PyModule, PyModuleMethods},
@@ -52,9 +52,11 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_class::<DropColumnsResult>()?;
m.add_class::<UpdateResult>()?;
m.add_class::<PyAsyncPermutationBuilder>()?;
m.add_class::<PyPermutationReader>()?;
m.add_function(wrap_pyfunction!(connect, m)?)?;
m.add_function(wrap_pyfunction!(permutation::async_permutation_builder, m)?)?;
m.add_function(wrap_pyfunction!(util::validate_table_name, m)?)?;
m.add_function(wrap_pyfunction!(query::fts_query_to_json, m)?)?;
m.add("__version__", env!("CARGO_PKG_VERSION"))?;
Ok(())
}

View File

@@ -3,14 +3,23 @@
use std::sync::{Arc, Mutex};
use crate::{error::PythonErrorExt, table::Table};
use lancedb::dataloader::{
permutation::builder::{PermutationBuilder as LancePermutationBuilder, ShuffleStrategy},
permutation::split::{SplitSizes, SplitStrategy},
use crate::{
arrow::RecordBatchStream, connection::Connection, error::PythonErrorExt, table::Table,
};
use arrow::pyarrow::ToPyArrow;
use lancedb::{
dataloader::permutation::{
builder::{PermutationBuilder as LancePermutationBuilder, ShuffleStrategy},
reader::PermutationReader,
split::{SplitSizes, SplitStrategy},
},
query::Select,
};
use pyo3::{
exceptions::PyRuntimeError, pyclass, pymethods, types::PyAnyMethods, Bound, PyAny, PyRefMut,
PyResult,
exceptions::PyRuntimeError,
pyclass, pymethods,
types::{PyAnyMethods, PyDict, PyDictMethods, PyType},
Bound, PyAny, PyRef, PyRefMut, PyResult, Python,
};
use pyo3_async_runtimes::tokio::future_into_py;
@@ -56,13 +65,32 @@ impl PyAsyncPermutationBuilder {
#[pymethods]
impl PyAsyncPermutationBuilder {
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None, seed=None))]
#[pyo3(signature = (database, table_name))]
pub fn persist(
slf: PyRefMut<'_, Self>,
database: Bound<'_, PyAny>,
table_name: String,
) -> PyResult<Self> {
let conn = if database.hasattr("_conn")? {
database
.getattr("_conn")?
.getattr("_inner")?
.downcast_into::<Connection>()?
} else {
database.getattr("_inner")?.downcast_into::<Connection>()?
};
let database = conn.borrow().database()?;
slf.modify(|builder| builder.persist(database, table_name))
}
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None, seed=None, split_names=None))]
pub fn split_random(
slf: PyRefMut<'_, Self>,
ratios: Option<Vec<f64>>,
counts: Option<Vec<u64>>,
fixed: Option<u64>,
seed: Option<u64>,
split_names: Option<Vec<String>>,
) -> PyResult<Self> {
// Check that exactly one split type is provided
let split_args_count = [ratios.is_some(), counts.is_some(), fixed.is_some()]
@@ -86,31 +114,38 @@ impl PyAsyncPermutationBuilder {
unreachable!("One of the split arguments must be provided");
};
slf.modify(|builder| builder.with_split_strategy(SplitStrategy::Random { seed, sizes }))
slf.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Random { seed, sizes }, split_names)
})
}
#[pyo3(signature = (columns, split_weights, *, discard_weight=0))]
#[pyo3(signature = (columns, split_weights, *, discard_weight=0, split_names=None))]
pub fn split_hash(
slf: PyRefMut<'_, Self>,
columns: Vec<String>,
split_weights: Vec<u64>,
discard_weight: u64,
split_names: Option<Vec<String>>,
) -> PyResult<Self> {
slf.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Hash {
columns,
split_weights,
discard_weight,
})
builder.with_split_strategy(
SplitStrategy::Hash {
columns,
split_weights,
discard_weight,
},
split_names,
)
})
}
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None))]
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None, split_names=None))]
pub fn split_sequential(
slf: PyRefMut<'_, Self>,
ratios: Option<Vec<f64>>,
counts: Option<Vec<u64>>,
fixed: Option<u64>,
split_names: Option<Vec<String>>,
) -> PyResult<Self> {
// Check that exactly one split type is provided
let split_args_count = [ratios.is_some(), counts.is_some(), fixed.is_some()]
@@ -134,11 +169,19 @@ impl PyAsyncPermutationBuilder {
unreachable!("One of the split arguments must be provided");
};
slf.modify(|builder| builder.with_split_strategy(SplitStrategy::Sequential { sizes }))
slf.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Sequential { sizes }, split_names)
})
}
pub fn split_calculated(slf: PyRefMut<'_, Self>, calculation: String) -> PyResult<Self> {
slf.modify(|builder| builder.with_split_strategy(SplitStrategy::Calculated { calculation }))
pub fn split_calculated(
slf: PyRefMut<'_, Self>,
calculation: String,
split_names: Option<Vec<String>>,
) -> PyResult<Self> {
slf.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Calculated { calculation }, split_names)
})
}
pub fn shuffle(
@@ -168,3 +211,121 @@ impl PyAsyncPermutationBuilder {
})
}
}
#[pyclass(name = "PermutationReader")]
pub struct PyPermutationReader {
reader: Arc<PermutationReader>,
}
impl PyPermutationReader {
fn from_reader(reader: PermutationReader) -> Self {
Self {
reader: Arc::new(reader),
}
}
fn parse_selection(selection: Option<Bound<'_, PyAny>>) -> PyResult<Select> {
let Some(selection) = selection else {
return Ok(Select::All);
};
let selection = selection.downcast_into::<PyDict>()?;
let selection = selection
.iter()
.map(|(key, value)| {
let key = key.extract::<String>()?;
let value = value.extract::<String>()?;
Ok((key, value))
})
.collect::<PyResult<Vec<_>>>()?;
Ok(Select::dynamic(&selection))
}
}
#[pymethods]
impl PyPermutationReader {
#[classmethod]
pub fn from_tables<'py>(
cls: &Bound<'py, PyType>,
base_table: Bound<'py, PyAny>,
permutation_table: Option<Bound<'py, PyAny>>,
split: u64,
) -> PyResult<Bound<'py, PyAny>> {
let base_table = base_table.getattr("_inner")?.downcast_into::<Table>()?;
let permutation_table = permutation_table
.map(|p| PyResult::Ok(p.getattr("_inner")?.downcast_into::<Table>()?))
.transpose()?;
let base_table = base_table.borrow().inner_ref()?.base_table().clone();
let permutation_table = permutation_table
.map(|p| PyResult::Ok(p.borrow().inner_ref()?.base_table().clone()))
.transpose()?;
future_into_py(cls.py(), async move {
let reader = if let Some(permutation_table) = permutation_table {
PermutationReader::try_from_tables(base_table, permutation_table, split)
.await
.infer_error()?
} else {
PermutationReader::identity(base_table).await
};
Ok(Self::from_reader(reader))
})
}
#[pyo3(signature = (selection=None))]
pub fn output_schema<'py>(
slf: PyRef<'py, Self>,
selection: Option<Bound<'py, PyAny>>,
) -> PyResult<Bound<'py, PyAny>> {
let selection = Self::parse_selection(selection)?;
let reader = slf.reader.clone();
future_into_py(slf.py(), async move {
let schema = reader.output_schema(selection).await.infer_error()?;
Python::with_gil(|py| schema.to_pyarrow(py))
})
}
#[pyo3(signature = ())]
pub fn count_rows<'py>(slf: PyRef<'py, Self>) -> u64 {
slf.reader.count_rows()
}
#[pyo3(signature = (offset))]
pub fn with_offset<'py>(slf: PyRef<'py, Self>, offset: u64) -> PyResult<Bound<'py, PyAny>> {
let reader = slf.reader.as_ref().clone();
future_into_py(slf.py(), async move {
let reader = reader.with_offset(offset).await.infer_error()?;
Ok(Self::from_reader(reader))
})
}
#[pyo3(signature = (limit))]
pub fn with_limit<'py>(slf: PyRef<'py, Self>, limit: u64) -> PyResult<Bound<'py, PyAny>> {
let reader = slf.reader.as_ref().clone();
future_into_py(slf.py(), async move {
let reader = reader.with_limit(limit).await.infer_error()?;
Ok(Self::from_reader(reader))
})
}
#[pyo3(signature = (selection=None, *, batch_size=None))]
pub fn read<'py>(
slf: PyRef<'py, Self>,
selection: Option<Bound<'py, PyAny>>,
batch_size: Option<u32>,
) -> PyResult<Bound<'py, PyAny>> {
let selection = Self::parse_selection(selection)?;
let reader = slf.reader.clone();
let batch_size = batch_size.unwrap_or(1024);
future_into_py(slf.py(), async move {
use lancedb::query::QueryExecutionOptions;
let mut execution_options = QueryExecutionOptions::default();
execution_options.max_batch_length = batch_size;
let stream = reader
.read(selection, execution_options)
.await
.infer_error()?;
Ok(RecordBatchStream::new(stream))
})
}
}

View File

@@ -23,6 +23,7 @@ use lancedb::query::{
};
use lancedb::table::AnyQuery;
use pyo3::prelude::{PyAnyMethods, PyDictMethods};
use pyo3::pyfunction;
use pyo3::pymethods;
use pyo3::types::PyList;
use pyo3::types::{PyDict, PyString};
@@ -982,3 +983,15 @@ impl HybridQuery {
req
}
}
/// Convert a Python FTS query to JSON string
#[pyfunction]
pub fn fts_query_to_json(query_obj: &Bound<'_, PyAny>) -> PyResult<String> {
let wrapped: PyLanceDB<FtsQuery> = query_obj.extract()?;
lancedb::table::datafusion::udtf::fts::to_json(&wrapped.0).map_err(|e| {
PyErr::new::<pyo3::exceptions::PyValueError, _>(format!(
"Failed to serialize FTS query to JSON: {}",
e
))
})
}

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.22.3-beta.0"
version = "0.22.3-beta.5"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -42,8 +42,9 @@ lance-table = { workspace = true }
lance-linalg = { workspace = true }
lance-testing = { workspace = true }
lance-encoding = { workspace = true }
lance-arrow = { workspace = true }
lance-namespace = { workspace = true }
lance-namespace-impls = { workspace = true, features = ["dir", "rest"] }
lance-namespace-impls = { workspace = true }
moka = { workspace = true }
pin-project = { workspace = true }
tokio = { version = "1.23", features = ["rt-multi-thread"] }
@@ -85,10 +86,6 @@ candle-nn = { version = "0.9.1", optional = true }
tokenizers = { version = "0.19.1", optional = true }
semver = { workspace = true }
# For a workaround, see workspace Cargo.toml
crunchy.workspace = true
bytemuck_derive.workspace = true
[dev-dependencies]
anyhow = "1"
tempfile = "3.5.0"

View File

@@ -1188,7 +1188,7 @@ mod tests {
use arrow_schema::{DataType, Field, Schema};
use datafusion_physical_plan::stream::RecordBatchStreamAdapter;
use futures::{stream, TryStreamExt};
use lance::error::{ArrowResult, DataFusionResult};
use lance_core::error::{ArrowResult, DataFusionResult};
use lance_testing::datagen::{BatchGenerator, IncrementingInt32};
use tempfile::tempdir;

View File

@@ -12,7 +12,7 @@ use arrow_array::{
use arrow_cast::{can_cast_types, cast};
use arrow_schema::{ArrowError, DataType, Field, Schema};
use half::f16;
use lance::arrow::{DataTypeExt, FixedSizeListArrayExt};
use lance_arrow::{DataTypeExt, FixedSizeListArrayExt};
use log::warn;
use num_traits::cast::AsPrimitive;
@@ -189,7 +189,7 @@ mod tests {
};
use arrow_schema::Field;
use half::f16;
use lance::arrow::FixedSizeListArrayExt;
use lance_arrow::FixedSizeListArrayExt;
#[test]
fn test_coerce_list_to_fixed_size_list() {

View File

@@ -455,6 +455,7 @@ impl ListingDatabase {
// `remove_dir_all` may be used to remove something not be a dataset
lance::Error::NotFound { .. } => Error::TableNotFound {
name: name.to_owned(),
source: Box::new(err),
},
_ => Error::from(err),
})?;

View File

@@ -14,7 +14,7 @@ use lance_namespace::{
},
LanceNamespace,
};
use lance_namespace_impls::connect::connect as connect_namespace;
use lance_namespace_impls::ConnectBuilder;
use crate::database::listing::ListingDatabase;
use crate::error::{Error, Result};
@@ -48,11 +48,16 @@ impl LanceNamespaceDatabase {
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
) -> Result<Self> {
let namespace = connect_namespace(ns_impl, ns_properties.clone())
.await
.map_err(|e| Error::InvalidInput {
message: format!("Failed to connect to namespace: {:?}", e),
})?;
let mut builder = ConnectBuilder::new(ns_impl);
for (key, value) in ns_properties.clone() {
builder = builder.property(key, value);
}
if let Some(ref sess) = session {
builder = builder.session(sess.clone());
}
let namespace = builder.connect().await.map_err(|e| Error::InvalidInput {
message: format!("Failed to connect to namespace: {:?}", e),
})?;
Ok(Self {
namespace,

View File

@@ -1,7 +1,7 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::Arc;
use std::{collections::HashMap, sync::Arc};
use datafusion::prelude::{SessionConfig, SessionContext};
use datafusion_execution::{disk_manager::DiskManagerBuilder, runtime_env::RuntimeEnvBuilder};
@@ -25,6 +25,8 @@ use crate::{
pub const SRC_ROW_ID_COL: &str = "row_id";
pub const SPLIT_NAMES_CONFIG_KEY: &str = "split_names";
/// Where to store the permutation table
#[derive(Debug, Clone, Default)]
enum PermutationDestination {
@@ -40,6 +42,8 @@ enum PermutationDestination {
pub struct PermutationConfig {
/// Splitting configuration
split_strategy: SplitStrategy,
/// Optional names for the splits
split_names: Option<Vec<String>>,
/// Shuffle strategy
shuffle_strategy: ShuffleStrategy,
/// Optional filter to apply to the base table
@@ -112,8 +116,16 @@ impl PermutationBuilder {
/// multiple processes and multiple nodes.
///
/// The default is a single split that contains all rows.
pub fn with_split_strategy(mut self, split_strategy: SplitStrategy) -> Self {
///
/// An optional list of names can be provided for the splits. This is for convenience and the names
/// will be stored in the permutation table's config metadata.
pub fn with_split_strategy(
mut self,
split_strategy: SplitStrategy,
split_names: Option<Vec<String>>,
) -> Self {
self.config.split_strategy = split_strategy;
self.config.split_names = split_names;
self
}
@@ -193,6 +205,30 @@ impl PermutationBuilder {
Ok(Box::pin(SimpleRecordBatchStream { schema, stream }))
}
fn add_split_names(
data: SendableRecordBatchStream,
split_names: &[String],
) -> Result<SendableRecordBatchStream> {
let schema = data
.schema()
.as_ref()
.clone()
.with_metadata(HashMap::from([(
SPLIT_NAMES_CONFIG_KEY.to_string(),
serde_json::to_string(split_names).map_err(|e| Error::Other {
message: format!("Failed to serialize split names: {}", e),
source: Some(e.into()),
})?,
)]));
let schema = Arc::new(schema);
let schema_clone = schema.clone();
let stream = data.map_ok(move |batch| batch.with_schema(schema.clone()).unwrap());
Ok(Box::pin(SimpleRecordBatchStream {
schema: schema_clone,
stream,
}))
}
/// Builds the permutation table and stores it in the given database.
pub async fn build(self) -> Result<Table> {
// First pass, apply filter and load row ids
@@ -249,6 +285,12 @@ impl PermutationBuilder {
// Rename _rowid to row_id
let renamed = rename_column(sorted, ROW_ID, SRC_ROW_ID_COL)?;
let streaming_data = if let Some(split_names) = &self.config.split_names {
Self::add_split_names(renamed, split_names)?
} else {
renamed
};
let (name, database) = match &self.config.destination {
PermutationDestination::Permanent(database, table_name) => {
(table_name.as_str(), database.clone())
@@ -259,10 +301,13 @@ impl PermutationBuilder {
}
};
let create_table_request =
CreateTableRequest::new(name.to_string(), CreateTableData::StreamingData(renamed));
let create_table_request = CreateTableRequest::new(
name.to_string(),
CreateTableData::StreamingData(streaming_data),
);
let table = database.create_table(create_table_request).await?;
Ok(Table::new(table, database))
}
}
@@ -296,10 +341,13 @@ mod tests {
let permutation_table = PermutationBuilder::new(data_table.clone())
.with_filter("some_value > 57".to_string())
.with_split_strategy(SplitStrategy::Random {
seed: Some(42),
sizes: SplitSizes::Percentages(vec![0.05, 0.30]),
})
.with_split_strategy(
SplitStrategy::Random {
seed: Some(42),
sizes: SplitSizes::Percentages(vec![0.05, 0.30]),
},
None,
)
.build()
.await
.unwrap();

View File

@@ -11,58 +11,160 @@ use crate::arrow::{SendableRecordBatchStream, SimpleRecordBatchStream};
use crate::dataloader::permutation::builder::SRC_ROW_ID_COL;
use crate::dataloader::permutation::split::SPLIT_ID_COLUMN;
use crate::error::Error;
use crate::query::{QueryExecutionOptions, QueryFilter, QueryRequest, Select};
use crate::table::{AnyQuery, BaseTable};
use crate::Result;
use crate::query::{
ExecutableQuery, QueryBase, QueryExecutionOptions, QueryFilter, QueryRequest, Select,
};
use crate::table::{AnyQuery, BaseTable, Filter};
use crate::{Result, Table};
use arrow::array::AsArray;
use arrow::compute::concat_batches;
use arrow::datatypes::UInt64Type;
use arrow_array::{RecordBatch, UInt64Array};
use arrow_schema::SchemaRef;
use futures::{StreamExt, TryStreamExt};
use lance::arrow::RecordBatchExt;
use lance::dataset::scanner::DatasetRecordBatchStream;
use lance::error::LanceOptionExt;
use lance::io::RecordBatchStream;
use lance_arrow::RecordBatchExt;
use lance_core::error::LanceOptionExt;
use lance_core::ROW_ID;
use std::collections::HashMap;
use std::sync::Arc;
/// Reads a permutation of a source table based on row IDs stored in a separate table
#[derive(Clone)]
pub struct PermutationReader {
base_table: Arc<dyn BaseTable>,
permutation_table: Arc<dyn BaseTable>,
permutation_table: Option<Arc<dyn BaseTable>>,
offset: Option<u64>,
limit: Option<u64>,
available_rows: u64,
split: u64,
}
impl std::fmt::Debug for PermutationReader {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(
f,
"PermutationReader(base={}, permutation={})",
"PermutationReader(base={}, permutation={}, split={}, offset={:?}, limit={:?})",
self.base_table.name(),
self.permutation_table.name(),
self.permutation_table
.as_ref()
.map(|t| t.name())
.unwrap_or("--"),
self.split,
self.offset,
self.limit,
)
}
}
impl PermutationReader {
/// Create a new PermutationReader
pub async fn try_new(
pub async fn inner_new(
base_table: Arc<dyn BaseTable>,
permutation_table: Arc<dyn BaseTable>,
permutation_table: Option<Arc<dyn BaseTable>>,
split: u64,
) -> Result<Self> {
let schema = permutation_table.schema().await?;
if schema.column_with_name(SRC_ROW_ID_COL).is_none() {
return Err(Error::InvalidInput {
message: "Permutation table must contain a column named row_id".to_string(),
});
}
if schema.column_with_name(SPLIT_ID_COLUMN).is_none() {
return Err(Error::InvalidInput {
message: "Permutation table must contain a column named split_id".to_string(),
});
}
Ok(Self {
let mut slf = Self {
base_table,
permutation_table,
})
offset: None,
limit: None,
available_rows: 0,
split,
};
slf.validate().await?;
// Calculate the number of available rows
slf.available_rows = slf.verify_limit_offset(None, None).await?;
if slf.available_rows == 0 {
return Err(Error::InvalidInput {
message: "No rows found in the permutation table for the given split".to_string(),
});
}
Ok(slf)
}
pub async fn try_from_tables(
base_table: Arc<dyn BaseTable>,
permutation_table: Arc<dyn BaseTable>,
split: u64,
) -> Result<Self> {
Self::inner_new(base_table, Some(permutation_table), split).await
}
pub async fn identity(base_table: Arc<dyn BaseTable>) -> Self {
Self::inner_new(base_table, None, 0).await.unwrap()
}
/// Validates the limit and offset and returns the number of rows that will be read
fn validate_limit_offset(
limit: Option<u64>,
offset: Option<u64>,
available_rows: u64,
) -> Result<u64> {
match (limit, offset) {
(Some(limit), Some(offset)) => {
if offset + limit > available_rows {
Err(Error::InvalidInput {
message: "Offset + limit is greater than the number of rows in the permutation table"
.to_string(),
})
} else {
Ok(limit)
}
}
(None, Some(offset)) => {
if offset > available_rows {
Err(Error::InvalidInput {
message:
"Offset is greater than the number of rows in the permutation table"
.to_string(),
})
} else {
Ok(available_rows - offset)
}
}
(Some(limit), None) => {
if limit > available_rows {
Err(Error::InvalidInput {
message:
"Limit is greater than the number of rows in the permutation table"
.to_string(),
})
} else {
Ok(limit)
}
}
(None, None) => Ok(available_rows),
}
}
async fn verify_limit_offset(&self, limit: Option<u64>, offset: Option<u64>) -> Result<u64> {
let available_rows = if let Some(permutation_table) = &self.permutation_table {
permutation_table
.count_rows(Some(Filter::Sql(format!(
"{} = {}",
SPLIT_ID_COLUMN, self.split
))))
.await? as u64
} else {
self.base_table.count_rows(None).await? as u64
};
Self::validate_limit_offset(limit, offset, available_rows)
}
pub async fn with_offset(mut self, offset: u64) -> Result<Self> {
let available_rows = self.verify_limit_offset(self.limit, Some(offset)).await?;
self.offset = Some(offset);
self.available_rows = available_rows;
Ok(self)
}
pub async fn with_limit(mut self, limit: u64) -> Result<Self> {
let available_rows = self.verify_limit_offset(Some(limit), self.offset).await?;
self.available_rows = available_rows;
self.limit = Some(limit);
Ok(self)
}
fn is_sorted_already<'a, T: Iterator<Item = &'a u64>>(iter: T) -> bool {
@@ -103,7 +205,7 @@ impl PermutationReader {
..Default::default()
};
let mut data = base_table
let data = base_table
.query(
&AnyQuery::Query(base_query),
QueryExecutionOptions {
@@ -112,25 +214,29 @@ impl PermutationReader {
},
)
.await?;
let schema = data.schema();
let Some(batch) = data.try_next().await? else {
let batches = data.try_collect::<Vec<_>>().await?;
if batches.is_empty() {
return Err(Error::InvalidInput {
message: "Base table returned no batches".to_string(),
});
};
if data.try_next().await?.is_some() {
return Err(Error::InvalidInput {
message: "Base table returned more than one batch".to_string(),
});
}
if batch.num_rows() != num_rows {
if batches.iter().map(|b| b.num_rows()).sum::<usize>() != num_rows {
return Err(Error::InvalidInput {
message: "Base table returned different number of rows than the number of row IDs"
.to_string(),
});
}
let batch = if batches.len() == 1 {
batches.into_iter().next().unwrap()
} else {
concat_batches(&schema, &batches)?
};
// There is no guarantee the result order will match the order provided
// so may need to restore order
let actual_row_ids = batch
@@ -230,26 +336,75 @@ impl PermutationReader {
}
}
pub async fn read_split(
async fn validate(&self) -> Result<()> {
if let Some(permutation_table) = &self.permutation_table {
let schema = permutation_table.schema().await?;
if schema.column_with_name(SRC_ROW_ID_COL).is_none() {
return Err(Error::InvalidInput {
message: "Permutation table must contain a column named row_id".to_string(),
});
}
if schema.column_with_name(SPLIT_ID_COLUMN).is_none() {
return Err(Error::InvalidInput {
message: "Permutation table must contain a column named split_id".to_string(),
});
}
}
let avail_rows = if let Some(permutation_table) = &self.permutation_table {
permutation_table.count_rows(None).await? as u64
} else {
self.base_table.count_rows(None).await? as u64
};
Self::validate_limit_offset(self.limit, self.offset, avail_rows)?;
Ok(())
}
pub async fn read(
&self,
split: u64,
selection: Select,
execution_options: QueryExecutionOptions,
) -> Result<SendableRecordBatchStream> {
let row_ids = self
.permutation_table
.query(
&AnyQuery::Query(QueryRequest {
select: Select::Columns(vec![SRC_ROW_ID_COL.to_string()]),
filter: Some(QueryFilter::Sql(format!("{} = {}", SPLIT_ID_COLUMN, split))),
..Default::default()
}),
execution_options,
)
.await?;
// Note: this relies on the row ids query here being returned in consistent order
let row_ids = if let Some(permutation_table) = &self.permutation_table {
permutation_table
.query(
&AnyQuery::Query(QueryRequest {
select: Select::Columns(vec![SRC_ROW_ID_COL.to_string()]),
filter: Some(QueryFilter::Sql(format!(
"{} = {}",
SPLIT_ID_COLUMN, self.split
))),
offset: self.offset.map(|o| o as usize),
limit: self.limit.map(|l| l as usize),
..Default::default()
}),
execution_options,
)
.await?
} else {
self.base_table
.query(
&AnyQuery::Query(QueryRequest {
select: Select::Columns(vec![ROW_ID.to_string()]),
offset: self.offset.map(|o| o as usize),
limit: self.limit.map(|l| l as usize),
..Default::default()
}),
execution_options,
)
.await?
};
Self::row_ids_to_batches(self.base_table.clone(), row_ids, selection).await
}
pub async fn output_schema(&self, selection: Select) -> Result<SchemaRef> {
let table = Table::from(self.base_table.clone());
table.query().select(selection).output_schema().await
}
pub fn count_rows(&self) -> u64 {
self.available_rows
}
}
#[cfg(test)]
@@ -321,17 +476,17 @@ mod tests {
.unwrap();
let row_ids_table = virtual_table("row_ids", &permutation_batch).await;
let reader = PermutationReader::try_new(
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
0,
)
.await
.unwrap();
// Read split 0
let mut stream = reader
.read_split(
0,
.read(
Select::All,
QueryExecutionOptions {
max_batch_length: 3,
@@ -366,9 +521,16 @@ mod tests {
assert!(stream.try_next().await.unwrap().is_none());
// Read split 1
let reader = PermutationReader::try_from_tables(
base_table.base_table().clone(),
row_ids_table.base_table().clone(),
1,
)
.await
.unwrap();
let mut stream = reader
.read_split(
1,
.read(
Select::All,
QueryExecutionOptions {
max_batch_length: 3,

View File

@@ -13,7 +13,7 @@ use arrow_array::{Array, BooleanArray, RecordBatch, UInt64Array};
use arrow_schema::{DataType, Field, Schema};
use datafusion_common::hash_utils::create_hashes;
use futures::{StreamExt, TryStreamExt};
use lance::arrow::SchemaExt;
use lance_arrow::SchemaExt;
use crate::{
arrow::{SendableRecordBatchStream, SimpleRecordBatchStream},

View File

@@ -10,7 +10,7 @@ pub mod sentence_transformers;
#[cfg(feature = "bedrock")]
pub mod bedrock;
use lance::arrow::RecordBatchExt;
use lance_arrow::RecordBatchExt;
use std::{
borrow::Cow,
collections::{HashMap, HashSet},

View File

@@ -6,6 +6,8 @@ use std::sync::PoisonError;
use arrow_schema::ArrowError;
use snafu::Snafu;
type BoxError = Box<dyn std::error::Error + Send + Sync>;
#[derive(Debug, Snafu)]
#[snafu(visibility(pub(crate)))]
pub enum Error {
@@ -14,7 +16,7 @@ pub enum Error {
#[snafu(display("Invalid input, {message}"))]
InvalidInput { message: String },
#[snafu(display("Table '{name}' was not found"))]
TableNotFound { name: String },
TableNotFound { name: String, source: BoxError },
#[snafu(display("Database '{name}' was not found"))]
DatabaseNotFound { name: String },
#[snafu(display("Database '{name}' already exists."))]

View File

@@ -11,10 +11,8 @@ use datafusion_expr::Expr;
use datafusion_physical_plan::ExecutionPlan;
use futures::{stream, try_join, FutureExt, TryFutureExt, TryStreamExt};
use half::f16;
use lance::{
arrow::RecordBatchExt,
dataset::{scanner::DatasetRecordBatchStream, ROW_ID},
};
use lance::dataset::{scanner::DatasetRecordBatchStream, ROW_ID};
use lance_arrow::RecordBatchExt;
use lance_datafusion::exec::execute_plan;
use lance_index::scalar::inverted::SCORE_COL;
use lance_index::scalar::FullTextSearchQuery;
@@ -36,7 +34,7 @@ pub(crate) const DEFAULT_TOP_K: usize = 10;
/// Which columns should be retrieved from the database
#[derive(Debug, Clone)]
pub enum Select {
/// Select all columns
/// Select all non-system columns
///
/// Warning: This will always be slower than selecting only the columns you need.
All,
@@ -669,6 +667,12 @@ pub struct QueryRequest {
/// Configure how query results are normalized when doing hybrid search
pub norm: Option<NormalizeMethod>,
/// If set to true, disables automatic projection of scoring columns (_score, _distance).
/// When disabled, these columns are only included if explicitly requested in the projection.
///
/// By default, this is false (scoring columns are auto-projected for backward compatibility).
pub disable_scoring_autoprojection: bool,
}
impl Default for QueryRequest {
@@ -684,6 +688,7 @@ impl Default for QueryRequest {
prefilter: true,
reranker: None,
norm: None,
disable_scoring_autoprojection: false,
}
}
}

View File

@@ -515,11 +515,8 @@ impl<S: HttpSend> Database for RemoteDatabase<S> {
.client
.post(&format!("/v1/table/{}/describe/", identifier));
let (request_id, rsp) = self.client.send_with_retry(req, None, true).await?;
if rsp.status() == StatusCode::NOT_FOUND {
return Err(crate::Error::TableNotFound {
name: identifier.clone(),
});
}
let rsp =
RemoteTable::<S>::handle_table_not_found(&request.name, rsp, &request_id).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let version = parse_server_version(&request_id, &rsp)?;
let table_identifier = build_table_identifier(

View File

@@ -336,16 +336,33 @@ impl<S: HttpSend> RemoteTable<S> {
Ok(res)
}
pub(super) async fn handle_table_not_found(
table_name: &str,
response: reqwest::Response,
request_id: &str,
) -> Result<reqwest::Response> {
let status = response.status();
if status == StatusCode::NOT_FOUND {
let body = response.text().await.ok().unwrap_or_default();
let request_error = Error::Http {
source: body.into(),
request_id: request_id.into(),
status_code: Some(status),
};
return Err(Error::TableNotFound {
name: table_name.to_string(),
source: Box::new(request_error),
});
}
Ok(response)
}
async fn check_table_response(
&self,
request_id: &str,
response: reqwest::Response,
) -> Result<reqwest::Response> {
if response.status() == StatusCode::NOT_FOUND {
return Err(Error::TableNotFound {
name: self.identifier.clone(),
});
}
let response = Self::handle_table_not_found(&self.name, response, request_id).await?;
self.client.check_response(request_id, response).await
}
@@ -681,8 +698,9 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
.map_err(|e| match e {
// try to map the error to a more user-friendly error telling them
// specifically that the version does not exist
Error::TableNotFound { name } => Error::TableNotFound {
Error::TableNotFound { name, source } => Error::TableNotFound {
name: format!("{} (version: {})", name, version),
source,
},
e => e,
})?;
@@ -1427,6 +1445,10 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
"NOT_SUPPORTED"
}
async fn storage_options(&self) -> Option<HashMap<String, String>> {
None
}
async fn stats(&self) -> Result<TableStatistics> {
let request = self
.client
@@ -1571,7 +1593,11 @@ mod tests {
for result in results {
let result = result.await;
assert!(result.is_err());
assert!(matches!(result, Err(Error::TableNotFound { name }) if name == "my_table"));
assert!(
matches!(&result, &Err(Error::TableNotFound { ref name, .. }) if name == "my_table")
);
let full_error_report = snafu::Report::from_error(result.unwrap_err()).to_string();
assert!(full_error_report.contains("table my_table not found"));
}
}
@@ -2880,7 +2906,7 @@ mod tests {
let res = table.checkout(43).await;
println!("{:?}", res);
assert!(
matches!(res, Err(Error::TableNotFound { name }) if name == "my_table (version: 43)")
matches!(res, Err(Error::TableNotFound { name, .. }) if name == "my_table (version: 43)")
);
}

View File

@@ -601,6 +601,8 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
async fn table_definition(&self) -> Result<TableDefinition>;
/// Get the table URI
fn dataset_uri(&self) -> &str;
/// Get the storage options used when opening this table, if any.
async fn storage_options(&self) -> Option<HashMap<String, String>>;
/// Poll until the columns are fully indexed. Will return Error::Timeout if the columns
/// are not fully indexed within the timeout.
async fn wait_for_index(
@@ -618,7 +620,7 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
#[derive(Clone, Debug)]
pub struct Table {
inner: Arc<dyn BaseTable>,
database: Arc<dyn Database>,
database: Option<Arc<dyn Database>>,
embedding_registry: Arc<dyn EmbeddingRegistry>,
}
@@ -642,7 +644,7 @@ mod test_utils {
let database = Arc::new(crate::remote::db::RemoteDatabase::new_mock(handler));
Self {
inner,
database,
database: Some(database),
// Registry is unused.
embedding_registry: Arc::new(MemoryRegistry::new()),
}
@@ -664,7 +666,7 @@ mod test_utils {
let database = Arc::new(crate::remote::db::RemoteDatabase::new_mock(handler));
Self {
inner,
database,
database: Some(database),
// Registry is unused.
embedding_registry: Arc::new(MemoryRegistry::new()),
}
@@ -678,11 +680,21 @@ impl std::fmt::Display for Table {
}
}
impl From<Arc<dyn BaseTable>> for Table {
fn from(inner: Arc<dyn BaseTable>) -> Self {
Self {
inner,
database: None,
embedding_registry: Arc::new(MemoryRegistry::new()),
}
}
}
impl Table {
pub fn new(inner: Arc<dyn BaseTable>, database: Arc<dyn Database>) -> Self {
Self {
inner,
database,
database: Some(database),
embedding_registry: Arc::new(MemoryRegistry::new()),
}
}
@@ -692,7 +704,7 @@ impl Table {
}
pub fn database(&self) -> &Arc<dyn Database> {
&self.database
self.database.as_ref().unwrap()
}
pub fn embedding_registry(&self) -> &Arc<dyn EmbeddingRegistry> {
@@ -706,7 +718,7 @@ impl Table {
) -> Self {
Self {
inner,
database,
database: Some(database),
embedding_registry,
}
}
@@ -1293,6 +1305,13 @@ impl Table {
self.inner.dataset_uri()
}
/// Get the storage options used when opening this table, if any.
///
/// Warning: This is an internal API and the return value is subject to change.
pub async fn storage_options(&self) -> Option<HashMap<String, String>> {
self.inner.storage_options().await
}
/// Get statistics about an index.
/// Returns None if the index does not exist.
pub async fn index_stats(
@@ -1525,6 +1544,7 @@ impl NativeTable {
.map_err(|e| match e {
lance::Error::DatasetNotFound { .. } => Error::TableNotFound {
name: name.to_string(),
source: Box::new(e),
},
source => Error::Lance { source },
})?;
@@ -1545,6 +1565,7 @@ impl NativeTable {
.file_stem()
.ok_or(Error::TableNotFound {
name: uri.to_string(),
source: format!("Could not extract table name from URI: '{}'", uri).into(),
})?
.to_str()
.ok_or(Error::InvalidTableName {
@@ -2382,6 +2403,10 @@ impl BaseTable for NativeTable {
scanner.distance_metric(distance_type.into());
}
if query.base.disable_scoring_autoprojection {
scanner.disable_scoring_autoprojection();
}
Ok(scanner.create_plan().await?)
}
@@ -2617,6 +2642,14 @@ impl BaseTable for NativeTable {
self.uri.as_str()
}
async fn storage_options(&self) -> Option<HashMap<String, String>> {
self.dataset
.get()
.await
.ok()
.and_then(|dataset| dataset.storage_options().cloned())
}
async fn index_stats(&self, index_name: &str) -> Result<Option<IndexStatistics>> {
let stats = match self
.dataset
@@ -2626,7 +2659,7 @@ impl BaseTable for NativeTable {
.await
{
Ok(stats) => stats,
Err(lance::error::Error::IndexNotFound { .. }) => return Ok(None),
Err(lance_core::Error::IndexNotFound { .. }) => return Ok(None),
Err(e) => return Err(Error::from(e)),
};

View File

@@ -2,6 +2,9 @@
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! This module contains adapters to allow LanceDB tables to be used as DataFusion table providers.
pub mod udtf;
use std::{collections::HashMap, sync::Arc};
use arrow_array::RecordBatch;
@@ -21,6 +24,8 @@ use crate::{
query::{QueryExecutionOptions, QueryFilter, QueryRequest, Select},
Result,
};
use arrow_schema::{DataType, Field};
use lance_index::scalar::FullTextSearchQuery;
/// Datafusion attempts to maintain batch metadata
///
@@ -135,19 +140,38 @@ impl ExecutionPlan for MetadataEraserExec {
pub struct BaseTableAdapter {
table: Arc<dyn BaseTable>,
schema: Arc<ArrowSchema>,
fts_query: Option<FullTextSearchQuery>,
}
impl BaseTableAdapter {
pub async fn try_new(table: Arc<dyn BaseTable>) -> Result<Self> {
let schema = Arc::new(
table
.schema()
.await?
.as_ref()
.clone()
.with_metadata(HashMap::default()),
);
Ok(Self { table, schema })
let schema = table
.schema()
.await?
.as_ref()
.clone()
.with_metadata(HashMap::default());
Ok(Self {
table,
schema: Arc::new(schema),
fts_query: None,
})
}
/// Create a new adapter with an FTS query applied.
pub fn with_fts_query(&self, fts_query: FullTextSearchQuery) -> Self {
// Add _score column to the schema
let score_field = Field::new("_score", DataType::Float32, true);
let mut fields = self.schema.fields().to_vec();
fields.push(Arc::new(score_field));
let schema = Arc::new(ArrowSchema::new(fields));
Self {
table: self.table.clone(),
schema,
fts_query: Some(fts_query),
}
}
}
@@ -172,7 +196,15 @@ impl TableProvider for BaseTableAdapter {
filters: &[Expr],
limit: Option<usize>,
) -> DataFusionResult<Arc<dyn ExecutionPlan>> {
let mut query = QueryRequest::default();
// For FTS queries, disable auto-projection of _score to match DataFusion expectations
let disable_scoring = self.fts_query.is_some() && projection.is_some();
let mut query = QueryRequest {
full_text_search: self.fts_query.clone(),
disable_scoring_autoprojection: disable_scoring,
..Default::default()
};
if let Some(projection) = projection {
let field_names = projection
.iter()

View File

@@ -0,0 +1,6 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! User-Defined Table Functions (UDTFs) for DataFusion integration
pub mod fts;

File diff suppressed because it is too large Load Diff