Compare commits

..

41 Commits

Author SHA1 Message Date
Lance Release
143184c0ae Bump version: 0.25.2 → 0.25.3-beta.0 2025-10-14 02:25:16 +00:00
Jack Ye
dadb042978 feat: bump lance to 0.38.3-beta.2 and rust to 1.90.0 (#2714) 2025-10-10 14:02:41 -07:00
Weston Pace
5a19cf15a6 feat: a utility for creating "permutation views" (#2552)
I'm working on a lancedb version of pytorch data loading (and hopefully
addressing https://github.com/lancedb/lance/issues/3727).

However, rather than rely on pytorch for everything I'm moving some of
the things that pytorch does into rust. This gives us more control over
data loading (e.g. using shards or a hash-based split) and it allows
permutations to be persistent. In particular I hope to be able to:

* Create a persistent permutation
* This permutation can handle splits, filtering, shuffling, and sharding
* Create a rust data loader that can read a permutation (one or more
splits), or a subset of a permutation (for DDP)
* Create a python data loader that delegates to the rust data loader

Eventually create integrations for other data loading libraries,
including rust & node
2025-10-09 18:07:31 -07:00
Will Jones
3dcec724b7 chore: loosen pin on chrono (#2710)
Fixes #2709
2025-10-09 14:23:56 -07:00
LuQQiu
86a6bb9fcb chore: supports limit push down through MetadataEraserExec (#2679)
For limit to sucessfully push down to FilteredReadExec
https://github.com/lancedb/lance/pull/4795/
2025-10-09 09:33:38 -07:00
BubbleCal
b59d1007d3 feat(index): add IVF_RQ index type (#2687)
this expose IVF_RQ (RabitQ quantization) index type to lancedb

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-10-09 15:46:18 +08:00
Lance Release
56a16b1728 Bump version: 0.22.2-beta.3 → 0.22.2 2025-10-08 18:13:08 +00:00
Lance Release
b7afed9beb Bump version: 0.22.2-beta.2 → 0.22.2-beta.3 2025-10-08 18:12:23 +00:00
Lance Release
5cbbaa2e4a Bump version: 0.25.2-beta.3 → 0.25.2 2025-10-08 18:11:45 +00:00
Lance Release
1b6bd2498e Bump version: 0.25.2-beta.2 → 0.25.2-beta.3 2025-10-08 18:11:45 +00:00
Jack Ye
285da9db1d feat: upgrade lance to 0.38.2 (#2705) 2025-10-08 09:59:28 -07:00
Ayush Chaurasia
ad8306c96b docs: add custom redirect for storage page (#2706)
Expand the custom redirection links list to include storage page
2025-10-08 21:35:48 +05:30
Wyatt Alt
3594538509 fix: add name to index config and fix create_index typing (#2660)
Co-authored-by: Mark McCaskey <markm@harvey.ai>
2025-10-08 04:41:30 -07:00
Tom LaMarre
917aabd077 fix(node): support specifying arrow field types by name (#2704)
The [`FieldLike` type in
arrow.ts](5ec12c9971/nodejs/lancedb/arrow.ts (L71-L78))
can have a `type: string` property, but before this change, actually
trying to create a table that has a schema that specifies field types by
name results in an error:

```
Error: Expected a Type but object was null/undefined
```

This change adds support for mapping some type name strings to arrow
`DataType`s, so that passing `FieldLike`s with a `type: string` property
to `sanitizeField` does not throw an error.

The type names that can be passed are upper/lowercase variations of the
keys of the `constructorsByTypeName` object. This does not support
mapping types that need parameters, such as timestamps which need
timezones.

With this, it is possible to create empty tables from `SchemaLike`
objects without instantiating arrow types, e.g.:

```
    import { SchemaLike } from "../lancedb/arrow"
    // ...
    const schemaLike = {
      fields: [
        {
          name: "id",
          type: "int64",
          nullable: true,
        },
        {
          name: "vector",
          type: "float64",
          nullable: true,
        },
      ],
    // ...
    } satisfies SchemaLike;
    const table = await con.createEmptyTable("test", schemaLike);
 ```

This change also makes `FieldLike.nullable` required since the `sanitizeField` function throws if it is undefined.
2025-10-08 04:40:06 -07:00
Jack Ye
5ec12c9971 fix: federated database should not pass namesapce to listing database (#2702)
Fixes error that when converting a federated database operation to a
listing database operation, the namespace parameter is no longer correct
and should be dropped.

Note that with the testing infra we have today, we don't have a good way
to test these changes. I will do a quick follow up on
https://github.com/lancedb/lancedb/issues/2701 but would be great to get
this in first to resolve the related issues.
2025-10-06 14:12:41 -07:00
Ed Rogers
d0ce489b21 fix: use stdlib override when possible (#2699)
## Description of changes

Fixes #2698  

This PR uses
[`typing.override`](https://docs.python.org/3/library/typing.html#typing.override)
in favor of the [`overrides`](https://pypi.org/project/overrides/)
dependency when possible. As of Python 3.12, the standard library offers
`typing.override` to perform a static check on overridden methods.

### Motivation

Currently, `overrides` is incompatible with Python 3.14. As a result,
any package that attempts to import `overrides` using Python 3.14+ will
raise an `AttributeError`. An
[issue](https://github.com/mkorpela/overrides/issues/127) has been
raised and a [pull
request](https://github.com/mkorpela/overrides/pull/133) has been
submitted to the GitHub repo for the `overrides` project. But the
maintainer has been unresponsive.

To ensure readiness for Python 3.14, this package (and any other package
directly depending on `overrides`) should consider using
`typing.override` instead.

### Impact

The standard library added `typing.override` as of 3.12. As a result,
this change will affect only users of Python 3.12+. Previous versions
will continue to rely on `overrides`. Notably, the standard library
implementation is slightly different than that of `overrides`. A
thorough discussion of those differences is shown in [PEP
698](https://peps.python.org/pep-0698/), and it is also summarized
nicely by the maintainer of `overrides`
[here](https://github.com/mkorpela/overrides/issues/126#issuecomment-2401327116).

There are 2 main ways that switching from `overrides` to
`typing.override` will have an impact on developers of this repo.
1. `typing.override` does not implement any runtime checking. Instead,
it provides information to type checkers.
2. The stdlib does not provide a mixin class to enforce override
decorators on child classes. (Their reasoning for this is explained in
[the PEP](https://peps.python.org/pep-0698/).) This PR disables that
behavior entirely by replacing the `EnforceOverrides`.
2025-10-06 11:23:20 -07:00
Lance Release
d7e02c8181 Bump version: 0.22.2-beta.1 → 0.22.2-beta.2 2025-10-06 18:10:40 +00:00
Lance Release
70958f6366 Bump version: 0.25.2-beta.1 → 0.25.2-beta.2 2025-10-06 18:09:24 +00:00
Will Jones
1ac745eb18 ci: fix Python and Node CI on main (#2700)
Example failure:
https://github.com/lancedb/lancedb/actions/runs/18237024283/job/51932651993
2025-10-06 09:40:08 -07:00
Will Jones
1357fe8aa1 ci: run remote tests on PRs only if they aren't a fork (#2697) 2025-10-03 17:38:40 -07:00
LuQQiu
0d78929893 feat: upgrade lance to 0.38.0 (#2695)
https://github.com/lancedb/lance/releases/tag/v0.38.0

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-10-03 16:47:05 -07:00
Neha Prasad
9e2a68541e fix(node): allow undefined/omitted values for nullable vector fields (#2656)
**Problem**: When a vector field is marked as nullable, users should be
able to omit it or pass `undefined`, but this was throwing an error:
"Table has embeddings: 'vector', but no embedding function was provided"

fixes: #2646

**Solution**: Modified `validateSchemaEmbeddings` to check
`field.nullable` before treating `undefined` values as missing embedding
fields.

**Changes**:
- Fixed validation logic in `nodejs/lancedb/arrow.ts`
- Enabled previously skipped test for nullable fields
- Added reproduction test case

**Behavior**:
-  `{ vector: undefined }` now works for nullable fields
-  `{}` (omitted field) now works for nullable fields  
-  `{ vector: null }` still works (unchanged)
-  Non-nullable fields still properly throw errors (unchanged)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: neha <neha@posthog.com>
2025-10-02 10:53:05 -07:00
Will Jones
1aa0fd16e7 ci: automatic issue creation for failed publish workflows (#2694)
## Summary
- Created custom GitHub Action that creates issues when workflow jobs
fail
- Added report-failure jobs to cargo-publish.yml, java-publish.yml,
npm-publish.yml, and pypi-publish.yml
- Issues are created automatically with workflow name, failed job names,
and run URL

## Test plan
- Workflows will only create issues on actual release or
workflow_dispatch events
- Can be tested by triggering workflow_dispatch on a publish workflow

Based on lancedb/lance#4873

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-02 08:24:16 -07:00
Lance Release
fec2a05629 Bump version: 0.22.2-beta.0 → 0.22.2-beta.1 2025-09-30 19:31:44 +00:00
Lance Release
79a1cd60ee Bump version: 0.25.2-beta.0 → 0.25.2-beta.1 2025-09-30 19:30:39 +00:00
Colin Patrick McCabe
88807a59a4 fix: have CI download from ci-support-binaries (#2692)
Have CI download from ci-support-binaries to fix the build.
2025-09-30 11:54:43 -07:00
Jack Ye
e0e7e01ea8 fix: inflated release size due to lance-namespace transitive dependency (#2691)
Fixed the issue on lance-namespace side to avoid pinning to a specific
lance version. This should fix the issue of the increased release
artifact size and build time.
2025-09-30 11:18:32 -07:00
Ayush Chaurasia
a416ebc11d fix: use correct nodejs path for ci (#2689) 2025-09-30 14:18:42 +05:30
Ayush Chaurasia
f941054baf docs: fix doc deployment and remove recipes workflow trigger (#2688) 2025-09-30 13:10:39 +05:30
Ayush Chaurasia
1a81c46505 docs: transition to new docs (#2681) 2025-09-29 11:37:08 +05:30
Colin Patrick McCabe
82b25a71e9 feat: add support for test_remote_connections (#2666)
Add a new test feature which allows for running the lancedb tests
against a remote server. Convert over a few tests in src/connection.rs
as a proof of concept.

To make local development easier, the remote tests can be run locally
from a Makefile. This file can also be used to run the feature tests,
with a single invocation of 'make'. (The feature tests require bringing
up a docker compose environment.)
2025-09-26 11:24:43 -07:00
Jack Ye
13c613d45f chore: upgrade lance to v0.37.1-beta.1 (#2682) 2025-09-25 23:12:09 -07:00
Weston Pace
e07389a36c feat: allow bitmap indexes on large-string, binary, large-binary, and bitmap (#2678)
The underlying `pylance` already supported this, it was just blocked out
by an over-eager validation function

Closes #1981
2025-09-25 09:46:42 -07:00
Lance Release
e7e9e80b1d Bump version: 0.22.1 → 0.22.2-beta.0 2025-09-24 22:54:54 +00:00
Lance Release
247fb58400 Bump version: 0.25.1 → 0.25.2-beta.0 2025-09-24 22:54:09 +00:00
Jack Ye
504bdc471c feat(rust): support namespace backed database (#2664)
This PR adds support for namespace-backed databases through
lance-namespace integration, enabling centralized table management
through namespace APIs.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-09-24 15:33:31 -07:00
Will Jones
d617cdef4a feat: add use_index parameter to merge insert operations (#2674)
## Summary

Exposes `use_index` Merge Insert parameter, which was created upstream
in https://github.com/lancedb/lance/pull/4688.

## API Examples

### Python
```python
# Force table scan
table.merge_insert(["id"]) \
    .when_not_matched_insert_all() \
    .use_index(False) \
    .execute(data)
```

### Node.js/TypeScript
```typescript
// Force table scan  
await table.mergeInsert("id")
    .whenNotMatchedInsertAll()
    .useIndex(false)
    .execute(data);
```

### Rust
```rust
// Force table scan
let mut builder = table.merge_insert(&["id"]);
builder.when_not_matched_insert_all()
       .use_index(false);
builder.execute(data).await?;
```

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-authored-by: Claude <noreply@anthropic.com>
2025-09-24 12:50:21 -07:00
Will Jones
356d7046fd ci: fix test failure on main (#2677)
Test was in wrong position.
2025-09-24 09:46:04 -07:00
Will Jones
48e5caabda ci(nodejs): lint for unused imports (#2673) 2025-09-23 18:49:42 -07:00
Lance Release
d6cc68f671 Bump version: 0.22.1-beta.4 → 0.22.1 2025-09-23 22:07:31 +00:00
Lance Release
55eacfa685 Bump version: 0.22.1-beta.3 → 0.22.1-beta.4 2025-09-23 22:06:45 +00:00
110 changed files with 7169 additions and 846 deletions

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.22.1-beta.3"
current_version = "0.22.2"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.

View File

@@ -0,0 +1,45 @@
name: Create Failure Issue
description: Creates a GitHub issue if any jobs in the workflow failed
inputs:
job-results:
description: 'JSON string of job results from needs context'
required: true
workflow-name:
description: 'Name of the workflow'
required: true
runs:
using: composite
steps:
- name: Check for failures and create issue
shell: bash
env:
JOB_RESULTS: ${{ inputs.job-results }}
WORKFLOW_NAME: ${{ inputs.workflow-name }}
RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
GH_TOKEN: ${{ github.token }}
run: |
# Check if any job failed
if echo "$JOB_RESULTS" | jq -e 'to_entries | any(.value.result == "failure")' > /dev/null; then
echo "Detected job failures, creating issue..."
# Extract failed job names
FAILED_JOBS=$(echo "$JOB_RESULTS" | jq -r 'to_entries | map(select(.value.result == "failure")) | map(.key) | join(", ")')
# Create issue with workflow name, failed jobs, and run URL
gh issue create \
--title "$WORKFLOW_NAME Failed ($FAILED_JOBS)" \
--body "The workflow **$WORKFLOW_NAME** failed during execution.
**Failed jobs:** $FAILED_JOBS
**Run URL:** $RUN_URL
Please investigate the failed jobs and address any issues." \
--label "ci"
echo "Issue created successfully"
else
echo "No job failures detected, skipping issue creation"
fi

View File

@@ -38,3 +38,17 @@ jobs:
- name: Publish the package
run: |
cargo publish -p lancedb --all-features --token ${{ steps.auth.outputs.token }}
report-failure:
name: Report Workflow Failure
runs-on: ubuntu-latest
needs: [build]
if: always() && (github.event_name == 'release' || github.event_name == 'workflow_dispatch')
permissions:
contents: read
issues: write
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/create-failure-issue
with:
job-results: ${{ toJSON(needs) }}
workflow-name: ${{ github.workflow }}

View File

@@ -56,8 +56,9 @@ jobs:
with:
node-version: 20
cache: 'npm'
cache-dependency-path: docs/package-lock.json
- name: Install node dependencies
working-directory: node
working-directory: nodejs
run: |
sudo apt update
sudo apt install -y protobuf-compiler libssl-dev

View File

@@ -43,7 +43,6 @@ jobs:
- uses: Swatinem/rust-cache@v2
- uses: actions-rust-lang/setup-rust-toolchain@v1
with:
toolchain: "1.81.0"
cache-workspaces: "./java/core/lancedb-jni"
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
@@ -112,3 +111,17 @@ jobs:
env:
SONATYPE_USER: ${{ secrets.SONATYPE_USER }}
SONATYPE_TOKEN: ${{ secrets.SONATYPE_TOKEN }}
report-failure:
name: Report Workflow Failure
runs-on: ubuntu-latest
needs: [linux-arm64, linux-x86, macos-arm64]
if: always() && (github.event_name == 'release' || github.event_name == 'workflow_dispatch')
permissions:
contents: read
issues: write
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/create-failure-issue
with:
job-results: ${{ toJSON(needs) }}
workflow-name: ${{ github.workflow }}

View File

@@ -6,6 +6,7 @@ on:
- main
pull_request:
paths:
- Cargo.toml
- nodejs/**
- .github/workflows/nodejs.yml
- docker-compose.yml
@@ -116,7 +117,7 @@ jobs:
set -e
npm ci
npm run docs
if ! git diff --exit-code -- . ':(exclude)Cargo.lock'; then
if ! git diff --exit-code -- ../ ':(exclude)Cargo.lock'; then
echo "Docs need to be updated"
echo "Run 'npm run docs', fix any warnings, and commit the changes."
exit 1

View File

@@ -365,3 +365,17 @@ jobs:
ARGS="$ARGS --tag preview"
fi
npm publish $ARGS
report-failure:
name: Report Workflow Failure
runs-on: ubuntu-latest
needs: [build-lancedb, test-lancedb, publish]
if: always() && (github.event_name == 'release' || github.event_name == 'workflow_dispatch')
permissions:
contents: read
issues: write
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/create-failure-issue
with:
job-results: ${{ toJSON(needs) }}
workflow-name: ${{ github.workflow }}

View File

@@ -173,3 +173,17 @@ jobs:
generate_release_notes: false
name: Python LanceDB v${{ steps.extract_version.outputs.version }}
body: ${{ steps.python_release_notes.outputs.changelog }}
report-failure:
name: Report Workflow Failure
runs-on: ubuntu-latest
needs: [linux, mac, windows]
permissions:
contents: read
issues: write
if: always() && (github.event_name == 'release' || github.event_name == 'workflow_dispatch')
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/create-failure-issue
with:
job-results: ${{ toJSON(needs) }}
workflow-name: ${{ github.workflow }}

View File

@@ -6,6 +6,7 @@ on:
- main
pull_request:
paths:
- Cargo.toml
- python/**
- .github/workflows/python.yml

View File

@@ -96,6 +96,7 @@ jobs:
# Need up-to-date compilers for kernels
CC: clang-18
CXX: clang++-18
GH_TOKEN: ${{ secrets.SOPHON_READ_TOKEN }}
steps:
- uses: actions/checkout@v4
with:
@@ -117,15 +118,17 @@ jobs:
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
- name: Start S3 integration test environment
working-directory: .
run: docker compose up --detach --wait
- name: Build
run: cargo build --all-features --tests --locked --examples
- name: Run tests
run: cargo test --all-features --locked
- name: Run feature tests
run: make -C ./lancedb feature-tests
- name: Run examples
run: cargo run --example simple --locked
- name: Run remote tests
# Running this requires access to secrets, so skip if this is
# a PR from a fork.
if: github.event_name != 'pull_request' || !github.event.pull_request.head.repo.fork
run: make -C ./lancedb remote-tests
macos:
timeout-minutes: 30

View File

@@ -1,26 +0,0 @@
name: Trigger vectordb-recipers workflow
on:
push:
branches: [ main ]
pull_request:
paths:
- .github/workflows/trigger-vectordb-recipes.yml
workflow_dispatch:
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Trigger vectordb-recipes workflow
uses: actions/github-script@v6
with:
github-token: ${{ secrets.VECTORDB_RECIPES_ACTION_TOKEN }}
script: |
const result = await github.rest.actions.createWorkflowDispatch({
owner: 'lancedb',
repo: 'vectordb-recipes',
workflow_id: 'examples-test.yml',
ref: 'main'
});
console.log(result);

1299
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -15,30 +15,34 @@ categories = ["database-implementations"]
rust-version = "1.78.0"
[workspace.dependencies]
lance = { "version" = "=0.37.0", default-features = false, "features" = ["dynamodb"] }
lance-io = { "version" = "=0.37.0", default-features = false }
lance-index = "=0.37.0"
lance-linalg = "=0.37.0"
lance-table = "=0.37.0"
lance-testing = "=0.37.0"
lance-datafusion = "=0.37.0"
lance-encoding = "=0.37.0"
lance = { "version" = "=0.38.2", default-features = false, "features" = ["dynamodb"], "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-core = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-datagen = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-file = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-io = { "version" = "=0.38.2", default-features = false, "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-index = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-linalg = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-table = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-testing = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-datafusion = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-encoding = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-namespace = "0.0.18"
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "55.1", optional = false }
arrow-array = "55.1"
arrow-data = "55.1"
arrow-ipc = "55.1"
arrow-ord = "55.1"
arrow-schema = "55.1"
arrow-arith = "55.1"
arrow-cast = "55.1"
arrow = { version = "56.2", optional = false }
arrow-array = "56.2"
arrow-data = "56.2"
arrow-ipc = "56.2"
arrow-ord = "56.2"
arrow-schema = "56.2"
arrow-cast = "56.2"
async-trait = "0"
datafusion = { version = "49.0", default-features = false }
datafusion-catalog = "49.0"
datafusion-common = { version = "49.0", default-features = false }
datafusion-execution = "49.0"
datafusion-expr = "49.0"
datafusion-physical-plan = "49.0"
datafusion = { version = "50.1", default-features = false }
datafusion-catalog = "50.1"
datafusion-common = { version = "50.1", default-features = false }
datafusion-execution = "50.1"
datafusion-expr = "50.1"
datafusion-physical-plan = "50.1"
env_logger = "0.11"
half = { "version" = "2.6.0", default-features = false, features = [
"num-traits",
@@ -48,18 +52,26 @@ log = "0.4"
moka = { version = "0.12", features = ["future"] }
object_store = "0.12.0"
pin-project = "1.0.7"
rand = "0.9"
snafu = "0.8"
url = "2"
num-traits = "0.2"
rand = "0.9"
regex = "1.10"
lazy_static = "1"
semver = "1.0.25"
crunchy = "0.2.4"
# Temporary pins to work around downstream issues
# https://github.com/apache/arrow-rs/commit/2fddf85afcd20110ce783ed5b4cdeb82293da30b
chrono = "=0.4.41"
# https://github.com/RustCrypto/formats/issues/1684
base64ct = "=1.6.0"
chrono = "0.4"
# Workaround for: https://github.com/Lokathor/bytemuck/issues/306
bytemuck_derive = ">=1.8.1, <1.9.0"
# This is only needed when we reference preview releases of lance
# Force to use the same lance version as the rest of the project to avoid duplicate dependencies
[patch.crates-io]
lance = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-io = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-index = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-linalg = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-table = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-testing = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-datafusion = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }
lance-encoding = { "version" = "=0.38.2", "tag" = "v0.38.3-beta.2", "git" = "https://github.com/lancedb/lance.git" }

View File

@@ -0,0 +1,4 @@
#!/usr/bin/env bash
export RUST_LOG=info
exec ./lancedb server --port 0 --sql-port 0 --data-dir "${1}"

18
ci/run_with_docker_compose.sh Executable file
View File

@@ -0,0 +1,18 @@
#!/usr/bin/env bash
#
# A script for running the given command together with a docker compose environment.
#
# Bring down the docker setup once the command is done running.
tear_down() {
docker compose -p fixture down
}
trap tear_down EXIT
set +xe
# Clean up any existing docker setup and bring up a new one.
docker compose -p fixture up --detach --wait || exit 1
"${@}"

68
ci/run_with_test_connection.sh Executable file
View File

@@ -0,0 +1,68 @@
#!/usr/bin/env bash
#
# A script for running the given command together with the lancedb cli.
#
die() {
echo $?
exit 1
}
check_command_exists() {
command="${1}"
which ${command} &> /dev/null || \
die "Unable to locate command: ${command}. Did you install it?"
}
if [[ ! -e ./lancedb ]]; then
if [[ -v SOPHON_READ_TOKEN ]]; then
INPUT="lancedb-linux-x64"
gh release \
--repo lancedb/lancedb \
download ci-support-binaries \
--pattern "${INPUT}" \
|| die "failed to fetch cli."
check_command_exists openssl
openssl enc -aes-256-cbc \
-d -pbkdf2 \
-pass "env:SOPHON_READ_TOKEN" \
-in "${INPUT}" \
-out ./lancedb-linux-x64.tar.gz \
|| die "openssl failed"
TARGET="${INPUT}.tar.gz"
else
ARCH="x64"
if [[ $OSTYPE == 'darwin'* ]]; then
UNAME=$(uname -m)
if [[ $UNAME == 'arm64' ]]; then
ARCH='arm64'
fi
OSTYPE="macos"
elif [[ $OSTYPE == 'linux'* ]]; then
if [[ $UNAME == 'aarch64' ]]; then
ARCH='arm64'
fi
OSTYPE="linux"
else
die "unknown OSTYPE: $OSTYPE"
fi
check_command_exists gh
TARGET="lancedb-${OSTYPE}-${ARCH}.tar.gz"
gh release \
--repo lancedb/sophon \
download lancedb-cli-v0.0.3 \
--pattern "${TARGET}" \
|| die "failed to fetch cli."
fi
check_command_exists tar
tar xvf "${TARGET}" || die "tar failed."
[[ -e ./lancedb ]] || die "failed to extract lancedb."
fi
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
export CREATE_LANCEDB_TEST_CONNECTION_SCRIPT="${SCRIPT_DIR}/create_lancedb_test_connection.sh"
"${@}"

View File

@@ -117,7 +117,7 @@ def update_cargo_toml(line_updater):
lance_line = ""
is_parsing_lance_line = False
for line in lines:
if line.startswith("lance"):
if line.startswith("lance") and not line.startswith("lance-namespace"):
# Check if this is a single-line or multi-line entry
# Single-line entries either:
# 1. End with } (complete inline table)

View File

@@ -70,6 +70,23 @@ plugins:
- mkdocs-jupyter
- render_swagger:
allow_arbitrary_locations: true
- redirects:
redirect_maps:
# Redirect the home page and other top-level markdown files. This enables maximum SEO benefit
# other sub-pages are handled by the ingected js in overrides/partials/header.html
'index.md': 'https://lancedb.com/docs/'
'guides/tables.md': 'https://lancedb.com/docs/tables/'
'ann_indexes.md': 'https://lancedb.com/docs/indexing/'
'basic.md': 'https://lancedb.com/docs/quickstart/'
'faq.md': 'https://lancedb.com/docs/faq/'
'embeddings/understanding_embeddings.md': 'https://lancedb.com/docs/embedding/'
'integrations.md': 'https://lancedb.com/docs/integrations/'
'examples.md': 'https://lancedb.com/docs/tutorials/'
'concepts/vector_search.md': 'https://lancedb.com/docs/search/vector-search/'
'troubleshooting.md': 'https://lancedb.com/docs/troubleshooting/'
'guides/storage.md': 'https://lancedb.com/docs/storage/integrations'
markdown_extensions:
- admonition

View File

@@ -19,7 +19,13 @@
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN THE SOFTWARE.
-->
<div id="deprecation-banner" style="background-color: #f8d7da; color: #721c24; padding: 1em; text-align: center;">
<p style="margin: 0; font-size: 1.1em;">
<strong>This documentation site is deprecated.</strong>
Please visit our new documentation site at <a href="https://lancedb.com/docs" style="color: #721c24; text-decoration: underline;">
lancedb.com/docs</a> for the latest information.
</p>
</div>
{% set class = "md-header" %}
{% if "navigation.tabs.sticky" in features %}
{% set class = class ~ " md-header--shadow md-header--lifted" %}
@@ -150,9 +156,9 @@
<div style="margin-left: 10px; margin-right: 5px;">
<a href="https://discord.com/invite/zMM32dvNtd" target="_blank" rel="noopener noreferrer">
<svg fill="#FFFFFF" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 50 50" width="25px" height="25px"><path d="M 41.625 10.769531 C 37.644531 7.566406 31.347656 7.023438 31.078125 7.003906 C 30.660156 6.96875 30.261719 7.203125 30.089844 7.589844 C 30.074219 7.613281 29.9375 7.929688 29.785156 8.421875 C 32.417969 8.867188 35.652344 9.761719 38.578125 11.578125 C 39.046875 11.867188 39.191406 12.484375 38.902344 12.953125 C 38.710938 13.261719 38.386719 13.429688 38.050781 13.429688 C 37.871094 13.429688 37.6875 13.378906 37.523438 13.277344 C 32.492188 10.15625 26.210938 10 25 10 C 23.789063 10 17.503906 10.15625 12.476563 13.277344 C 12.007813 13.570313 11.390625 13.425781 11.101563 12.957031 C 10.808594 12.484375 10.953125 11.871094 11.421875 11.578125 C 14.347656 9.765625 17.582031 8.867188 20.214844 8.425781 C 20.0625 7.929688 19.925781 7.617188 19.914063 7.589844 C 19.738281 7.203125 19.34375 6.960938 18.921875 7.003906 C 18.652344 7.023438 12.355469 7.566406 8.320313 10.8125 C 6.214844 12.761719 2 24.152344 2 34 C 2 34.175781 2.046875 34.34375 2.132813 34.496094 C 5.039063 39.605469 12.972656 40.941406 14.78125 41 C 14.789063 41 14.800781 41 14.8125 41 C 15.132813 41 15.433594 40.847656 15.621094 40.589844 L 17.449219 38.074219 C 12.515625 36.800781 9.996094 34.636719 9.851563 34.507813 C 9.4375 34.144531 9.398438 33.511719 9.765625 33.097656 C 10.128906 32.683594 10.761719 32.644531 11.175781 33.007813 C 11.234375 33.0625 15.875 37 25 37 C 34.140625 37 38.78125 33.046875 38.828125 33.007813 C 39.242188 32.648438 39.871094 32.683594 40.238281 33.101563 C 40.601563 33.515625 40.5625 34.144531 40.148438 34.507813 C 40.003906 34.636719 37.484375 36.800781 32.550781 38.074219 L 34.378906 40.589844 C 34.566406 40.847656 34.867188 41 35.1875 41 C 35.199219 41 35.210938 41 35.21875 41 C 37.027344 40.941406 44.960938 39.605469 47.867188 34.496094 C 47.953125 34.34375 48 34.175781 48 34 C 48 24.152344 43.785156 12.761719 41.625 10.769531 Z M 18.5 30 C 16.566406 30 15 28.210938 15 26 C 15 23.789063 16.566406 22 18.5 22 C 20.433594 22 22 23.789063 22 26 C 22 28.210938 20.433594 30 18.5 30 Z M 31.5 30 C 29.566406 30 28 28.210938 28 26 C 28 23.789063 29.566406 22 31.5 22 C 33.433594 22 35 23.789063 35 26 C 35 28.210938 33.433594 30 31.5 30 Z"/></svg>
</a>
</div>
<svg fill="#FFFFFF" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 50 50" width="25px" height="25px"><path d="M 41.625 10.769531 C 37.644531 7.566406 31.347656 7.023438 31.078125 7.003906 C 30.660156 6.96875 30.261719 7.203125 30.089844 7.589844 C 30.074219 7.613281 29.9375 7.929688 29.785156 8.421875 C 32.417969 8.867188 35.652344 9.761719 38.578125 11.578125 C 39.046875 11.867188 39.191406 12.484375 38.902344 12.953125 C 38.710938 13.261719 38.386719 13.429688 38.050781 13.429688 C 37.871094 13.429688 37.6875 13.378906 37.523438 13.277344 C 32.492188 10.15625 26.210938 10 25 10 C 23.789063 10 17.503906 10.15625 12.476563 13.277344 C 12.007813 13.570313 11.390625 13.425781 11.101563 12.957031 C 10.808594 12.484375 10.953125 11.871094 11.421875 11.578125 C 14.347656 9.765625 17.582031 8.867188 20.214844 8.425781 C 20.0625 7.929688 19.925781 7.617188 19.914063 7.589844 C 19.738281 7.203125 19.34375 6.960938 18.921875 7.003906 C 18.652344 7.023438 12.355469 7.566406 8.320313 10.8125 C 6.214844 12.761719 2 24.152344 2 34 C 2 34.175781 2.046875 34.34375 2.132813 34.496094 C 5.039063 39.605469 12.972656 40.941406 14.78125 41 C 14.789063 41 14.800781 41 14.8125 41 C 15.132813 41 15.433594 40.847656 15.621094 40.589844 L 17.449219 38.074219 C 12.515625 36.800781 9.996094 34.636719 9.851563 34.507813 C 9.4375 34.144531 9.398438 33.511719 9.765625 33.097656 C 10.128906 32.683594 10.761719 32.644531 11.175781 33.007813 C 11.234375 33.0625 15.875 37 25 37 C 34.140625 37 38.78125 33.046875 38.828125 33.007813 C 39.242188 32.648438 39.871094 32.683594 40.238281 33.101563 C 40.601563 33.515625 40.5625 34.144531 40.148438 34.507813 C 40.003906 34.636719 37.484375 36.800781 32.550781 38.074219 L 34.378906 40.589844 C 34.566406 40.847656 34.867188 41 35.1875 41 C 35.199219 41 35.210938 41 35.21875 41 C 37.027344 40.941406 44.960938 39.605469 47.867188 34.496094 C 47.953125 34.34375 48 34.175781 48 34 C 48 24.152344 43.785156 12.761719 41.625 10.769531 Z M 18.5 30 C 16.566406 30 15 28.210938 15 26 C 15 23.789063 16.566406 22 18.5 22 C 20.433594 22 22 23.789063 22 26 C 22 28.210938 20.433594 30 18.5 30 Z M 31.5 30 C 29.566406 30 28 28.210938 28 26 C 28 23.789063 29.566406 22 31.5 22 C 33.433594 22 35 23.789063 35 26 C 35 28.210938 33.433594 30 31.5 30 Z"/></svg>
</a>
</div>
<div style="margin-left: 5px; margin-right: 5px;">
<a href="https://twitter.com/lancedb" target="_blank" rel="noopener noreferrer">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" viewBox="0,0,256,256" width="25px" height="25px" fill-rule="nonzero"><g fill-opacity="0" fill="#ffffff" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><path d="M0,256v-256h256v256z" id="bgRectangle"></path></g><g fill="#ffffff" fill-rule="nonzero" stroke="none" stroke-width="1" stroke-linecap="butt" stroke-linejoin="miter" stroke-miterlimit="10" stroke-dasharray="" stroke-dashoffset="0" font-family="none" font-weight="none" font-size="none" text-anchor="none" style="mix-blend-mode: normal"><g transform="scale(4,4)"><path d="M57,17.114c-1.32,1.973 -2.991,3.707 -4.916,5.097c0.018,0.423 0.028,0.847 0.028,1.274c0,13.013 -9.902,28.018 -28.016,28.018c-5.562,0 -12.81,-1.948 -15.095,-4.423c0.772,0.092 1.556,0.138 2.35,0.138c4.615,0 8.861,-1.575 12.23,-4.216c-4.309,-0.079 -7.946,-2.928 -9.199,-6.84c1.96,0.308 4.447,-0.17 4.447,-0.17c0,0 -7.7,-1.322 -7.899,-9.779c2.226,1.291 4.46,1.231 4.46,1.231c0,0 -4.441,-2.734 -4.379,-8.195c0.037,-3.221 1.331,-4.953 1.331,-4.953c8.414,10.361 20.298,10.29 20.298,10.29c0,0 -0.255,-1.471 -0.255,-2.243c0,-5.437 4.408,-9.847 9.847,-9.847c2.832,0 5.391,1.196 7.187,3.111c2.245,-0.443 4.353,-1.263 6.255,-2.391c-0.859,3.44 -4.329,5.448 -4.329,5.448c0,0 2.969,-0.329 5.655,-1.55z"></path></g></g></svg>
@@ -173,4 +179,77 @@
{% include "partials/tabs.html" %}
{% endif %}
{% endif %}
</header>
</header>
<script>
(function() {
function checkPathAndRedirect() {
var banner = document.getElementById('deprecation-banner');
if (document.querySelector('meta[http-equiv="refresh"]')) {
return; // The redirects plugin is already handling this page.
}
var currentPath = window.location.pathname;
var cleanPath = currentPath.endsWith('/') && currentPath.length > 1
? currentPath.slice(0, -1)
: currentPath;
// These are the ONLY paths that should remain on the old site
var apiPaths = [
'/lancedb/python',
'/lancedb/javascript',
'/lancedb/js',
'/lancedb/api_reference'
];
var isApiPage = apiPaths.some(function(apiPath) {
return cleanPath.startsWith(apiPath);
});
if (isApiPage) {
if (banner) {
banner.style.display = 'none';
}
} else {
if (banner) {
banner.style.display = 'block';
}
// Add noindex meta tag to prevent indexing of old docs for seo
var noindexMeta = document.createElement('meta');
noindexMeta.setAttribute('name', 'robots');
noindexMeta.setAttribute('content', 'noindex, follow');
document.head.appendChild(noindexMeta);
// Add canonical link to point to the new docs to reward new site for seo
var canonicalLink = document.createElement('link');
canonicalLink.setAttribute('rel', 'canonical');
canonicalLink.setAttribute('href', 'https://lancedb.com/docs');
document.head.appendChild(canonicalLink);
window.location.replace('https://lancedb.com/docs');
}
}
// Run the check only if doc is ready. This makes sure we catch the initial load
// and redirect.
if (document.readyState === 'loading') {
document.addEventListener('DOMContentLoaded', checkPathAndRedirect);
} else {
checkPathAndRedirect();
}
// Use an interval to handle subsequent navigation clicks.
var lastPath = window.location.pathname;
setInterval(function() {
if (window.location.pathname !== lastPath) {
lastPath = window.location.pathname;
checkPathAndRedirect();
}
}, 2000); // keeping it 2 second to make it easy for user to understand
// what's happening
})();
</script>

View File

@@ -5,3 +5,4 @@ mkdocstrings[python]==0.25.2
griffe
mkdocs-render-swagger-plugin
pydantic
mkdocs-redirects

View File

@@ -25,6 +25,51 @@ the underlying connection has been closed.
## Methods
### cloneTable()
```ts
abstract cloneTable(
targetTableName,
sourceUri,
options?): Promise<Table>
```
Clone a table from a source table.
A shallow clone creates a new table that shares the underlying data files
with the source table but has its own independent manifest. This allows
both the source and cloned tables to evolve independently while initially
sharing the same data, deletion, and index files.
#### Parameters
* **targetTableName**: `string`
The name of the target table to create.
* **sourceUri**: `string`
The URI of the source table to clone from.
* **options?**
Clone options.
* **options.isShallow?**: `boolean`
Whether to perform a shallow clone (defaults to true).
* **options.sourceTag?**: `string`
The tag of the source table to clone.
* **options.sourceVersion?**: `number`
The version of the source table to clone.
* **options.targetNamespace?**: `string`[]
The namespace for the target table (defaults to root namespace).
#### Returns
`Promise`&lt;[`Table`](Table.md)&gt;
***
### close()
```ts

View File

@@ -194,6 +194,37 @@ currently is also a memory intensive operation.
***
### ivfRq()
```ts
static ivfRq(options?): Index
```
Create an IvfRq index
IVF-RQ (RabitQ Quantization) compresses vectors using RabitQ quantization
and organizes them into IVF partitions.
The compression scheme is called RabitQ quantization. Each dimension is quantized into a small number of bits.
The parameters `num_bits` and `num_partitions` control this process, providing a tradeoff
between index size (and thus search speed) and index accuracy.
The partitioning process is called IVF and the `num_partitions` parameter controls how
many groups to create.
Note that training an IVF RQ index on a large dataset is a slow operation and
currently is also a memory intensive operation.
#### Parameters
* **options?**: `Partial`&lt;[`IvfRqOptions`](../interfaces/IvfRqOptions.md)&gt;
#### Returns
[`Index`](Index.md)
***
### labelList()
```ts

View File

@@ -52,6 +52,30 @@ the merge result
***
### useIndex()
```ts
useIndex(useIndex): MergeInsertBuilder
```
Controls whether to use indexes for the merge operation.
When set to `true` (the default), the operation will use an index if available
on the join key for improved performance. When set to `false`, it forces a full
table scan even if an index exists. This can be useful for benchmarking or when
the query optimizer chooses a suboptimal path.
#### Parameters
* **useIndex**: `boolean`
Whether to use indices for the merge operation. Defaults to `true`.
#### Returns
[`MergeInsertBuilder`](MergeInsertBuilder.md)
***
### whenMatchedUpdateAll()
```ts

View File

@@ -0,0 +1,220 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / PermutationBuilder
# Class: PermutationBuilder
A PermutationBuilder for creating data permutations with splits, shuffling, and filtering.
This class provides a TypeScript wrapper around the native Rust PermutationBuilder,
offering methods to configure data splits, shuffling, and filtering before executing
the permutation to create a new table.
## Methods
### execute()
```ts
execute(): Promise<Table>
```
Execute the permutation and create the destination table.
#### Returns
`Promise`&lt;[`Table`](Table.md)&gt;
A Promise that resolves to the new Table instance
#### Example
```ts
const permutationTable = await builder.execute();
console.log(`Created table: ${permutationTable.name}`);
```
***
### filter()
```ts
filter(filter): PermutationBuilder
```
Configure filtering for the permutation.
#### Parameters
* **filter**: `string`
SQL filter expression
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
builder.filter("age > 18 AND status = 'active'");
```
***
### shuffle()
```ts
shuffle(options): PermutationBuilder
```
Configure shuffling for the permutation.
#### Parameters
* **options**: [`ShuffleOptions`](../interfaces/ShuffleOptions.md)
Configuration for shuffling
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
// Basic shuffle
builder.shuffle({ seed: 42 });
// Shuffle with clump size
builder.shuffle({ seed: 42, clumpSize: 10 });
```
***
### splitCalculated()
```ts
splitCalculated(calculation): PermutationBuilder
```
Configure calculated splits for the permutation.
#### Parameters
* **calculation**: `string`
SQL expression for calculating splits
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
builder.splitCalculated("user_id % 3");
```
***
### splitHash()
```ts
splitHash(options): PermutationBuilder
```
Configure hash-based splits for the permutation.
#### Parameters
* **options**: [`SplitHashOptions`](../interfaces/SplitHashOptions.md)
Configuration for hash-based splitting
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
builder.splitHash({
columns: ["user_id"],
splitWeights: [70, 30],
discardWeight: 0
});
```
***
### splitRandom()
```ts
splitRandom(options): PermutationBuilder
```
Configure random splits for the permutation.
#### Parameters
* **options**: [`SplitRandomOptions`](../interfaces/SplitRandomOptions.md)
Configuration for random splitting
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
// Split by ratios
builder.splitRandom({ ratios: [0.7, 0.3], seed: 42 });
// Split by counts
builder.splitRandom({ counts: [1000, 500], seed: 42 });
// Split with fixed size
builder.splitRandom({ fixed: 100, seed: 42 });
```
***
### splitSequential()
```ts
splitSequential(options): PermutationBuilder
```
Configure sequential splits for the permutation.
#### Parameters
* **options**: [`SplitSequentialOptions`](../interfaces/SplitSequentialOptions.md)
Configuration for sequential splitting
#### Returns
[`PermutationBuilder`](PermutationBuilder.md)
A new PermutationBuilder instance
#### Example
```ts
// Split by ratios
builder.splitSequential({ ratios: [0.8, 0.2] });
// Split by counts
builder.splitSequential({ counts: [800, 200] });
// Split with fixed size
builder.splitSequential({ fixed: 1000 });
```

View File

@@ -13,7 +13,7 @@ function makeArrowTable(
metadata?): ArrowTable
```
An enhanced version of the makeTable function from Apache Arrow
An enhanced version of the apache-arrow makeTable function from Apache Arrow
that supports nested fields and embeddings columns.
(typically you do not need to call this function. It will be called automatically

View File

@@ -0,0 +1,37 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / permutationBuilder
# Function: permutationBuilder()
```ts
function permutationBuilder(table, destTableName): PermutationBuilder
```
Create a permutation builder for the given table.
## Parameters
* **table**: [`Table`](../classes/Table.md)
The source table to create a permutation from
* **destTableName**: `string`
The name for the destination permutation table
## Returns
[`PermutationBuilder`](../classes/PermutationBuilder.md)
A PermutationBuilder instance
## Example
```ts
const builder = permutationBuilder(sourceTable, "training_data")
.splitRandom({ ratios: [0.8, 0.2], seed: 42 })
.shuffle({ seed: 123 });
const trainingTable = await builder.execute();
```

View File

@@ -28,6 +28,7 @@
- [MultiMatchQuery](classes/MultiMatchQuery.md)
- [NativeJsHeaderProvider](classes/NativeJsHeaderProvider.md)
- [OAuthHeaderProvider](classes/OAuthHeaderProvider.md)
- [PermutationBuilder](classes/PermutationBuilder.md)
- [PhraseQuery](classes/PhraseQuery.md)
- [Query](classes/Query.md)
- [QueryBase](classes/QueryBase.md)
@@ -68,6 +69,7 @@
- [IndexStatistics](interfaces/IndexStatistics.md)
- [IvfFlatOptions](interfaces/IvfFlatOptions.md)
- [IvfPqOptions](interfaces/IvfPqOptions.md)
- [IvfRqOptions](interfaces/IvfRqOptions.md)
- [MergeResult](interfaces/MergeResult.md)
- [OpenTableOptions](interfaces/OpenTableOptions.md)
- [OptimizeOptions](interfaces/OptimizeOptions.md)
@@ -75,9 +77,14 @@
- [QueryExecutionOptions](interfaces/QueryExecutionOptions.md)
- [RemovalStats](interfaces/RemovalStats.md)
- [RetryConfig](interfaces/RetryConfig.md)
- [ShuffleOptions](interfaces/ShuffleOptions.md)
- [SplitHashOptions](interfaces/SplitHashOptions.md)
- [SplitRandomOptions](interfaces/SplitRandomOptions.md)
- [SplitSequentialOptions](interfaces/SplitSequentialOptions.md)
- [TableNamesOptions](interfaces/TableNamesOptions.md)
- [TableStatistics](interfaces/TableStatistics.md)
- [TimeoutConfig](interfaces/TimeoutConfig.md)
- [TlsConfig](interfaces/TlsConfig.md)
- [TokenResponse](interfaces/TokenResponse.md)
- [UpdateOptions](interfaces/UpdateOptions.md)
- [UpdateResult](interfaces/UpdateResult.md)
@@ -101,3 +108,4 @@
- [connect](functions/connect.md)
- [makeArrowTable](functions/makeArrowTable.md)
- [packBits](functions/packBits.md)
- [permutationBuilder](functions/permutationBuilder.md)

View File

@@ -40,6 +40,14 @@ optional timeoutConfig: TimeoutConfig;
***
### tlsConfig?
```ts
optional tlsConfig: TlsConfig;
```
***
### userAgent?
```ts

View File

@@ -0,0 +1,23 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / ShuffleOptions
# Interface: ShuffleOptions
## Properties
### clumpSize?
```ts
optional clumpSize: number;
```
***
### seed?
```ts
optional seed: number;
```

View File

@@ -0,0 +1,31 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / SplitHashOptions
# Interface: SplitHashOptions
## Properties
### columns
```ts
columns: string[];
```
***
### discardWeight?
```ts
optional discardWeight: number;
```
***
### splitWeights
```ts
splitWeights: number[];
```

View File

@@ -0,0 +1,39 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / SplitRandomOptions
# Interface: SplitRandomOptions
## Properties
### counts?
```ts
optional counts: number[];
```
***
### fixed?
```ts
optional fixed: number;
```
***
### ratios?
```ts
optional ratios: number[];
```
***
### seed?
```ts
optional seed: number;
```

View File

@@ -0,0 +1,31 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / SplitSequentialOptions
# Interface: SplitSequentialOptions
## Properties
### counts?
```ts
optional counts: number[];
```
***
### fixed?
```ts
optional fixed: number;
```
***
### ratios?
```ts
optional ratios: number[];
```

View File

@@ -0,0 +1,49 @@
[**@lancedb/lancedb**](../README.md) • **Docs**
***
[@lancedb/lancedb](../globals.md) / TlsConfig
# Interface: TlsConfig
TLS/mTLS configuration for the remote HTTP client.
## Properties
### assertHostname?
```ts
optional assertHostname: boolean;
```
Whether to verify the hostname in the server's certificate.
***
### certFile?
```ts
optional certFile: string;
```
Path to the client certificate file (PEM format) for mTLS authentication.
***
### keyFile?
```ts
optional keyFile: string;
```
Path to the client private key file (PEM format) for mTLS authentication.
***
### sslCaCert?
```ts
optional sslCaCert: string;
```
Path to the CA certificate file (PEM format) for server verification.

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.22.1-beta.3</version>
<version>0.22.2-final.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -8,7 +8,7 @@
<parent>
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.22.1-beta.3</version>
<version>0.22.2-final.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

View File

@@ -6,7 +6,7 @@
<groupId>com.lancedb</groupId>
<artifactId>lancedb-parent</artifactId>
<version>0.22.1-beta.3</version>
<version>0.22.2-final.0</version>
<packaging>pom</packaging>
<name>${project.artifactId}</name>
<description>LanceDB Java SDK Parent POM</description>

View File

@@ -1,7 +1,7 @@
[package]
name = "lancedb-nodejs"
edition.workspace = true
version = "0.22.1-beta.3"
version = "0.22.2"
license.workspace = true
description.workspace = true
repository.workspace = true

View File

@@ -1,17 +1,5 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
import {
Bool,
Field,
Int32,
List,
Schema,
Struct,
Uint8,
Utf8,
} from "apache-arrow";
import * as arrow15 from "apache-arrow-15";
import * as arrow16 from "apache-arrow-16";
import * as arrow17 from "apache-arrow-17";
@@ -25,11 +13,9 @@ import {
fromTableToBuffer,
makeArrowTable,
makeEmptyTable,
tableFromIPC,
} from "../lancedb/arrow";
import {
EmbeddingFunction,
FieldOptions,
FunctionOptions,
} from "../lancedb/embedding/embedding_function";
import { EmbeddingFunctionConfig } from "../lancedb/embedding/registry";
@@ -1037,35 +1023,35 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
expect(table.getChild("test")!.get(2)).toBe(false);
});
});
// Test for the undefined values bug fix
describe("undefined values handling", () => {
it("should handle mixed undefined and actual values", () => {
const schema = new Schema([
new Field("text", new Utf8(), true), // nullable
new Field("number", new Int32(), true), // nullable
new Field("bool", new Bool(), true), // nullable
]);
const data = [
{ text: undefined, number: 42, bool: true },
{ text: "hello", number: undefined, bool: false },
{ text: "world", number: 123, bool: undefined },
];
const table = makeArrowTable(data, { schema });
const result = table.toArray();
expect(result).toHaveLength(3);
expect(result[0].text).toBe(null);
expect(result[0].number).toBe(42);
expect(result[0].bool).toBe(true);
expect(result[1].text).toBe("hello");
expect(result[1].number).toBe(null);
expect(result[1].bool).toBe(false);
expect(result[2].text).toBe("world");
expect(result[2].number).toBe(123);
expect(result[2].bool).toBe(null);
});
});
},
);
// Test for the undefined values bug fix
describe("undefined values handling", () => {
it("should handle mixed undefined and actual values", () => {
const schema = new Schema([
new Field("text", new Utf8(), true), // nullable
new Field("number", new Int32(), true), // nullable
new Field("bool", new Bool(), true), // nullable
]);
const data = [
{ text: undefined, number: 42, bool: true },
{ text: "hello", number: undefined, bool: false },
{ text: "world", number: 123, bool: undefined },
];
const table = makeArrowTable(data, { schema });
const result = table.toArray();
expect(result).toHaveLength(3);
expect(result[0].text).toBe(null);
expect(result[0].number).toBe(42);
expect(result[0].bool).toBe(true);
expect(result[1].text).toBe("hello");
expect(result[1].number).toBe(null);
expect(result[1].bool).toBe(false);
expect(result[2].text).toBe("world");
expect(result[2].number).toBe(123);
expect(result[2].bool).toBe(null);
});
});

View File

@@ -0,0 +1,234 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
import * as tmp from "tmp";
import { Table, connect, permutationBuilder } from "../lancedb";
import { makeArrowTable } from "../lancedb/arrow";
describe("PermutationBuilder", () => {
let tmpDir: tmp.DirResult;
let table: Table;
beforeEach(async () => {
tmpDir = tmp.dirSync({ unsafeCleanup: true });
const db = await connect(tmpDir.name);
// Create test data
const data = makeArrowTable(
[
{ id: 1, value: 10 },
{ id: 2, value: 20 },
{ id: 3, value: 30 },
{ id: 4, value: 40 },
{ id: 5, value: 50 },
{ id: 6, value: 60 },
{ id: 7, value: 70 },
{ id: 8, value: 80 },
{ id: 9, value: 90 },
{ id: 10, value: 100 },
],
{ vectorColumns: {} },
);
table = await db.createTable("test_table", data);
});
afterEach(() => {
tmpDir.removeCallback();
});
test("should create permutation builder", () => {
const builder = permutationBuilder(table, "permutation_table");
expect(builder).toBeDefined();
});
test("should execute basic permutation", async () => {
const builder = permutationBuilder(table, "permutation_table");
const permutationTable = await builder.execute();
expect(permutationTable).toBeDefined();
expect(permutationTable.name).toBe("permutation_table");
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
});
test("should create permutation with random splits", async () => {
const builder = permutationBuilder(table, "permutation_table").splitRandom({
ratios: [1.0],
seed: 42,
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
});
test("should create permutation with percentage splits", async () => {
const builder = permutationBuilder(table, "permutation_table").splitRandom({
ratios: [0.3, 0.7],
seed: 42,
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Check split distribution
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
});
test("should create permutation with count splits", async () => {
const builder = permutationBuilder(table, "permutation_table").splitRandom({
counts: [3, 7],
seed: 42,
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Check split distribution
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBe(3);
expect(split1Count).toBe(7);
});
test("should create permutation with hash splits", async () => {
const builder = permutationBuilder(table, "permutation_table").splitHash({
columns: ["id"],
splitWeights: [50, 50],
discardWeight: 0,
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Check that splits exist
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
});
test("should create permutation with sequential splits", async () => {
const builder = permutationBuilder(
table,
"permutation_table",
).splitSequential({ ratios: [0.5, 0.5] });
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Check split distribution - sequential should give exactly 5 and 5
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBe(5);
expect(split1Count).toBe(5);
});
test("should create permutation with calculated splits", async () => {
const builder = permutationBuilder(
table,
"permutation_table",
).splitCalculated("id % 2");
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
// Check split distribution
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(10);
});
test("should create permutation with shuffle", async () => {
const builder = permutationBuilder(table, "permutation_table").shuffle({
seed: 42,
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
});
test("should create permutation with shuffle and clump size", async () => {
const builder = permutationBuilder(table, "permutation_table").shuffle({
seed: 42,
clumpSize: 2,
});
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(10);
});
test("should create permutation with filter", async () => {
const builder = permutationBuilder(table, "permutation_table").filter(
"value > 50",
);
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(5); // Values 60, 70, 80, 90, 100
});
test("should chain multiple operations", async () => {
const builder = permutationBuilder(table, "permutation_table")
.filter("value <= 80")
.splitRandom({ ratios: [0.5, 0.5], seed: 42 })
.shuffle({ seed: 123 });
const permutationTable = await builder.execute();
const rowCount = await permutationTable.countRows();
expect(rowCount).toBe(8); // Values 10, 20, 30, 40, 50, 60, 70, 80
// Check split distribution
const split0Count = await permutationTable.countRows("split_id = 0");
const split1Count = await permutationTable.countRows("split_id = 1");
expect(split0Count).toBeGreaterThan(0);
expect(split1Count).toBeGreaterThan(0);
expect(split0Count + split1Count).toBe(8);
});
test("should throw error for invalid split arguments", () => {
const builder = permutationBuilder(table, "permutation_table");
// Test no arguments provided
expect(() => builder.splitRandom({})).toThrow(
"Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
);
// Test multiple arguments provided
expect(() =>
builder.splitRandom({ ratios: [0.5, 0.5], counts: [3, 7], seed: 42 }),
).toThrow("Exactly one of 'ratios', 'counts', or 'fixed' must be provided");
});
test("should throw error when builder is consumed", async () => {
const builder = permutationBuilder(table, "permutation_table");
// Execute once
await builder.execute();
// Should throw error on second execution
await expect(builder.execute()).rejects.toThrow("Builder already consumed");
});
});

View File

@@ -7,7 +7,6 @@ import {
ClientConfig,
Connection,
ConnectionOptions,
NativeJsHeaderProvider,
TlsConfig,
connect,
} from "../lancedb";

View File

@@ -0,0 +1,184 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
import * as arrow from "../lancedb/arrow";
import { sanitizeField, sanitizeType } from "../lancedb/sanitize";
describe("sanitize", function () {
describe("sanitizeType function", function () {
it("should handle type objects", function () {
const type = new arrow.Int32();
const result = sanitizeType(type);
expect(result.typeId).toBe(arrow.Type.Int);
expect((result as arrow.Int).bitWidth).toBe(32);
expect((result as arrow.Int).isSigned).toBe(true);
const floatType = {
typeId: 3, // Type.Float = 3
precision: 2,
toString: () => "Float",
isFloat: true,
isFixedWidth: true,
};
const floatResult = sanitizeType(floatType);
expect(floatResult).toBeInstanceOf(arrow.DataType);
expect(floatResult.typeId).toBe(arrow.Type.Float);
const floatResult2 = sanitizeType({ ...floatType, typeId: () => 3 });
expect(floatResult2).toBeInstanceOf(arrow.DataType);
expect(floatResult2.typeId).toBe(arrow.Type.Float);
});
const allTypeNameTestCases = [
["null", new arrow.Null()],
["binary", new arrow.Binary()],
["utf8", new arrow.Utf8()],
["bool", new arrow.Bool()],
["int8", new arrow.Int8()],
["int16", new arrow.Int16()],
["int32", new arrow.Int32()],
["int64", new arrow.Int64()],
["uint8", new arrow.Uint8()],
["uint16", new arrow.Uint16()],
["uint32", new arrow.Uint32()],
["uint64", new arrow.Uint64()],
["float16", new arrow.Float16()],
["float32", new arrow.Float32()],
["float64", new arrow.Float64()],
["datemillisecond", new arrow.DateMillisecond()],
["dateday", new arrow.DateDay()],
["timenanosecond", new arrow.TimeNanosecond()],
["timemicrosecond", new arrow.TimeMicrosecond()],
["timemillisecond", new arrow.TimeMillisecond()],
["timesecond", new arrow.TimeSecond()],
["intervaldaytime", new arrow.IntervalDayTime()],
["intervalyearmonth", new arrow.IntervalYearMonth()],
["durationnanosecond", new arrow.DurationNanosecond()],
["durationmicrosecond", new arrow.DurationMicrosecond()],
["durationmillisecond", new arrow.DurationMillisecond()],
["durationsecond", new arrow.DurationSecond()],
] as const;
it.each(allTypeNameTestCases)(
'should map type name "%s" to %s',
function (name, expected) {
const result = sanitizeType(name);
expect(result).toBeInstanceOf(expected.constructor);
},
);
const caseVariationTestCases = [
["NULL", new arrow.Null()],
["Utf8", new arrow.Utf8()],
["FLOAT32", new arrow.Float32()],
["DaTedAy", new arrow.DateDay()],
] as const;
it.each(caseVariationTestCases)(
'should be case insensitive for type name "%s" mapped to %s',
function (name, expected) {
const result = sanitizeType(name);
expect(result).toBeInstanceOf(expected.constructor);
},
);
it("should throw error for unrecognized type name", function () {
expect(() => sanitizeType("invalid_type")).toThrow(
"Unrecognized type name in schema: invalid_type",
);
});
});
describe("sanitizeField function", function () {
it("should handle field with string type name", function () {
const field = sanitizeField({
name: "string_field",
type: "utf8",
nullable: true,
metadata: new Map([["key", "value"]]),
});
expect(field).toBeInstanceOf(arrow.Field);
expect(field.name).toBe("string_field");
expect(field.type).toBeInstanceOf(arrow.Utf8);
expect(field.nullable).toBe(true);
expect(field.metadata?.get("key")).toBe("value");
});
it("should handle field with type object", function () {
const floatType = {
typeId: 3, // Float
precision: 32,
};
const field = sanitizeField({
name: "float_field",
type: floatType,
nullable: false,
});
expect(field).toBeInstanceOf(arrow.Field);
expect(field.name).toBe("float_field");
expect(field.type).toBeInstanceOf(arrow.DataType);
expect(field.type.typeId).toBe(arrow.Type.Float);
expect((field.type as arrow.Float64).precision).toBe(32);
expect(field.nullable).toBe(false);
});
it("should handle field with direct Type instance", function () {
const field = sanitizeField({
name: "bool_field",
type: new arrow.Bool(),
nullable: true,
});
expect(field).toBeInstanceOf(arrow.Field);
expect(field.name).toBe("bool_field");
expect(field.type).toBeInstanceOf(arrow.Bool);
expect(field.nullable).toBe(true);
});
it("should throw error for invalid field object", function () {
expect(() =>
sanitizeField({
type: "int32",
nullable: true,
}),
).toThrow(
"The field passed in is missing a `type`/`name`/`nullable` property",
);
// Invalid type
expect(() =>
sanitizeField({
name: "invalid",
type: { invalid: true },
nullable: true,
}),
).toThrow("Expected a Type to have a typeId property");
// Invalid nullable
expect(() =>
sanitizeField({
name: "invalid_nullable",
type: "int32",
nullable: "not a boolean",
}),
).toThrow("The field passed in had a non-boolean `nullable` property");
});
it("should report error for invalid type name", function () {
expect(() =>
sanitizeField({
name: "invalid_field",
type: "invalid_type",
nullable: true,
}),
).toThrow(
"Unable to sanitize type for field: invalid_field due to error: Error: Unrecognized type name in schema: invalid_type",
);
});
});
});

View File

@@ -10,7 +10,13 @@ import * as arrow16 from "apache-arrow-16";
import * as arrow17 from "apache-arrow-17";
import * as arrow18 from "apache-arrow-18";
import { MatchQuery, PhraseQuery, Table, connect } from "../lancedb";
import {
Connection,
MatchQuery,
PhraseQuery,
Table,
connect,
} from "../lancedb";
import {
Table as ArrowTable,
Field,
@@ -21,6 +27,8 @@ import {
Int64,
List,
Schema,
SchemaLike,
Type,
Uint8,
Utf8,
makeArrowTable,
@@ -39,7 +47,6 @@ import {
Operator,
instanceOfFullTextQuery,
} from "../lancedb/query";
import exp = require("constants");
describe.each([arrow15, arrow16, arrow17, arrow18])(
"Given a table",
@@ -212,8 +219,7 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
},
);
// TODO: https://github.com/lancedb/lancedb/issues/1832
it.skip("should be able to omit nullable fields", async () => {
it("should be able to omit nullable fields", async () => {
const db = await connect(tmpDir.name);
const schema = new arrow.Schema([
new arrow.Field(
@@ -237,23 +243,36 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
await table.add([data3]);
let res = await table.query().limit(10).toArray();
const resVector = res.map((r) => r.get("vector").toArray());
const resVector = res.map((r) =>
r.vector ? Array.from(r.vector) : null,
);
expect(resVector).toEqual([null, data2.vector, data3.vector]);
const resItem = res.map((r) => r.get("item").toArray());
const resItem = res.map((r) => r.item);
expect(resItem).toEqual(["foo", null, "bar"]);
const resPrice = res.map((r) => r.get("price").toArray());
const resPrice = res.map((r) => r.price);
expect(resPrice).toEqual([10.0, 2.0, 3.0]);
const data4 = { item: "foo" };
// We can't omit a column if it's not nullable
await expect(table.add([data4])).rejects.toThrow("Invalid user input");
await expect(table.add([data4])).rejects.toThrow(
"Append with different schema",
);
// But we can alter columns to make them nullable
await table.alterColumns([{ path: "price", nullable: true }]);
await table.add([data4]);
res = (await table.query().limit(10).toArray()).map((r) => r.toJSON());
expect(res).toEqual([data1, data2, data3, data4]);
res = (await table.query().limit(10).toArray()).map((r) => ({
...r.toJSON(),
vector: r.vector ? Array.from(r.vector) : null,
}));
// Rust fills missing nullable fields with null
expect(res).toEqual([
{ ...data1, vector: null },
{ ...data2, item: null },
data3,
{ ...data4, price: null, vector: null },
]);
});
it("should be able to insert nullable data for non-nullable fields", async () => {
@@ -331,6 +350,43 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
const table = await db.createTable("my_table", data);
expect(await table.countRows()).toEqual(2);
});
it("should allow undefined and omitted nullable vector fields", async () => {
// Test for the bug: can't pass undefined or omit vector column
const db = await connect("memory://");
const schema = new arrow.Schema([
new arrow.Field("id", new arrow.Int32(), true),
new arrow.Field(
"vector",
new arrow.FixedSizeList(
32,
new arrow.Field("item", new arrow.Float32(), true),
),
true, // nullable = true
),
]);
const table = await db.createEmptyTable("test_table", schema);
// Should not throw error for undefined value
await table.add([{ id: 0, vector: undefined }]);
// Should not throw error for omitted field
await table.add([{ id: 1 }]);
// Should still work for null
await table.add([{ id: 2, vector: null }]);
// Should still work for actual vector
const testVector = new Array(32).fill(0.5);
await table.add([{ id: 3, vector: testVector }]);
expect(await table.countRows()).toEqual(4);
const res = await table.query().limit(10).toArray();
const resVector = res.map((r) =>
r.vector ? Array.from(r.vector) : null,
);
expect(resVector).toEqual([null, null, null, testVector]);
});
},
);
@@ -488,6 +544,32 @@ describe("merge insert", () => {
.execute(newData, { timeoutMs: 0 }),
).rejects.toThrow("merge insert timed out");
});
test("useIndex", async () => {
const newData = [
{ a: 2, b: "x" },
{ a: 4, b: "z" },
];
// Test with useIndex(true) - should work fine
const result1 = await table
.mergeInsert("a")
.whenNotMatchedInsertAll()
.useIndex(true)
.execute(newData);
expect(result1.numInsertedRows).toBe(1); // Only a=4 should be inserted
// Test with useIndex(false) - should also work fine
const newData2 = [{ a: 5, b: "w" }];
const result2 = await table
.mergeInsert("a")
.whenNotMatchedInsertAll()
.useIndex(false)
.execute(newData2);
expect(result2.numInsertedRows).toBe(1); // a=5 should be inserted
});
});
describe("When creating an index", () => {
@@ -779,6 +861,15 @@ describe("When creating an index", () => {
});
});
it("should be able to create IVF_RQ", async () => {
await tbl.createIndex("vec", {
config: Index.ivfRq({
numPartitions: 10,
numBits: 1,
}),
});
});
it("should allow me to replace (or not) an existing index", async () => {
await tbl.createIndex("id");
// Default is replace=true
@@ -1429,7 +1520,9 @@ describe("when optimizing a dataset", () => {
it("delete unverified", async () => {
const version = await table.version();
const versionFile = `${tmpDir.name}/${table.name}.lance/_versions/${version - 1}.manifest`;
const versionFile = `${tmpDir.name}/${table.name}.lance/_versions/${
version - 1
}.manifest`;
fs.rmSync(versionFile);
let stats = await table.optimize({ deleteUnverified: false });
@@ -1943,3 +2036,52 @@ describe("column name options", () => {
expect(results2.length).toBe(10);
});
});
describe("when creating an empty table", () => {
let con: Connection;
beforeEach(async () => {
const tmpDir = tmp.dirSync({ unsafeCleanup: true });
con = await connect(tmpDir.name);
});
afterEach(() => {
con.close();
});
it("can create an empty table from an arrow Schema", async () => {
const schema = new Schema([
new Field("id", new Int64()),
new Field("vector", new Float64()),
]);
const table = await con.createEmptyTable("test", schema);
const actualSchema = await table.schema();
expect(actualSchema.fields[0].type.typeId).toBe(Type.Int);
expect((actualSchema.fields[0].type as Int64).bitWidth).toBe(64);
expect(actualSchema.fields[1].type.typeId).toBe(Type.Float);
expect((actualSchema.fields[1].type as Float64).precision).toBe(2);
});
it("can create an empty table from schema that specifies field types by name", async () => {
const schemaLike = {
fields: [
{
name: "id",
type: "int64",
nullable: true,
},
{
name: "vector",
type: "float64",
nullable: true,
},
],
metadata: new Map(),
names: ["id", "vector"],
} satisfies SchemaLike;
const table = await con.createEmptyTable("test", schemaLike);
const actualSchema = await table.schema();
expect(actualSchema.fields[0].type.typeId).toBe(Type.Int);
expect((actualSchema.fields[0].type as Int64).bitWidth).toBe(64);
expect(actualSchema.fields[1].type.typeId).toBe(Type.Float);
expect((actualSchema.fields[1].type as Float64).precision).toBe(2);
});
});

View File

@@ -48,6 +48,7 @@
"noUnreachableSuper": "error",
"noUnsafeFinally": "error",
"noUnsafeOptionalChaining": "error",
"noUnusedImports": "error",
"noUnusedLabels": "error",
"noUnusedVariables": "warn",
"useIsNan": "error",

View File

@@ -41,7 +41,6 @@ import {
vectorFromArray as badVectorFromArray,
makeBuilder,
makeData,
makeTable,
} from "apache-arrow";
import { Buffers } from "apache-arrow/data";
import { type EmbeddingFunction } from "./embedding/embedding_function";
@@ -74,7 +73,7 @@ export type FieldLike =
| {
type: string;
name: string;
nullable?: boolean;
nullable: boolean;
metadata?: Map<string, string>;
};
@@ -279,7 +278,7 @@ export class MakeArrowTableOptions {
}
/**
* An enhanced version of the {@link makeTable} function from Apache Arrow
* An enhanced version of the apache-arrow makeTable function from Apache Arrow
* that supports nested fields and embeddings columns.
*
* (typically you do not need to call this function. It will be called automatically
@@ -1286,19 +1285,36 @@ function validateSchemaEmbeddings(
if (isFixedSizeList(field.type)) {
field = sanitizeField(field);
if (data.length !== 0 && data?.[0]?.[field.name] === undefined) {
// Check if there's an embedding function registered for this field
let hasEmbeddingFunction = false;
// Check schema metadata for embedding functions
if (schema.metadata.has("embedding_functions")) {
const embeddings = JSON.parse(
schema.metadata.get("embedding_functions")!,
);
if (
// biome-ignore lint/suspicious/noExplicitAny: we don't know the type of `f`
embeddings.find((f: any) => f["vectorColumn"] === field.name) ===
undefined
) {
// biome-ignore lint/suspicious/noExplicitAny: we don't know the type of `f`
if (embeddings.find((f: any) => f["vectorColumn"] === field.name)) {
hasEmbeddingFunction = true;
}
}
// Check passed embedding function parameter
if (embeddings && embeddings.vectorColumn === field.name) {
hasEmbeddingFunction = true;
}
// If the field is nullable AND there's no embedding function, allow undefined/omitted values
if (field.nullable && !hasEmbeddingFunction) {
fields.push(field);
} else {
// Either not nullable OR has embedding function - require explicit values
if (hasEmbeddingFunction) {
// Don't add to missingEmbeddingFields since this is expected to be filled by embedding function
fields.push(field);
} else {
missingEmbeddingFields.push(field);
}
} else {
missingEmbeddingFields.push(field);
}
} else {
fields.push(field);

View File

@@ -3,7 +3,6 @@
import {
Data,
Schema,
SchemaLike,
TableLike,
fromTableToStreamBuffer,

View File

@@ -43,6 +43,10 @@ export {
DeleteResult,
DropColumnsResult,
UpdateResult,
SplitRandomOptions,
SplitHashOptions,
SplitSequentialOptions,
ShuffleOptions,
} from "./native.js";
export {
@@ -85,6 +89,7 @@ export {
Index,
IndexOptions,
IvfPqOptions,
IvfRqOptions,
IvfFlatOptions,
HnswPqOptions,
HnswSqOptions,
@@ -110,6 +115,7 @@ export {
export { MergeInsertBuilder, WriteExecutionOptions } from "./merge";
export * as embedding from "./embedding";
export { permutationBuilder, PermutationBuilder } from "./permutation";
export * as rerankers from "./rerankers";
export {
SchemaLike,

View File

@@ -112,6 +112,77 @@ export interface IvfPqOptions {
sampleRate?: number;
}
export interface IvfRqOptions {
/**
* The number of IVF partitions to create.
*
* This value should generally scale with the number of rows in the dataset.
* By default the number of partitions is the square root of the number of
* rows.
*
* If this value is too large then the first part of the search (picking the
* right partition) will be slow. If this value is too small then the second
* part of the search (searching within a partition) will be slow.
*/
numPartitions?: number;
/**
* Number of bits per dimension for residual quantization.
*
* This value controls how much each residual component is compressed. The more
* bits, the more accurate the index will be but the slower search. Typical values
* are small integers; the default is 1 bit per dimension.
*/
numBits?: number;
/**
* Distance type to use to build the index.
*
* Default value is "l2".
*
* This is used when training the index to calculate the IVF partitions
* (vectors are grouped in partitions with similar vectors according to this
* distance type) and during quantization.
*
* The distance type used to train an index MUST match the distance type used
* to search the index. Failure to do so will yield inaccurate results.
*
* The following distance types are available:
*
* "l2" - Euclidean distance.
* "cosine" - Cosine distance.
* "dot" - Dot product.
*/
distanceType?: "l2" | "cosine" | "dot";
/**
* Max iterations to train IVF kmeans.
*
* When training an IVF index we use kmeans to calculate the partitions. This parameter
* controls how many iterations of kmeans to run.
*
* The default value is 50.
*/
maxIterations?: number;
/**
* The number of vectors, per partition, to sample when training IVF kmeans.
*
* When an IVF index is trained, we need to calculate partitions. These are groups
* of vectors that are similar to each other. To do this we use an algorithm called kmeans.
*
* Running kmeans on a large dataset can be slow. To speed this up we run kmeans on a
* random sample of the data. This parameter controls the size of the sample. The total
* number of vectors used to train the index is `sample_rate * num_partitions`.
*
* Increasing this value might improve the quality of the index but in most cases the
* default should be sufficient.
*
* The default value is 256.
*/
sampleRate?: number;
}
/**
* Options to create an `HNSW_PQ` index
*/
@@ -523,6 +594,35 @@ export class Index {
options?.distanceType,
options?.numPartitions,
options?.numSubVectors,
options?.numBits,
options?.maxIterations,
options?.sampleRate,
),
);
}
/**
* Create an IvfRq index
*
* IVF-RQ (RabitQ Quantization) compresses vectors using RabitQ quantization
* and organizes them into IVF partitions.
*
* The compression scheme is called RabitQ quantization. Each dimension is quantized into a small number of bits.
* The parameters `num_bits` and `num_partitions` control this process, providing a tradeoff
* between index size (and thus search speed) and index accuracy.
*
* The partitioning process is called IVF and the `num_partitions` parameter controls how
* many groups to create.
*
* Note that training an IVF RQ index on a large dataset is a slow operation and
* currently is also a memory intensive operation.
*/
static ivfRq(options?: Partial<IvfRqOptions>) {
return new Index(
LanceDbIndex.ivfRq(
options?.distanceType,
options?.numPartitions,
options?.numBits,
options?.maxIterations,
options?.sampleRate,
),

View File

@@ -70,6 +70,23 @@ export class MergeInsertBuilder {
this.#schema,
);
}
/**
* Controls whether to use indexes for the merge operation.
*
* When set to `true` (the default), the operation will use an index if available
* on the join key for improved performance. When set to `false`, it forces a full
* table scan even if an index exists. This can be useful for benchmarking or when
* the query optimizer chooses a suboptimal path.
*
* @param useIndex - Whether to use indices for the merge operation. Defaults to `true`.
*/
useIndex(useIndex: boolean): MergeInsertBuilder {
return new MergeInsertBuilder(
this.#native.useIndex(useIndex),
this.#schema,
);
}
/**
* Executes the merge insert operation
*

View File

@@ -0,0 +1,188 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
import {
PermutationBuilder as NativePermutationBuilder,
Table as NativeTable,
ShuffleOptions,
SplitHashOptions,
SplitRandomOptions,
SplitSequentialOptions,
permutationBuilder as nativePermutationBuilder,
} from "./native.js";
import { LocalTable, Table } from "./table";
/**
* A PermutationBuilder for creating data permutations with splits, shuffling, and filtering.
*
* This class provides a TypeScript wrapper around the native Rust PermutationBuilder,
* offering methods to configure data splits, shuffling, and filtering before executing
* the permutation to create a new table.
*/
export class PermutationBuilder {
private inner: NativePermutationBuilder;
/**
* @hidden
*/
constructor(inner: NativePermutationBuilder) {
this.inner = inner;
}
/**
* Configure random splits for the permutation.
*
* @param options - Configuration for random splitting
* @returns A new PermutationBuilder instance
* @example
* ```ts
* // Split by ratios
* builder.splitRandom({ ratios: [0.7, 0.3], seed: 42 });
*
* // Split by counts
* builder.splitRandom({ counts: [1000, 500], seed: 42 });
*
* // Split with fixed size
* builder.splitRandom({ fixed: 100, seed: 42 });
* ```
*/
splitRandom(options: SplitRandomOptions): PermutationBuilder {
const newInner = this.inner.splitRandom(options);
return new PermutationBuilder(newInner);
}
/**
* Configure hash-based splits for the permutation.
*
* @param options - Configuration for hash-based splitting
* @returns A new PermutationBuilder instance
* @example
* ```ts
* builder.splitHash({
* columns: ["user_id"],
* splitWeights: [70, 30],
* discardWeight: 0
* });
* ```
*/
splitHash(options: SplitHashOptions): PermutationBuilder {
const newInner = this.inner.splitHash(options);
return new PermutationBuilder(newInner);
}
/**
* Configure sequential splits for the permutation.
*
* @param options - Configuration for sequential splitting
* @returns A new PermutationBuilder instance
* @example
* ```ts
* // Split by ratios
* builder.splitSequential({ ratios: [0.8, 0.2] });
*
* // Split by counts
* builder.splitSequential({ counts: [800, 200] });
*
* // Split with fixed size
* builder.splitSequential({ fixed: 1000 });
* ```
*/
splitSequential(options: SplitSequentialOptions): PermutationBuilder {
const newInner = this.inner.splitSequential(options);
return new PermutationBuilder(newInner);
}
/**
* Configure calculated splits for the permutation.
*
* @param calculation - SQL expression for calculating splits
* @returns A new PermutationBuilder instance
* @example
* ```ts
* builder.splitCalculated("user_id % 3");
* ```
*/
splitCalculated(calculation: string): PermutationBuilder {
const newInner = this.inner.splitCalculated(calculation);
return new PermutationBuilder(newInner);
}
/**
* Configure shuffling for the permutation.
*
* @param options - Configuration for shuffling
* @returns A new PermutationBuilder instance
* @example
* ```ts
* // Basic shuffle
* builder.shuffle({ seed: 42 });
*
* // Shuffle with clump size
* builder.shuffle({ seed: 42, clumpSize: 10 });
* ```
*/
shuffle(options: ShuffleOptions): PermutationBuilder {
const newInner = this.inner.shuffle(options);
return new PermutationBuilder(newInner);
}
/**
* Configure filtering for the permutation.
*
* @param filter - SQL filter expression
* @returns A new PermutationBuilder instance
* @example
* ```ts
* builder.filter("age > 18 AND status = 'active'");
* ```
*/
filter(filter: string): PermutationBuilder {
const newInner = this.inner.filter(filter);
return new PermutationBuilder(newInner);
}
/**
* Execute the permutation and create the destination table.
*
* @returns A Promise that resolves to the new Table instance
* @example
* ```ts
* const permutationTable = await builder.execute();
* console.log(`Created table: ${permutationTable.name}`);
* ```
*/
async execute(): Promise<Table> {
const nativeTable: NativeTable = await this.inner.execute();
return new LocalTable(nativeTable);
}
}
/**
* Create a permutation builder for the given table.
*
* @param table - The source table to create a permutation from
* @param destTableName - The name for the destination permutation table
* @returns A PermutationBuilder instance
* @example
* ```ts
* const builder = permutationBuilder(sourceTable, "training_data")
* .splitRandom({ ratios: [0.8, 0.2], seed: 42 })
* .shuffle({ seed: 123 });
*
* const trainingTable = await builder.execute();
* ```
*/
export function permutationBuilder(
table: Table,
destTableName: string,
): PermutationBuilder {
// Extract the inner native table from the TypeScript wrapper
const localTable = table as LocalTable;
// Access inner through type assertion since it's private
const nativeBuilder = nativePermutationBuilder(
// biome-ignore lint/suspicious/noExplicitAny: need access to private variable
(localTable as any).inner,
destTableName,
);
return new PermutationBuilder(nativeBuilder);
}

View File

@@ -326,6 +326,9 @@ export function sanitizeDictionary(typeLike: object) {
// biome-ignore lint/suspicious/noExplicitAny: skip
export function sanitizeType(typeLike: unknown): DataType<any> {
if (typeof typeLike === "string") {
return dataTypeFromName(typeLike);
}
if (typeof typeLike !== "object" || typeLike === null) {
throw Error("Expected a Type but object was null/undefined");
}
@@ -447,7 +450,7 @@ export function sanitizeType(typeLike: unknown): DataType<any> {
case Type.DurationSecond:
return new DurationSecond();
default:
throw new Error("Unrecoginized type id in schema: " + typeId);
throw new Error("Unrecognized type id in schema: " + typeId);
}
}
@@ -467,7 +470,15 @@ export function sanitizeField(fieldLike: unknown): Field {
"The field passed in is missing a `type`/`name`/`nullable` property",
);
}
const type = sanitizeType(fieldLike.type);
let type: DataType;
try {
type = sanitizeType(fieldLike.type);
} catch (error: unknown) {
throw Error(
`Unable to sanitize type for field: ${fieldLike.name} due to error: ${error}`,
{ cause: error },
);
}
const name = fieldLike.name;
if (!(typeof name === "string")) {
throw Error("The field passed in had a non-string `name` property");
@@ -581,3 +592,46 @@ function sanitizeData(
},
);
}
const constructorsByTypeName = {
null: () => new Null(),
binary: () => new Binary(),
utf8: () => new Utf8(),
bool: () => new Bool(),
int8: () => new Int8(),
int16: () => new Int16(),
int32: () => new Int32(),
int64: () => new Int64(),
uint8: () => new Uint8(),
uint16: () => new Uint16(),
uint32: () => new Uint32(),
uint64: () => new Uint64(),
float16: () => new Float16(),
float32: () => new Float32(),
float64: () => new Float64(),
datemillisecond: () => new DateMillisecond(),
dateday: () => new DateDay(),
timenanosecond: () => new TimeNanosecond(),
timemicrosecond: () => new TimeMicrosecond(),
timemillisecond: () => new TimeMillisecond(),
timesecond: () => new TimeSecond(),
intervaldaytime: () => new IntervalDayTime(),
intervalyearmonth: () => new IntervalYearMonth(),
durationnanosecond: () => new DurationNanosecond(),
durationmicrosecond: () => new DurationMicrosecond(),
durationmillisecond: () => new DurationMillisecond(),
durationsecond: () => new DurationSecond(),
} as const;
type MappableTypeName = keyof typeof constructorsByTypeName;
export function dataTypeFromName(typeName: string): DataType {
const normalizedTypeName = typeName.toLowerCase() as MappableTypeName;
const _constructor = constructorsByTypeName[normalizedTypeName];
if (!_constructor) {
throw new Error("Unrecognized type name in schema: " + typeName);
}
return _constructor();
}

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-arm64",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["darwin"],
"cpu": ["arm64"],
"main": "lancedb.darwin-arm64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-darwin-x64",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["darwin"],
"cpu": ["x64"],
"main": "lancedb.darwin-x64.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-gnu",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-arm64-musl",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["linux"],
"cpu": ["arm64"],
"main": "lancedb.linux-arm64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-gnu",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-gnu.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-linux-x64-musl",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["linux"],
"cpu": ["x64"],
"main": "lancedb.linux-x64-musl.node",

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-arm64-msvc",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": [
"win32"
],

View File

@@ -1,6 +1,6 @@
{
"name": "@lancedb/lancedb-win32-x64-msvc",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"os": ["win32"],
"cpu": ["x64"],
"main": "lancedb.win32-x64-msvc.node",

View File

@@ -1,12 +1,12 @@
{
"name": "@lancedb/lancedb",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"lockfileVersion": 3,
"requires": true,
"packages": {
"": {
"name": "@lancedb/lancedb",
"version": "0.22.1-beta.3",
"version": "0.22.2",
"cpu": [
"x64",
"arm64"

View File

@@ -11,7 +11,7 @@
"ann"
],
"private": false,
"version": "0.22.1-beta.3",
"version": "0.22.2",
"main": "dist/index.js",
"exports": {
".": "./dist/index.js",

View File

@@ -6,6 +6,7 @@ use std::sync::Mutex;
use lancedb::index::scalar::{BTreeIndexBuilder, FtsIndexBuilder};
use lancedb::index::vector::{
IvfFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder,
IvfRqIndexBuilder,
};
use lancedb::index::Index as LanceDbIndex;
use napi_derive::napi;
@@ -65,6 +66,36 @@ impl Index {
})
}
#[napi(factory)]
pub fn ivf_rq(
distance_type: Option<String>,
num_partitions: Option<u32>,
num_bits: Option<u32>,
max_iterations: Option<u32>,
sample_rate: Option<u32>,
) -> napi::Result<Self> {
let mut ivf_rq_builder = IvfRqIndexBuilder::default();
if let Some(distance_type) = distance_type {
let distance_type = parse_distance_type(distance_type)?;
ivf_rq_builder = ivf_rq_builder.distance_type(distance_type);
}
if let Some(num_partitions) = num_partitions {
ivf_rq_builder = ivf_rq_builder.num_partitions(num_partitions);
}
if let Some(num_bits) = num_bits {
ivf_rq_builder = ivf_rq_builder.num_bits(num_bits);
}
if let Some(max_iterations) = max_iterations {
ivf_rq_builder = ivf_rq_builder.max_iterations(max_iterations);
}
if let Some(sample_rate) = sample_rate {
ivf_rq_builder = ivf_rq_builder.sample_rate(sample_rate);
}
Ok(Self {
inner: Mutex::new(Some(LanceDbIndex::IvfRq(ivf_rq_builder))),
})
}
#[napi(factory)]
pub fn ivf_flat(
distance_type: Option<String>,

View File

@@ -12,6 +12,7 @@ mod header;
mod index;
mod iterator;
pub mod merge;
pub mod permutation;
mod query;
pub mod remote;
mod rerankers;

View File

@@ -43,6 +43,13 @@ impl NativeMergeInsertBuilder {
self.inner.timeout(Duration::from_millis(timeout as u64));
}
#[napi]
pub fn use_index(&self, use_index: bool) -> Self {
let mut this = self.clone();
this.inner.use_index(use_index);
this
}
#[napi(catch_unwind)]
pub async fn execute(&self, buf: Buffer) -> napi::Result<MergeResult> {
let data = ipc_file_to_batches(buf.to_vec())

222
nodejs/src/permutation.rs Normal file
View File

@@ -0,0 +1,222 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::{Arc, Mutex};
use crate::{error::NapiErrorExt, table::Table};
use lancedb::dataloader::{
permutation::{PermutationBuilder as LancePermutationBuilder, ShuffleStrategy},
split::{SplitSizes, SplitStrategy},
};
use napi_derive::napi;
#[napi(object)]
pub struct SplitRandomOptions {
pub ratios: Option<Vec<f64>>,
pub counts: Option<Vec<i64>>,
pub fixed: Option<i64>,
pub seed: Option<i64>,
}
#[napi(object)]
pub struct SplitHashOptions {
pub columns: Vec<String>,
pub split_weights: Vec<i64>,
pub discard_weight: Option<i64>,
}
#[napi(object)]
pub struct SplitSequentialOptions {
pub ratios: Option<Vec<f64>>,
pub counts: Option<Vec<i64>>,
pub fixed: Option<i64>,
}
#[napi(object)]
pub struct ShuffleOptions {
pub seed: Option<i64>,
pub clump_size: Option<i64>,
}
pub struct PermutationBuilderState {
pub builder: Option<LancePermutationBuilder>,
pub dest_table_name: String,
}
#[napi]
pub struct PermutationBuilder {
state: Arc<Mutex<PermutationBuilderState>>,
}
impl PermutationBuilder {
pub fn new(builder: LancePermutationBuilder, dest_table_name: String) -> Self {
Self {
state: Arc::new(Mutex::new(PermutationBuilderState {
builder: Some(builder),
dest_table_name,
})),
}
}
}
impl PermutationBuilder {
fn modify(
&self,
func: impl FnOnce(LancePermutationBuilder) -> LancePermutationBuilder,
) -> napi::Result<Self> {
let mut state = self.state.lock().unwrap();
let builder = state
.builder
.take()
.ok_or_else(|| napi::Error::from_reason("Builder already consumed"))?;
state.builder = Some(func(builder));
Ok(Self {
state: self.state.clone(),
})
}
}
#[napi]
impl PermutationBuilder {
/// Configure random splits
#[napi]
pub fn split_random(&self, options: SplitRandomOptions) -> napi::Result<Self> {
// Check that exactly one split type is provided
let split_args_count = [
options.ratios.is_some(),
options.counts.is_some(),
options.fixed.is_some(),
]
.iter()
.filter(|&&x| x)
.count();
if split_args_count != 1 {
return Err(napi::Error::from_reason(
"Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
));
}
let sizes = if let Some(ratios) = options.ratios {
SplitSizes::Percentages(ratios)
} else if let Some(counts) = options.counts {
SplitSizes::Counts(counts.into_iter().map(|c| c as u64).collect())
} else if let Some(fixed) = options.fixed {
SplitSizes::Fixed(fixed as u64)
} else {
unreachable!("One of the split arguments must be provided");
};
let seed = options.seed.map(|s| s as u64);
self.modify(|builder| builder.with_split_strategy(SplitStrategy::Random { seed, sizes }))
}
/// Configure hash-based splits
#[napi]
pub fn split_hash(&self, options: SplitHashOptions) -> napi::Result<Self> {
let split_weights = options
.split_weights
.into_iter()
.map(|w| w as u64)
.collect();
let discard_weight = options.discard_weight.unwrap_or(0) as u64;
self.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Hash {
columns: options.columns,
split_weights,
discard_weight,
})
})
}
/// Configure sequential splits
#[napi]
pub fn split_sequential(&self, options: SplitSequentialOptions) -> napi::Result<Self> {
// Check that exactly one split type is provided
let split_args_count = [
options.ratios.is_some(),
options.counts.is_some(),
options.fixed.is_some(),
]
.iter()
.filter(|&&x| x)
.count();
if split_args_count != 1 {
return Err(napi::Error::from_reason(
"Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
));
}
let sizes = if let Some(ratios) = options.ratios {
SplitSizes::Percentages(ratios)
} else if let Some(counts) = options.counts {
SplitSizes::Counts(counts.into_iter().map(|c| c as u64).collect())
} else if let Some(fixed) = options.fixed {
SplitSizes::Fixed(fixed as u64)
} else {
unreachable!("One of the split arguments must be provided");
};
self.modify(|builder| builder.with_split_strategy(SplitStrategy::Sequential { sizes }))
}
/// Configure calculated splits
#[napi]
pub fn split_calculated(&self, calculation: String) -> napi::Result<Self> {
self.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Calculated { calculation })
})
}
/// Configure shuffling
#[napi]
pub fn shuffle(&self, options: ShuffleOptions) -> napi::Result<Self> {
let seed = options.seed.map(|s| s as u64);
let clump_size = options.clump_size.map(|c| c as u64);
self.modify(|builder| {
builder.with_shuffle_strategy(ShuffleStrategy::Random { seed, clump_size })
})
}
/// Configure filtering
#[napi]
pub fn filter(&self, filter: String) -> napi::Result<Self> {
self.modify(|builder| builder.with_filter(filter))
}
/// Execute the permutation builder and create the table
#[napi]
pub async fn execute(&self) -> napi::Result<Table> {
let (builder, dest_table_name) = {
let mut state = self.state.lock().unwrap();
let builder = state
.builder
.take()
.ok_or_else(|| napi::Error::from_reason("Builder already consumed"))?;
let dest_table_name = std::mem::take(&mut state.dest_table_name);
(builder, dest_table_name)
};
let table = builder.build(&dest_table_name).await.default_error()?;
Ok(Table::new(table))
}
}
/// Create a permutation builder for the given table
#[napi]
pub fn permutation_builder(
table: &crate::table::Table,
dest_table_name: String,
) -> napi::Result<PermutationBuilder> {
use lancedb::dataloader::permutation::PermutationBuilder as LancePermutationBuilder;
let inner_table = table.inner_ref()?.clone();
let inner_builder = LancePermutationBuilder::new(inner_table);
Ok(PermutationBuilder::new(inner_builder, dest_table_name))
}

View File

@@ -26,7 +26,7 @@ pub struct Table {
}
impl Table {
fn inner_ref(&self) -> napi::Result<&LanceDbTable> {
pub(crate) fn inner_ref(&self) -> napi::Result<&LanceDbTable> {
self.inner
.as_ref()
.ok_or_else(|| napi::Error::from_reason(format!("Table {} is closed", self.name)))

View File

@@ -1,5 +1,5 @@
[tool.bumpversion]
current_version = "0.25.1"
current_version = "0.25.3-beta.0"
parse = """(?x)
(?P<major>0|[1-9]\\d*)\\.
(?P<minor>0|[1-9]\\d*)\\.
@@ -24,6 +24,19 @@ commit = true
message = "Bump version: {current_version} → {new_version}"
commit_args = ""
# Update Cargo.lock after version bump
pre_commit_hooks = [
"""
cd python && cargo update -p lancedb-python
if git diff --quiet ../Cargo.lock; then
echo "Cargo.lock unchanged"
else
git add ../Cargo.lock
echo "Updated and staged Cargo.lock"
fi
""",
]
[tool.bumpversion.parts.pre_l]
values = ["beta", "final"]
optional_value = "final"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb-python"
version = "0.25.1"
version = "0.25.3-beta.0"
edition.workspace = true
description = "Python bindings for LanceDB"
license.workspace = true
@@ -14,12 +14,12 @@ name = "_lancedb"
crate-type = ["cdylib"]
[dependencies]
arrow = { version = "55.1", features = ["pyarrow"] }
arrow = { version = "56.2", features = ["pyarrow"] }
async-trait = "0.1"
lancedb = { path = "../rust/lancedb", default-features = false }
env_logger.workspace = true
pyo3 = { version = "0.24", features = ["extension-module", "abi3-py39"] }
pyo3-async-runtimes = { version = "0.24", features = [
pyo3 = { version = "0.25", features = ["extension-module", "abi3-py39"] }
pyo3-async-runtimes = { version = "0.25", features = [
"attributes",
"tokio-runtime",
] }
@@ -28,7 +28,7 @@ futures.workspace = true
tokio = { version = "1.40", features = ["sync"] }
[build-dependencies]
pyo3-build-config = { version = "0.24", features = [
pyo3-build-config = { version = "0.25", features = [
"extension-module",
"abi3-py39",
] }

View File

@@ -5,12 +5,12 @@ dynamic = ["version"]
dependencies = [
"deprecation",
"numpy",
"overrides>=0.7",
"overrides>=0.7; python_version<'3.12'",
"packaging",
"pyarrow>=16",
"pydantic>=1.10",
"tqdm>=4.27.0",
"lance-namespace==0.0.6"
"lance-namespace>=0.0.16"
]
description = "lancedb"
authors = [{ name = "LanceDB Devs", email = "dev@lancedb.com" }]

View File

@@ -133,6 +133,7 @@ class Tags:
async def update(self, tag: str, version: int): ...
class IndexConfig:
name: str
index_type: str
columns: List[str]
@@ -295,3 +296,34 @@ class AlterColumnsResult:
class DropColumnsResult:
version: int
class AsyncPermutationBuilder:
def select(self, projections: Dict[str, str]) -> "AsyncPermutationBuilder": ...
def split_random(
self,
*,
ratios: Optional[List[float]] = None,
counts: Optional[List[int]] = None,
fixed: Optional[int] = None,
seed: Optional[int] = None,
) -> "AsyncPermutationBuilder": ...
def split_hash(
self, columns: List[str], split_weights: List[int], *, discard_weight: int = 0
) -> "AsyncPermutationBuilder": ...
def split_sequential(
self,
*,
ratios: Optional[List[float]] = None,
counts: Optional[List[int]] = None,
fixed: Optional[int] = None,
) -> "AsyncPermutationBuilder": ...
def split_calculated(self, calculation: str) -> "AsyncPermutationBuilder": ...
def shuffle(
self, seed: Optional[int], clump_size: Optional[int]
) -> "AsyncPermutationBuilder": ...
def filter(self, filter: str) -> "AsyncPermutationBuilder": ...
async def execute(self) -> Table: ...
def async_permutation_builder(
table: Table, dest_table_name: str
) -> AsyncPermutationBuilder: ...

View File

@@ -5,11 +5,20 @@
from __future__ import annotations
from abc import abstractmethod
from datetime import timedelta
from pathlib import Path
import sys
from typing import TYPE_CHECKING, Dict, Iterable, List, Literal, Optional, Union
if sys.version_info >= (3, 12):
from typing import override
class EnforceOverrides:
pass
else:
from overrides import EnforceOverrides, override # type: ignore
from lancedb.embeddings.registry import EmbeddingFunctionRegistry
from overrides import EnforceOverrides, override # type: ignore
from lancedb.common import data_to_reader, sanitize_uri, validate_schema
from lancedb.background_loop import LOOP
@@ -32,7 +41,6 @@ import deprecation
if TYPE_CHECKING:
import pyarrow as pa
from .pydantic import LanceModel
from datetime import timedelta
from ._lancedb import Connection as LanceDbConnection
from .common import DATA, URI
@@ -444,7 +452,12 @@ class LanceDBConnection(DBConnection):
read_consistency_interval: Optional[timedelta] = None,
storage_options: Optional[Dict[str, str]] = None,
session: Optional[Session] = None,
_inner: Optional[LanceDbConnection] = None,
):
if _inner is not None:
self._conn = _inner
return
if not isinstance(uri, Path):
scheme = get_uri_scheme(uri)
is_local = isinstance(uri, Path) or scheme == "file"
@@ -453,11 +466,6 @@ class LanceDBConnection(DBConnection):
uri = Path(uri)
uri = uri.expanduser().absolute()
Path(uri).mkdir(parents=True, exist_ok=True)
self._uri = str(uri)
self._entered = False
self.read_consistency_interval = read_consistency_interval
self.storage_options = storage_options
self.session = session
if read_consistency_interval is not None:
read_consistency_interval_secs = read_consistency_interval.total_seconds()
@@ -476,10 +484,32 @@ class LanceDBConnection(DBConnection):
session,
)
# TODO: It would be nice if we didn't store self.storage_options but it is
# currently used by the LanceTable.to_lance method. This doesn't _really_
# work because some paths like LanceDBConnection.from_inner will lose the
# storage_options. Also, this class really shouldn't be holding any state
# beyond _conn.
self.storage_options = storage_options
self._conn = AsyncConnection(LOOP.run(do_connect()))
@property
def read_consistency_interval(self) -> Optional[timedelta]:
return LOOP.run(self._conn.get_read_consistency_interval())
@property
def session(self) -> Optional[Session]:
return self._conn.session
@property
def uri(self) -> str:
return self._conn.uri
@classmethod
def from_inner(cls, inner: LanceDbConnection):
return cls(None, _inner=inner)
def __repr__(self) -> str:
val = f"{self.__class__.__name__}(uri={self._uri!r}"
val = f"{self.__class__.__name__}(uri={self._conn.uri!r}"
if self.read_consistency_interval is not None:
val += f", read_consistency_interval={repr(self.read_consistency_interval)}"
val += ")"
@@ -489,6 +519,10 @@ class LanceDBConnection(DBConnection):
conn = AsyncConnection(await lancedb_connect(self.uri))
return await conn.table_names(start_after=start_after, limit=limit)
@property
def _inner(self) -> LanceDbConnection:
return self._conn._inner
@override
def list_namespaces(
self,
@@ -848,6 +882,13 @@ class AsyncConnection(object):
def uri(self) -> str:
return self._inner.uri
async def get_read_consistency_interval(self) -> Optional[timedelta]:
interval_secs = await self._inner.get_read_consistency_interval()
if interval_secs is not None:
return timedelta(seconds=interval_secs)
else:
return None
async def list_namespaces(
self,
namespace: List[str] = [],

View File

@@ -605,9 +605,53 @@ class IvfPq:
target_partition_size: Optional[int] = None
@dataclass
class IvfRq:
"""Describes an IVF RQ Index
IVF-RQ (Residual Quantization) stores a compressed copy of each vector using
residual quantization and organizes them into IVF partitions. Parameters
largely mirror IVF-PQ for consistency.
Attributes
----------
distance_type: str, default "l2"
Distance metric used to train the index and for quantization.
The following distance types are available:
"l2" - Euclidean distance.
"cosine" - Cosine distance.
"dot" - Dot product.
num_partitions: int, default sqrt(num_rows)
Number of IVF partitions to create.
num_bits: int, default 1
Number of bits to encode each dimension.
max_iterations: int, default 50
Max iterations to train kmeans when computing IVF partitions.
sample_rate: int, default 256
Controls the number of training vectors: sample_rate * num_partitions.
target_partition_size, default is 8192
Target size of each partition.
"""
distance_type: Literal["l2", "cosine", "dot"] = "l2"
num_partitions: Optional[int] = None
num_bits: int = 1
max_iterations: int = 50
sample_rate: int = 256
target_partition_size: Optional[int] = None
__all__ = [
"BTree",
"IvfPq",
"IvfRq",
"IvfFlat",
"HnswPq",
"HnswSq",

View File

@@ -33,6 +33,7 @@ class LanceMergeInsertBuilder(object):
self._when_not_matched_by_source_delete = False
self._when_not_matched_by_source_condition = None
self._timeout = None
self._use_index = True
def when_matched_update_all(
self, *, where: Optional[str] = None
@@ -78,6 +79,23 @@ class LanceMergeInsertBuilder(object):
self._when_not_matched_by_source_condition = condition
return self
def use_index(self, use_index: bool) -> LanceMergeInsertBuilder:
"""
Controls whether to use indexes for the merge operation.
When set to `True` (the default), the operation will use an index if available
on the join key for improved performance. When set to `False`, it forces a full
table scan even if an index exists. This can be useful for benchmarking or when
the query optimizer chooses a suboptimal path.
Parameters
----------
use_index: bool
Whether to use indices for the merge operation. Defaults to `True`.
"""
self._use_index = use_index
return self
def execute(
self,
new_data: DATA,

View File

@@ -12,13 +12,18 @@ from __future__ import annotations
from typing import Dict, Iterable, List, Optional, Union
import os
import sys
if sys.version_info >= (3, 12):
from typing import override
else:
from overrides import override
from lancedb.db import DBConnection
from lancedb.table import LanceTable, Table
from lancedb.util import validate_table_name
from lancedb.common import validate_schema
from lancedb.table import sanitize_create_table
from overrides import override
from lance_namespace import LanceNamespace, connect as namespace_connect
from lance_namespace_urllib3_client.models import (

View File

@@ -0,0 +1,72 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
from ._lancedb import async_permutation_builder
from .table import LanceTable
from .background_loop import LOOP
from typing import Optional
class PermutationBuilder:
def __init__(self, table: LanceTable, dest_table_name: str):
self._async = async_permutation_builder(table, dest_table_name)
def select(self, projections: dict[str, str]) -> "PermutationBuilder":
self._async.select(projections)
return self
def split_random(
self,
*,
ratios: Optional[list[float]] = None,
counts: Optional[list[int]] = None,
fixed: Optional[int] = None,
seed: Optional[int] = None,
) -> "PermutationBuilder":
self._async.split_random(ratios=ratios, counts=counts, fixed=fixed, seed=seed)
return self
def split_hash(
self,
columns: list[str],
split_weights: list[int],
*,
discard_weight: Optional[int] = None,
) -> "PermutationBuilder":
self._async.split_hash(columns, split_weights, discard_weight=discard_weight)
return self
def split_sequential(
self,
*,
ratios: Optional[list[float]] = None,
counts: Optional[list[int]] = None,
fixed: Optional[int] = None,
) -> "PermutationBuilder":
self._async.split_sequential(ratios=ratios, counts=counts, fixed=fixed)
return self
def split_calculated(self, calculation: str) -> "PermutationBuilder":
self._async.split_calculated(calculation)
return self
def shuffle(
self, *, seed: Optional[int] = None, clump_size: Optional[int] = None
) -> "PermutationBuilder":
self._async.shuffle(seed=seed, clump_size=clump_size)
return self
def filter(self, filter: str) -> "PermutationBuilder":
self._async.filter(filter)
return self
def execute(self) -> LanceTable:
async def do_execute():
inner_tbl = await self._async.execute()
return LanceTable.from_inner(inner_tbl)
return LOOP.run(do_execute())
def permutation_builder(table: LanceTable, dest_table_name: str) -> PermutationBuilder:
return PermutationBuilder(table, dest_table_name)

View File

@@ -5,15 +5,20 @@
from datetime import timedelta
import logging
from concurrent.futures import ThreadPoolExecutor
import sys
from typing import Any, Dict, Iterable, List, Optional, Union
from urllib.parse import urlparse
import warnings
if sys.version_info >= (3, 12):
from typing import override
else:
from overrides import override
# Remove this import to fix circular dependency
# from lancedb import connect_async
from lancedb.remote import ClientConfig
import pyarrow as pa
from overrides import override
from ..common import DATA
from ..db import DBConnection, LOOP

View File

@@ -114,7 +114,7 @@ class RemoteTable(Table):
index_type: Literal["BTREE", "BITMAP", "LABEL_LIST", "scalar"] = "scalar",
*,
replace: bool = False,
wait_timeout: timedelta = None,
wait_timeout: Optional[timedelta] = None,
name: Optional[str] = None,
):
"""Creates a scalar index
@@ -153,7 +153,7 @@ class RemoteTable(Table):
column: str,
*,
replace: bool = False,
wait_timeout: timedelta = None,
wait_timeout: Optional[timedelta] = None,
with_position: bool = False,
# tokenizer configs:
base_tokenizer: str = "simple",

View File

@@ -44,7 +44,7 @@ import numpy as np
from .common import DATA, VEC, VECTOR_COLUMN_NAME
from .embeddings import EmbeddingFunctionConfig, EmbeddingFunctionRegistry
from .index import BTree, IvfFlat, IvfPq, Bitmap, LabelList, HnswPq, HnswSq, FTS
from .index import BTree, IvfFlat, IvfPq, Bitmap, IvfRq, LabelList, HnswPq, HnswSq, FTS
from .merge import LanceMergeInsertBuilder
from .pydantic import LanceModel, model_to_dict
from .query import (
@@ -74,6 +74,7 @@ from .index import lang_mapping
if TYPE_CHECKING:
from .db import LanceDBConnection
from ._lancedb import (
Table as LanceDBTable,
OptimizeStats,
@@ -88,7 +89,6 @@ if TYPE_CHECKING:
MergeResult,
UpdateResult,
)
from .db import LanceDBConnection
from .index import IndexConfig
import pandas
import PIL
@@ -1707,22 +1707,38 @@ class LanceTable(Table):
namespace: List[str] = [],
storage_options: Optional[Dict[str, str]] = None,
index_cache_size: Optional[int] = None,
_async: AsyncTable = None,
):
self._conn = connection
self._namespace = namespace
self._table = LOOP.run(
connection._conn.open_table(
name,
namespace=namespace,
storage_options=storage_options,
index_cache_size=index_cache_size,
if _async is not None:
self._table = _async
else:
self._table = LOOP.run(
connection._conn.open_table(
name,
namespace=namespace,
storage_options=storage_options,
index_cache_size=index_cache_size,
)
)
)
@property
def name(self) -> str:
return self._table.name
@classmethod
def from_inner(cls, tbl: LanceDBTable):
from .db import LanceDBConnection
async_tbl = AsyncTable(tbl)
conn = LanceDBConnection.from_inner(tbl.database())
return cls(
conn,
async_tbl.name,
_async=async_tbl,
)
@classmethod
def open(cls, db, name, *, namespace: List[str] = [], **kwargs):
tbl = cls(db, name, namespace=namespace, **kwargs)
@@ -1991,7 +2007,7 @@ class LanceTable(Table):
index_cache_size: Optional[int] = None,
num_bits: int = 8,
index_type: Literal[
"IVF_FLAT", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
"IVF_FLAT", "IVF_PQ", "IVF_RQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"
] = "IVF_PQ",
max_iterations: int = 50,
sample_rate: int = 256,
@@ -2039,6 +2055,15 @@ class LanceTable(Table):
sample_rate=sample_rate,
target_partition_size=target_partition_size,
)
elif index_type == "IVF_RQ":
config = IvfRq(
distance_type=metric,
num_partitions=num_partitions,
num_bits=num_bits,
max_iterations=max_iterations,
sample_rate=sample_rate,
target_partition_size=target_partition_size,
)
elif index_type == "IVF_HNSW_PQ":
config = HnswPq(
distance_type=metric,
@@ -2747,6 +2772,10 @@ class LanceTable(Table):
self._table._do_merge(merge, new_data, on_bad_vectors, fill_value)
)
@property
def _inner(self) -> LanceDBTable:
return self._table._inner
@deprecation.deprecated(
deprecated_in="0.21.0",
current_version=__version__,
@@ -3330,7 +3359,7 @@ class AsyncTable:
*,
replace: Optional[bool] = None,
config: Optional[
Union[IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]
Union[IvfFlat, IvfPq, IvfRq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS]
] = None,
wait_timeout: Optional[timedelta] = None,
name: Optional[str] = None,
@@ -3369,11 +3398,12 @@ class AsyncTable:
"""
if config is not None:
if not isinstance(
config, (IvfFlat, IvfPq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS)
config,
(IvfFlat, IvfPq, IvfRq, HnswPq, HnswSq, BTree, Bitmap, LabelList, FTS),
):
raise TypeError(
"config must be an instance of IvfPq, HnswPq, HnswSq, BTree,"
" Bitmap, LabelList, or FTS"
"config must be an instance of IvfPq, IvfRq, HnswPq, HnswSq, BTree,"
" Bitmap, LabelList, or FTS, but got " + str(type(config))
)
try:
await self._inner.create_index(
@@ -3920,6 +3950,7 @@ class AsyncTable:
when_not_matched_by_source_delete=merge._when_not_matched_by_source_delete,
when_not_matched_by_source_condition=merge._when_not_matched_by_source_condition,
timeout=merge._timeout,
use_index=merge._use_index,
),
)

View File

@@ -18,10 +18,17 @@ AddMode = Literal["append", "overwrite"]
CreateMode = Literal["create", "overwrite"]
# Index type literals
VectorIndexType = Literal["IVF_FLAT", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ"]
VectorIndexType = Literal["IVF_FLAT", "IVF_PQ", "IVF_HNSW_SQ", "IVF_HNSW_PQ", "IVF_RQ"]
ScalarIndexType = Literal["BTREE", "BITMAP", "LABEL_LIST"]
IndexType = Literal[
"IVF_PQ", "IVF_HNSW_PQ", "IVF_HNSW_SQ", "FTS", "BTREE", "BITMAP", "LABEL_LIST"
"IVF_PQ",
"IVF_HNSW_PQ",
"IVF_HNSW_SQ",
"FTS",
"BTREE",
"BITMAP",
"LABEL_LIST",
"IVF_RQ",
]
# Tokenizer literals

View File

@@ -8,7 +8,17 @@ import pyarrow as pa
import pytest
import pytest_asyncio
from lancedb import AsyncConnection, AsyncTable, connect_async
from lancedb.index import BTree, IvfFlat, IvfPq, Bitmap, LabelList, HnswPq, HnswSq, FTS
from lancedb.index import (
BTree,
IvfFlat,
IvfPq,
IvfRq,
Bitmap,
LabelList,
HnswPq,
HnswSq,
FTS,
)
@pytest_asyncio.fixture
@@ -35,6 +45,8 @@ async def some_table(db_async):
"tags": [
[f"tag{random.randint(0, 8)}" for _ in range(2)] for _ in range(NROWS)
],
"is_active": [random.choice([True, False]) for _ in range(NROWS)],
"data": [random.randbytes(random.randint(0, 128)) for _ in range(NROWS)],
}
)
return await db_async.create_table(
@@ -99,10 +111,17 @@ async def test_create_fixed_size_binary_index(some_table: AsyncTable):
@pytest.mark.asyncio
async def test_create_bitmap_index(some_table: AsyncTable):
await some_table.create_index("id", config=Bitmap())
await some_table.create_index("is_active", config=Bitmap())
await some_table.create_index("data", config=Bitmap())
indices = await some_table.list_indices()
assert str(indices) == '[Index(Bitmap, columns=["id"], name="id_idx")]'
indices = await some_table.list_indices()
assert len(indices) == 1
assert len(indices) == 3
assert indices[0].index_type == "Bitmap"
assert indices[0].columns == ["id"]
assert indices[1].index_type == "Bitmap"
assert indices[1].columns == ["is_active"]
assert indices[2].index_type == "Bitmap"
assert indices[2].columns == ["data"]
index_name = indices[0].name
stats = await some_table.index_stats(index_name)
assert stats.index_type == "BITMAP"
@@ -111,6 +130,11 @@ async def test_create_bitmap_index(some_table: AsyncTable):
assert stats.num_unindexed_rows == 0
assert stats.num_indices == 1
assert (
"ScalarIndexQuery"
in await some_table.query().where("is_active = TRUE").explain_plan()
)
@pytest.mark.asyncio
async def test_create_label_list_index(some_table: AsyncTable):
@@ -181,6 +205,16 @@ async def test_create_4bit_ivfpq_index(some_table: AsyncTable):
assert stats.loss >= 0.0
@pytest.mark.asyncio
async def test_create_ivfrq_index(some_table: AsyncTable):
await some_table.create_index("vector", config=IvfRq(num_bits=1))
indices = await some_table.list_indices()
assert len(indices) == 1
assert indices[0].index_type == "IvfRq"
assert indices[0].columns == ["vector"]
assert indices[0].name == "vector_idx"
@pytest.mark.asyncio
async def test_create_hnswpq_index(some_table: AsyncTable):
await some_table.create_index("vector", config=HnswPq(num_partitions=10))

View File

@@ -0,0 +1,496 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
import pyarrow as pa
import pytest
from lancedb.permutation import permutation_builder
def test_split_random_ratios(mem_db):
"""Test random splitting with ratios."""
tbl = mem_db.create_table(
"test_table", pa.table({"x": range(100), "y": range(100)})
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_random(ratios=[0.3, 0.7])
.execute()
)
# Check that the table was created and has data
assert permutation_tbl.count_rows() == 100
# Check that split_id column exists and has correct values
data = permutation_tbl.search(None).to_arrow().to_pydict()
split_ids = data["split_id"]
assert set(split_ids) == {0, 1}
# Check approximate split sizes (allowing for rounding)
split_0_count = split_ids.count(0)
split_1_count = split_ids.count(1)
assert 25 <= split_0_count <= 35 # ~30% ± tolerance
assert 65 <= split_1_count <= 75 # ~70% ± tolerance
def test_split_random_counts(mem_db):
"""Test random splitting with absolute counts."""
tbl = mem_db.create_table(
"test_table", pa.table({"x": range(100), "y": range(100)})
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_random(counts=[20, 30])
.execute()
)
# Check that we have exactly the requested counts
assert permutation_tbl.count_rows() == 50
data = permutation_tbl.search(None).to_arrow().to_pydict()
split_ids = data["split_id"]
assert split_ids.count(0) == 20
assert split_ids.count(1) == 30
def test_split_random_fixed(mem_db):
"""Test random splitting with fixed number of splits."""
tbl = mem_db.create_table(
"test_table", pa.table({"x": range(100), "y": range(100)})
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation").split_random(fixed=4).execute()
)
# Check that we have 4 splits with 25 rows each
assert permutation_tbl.count_rows() == 100
data = permutation_tbl.search(None).to_arrow().to_pydict()
split_ids = data["split_id"]
assert set(split_ids) == {0, 1, 2, 3}
for split_id in range(4):
assert split_ids.count(split_id) == 25
def test_split_random_with_seed(mem_db):
"""Test that seeded random splits are reproducible."""
tbl = mem_db.create_table("test_table", pa.table({"x": range(50), "y": range(50)}))
# Create two identical permutations with same seed
perm1 = (
permutation_builder(tbl, "perm1")
.split_random(ratios=[0.6, 0.4], seed=42)
.execute()
)
perm2 = (
permutation_builder(tbl, "perm2")
.split_random(ratios=[0.6, 0.4], seed=42)
.execute()
)
# Results should be identical
data1 = perm1.search(None).to_arrow().to_pydict()
data2 = perm2.search(None).to_arrow().to_pydict()
assert data1["row_id"] == data2["row_id"]
assert data1["split_id"] == data2["split_id"]
def test_split_hash(mem_db):
"""Test hash-based splitting."""
tbl = mem_db.create_table(
"test_table",
pa.table(
{
"id": range(100),
"category": (["A", "B", "C"] * 34)[:100], # Repeating pattern
"value": range(100),
}
),
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_hash(["category"], [1, 1], discard_weight=0)
.execute()
)
# Should have all 100 rows (no discard)
assert permutation_tbl.count_rows() == 100
data = permutation_tbl.search(None).to_arrow().to_pydict()
split_ids = data["split_id"]
assert set(split_ids) == {0, 1}
# Verify that each split has roughly 50 rows (allowing for hash variance)
split_0_count = split_ids.count(0)
split_1_count = split_ids.count(1)
assert 30 <= split_0_count <= 70 # ~50 ± 20 tolerance for hash distribution
assert 30 <= split_1_count <= 70 # ~50 ± 20 tolerance for hash distribution
# Hash splits should be deterministic - same category should go to same split
# Let's verify by creating another permutation and checking consistency
perm2 = (
permutation_builder(tbl, "test_permutation2")
.split_hash(["category"], [1, 1], discard_weight=0)
.execute()
)
data2 = perm2.search(None).to_arrow().to_pydict()
assert data["split_id"] == data2["split_id"] # Should be identical
def test_split_hash_with_discard(mem_db):
"""Test hash-based splitting with discard weight."""
tbl = mem_db.create_table(
"test_table",
pa.table({"id": range(100), "category": ["A", "B"] * 50, "value": range(100)}),
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_hash(["category"], [1, 1], discard_weight=2) # Should discard ~50%
.execute()
)
# Should have fewer than 100 rows due to discard
row_count = permutation_tbl.count_rows()
assert row_count < 100
assert row_count > 0 # But not empty
def test_split_sequential(mem_db):
"""Test sequential splitting."""
tbl = mem_db.create_table(
"test_table", pa.table({"x": range(100), "y": range(100)})
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_sequential(counts=[30, 40])
.execute()
)
assert permutation_tbl.count_rows() == 70
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
split_ids = data["split_id"]
# Sequential should maintain order
assert row_ids == sorted(row_ids)
# First 30 should be split 0, next 40 should be split 1
assert split_ids[:30] == [0] * 30
assert split_ids[30:] == [1] * 40
def test_split_calculated(mem_db):
"""Test calculated splitting."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(100), "value": range(100)})
)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_calculated("id % 3") # Split based on id modulo 3
.execute()
)
assert permutation_tbl.count_rows() == 100
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
split_ids = data["split_id"]
# Verify the calculation: each row's split_id should equal row_id % 3
for i, (row_id, split_id) in enumerate(zip(row_ids, split_ids)):
assert split_id == row_id % 3
def test_split_error_cases(mem_db):
"""Test error handling for invalid split parameters."""
tbl = mem_db.create_table("test_table", pa.table({"x": range(10), "y": range(10)}))
# Test split_random with no parameters
with pytest.raises(Exception):
permutation_builder(tbl, "error1").split_random().execute()
# Test split_random with multiple parameters
with pytest.raises(Exception):
permutation_builder(tbl, "error2").split_random(
ratios=[0.5, 0.5], counts=[5, 5]
).execute()
# Test split_sequential with no parameters
with pytest.raises(Exception):
permutation_builder(tbl, "error3").split_sequential().execute()
# Test split_sequential with multiple parameters
with pytest.raises(Exception):
permutation_builder(tbl, "error4").split_sequential(
ratios=[0.5, 0.5], fixed=2
).execute()
def test_shuffle_no_seed(mem_db):
"""Test shuffling without a seed."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(100), "value": range(100)})
)
# Create a permutation with shuffling (no seed)
permutation_tbl = permutation_builder(tbl, "test_permutation").shuffle().execute()
assert permutation_tbl.count_rows() == 100
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
# Row IDs should not be in sequential order due to shuffling
# This is probabilistic but with 100 rows, it's extremely unlikely they'd stay
# in order
assert row_ids != list(range(100))
def test_shuffle_with_seed(mem_db):
"""Test that shuffling with a seed is reproducible."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(50), "value": range(50)})
)
# Create two identical permutations with same shuffle seed
perm1 = permutation_builder(tbl, "perm1").shuffle(seed=42).execute()
perm2 = permutation_builder(tbl, "perm2").shuffle(seed=42).execute()
# Results should be identical due to same seed
data1 = perm1.search(None).to_arrow().to_pydict()
data2 = perm2.search(None).to_arrow().to_pydict()
assert data1["row_id"] == data2["row_id"]
assert data1["split_id"] == data2["split_id"]
def test_shuffle_with_clump_size(mem_db):
"""Test shuffling with clump size."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(100), "value": range(100)})
)
# Create a permutation with shuffling using clumps
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.shuffle(clump_size=10) # 10-row clumps
.execute()
)
assert permutation_tbl.count_rows() == 100
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
for i in range(10):
start = row_ids[i * 10]
assert row_ids[i * 10 : (i + 1) * 10] == list(range(start, start + 10))
def test_shuffle_different_seeds(mem_db):
"""Test that different seeds produce different shuffle orders."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(50), "value": range(50)})
)
# Create two permutations with different shuffle seeds
perm1 = (
permutation_builder(tbl, "perm1")
.split_random(fixed=2)
.shuffle(seed=42)
.execute()
)
perm2 = (
permutation_builder(tbl, "perm2")
.split_random(fixed=2)
.shuffle(seed=123)
.execute()
)
# Results should be different due to different seeds
data1 = perm1.search(None).to_arrow().to_pydict()
data2 = perm2.search(None).to_arrow().to_pydict()
# Row order should be different
assert data1["row_id"] != data2["row_id"]
def test_shuffle_combined_with_splits(mem_db):
"""Test shuffling combined with different split strategies."""
tbl = mem_db.create_table(
"test_table",
pa.table(
{
"id": range(100),
"category": (["A", "B", "C"] * 34)[:100],
"value": range(100),
}
),
)
# Test shuffle with random splits
perm_random = (
permutation_builder(tbl, "perm_random")
.split_random(ratios=[0.6, 0.4], seed=42)
.shuffle(seed=123, clump_size=None)
.execute()
)
# Test shuffle with hash splits
perm_hash = (
permutation_builder(tbl, "perm_hash")
.split_hash(["category"], [1, 1], discard_weight=0)
.shuffle(seed=456, clump_size=5)
.execute()
)
# Test shuffle with sequential splits
perm_sequential = (
permutation_builder(tbl, "perm_sequential")
.split_sequential(counts=[40, 35])
.shuffle(seed=789, clump_size=None)
.execute()
)
# Verify all permutations work and have expected properties
assert perm_random.count_rows() == 100
assert perm_hash.count_rows() == 100
assert perm_sequential.count_rows() == 75
# Verify shuffle affected the order
data_random = perm_random.search(None).to_arrow().to_pydict()
data_sequential = perm_sequential.search(None).to_arrow().to_pydict()
assert data_random["row_id"] != list(range(100))
assert data_sequential["row_id"] != list(range(75))
def test_no_shuffle_maintains_order(mem_db):
"""Test that not calling shuffle maintains the original order."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(50), "value": range(50)})
)
# Create permutation without shuffle (should maintain some order)
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.split_sequential(counts=[25, 25]) # Sequential maintains order
.execute()
)
assert permutation_tbl.count_rows() == 50
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
# With sequential splits and no shuffle, should maintain order
assert row_ids == list(range(50))
def test_filter_basic(mem_db):
"""Test basic filtering functionality."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(100), "value": range(100, 200)})
)
# Filter to only include rows where id < 50
permutation_tbl = (
permutation_builder(tbl, "test_permutation").filter("id < 50").execute()
)
assert permutation_tbl.count_rows() == 50
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
# All row_ids should be less than 50
assert all(row_id < 50 for row_id in row_ids)
def test_filter_with_splits(mem_db):
"""Test filtering combined with split strategies."""
tbl = mem_db.create_table(
"test_table",
pa.table(
{
"id": range(100),
"category": (["A", "B", "C"] * 34)[:100],
"value": range(100),
}
),
)
# Filter to only category A and B, then split
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.filter("category IN ('A', 'B')")
.split_random(ratios=[0.5, 0.5])
.execute()
)
# Should have fewer than 100 rows due to filtering
row_count = permutation_tbl.count_rows()
assert row_count == 67
data = permutation_tbl.search(None).to_arrow().to_pydict()
categories = data["category"]
# All categories should be A or B
assert all(cat in ["A", "B"] for cat in categories)
def test_filter_with_shuffle(mem_db):
"""Test filtering combined with shuffling."""
tbl = mem_db.create_table(
"test_table",
pa.table(
{
"id": range(100),
"category": (["A", "B", "C", "D"] * 25)[:100],
"value": range(100),
}
),
)
# Filter and shuffle
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.filter("category IN ('A', 'C')")
.shuffle(seed=42)
.execute()
)
row_count = permutation_tbl.count_rows()
assert row_count == 50 # Should have 50 rows (A and C categories)
data = permutation_tbl.search(None).to_arrow().to_pydict()
row_ids = data["row_id"]
assert row_ids != sorted(row_ids)
def test_filter_empty_result(mem_db):
"""Test filtering that results in empty set."""
tbl = mem_db.create_table(
"test_table", pa.table({"id": range(10), "value": range(10)})
)
# Filter that matches nothing
permutation_tbl = (
permutation_builder(tbl, "test_permutation")
.filter("value > 100") # No values > 100 in our data
.execute()
)
assert permutation_tbl.count_rows() == 0

View File

@@ -4,7 +4,10 @@
use std::{collections::HashMap, sync::Arc, time::Duration};
use arrow::{datatypes::Schema, ffi_stream::ArrowArrayStreamReader, pyarrow::FromPyArrow};
use lancedb::{connection::Connection as LanceConnection, database::CreateTableMode};
use lancedb::{
connection::Connection as LanceConnection,
database::{CreateTableMode, ReadConsistency},
};
use pyo3::{
exceptions::{PyRuntimeError, PyValueError},
pyclass, pyfunction, pymethods, Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
@@ -23,7 +26,7 @@ impl Connection {
Self { inner: Some(inner) }
}
fn get_inner(&self) -> PyResult<&LanceConnection> {
pub(crate) fn get_inner(&self) -> PyResult<&LanceConnection> {
self.inner
.as_ref()
.ok_or_else(|| PyRuntimeError::new_err("Connection is closed"))
@@ -63,6 +66,18 @@ impl Connection {
self.get_inner().map(|inner| inner.uri().to_string())
}
#[pyo3(signature = ())]
pub fn get_read_consistency_interval(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
Ok(match inner.read_consistency().await.infer_error()? {
ReadConsistency::Manual => None,
ReadConsistency::Eventual(duration) => Some(duration.as_secs_f64()),
ReadConsistency::Strong => Some(0.0_f64),
})
})
}
#[pyo3(signature = (namespace=vec![], start_after=None, limit=None))]
pub fn table_names(
self_: PyRef<'_, Self>,

View File

@@ -1,7 +1,7 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use lancedb::index::vector::IvfFlatIndexBuilder;
use lancedb::index::vector::{IvfFlatIndexBuilder, IvfRqIndexBuilder};
use lancedb::index::{
scalar::{BTreeIndexBuilder, FtsIndexBuilder},
vector::{IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder},
@@ -87,6 +87,22 @@ pub fn extract_index_params(source: &Option<Bound<'_, PyAny>>) -> PyResult<Lance
}
Ok(LanceDbIndex::IvfPq(ivf_pq_builder))
},
"IvfRq" => {
let params = source.extract::<IvfRqParams>()?;
let distance_type = parse_distance_type(params.distance_type)?;
let mut ivf_rq_builder = IvfRqIndexBuilder::default()
.distance_type(distance_type)
.max_iterations(params.max_iterations)
.sample_rate(params.sample_rate)
.num_bits(params.num_bits);
if let Some(num_partitions) = params.num_partitions {
ivf_rq_builder = ivf_rq_builder.num_partitions(num_partitions);
}
if let Some(target_partition_size) = params.target_partition_size {
ivf_rq_builder = ivf_rq_builder.target_partition_size(target_partition_size);
}
Ok(LanceDbIndex::IvfRq(ivf_rq_builder))
},
"HnswPq" => {
let params = source.extract::<IvfHnswPqParams>()?;
let distance_type = parse_distance_type(params.distance_type)?;
@@ -170,6 +186,16 @@ struct IvfPqParams {
target_partition_size: Option<u32>,
}
#[derive(FromPyObject)]
struct IvfRqParams {
distance_type: String,
num_partitions: Option<u32>,
num_bits: u32,
max_iterations: u32,
sample_rate: u32,
target_partition_size: Option<u32>,
}
#[derive(FromPyObject)]
struct IvfHnswPqParams {
distance_type: String,

View File

@@ -5,6 +5,7 @@ use arrow::RecordBatchStream;
use connection::{connect, Connection};
use env_logger::Env;
use index::IndexConfig;
use permutation::PyAsyncPermutationBuilder;
use pyo3::{
pymodule,
types::{PyModule, PyModuleMethods},
@@ -22,6 +23,7 @@ pub mod connection;
pub mod error;
pub mod header;
pub mod index;
pub mod permutation;
pub mod query;
pub mod session;
pub mod table;
@@ -49,7 +51,9 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_class::<DeleteResult>()?;
m.add_class::<DropColumnsResult>()?;
m.add_class::<UpdateResult>()?;
m.add_class::<PyAsyncPermutationBuilder>()?;
m.add_function(wrap_pyfunction!(connect, m)?)?;
m.add_function(wrap_pyfunction!(permutation::async_permutation_builder, m)?)?;
m.add_function(wrap_pyfunction!(util::validate_table_name, m)?)?;
m.add("__version__", env!("CARGO_PKG_VERSION"))?;
Ok(())

177
python/src/permutation.rs Normal file
View File

@@ -0,0 +1,177 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::{Arc, Mutex};
use crate::{error::PythonErrorExt, table::Table};
use lancedb::dataloader::{
permutation::{PermutationBuilder as LancePermutationBuilder, ShuffleStrategy},
split::{SplitSizes, SplitStrategy},
};
use pyo3::{
exceptions::PyRuntimeError, pyclass, pymethods, types::PyAnyMethods, Bound, PyAny, PyRefMut,
PyResult,
};
use pyo3_async_runtimes::tokio::future_into_py;
/// Create a permutation builder for the given table
#[pyo3::pyfunction]
pub fn async_permutation_builder(
table: Bound<'_, PyAny>,
dest_table_name: String,
) -> PyResult<PyAsyncPermutationBuilder> {
let table = table.getattr("_inner")?.downcast_into::<Table>()?;
let inner_table = table.borrow().inner_ref()?.clone();
let inner_builder = LancePermutationBuilder::new(inner_table);
Ok(PyAsyncPermutationBuilder {
state: Arc::new(Mutex::new(PyAsyncPermutationBuilderState {
builder: Some(inner_builder),
dest_table_name,
})),
})
}
struct PyAsyncPermutationBuilderState {
builder: Option<LancePermutationBuilder>,
dest_table_name: String,
}
#[pyclass(name = "AsyncPermutationBuilder")]
pub struct PyAsyncPermutationBuilder {
state: Arc<Mutex<PyAsyncPermutationBuilderState>>,
}
impl PyAsyncPermutationBuilder {
fn modify(
&self,
func: impl FnOnce(LancePermutationBuilder) -> LancePermutationBuilder,
) -> PyResult<Self> {
let mut state = self.state.lock().unwrap();
let builder = state
.builder
.take()
.ok_or_else(|| PyRuntimeError::new_err("Builder already consumed"))?;
state.builder = Some(func(builder));
Ok(Self {
state: self.state.clone(),
})
}
}
#[pymethods]
impl PyAsyncPermutationBuilder {
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None, seed=None))]
pub fn split_random(
slf: PyRefMut<'_, Self>,
ratios: Option<Vec<f64>>,
counts: Option<Vec<u64>>,
fixed: Option<u64>,
seed: Option<u64>,
) -> PyResult<Self> {
// Check that exactly one split type is provided
let split_args_count = [ratios.is_some(), counts.is_some(), fixed.is_some()]
.iter()
.filter(|&&x| x)
.count();
if split_args_count != 1 {
return Err(pyo3::exceptions::PyValueError::new_err(
"Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
));
}
let sizes = if let Some(ratios) = ratios {
SplitSizes::Percentages(ratios)
} else if let Some(counts) = counts {
SplitSizes::Counts(counts)
} else if let Some(fixed) = fixed {
SplitSizes::Fixed(fixed)
} else {
unreachable!("One of the split arguments must be provided");
};
slf.modify(|builder| builder.with_split_strategy(SplitStrategy::Random { seed, sizes }))
}
#[pyo3(signature = (columns, split_weights, *, discard_weight=0))]
pub fn split_hash(
slf: PyRefMut<'_, Self>,
columns: Vec<String>,
split_weights: Vec<u64>,
discard_weight: u64,
) -> PyResult<Self> {
slf.modify(|builder| {
builder.with_split_strategy(SplitStrategy::Hash {
columns,
split_weights,
discard_weight,
})
})
}
#[pyo3(signature = (*, ratios=None, counts=None, fixed=None))]
pub fn split_sequential(
slf: PyRefMut<'_, Self>,
ratios: Option<Vec<f64>>,
counts: Option<Vec<u64>>,
fixed: Option<u64>,
) -> PyResult<Self> {
// Check that exactly one split type is provided
let split_args_count = [ratios.is_some(), counts.is_some(), fixed.is_some()]
.iter()
.filter(|&&x| x)
.count();
if split_args_count != 1 {
return Err(pyo3::exceptions::PyValueError::new_err(
"Exactly one of 'ratios', 'counts', or 'fixed' must be provided",
));
}
let sizes = if let Some(ratios) = ratios {
SplitSizes::Percentages(ratios)
} else if let Some(counts) = counts {
SplitSizes::Counts(counts)
} else if let Some(fixed) = fixed {
SplitSizes::Fixed(fixed)
} else {
unreachable!("One of the split arguments must be provided");
};
slf.modify(|builder| builder.with_split_strategy(SplitStrategy::Sequential { sizes }))
}
pub fn split_calculated(slf: PyRefMut<'_, Self>, calculation: String) -> PyResult<Self> {
slf.modify(|builder| builder.with_split_strategy(SplitStrategy::Calculated { calculation }))
}
pub fn shuffle(
slf: PyRefMut<'_, Self>,
seed: Option<u64>,
clump_size: Option<u64>,
) -> PyResult<Self> {
slf.modify(|builder| {
builder.with_shuffle_strategy(ShuffleStrategy::Random { seed, clump_size })
})
}
pub fn filter(slf: PyRefMut<'_, Self>, filter: String) -> PyResult<Self> {
slf.modify(|builder| builder.with_filter(filter))
}
pub fn execute(slf: PyRefMut<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let mut state = slf.state.lock().unwrap();
let builder = state
.builder
.take()
.ok_or_else(|| PyRuntimeError::new_err("Builder already consumed"))?;
let dest_table_name = std::mem::take(&mut state.dest_table_name);
future_into_py(slf.py(), async move {
let table = builder.build(&dest_table_name).await.infer_error()?;
Ok(Table::new(table))
})
}
}

View File

@@ -3,6 +3,7 @@
use std::{collections::HashMap, sync::Arc};
use crate::{
connection::Connection,
error::PythonErrorExt,
index::{extract_index_params, IndexConfig},
query::{Query, TakeQuery},
@@ -249,7 +250,7 @@ impl Table {
}
impl Table {
fn inner_ref(&self) -> PyResult<&LanceDbTable> {
pub(crate) fn inner_ref(&self) -> PyResult<&LanceDbTable> {
self.inner
.as_ref()
.ok_or_else(|| PyRuntimeError::new_err(format!("Table {} is closed", self.name)))
@@ -272,6 +273,13 @@ impl Table {
self.inner.take();
}
pub fn database(&self) -> PyResult<Connection> {
let inner = self.inner_ref()?.clone();
let inner_connection =
lancedb::Connection::new(inner.database().clone(), inner.embedding_registry().clone());
Ok(Connection::new(inner_connection))
}
pub fn schema(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
@@ -672,6 +680,9 @@ impl Table {
if let Some(timeout) = parameters.timeout {
builder.timeout(timeout);
}
if let Some(use_index) = parameters.use_index {
builder.use_index(use_index);
}
future_into_py(self_.py(), async move {
let res = builder.execute(Box::new(batches)).await.infer_error()?;
@@ -831,6 +842,7 @@ pub struct MergeInsertParams {
when_not_matched_by_source_delete: bool,
when_not_matched_by_source_condition: Option<String>,
timeout: Option<std::time::Duration>,
use_index: Option<bool>,
}
#[pyclass]

View File

@@ -1,2 +1,2 @@
[toolchain]
channel = "1.86.0"
channel = "1.90.0"

View File

@@ -1,6 +1,6 @@
[package]
name = "lancedb"
version = "0.22.1-beta.3"
version = "0.22.2"
edition.workspace = true
description = "LanceDB: A serverless, low-latency vector database for AI applications"
license.workspace = true
@@ -11,6 +11,7 @@ rust-version.workspace = true
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
ahash = { workspace = true }
arrow = { workspace = true }
arrow-array = { workspace = true }
arrow-data = { workspace = true }
@@ -24,18 +25,23 @@ datafusion-common.workspace = true
datafusion-execution.workspace = true
datafusion-expr.workspace = true
datafusion-physical-plan.workspace = true
datafusion.workspace = true
object_store = { workspace = true }
snafu = { workspace = true }
half = { workspace = true }
lazy_static.workspace = true
lance = { workspace = true }
lance-core = { workspace = true }
lance-datafusion.workspace = true
lance-datagen = { workspace = true }
lance-file = { workspace = true }
lance-io = { workspace = true }
lance-index = { workspace = true }
lance-table = { workspace = true }
lance-linalg = { workspace = true }
lance-testing = { workspace = true }
lance-encoding = { workspace = true }
lance-namespace = { workspace = true }
moka = { workspace = true }
pin-project = { workspace = true }
tokio = { version = "1.23", features = ["rt-multi-thread"] }
@@ -45,11 +51,13 @@ bytes = "1"
futures.workspace = true
num-traits.workspace = true
url.workspace = true
rand.workspace = true
regex.workspace = true
serde = { version = "^1" }
serde_json = { version = "1" }
async-openai = { version = "0.20.0", optional = true }
serde_with = { version = "3.8.1" }
tempfile = "3.5.0"
aws-sdk-bedrockruntime = { version = "1.27.0", optional = true }
# For remote feature
reqwest = { version = "0.12.0", default-features = false, features = [
@@ -60,9 +68,8 @@ reqwest = { version = "0.12.0", default-features = false, features = [
"macos-system-configuration",
"stream",
], optional = true }
rand = { version = "0.9", features = ["small_rng"], optional = true }
http = { version = "1", optional = true } # Matching what is in reqwest
uuid = { version = "1.7.0", features = ["v4"], optional = true }
uuid = { version = "1.7.0", features = ["v4"] }
polars-arrow = { version = ">=0.37,<0.40.0", optional = true }
polars = { version = ">=0.37,<0.40.0", optional = true }
hf-hub = { version = "0.4.1", optional = true, default-features = false, features = [
@@ -81,8 +88,8 @@ crunchy.workspace = true
bytemuck_derive.workspace = true
[dev-dependencies]
anyhow = "1"
tempfile = "3.5.0"
rand = { version = "0.9", features = ["small_rng"] }
random_word = { version = "0.4.3", features = ["en"] }
uuid = { version = "1.7.0", features = ["v4"] }
walkdir = "2"
@@ -94,6 +101,7 @@ aws-smithy-runtime = { version = "1.9.1" }
datafusion.workspace = true
http-body = "1" # Matching reqwest
rstest = "0.23.0"
test-log = "0.2"
[features]
@@ -103,7 +111,7 @@ oss = ["lance/oss", "lance-io/oss"]
gcs = ["lance/gcp", "lance-io/gcp"]
azure = ["lance/azure", "lance-io/azure"]
dynamodb = ["lance/dynamodb", "aws"]
remote = ["dep:reqwest", "dep:http", "dep:rand", "dep:uuid"]
remote = ["dep:reqwest", "dep:http"]
fp16kernels = ["lance-linalg/fp16kernels"]
s3-test = []
bedrock = ["dep:aws-sdk-bedrockruntime"]

19
rust/lancedb/Makefile Normal file
View File

@@ -0,0 +1,19 @@
#
# Makefile for running tests.
#
# Run all tests.
all-tests: feature-tests remote-tests
# Run tests for every feature. This requires using docker compose to set up
# the environment.
feature-tests:
../../ci/run_with_docker_compose.sh \
cargo test --all-features --tests --locked --examples
.PHONY: feature-tests
# Run tests against remote endpoints.
remote-tests:
../../ci/run_with_test_connection.sh \
cargo test --features remote --locked
.PHONY: remote-tests

View File

@@ -7,6 +7,7 @@ pub use arrow_schema;
use datafusion_common::DataFusionError;
use datafusion_physical_plan::stream::RecordBatchStreamAdapter;
use futures::{Stream, StreamExt, TryStreamExt};
use lance_datagen::{BatchCount, BatchGeneratorBuilder, RowCount};
#[cfg(feature = "polars")]
use {crate::polars_arrow_convertors, polars::frame::ArrowChunk, polars::prelude::DataFrame};
@@ -161,6 +162,26 @@ impl IntoArrowStream for datafusion_physical_plan::SendableRecordBatchStream {
}
}
pub trait LanceDbDatagenExt {
fn into_ldb_stream(
self,
batch_size: RowCount,
num_batches: BatchCount,
) -> SendableRecordBatchStream;
}
impl LanceDbDatagenExt for BatchGeneratorBuilder {
fn into_ldb_stream(
self,
batch_size: RowCount,
num_batches: BatchCount,
) -> SendableRecordBatchStream {
let (stream, schema) = self.into_reader_stream(batch_size, num_batches);
let stream = stream.map_err(|err| Error::Arrow { source: err });
Box::pin(SimpleRecordBatchStream::new(stream, schema))
}
}
#[cfg(feature = "polars")]
/// An iterator of record batches formed from a Polars DataFrame.
pub struct PolarsDataFrameRecordBatchReader {

View File

@@ -19,7 +19,7 @@ use crate::database::listing::{
use crate::database::{
CloneTableRequest, CreateNamespaceRequest, CreateTableData, CreateTableMode,
CreateTableRequest, Database, DatabaseOptions, DropNamespaceRequest, ListNamespacesRequest,
OpenTableRequest, TableNamesRequest,
OpenTableRequest, ReadConsistency, TableNamesRequest,
};
use crate::embeddings::{
EmbeddingDefinition, EmbeddingFunction, EmbeddingRegistry, MemoryRegistry, WithEmbeddings,
@@ -152,6 +152,7 @@ impl CreateTableBuilder<true> {
let request = self.into_request()?;
Ok(Table::new_with_embedding_registry(
parent.create_table(request).await?,
parent,
embedding_registry,
))
}
@@ -211,9 +212,9 @@ impl CreateTableBuilder<false> {
/// Execute the create table operation
pub async fn execute(self) -> Result<Table> {
Ok(Table::new(
self.parent.clone().create_table(self.request).await?,
))
let parent = self.parent.clone();
let table = parent.create_table(self.request).await?;
Ok(Table::new(table, parent))
}
}
@@ -462,8 +463,10 @@ impl OpenTableBuilder {
/// Open the table
pub async fn execute(self) -> Result<Table> {
let table = self.parent.open_table(self.request).await?;
Ok(Table::new_with_embedding_registry(
self.parent.clone().open_table(self.request).await?,
table,
self.parent,
self.embedding_registry,
))
}
@@ -519,16 +522,15 @@ impl CloneTableBuilder {
/// Execute the clone operation
pub async fn execute(self) -> Result<Table> {
Ok(Table::new(
self.parent.clone().clone_table(self.request).await?,
))
let parent = self.parent.clone();
let table = parent.clone_table(self.request).await?;
Ok(Table::new(table, parent))
}
}
/// A connection to LanceDB
#[derive(Clone)]
pub struct Connection {
uri: String,
internal: Arc<dyn Database>,
embedding_registry: Arc<dyn EmbeddingRegistry>,
}
@@ -540,9 +542,19 @@ impl std::fmt::Display for Connection {
}
impl Connection {
pub fn new(
internal: Arc<dyn Database>,
embedding_registry: Arc<dyn EmbeddingRegistry>,
) -> Self {
Self {
internal,
embedding_registry,
}
}
/// Get the URI of the connection
pub fn uri(&self) -> &str {
self.uri.as_str()
self.internal.uri()
}
/// Get access to the underlying database
@@ -675,6 +687,11 @@ impl Connection {
.await
}
/// Get the read consistency of the connection
pub async fn read_consistency(&self) -> Result<ReadConsistency> {
self.internal.read_consistency().await
}
/// Drop a table in the database.
///
/// # Arguments
@@ -973,7 +990,6 @@ impl ConnectBuilder {
)?);
Ok(Connection {
internal,
uri: self.request.uri,
embedding_registry: self
.embedding_registry
.unwrap_or_else(|| Arc::new(MemoryRegistry::new())),
@@ -996,7 +1012,6 @@ impl ConnectBuilder {
let internal = Arc::new(ListingDatabase::connect_with_options(&self.request).await?);
Ok(Connection {
internal,
uri: self.request.uri,
embedding_registry: self
.embedding_registry
.unwrap_or_else(|| Arc::new(MemoryRegistry::new())),
@@ -1015,6 +1030,116 @@ pub fn connect(uri: &str) -> ConnectBuilder {
ConnectBuilder::new(uri)
}
pub struct ConnectNamespaceBuilder {
ns_impl: String,
properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
embedding_registry: Option<Arc<dyn EmbeddingRegistry>>,
session: Option<Arc<lance::session::Session>>,
}
impl ConnectNamespaceBuilder {
fn new(ns_impl: &str, properties: HashMap<String, String>) -> Self {
Self {
ns_impl: ns_impl.to_string(),
properties,
storage_options: HashMap::new(),
read_consistency_interval: None,
embedding_registry: None,
session: None,
}
}
/// Set an option for the storage layer.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_option(mut self, key: impl Into<String>, value: impl Into<String>) -> Self {
self.storage_options.insert(key.into(), value.into());
self
}
/// Set multiple options for the storage layer.
///
/// See available options at <https://lancedb.github.io/lancedb/guides/storage/>
pub fn storage_options(
mut self,
pairs: impl IntoIterator<Item = (impl Into<String>, impl Into<String>)>,
) -> Self {
for (key, value) in pairs {
self.storage_options.insert(key.into(), value.into());
}
self
}
/// The interval at which to check for updates from other processes.
///
/// If left unset, consistency is not checked. For maximum read
/// performance, this is the default. For strong consistency, set this to
/// zero seconds. Then every read will check for updates from other processes.
/// As a compromise, set this to a non-zero duration for eventual consistency.
pub fn read_consistency_interval(
mut self,
read_consistency_interval: std::time::Duration,
) -> Self {
self.read_consistency_interval = Some(read_consistency_interval);
self
}
/// Provide a custom [`EmbeddingRegistry`] to use for this connection.
pub fn embedding_registry(mut self, registry: Arc<dyn EmbeddingRegistry>) -> Self {
self.embedding_registry = Some(registry);
self
}
/// Set a custom session for object stores and caching.
///
/// By default, a new session with default configuration will be created.
/// This method allows you to provide a custom session with your own
/// configuration for object store registries, caching, etc.
pub fn session(mut self, session: Arc<lance::session::Session>) -> Self {
self.session = Some(session);
self
}
/// Execute the connection
pub async fn execute(self) -> Result<Connection> {
use crate::database::namespace::LanceNamespaceDatabase;
let internal = Arc::new(
LanceNamespaceDatabase::connect(
&self.ns_impl,
self.properties,
self.storage_options,
self.read_consistency_interval,
self.session,
)
.await?,
);
Ok(Connection {
internal,
embedding_registry: self
.embedding_registry
.unwrap_or_else(|| Arc::new(MemoryRegistry::new())),
})
}
}
/// Connect to a LanceDB database through a namespace.
///
/// # Arguments
///
/// * `ns_impl` - The namespace implementation to use (e.g., "dir" for directory-based, "rest" for REST API)
/// * `properties` - Configuration properties for the namespace implementation
/// ```
pub fn connect_namespace(
ns_impl: &str,
properties: HashMap<String, String>,
) -> ConnectNamespaceBuilder {
ConnectNamespaceBuilder::new(ns_impl, properties)
}
#[cfg(all(test, feature = "remote"))]
mod test_utils {
use super::*;
@@ -1028,7 +1153,6 @@ mod test_utils {
let internal = Arc::new(crate::remote::db::RemoteDatabase::new_mock(handler));
Self {
internal,
uri: "db://test".to_string(),
embedding_registry: Arc::new(MemoryRegistry::new()),
}
}
@@ -1045,7 +1169,6 @@ mod test_utils {
));
Self {
internal,
uri: "db://test".to_string(),
embedding_registry: Arc::new(MemoryRegistry::new()),
}
}
@@ -1059,6 +1182,7 @@ mod tests {
use crate::database::listing::{ListingDatabaseOptions, NewTableConfig};
use crate::query::QueryBase;
use crate::query::{ExecutableQuery, QueryExecutionOptions};
use crate::test_connection::test_utils::new_test_connection;
use arrow::compute::concat_batches;
use arrow_array::RecordBatchReader;
use arrow_schema::{DataType, Field, Schema};
@@ -1074,11 +1198,8 @@ mod tests {
#[tokio::test]
async fn test_connect() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
assert_eq!(db.uri, uri);
let tc = new_test_connection().await.unwrap();
assert_eq!(tc.connection.uri(), tc.uri);
}
#[cfg(not(windows))]
@@ -1099,7 +1220,7 @@ mod tests {
.await
.unwrap();
assert_eq!(db.uri, relative_uri.to_str().unwrap().to_string());
assert_eq!(db.uri(), relative_uri.to_str().unwrap().to_string());
}
#[tokio::test]
@@ -1144,16 +1265,10 @@ mod tests {
assert_eq!(tables, names[..7]);
}
#[tokio::test]
async fn test_connect_s3() {
// let db = Database::connect("s3://bucket/path/to/database").await.unwrap();
}
#[tokio::test]
async fn test_open_table() {
let tmp_dir = tempdir().unwrap();
let uri = tmp_dir.path().to_str().unwrap();
let db = connect(uri).execute().await.unwrap();
let tc = new_test_connection().await.unwrap();
let db = tc.connection;
assert_eq!(db.table_names().execute().await.unwrap().len(), 0);
// open non-exist table

View File

@@ -52,13 +52,13 @@ pub fn infer_vector_columns(
for field in reader.schema().fields() {
match field.data_type() {
DataType::FixedSizeList(sub_field, _) if sub_field.data_type().is_floating() => {
columns.push(field.name().to_string());
columns.push(field.name().clone());
}
DataType::List(sub_field) if sub_field.data_type().is_floating() && !strict => {
columns_to_infer.insert(field.name().to_string(), None);
columns_to_infer.insert(field.name().clone(), None);
}
DataType::LargeList(sub_field) if sub_field.data_type().is_floating() && !strict => {
columns_to_infer.insert(field.name().to_string(), None);
columns_to_infer.insert(field.name().clone(), None);
}
_ => {}
}

View File

@@ -16,6 +16,7 @@
use std::collections::HashMap;
use std::sync::Arc;
use std::time::Duration;
use arrow_array::RecordBatchReader;
use async_trait::async_trait;
@@ -29,6 +30,7 @@ use crate::error::Result;
use crate::table::{BaseTable, TableDefinition, WriteOptions};
pub mod listing;
pub mod namespace;
pub trait DatabaseOptions {
fn serialize_into_map(&self, map: &mut HashMap<String, String>);
@@ -212,6 +214,20 @@ impl CloneTableRequest {
}
}
/// How long until a change is reflected from one Table instance to another
///
/// Tables are always internally consistent. If a write method is called on
/// a table instance it will be immediately visible in that same table instance.
pub enum ReadConsistency {
/// Changes will not be automatically propagated until the checkout_latest
/// method is called on the target table
Manual,
/// Changes will be propagated automatically within the given duration
Eventual(Duration),
/// Changes are immediately visible in target tables
Strong,
}
/// The `Database` trait defines the interface for database implementations.
///
/// A database is responsible for managing tables and their metadata.
@@ -219,6 +235,10 @@ impl CloneTableRequest {
pub trait Database:
Send + Sync + std::any::Any + std::fmt::Debug + std::fmt::Display + 'static
{
/// Get the uri of the database
fn uri(&self) -> &str;
/// Get the read consistency of the database
async fn read_consistency(&self) -> Result<ReadConsistency>;
/// List immediate child namespace names in the given namespace
async fn list_namespaces(&self, request: ListNamespacesRequest) -> Result<Vec<String>>;
/// Create a new namespace

View File

@@ -17,6 +17,7 @@ use object_store::local::LocalFileSystem;
use snafu::ResultExt;
use crate::connection::ConnectRequest;
use crate::database::ReadConsistency;
use crate::error::{CreateDirSnafu, Error, Result};
use crate::io::object_store::MirroringObjectStoreWrapper;
use crate::table::NativeTable;
@@ -598,6 +599,22 @@ impl Database for ListingDatabase {
Ok(Vec::new())
}
fn uri(&self) -> &str {
&self.uri
}
async fn read_consistency(&self) -> Result<ReadConsistency> {
if let Some(read_consistency_inverval) = self.read_consistency_interval {
if read_consistency_inverval.is_zero() {
Ok(ReadConsistency::Strong)
} else {
Ok(ReadConsistency::Eventual(read_consistency_inverval))
}
} else {
Ok(ReadConsistency::Manual)
}
}
async fn create_namespace(&self, _request: CreateNamespaceRequest) -> Result<()> {
Err(Error::NotSupported {
message: "Namespace operations are not supported for listing database".into(),
@@ -718,9 +735,9 @@ impl Database for ListingDatabase {
.map_err(|e| Error::Lance { source: e })?;
let version_ref = match (request.source_version, request.source_tag) {
(Some(v), None) => Ok(Ref::Version(v)),
(Some(v), None) => Ok(Ref::Version(None, Some(v))),
(None, Some(tag)) => Ok(Ref::Tag(tag)),
(None, None) => Ok(Ref::Version(source_dataset.version().version)),
(None, None) => Ok(Ref::Version(None, Some(source_dataset.version().version))),
_ => Err(Error::InvalidInput {
message: "Cannot specify both source_version and source_tag".to_string(),
}),
@@ -728,7 +745,7 @@ impl Database for ListingDatabase {
let target_uri = self.table_uri(&request.target_table_name)?;
source_dataset
.shallow_clone(&target_uri, version_ref, storage_params)
.shallow_clone(&target_uri, version_ref, Some(storage_params))
.await
.map_err(|e| Error::Lance { source: e })?;
@@ -1249,7 +1266,8 @@ mod tests {
)
.unwrap();
let source_table_obj = Table::new(source_table.clone());
let db = Arc::new(db);
let source_table_obj = Table::new(source_table.clone(), db.clone());
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch2)],
@@ -1320,7 +1338,8 @@ mod tests {
.unwrap();
// Create a tag for the current version
let source_table_obj = Table::new(source_table.clone());
let db = Arc::new(db);
let source_table_obj = Table::new(source_table.clone(), db.clone());
let mut tags = source_table_obj.tags().await.unwrap();
tags.create("v1.0", source_table.version().await.unwrap())
.await
@@ -1336,7 +1355,7 @@ mod tests {
)
.unwrap();
let source_table_obj = Table::new(source_table.clone());
let source_table_obj = Table::new(source_table.clone(), db.clone());
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch2)],
@@ -1432,7 +1451,8 @@ mod tests {
)
.unwrap();
let cloned_table_obj = Table::new(cloned_table.clone());
let db = Arc::new(db);
let cloned_table_obj = Table::new(cloned_table.clone(), db.clone());
cloned_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch_clone)],
@@ -1452,7 +1472,7 @@ mod tests {
)
.unwrap();
let source_table_obj = Table::new(source_table.clone());
let source_table_obj = Table::new(source_table.clone(), db);
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch_source)],
@@ -1495,6 +1515,7 @@ mod tests {
.unwrap();
// Add more data to create new versions
let db = Arc::new(db);
for i in 0..3 {
let batch = RecordBatch::try_new(
schema.clone(),
@@ -1502,7 +1523,7 @@ mod tests {
)
.unwrap();
let source_table_obj = Table::new(source_table.clone());
let source_table_obj = Table::new(source_table.clone(), db.clone());
source_table_obj
.add(Box::new(arrow_array::RecordBatchIterator::new(
vec![Ok(batch)],

View File

@@ -0,0 +1,872 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! Namespace-based database implementation that delegates table management to lance-namespace
use std::collections::HashMap;
use std::sync::Arc;
use async_trait::async_trait;
use lance_namespace::{
connect as connect_namespace,
models::{
CreateEmptyTableRequest, CreateNamespaceRequest, DescribeTableRequest,
DropNamespaceRequest, DropTableRequest, ListNamespacesRequest, ListTablesRequest,
},
LanceNamespace,
};
use crate::database::listing::ListingDatabase;
use crate::error::{Error, Result};
use crate::{connection::ConnectRequest, database::ReadConsistency};
use super::{
BaseTable, CloneTableRequest, CreateNamespaceRequest as DbCreateNamespaceRequest,
CreateTableMode, CreateTableRequest as DbCreateTableRequest, Database,
DropNamespaceRequest as DbDropNamespaceRequest,
ListNamespacesRequest as DbListNamespacesRequest, OpenTableRequest, TableNamesRequest,
};
/// A database implementation that uses lance-namespace for table management
pub struct LanceNamespaceDatabase {
namespace: Arc<dyn LanceNamespace>,
// Storage options to be inherited by tables
storage_options: HashMap<String, String>,
// Read consistency interval for tables
read_consistency_interval: Option<std::time::Duration>,
// Optional session for object stores and caching
session: Option<Arc<lance::session::Session>>,
// database URI
uri: String,
}
impl LanceNamespaceDatabase {
pub async fn connect(
ns_impl: &str,
ns_properties: HashMap<String, String>,
storage_options: HashMap<String, String>,
read_consistency_interval: Option<std::time::Duration>,
session: Option<Arc<lance::session::Session>>,
) -> Result<Self> {
let namespace = connect_namespace(ns_impl, ns_properties.clone())
.await
.map_err(|e| Error::InvalidInput {
message: format!("Failed to connect to namespace: {:?}", e),
})?;
Ok(Self {
namespace,
storage_options,
read_consistency_interval,
session,
uri: format!("namespace://{}", ns_impl),
})
}
/// Helper method to create a ListingDatabase from a table location
///
/// This method:
/// 1. Validates that the location ends with <table_name>.lance
/// 2. Extracts the parent directory from the location
/// 3. Creates a ListingDatabase at that parent directory
async fn create_listing_database(
&self,
table_name: &str,
location: &str,
additional_storage_options: Option<HashMap<String, String>>,
) -> Result<Arc<ListingDatabase>> {
let expected_suffix = format!("{}.lance", table_name);
if !location.ends_with(&expected_suffix) {
return Err(Error::Runtime {
message: format!(
"Invalid table location '{}': expected to end with '{}'",
location, expected_suffix
),
});
}
let parent_dir = location
.rsplit_once('/')
.map(|(parent, _)| parent.to_string())
.ok_or_else(|| Error::Runtime {
message: format!("Invalid table location '{}': no parent directory", location),
})?;
let mut merged_storage_options = self.storage_options.clone();
if let Some(opts) = additional_storage_options {
merged_storage_options.extend(opts);
}
let connect_request = ConnectRequest {
uri: parent_dir,
options: merged_storage_options,
read_consistency_interval: self.read_consistency_interval,
session: self.session.clone(),
#[cfg(feature = "remote")]
client_config: Default::default(),
};
let listing_db = ListingDatabase::connect_with_options(&connect_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to create listing database: {}", e),
})?;
Ok(Arc::new(listing_db))
}
}
impl std::fmt::Debug for LanceNamespaceDatabase {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
f.debug_struct("LanceNamespaceDatabase")
.field("storage_options", &self.storage_options)
.field("read_consistency_interval", &self.read_consistency_interval)
.finish()
}
}
impl std::fmt::Display for LanceNamespaceDatabase {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "LanceNamespaceDatabase")
}
}
#[async_trait]
impl Database for LanceNamespaceDatabase {
fn uri(&self) -> &str {
&self.uri
}
async fn read_consistency(&self) -> Result<ReadConsistency> {
if let Some(read_consistency_inverval) = self.read_consistency_interval {
if read_consistency_inverval.is_zero() {
Ok(ReadConsistency::Strong)
} else {
Ok(ReadConsistency::Eventual(read_consistency_inverval))
}
} else {
Ok(ReadConsistency::Manual)
}
}
async fn list_namespaces(&self, request: DbListNamespacesRequest) -> Result<Vec<String>> {
let ns_request = ListNamespacesRequest {
id: if request.namespace.is_empty() {
None
} else {
Some(request.namespace)
},
page_token: request.page_token,
limit: request.limit.map(|l| l as i32),
};
let response = self
.namespace
.list_namespaces(ns_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to list namespaces: {}", e),
})?;
Ok(response.namespaces)
}
async fn create_namespace(&self, request: DbCreateNamespaceRequest) -> Result<()> {
let ns_request = CreateNamespaceRequest {
id: if request.namespace.is_empty() {
None
} else {
Some(request.namespace)
},
mode: None,
properties: None,
};
self.namespace
.create_namespace(ns_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to create namespace: {}", e),
})?;
Ok(())
}
async fn drop_namespace(&self, request: DbDropNamespaceRequest) -> Result<()> {
let ns_request = DropNamespaceRequest {
id: if request.namespace.is_empty() {
None
} else {
Some(request.namespace)
},
mode: None,
behavior: None,
};
self.namespace
.drop_namespace(ns_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to drop namespace: {}", e),
})?;
Ok(())
}
async fn table_names(&self, request: TableNamesRequest) -> Result<Vec<String>> {
let ns_request = ListTablesRequest {
id: if request.namespace.is_empty() {
None
} else {
Some(request.namespace)
},
page_token: request.start_after,
limit: request.limit.map(|l| l as i32),
};
let response =
self.namespace
.list_tables(ns_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to list tables: {}", e),
})?;
Ok(response.tables)
}
async fn create_table(&self, request: DbCreateTableRequest) -> Result<Arc<dyn BaseTable>> {
let mut table_id = request.namespace.clone();
table_id.push(request.name.clone());
let describe_request = DescribeTableRequest {
id: Some(table_id.clone()),
version: None,
};
let describe_result = self.namespace.describe_table(describe_request).await;
match request.mode {
CreateTableMode::Create => {
if describe_result.is_ok() {
return Err(Error::TableAlreadyExists {
name: request.name.clone(),
});
}
}
CreateTableMode::Overwrite => {
if describe_result.is_ok() {
// Drop the existing table - must succeed
let drop_request = DropTableRequest {
id: Some(table_id.clone()),
};
self.namespace
.drop_table(drop_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to drop existing table for overwrite: {}", e),
})?;
}
}
CreateTableMode::ExistOk(_) => {
if let Ok(response) = describe_result {
let location = response.location.ok_or_else(|| Error::Runtime {
message: "Table location is missing from namespace response".to_string(),
})?;
let listing_db = self
.create_listing_database(&request.name, &location, response.storage_options)
.await?;
return listing_db
.open_table(OpenTableRequest {
name: request.name.clone(),
namespace: vec![],
index_cache_size: None,
lance_read_params: None,
})
.await;
}
}
}
let mut table_id = request.namespace.clone();
table_id.push(request.name.clone());
let create_empty_request = CreateEmptyTableRequest {
id: Some(table_id),
location: None,
properties: if self.storage_options.is_empty() {
None
} else {
Some(self.storage_options.clone())
},
};
let create_empty_response = self
.namespace
.create_empty_table(create_empty_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to create empty table: {}", e),
})?;
let location = create_empty_response
.location
.ok_or_else(|| Error::Runtime {
message: "Table location is missing from create_empty_table response".to_string(),
})?;
let listing_db = self
.create_listing_database(
&request.name,
&location,
create_empty_response.storage_options,
)
.await?;
let create_request = DbCreateTableRequest {
name: request.name,
namespace: vec![],
data: request.data,
mode: request.mode,
write_options: request.write_options,
};
listing_db.create_table(create_request).await
}
async fn open_table(&self, request: OpenTableRequest) -> Result<Arc<dyn BaseTable>> {
let mut table_id = request.namespace.clone();
table_id.push(request.name.clone());
let describe_request = DescribeTableRequest {
id: Some(table_id),
version: None,
};
let response = self
.namespace
.describe_table(describe_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to describe table: {}", e),
})?;
let location = response.location.ok_or_else(|| Error::Runtime {
message: "Table location is missing from namespace response".to_string(),
})?;
let listing_db = self
.create_listing_database(&request.name, &location, response.storage_options)
.await?;
let open_request = OpenTableRequest {
name: request.name.clone(),
namespace: vec![],
index_cache_size: request.index_cache_size,
lance_read_params: request.lance_read_params,
};
listing_db.open_table(open_request).await
}
async fn clone_table(&self, _request: CloneTableRequest) -> Result<Arc<dyn BaseTable>> {
Err(Error::NotSupported {
message: "clone_table is not supported for namespace connections".to_string(),
})
}
async fn rename_table(
&self,
_cur_name: &str,
_new_name: &str,
_cur_namespace: &[String],
_new_namespace: &[String],
) -> Result<()> {
Err(Error::NotSupported {
message: "rename_table is not supported for namespace connections".to_string(),
})
}
async fn drop_table(&self, name: &str, namespace: &[String]) -> Result<()> {
let mut table_id = namespace.to_vec();
table_id.push(name.to_string());
let drop_request = DropTableRequest { id: Some(table_id) };
self.namespace
.drop_table(drop_request)
.await
.map_err(|e| Error::Runtime {
message: format!("Failed to drop table: {}", e),
})?;
Ok(())
}
async fn drop_all_tables(&self, namespace: &[String]) -> Result<()> {
let tables = self
.table_names(TableNamesRequest {
namespace: namespace.to_vec(),
start_after: None,
limit: None,
})
.await?;
for table in tables {
self.drop_table(&table, namespace).await?;
}
Ok(())
}
fn as_any(&self) -> &dyn std::any::Any {
self
}
}
#[cfg(test)]
#[cfg(not(windows))] // TODO: support windows for lance-namespace
mod tests {
use super::*;
use crate::connect_namespace;
use crate::query::ExecutableQuery;
use arrow_array::{Int32Array, RecordBatch, RecordBatchIterator, StringArray};
use arrow_schema::{DataType, Field, Schema};
use futures::TryStreamExt;
use tempfile::tempdir;
/// Helper function to create test data
fn create_test_data() -> RecordBatchIterator<
std::vec::IntoIter<std::result::Result<RecordBatch, arrow_schema::ArrowError>>,
> {
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
]));
let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
let name_array = StringArray::from(vec!["Alice", "Bob", "Charlie", "David", "Eve"]);
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(id_array), Arc::new(name_array)],
)
.unwrap();
RecordBatchIterator::new(vec![std::result::Result::Ok(batch)].into_iter(), schema)
}
#[tokio::test]
async fn test_namespace_connection_simple() {
// Test that namespace connections work with simple connect_namespace(impl_type, properties)
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
// This should succeed with directory-based namespace
let result = connect_namespace("dir", properties).execute().await;
assert!(result.is_ok());
}
#[tokio::test]
async fn test_namespace_connection_with_storage_options() {
// Test namespace connections with storage options
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
// This should succeed with directory-based namespace and storage options
let result = connect_namespace("dir", properties)
.storage_option("timeout", "30s")
.execute()
.await;
assert!(result.is_ok());
}
#[tokio::test]
async fn test_namespace_connection_with_all_options() {
use crate::embeddings::MemoryRegistry;
use std::time::Duration;
// Test namespace connections with all configuration options
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let embedding_registry = Arc::new(MemoryRegistry::new());
let session = Arc::new(lance::session::Session::default());
// Test with all options set
let result = connect_namespace("dir", properties)
.storage_option("timeout", "30s")
.storage_options([("cache_size", "1gb"), ("region", "us-east-1")])
.read_consistency_interval(Duration::from_secs(5))
.embedding_registry(embedding_registry.clone())
.session(session.clone())
.execute()
.await;
assert!(result.is_ok());
let conn = result.unwrap();
// Verify embedding registry is set correctly
assert!(std::ptr::eq(
conn.embedding_registry() as *const _,
embedding_registry.as_ref() as *const _
));
}
#[tokio::test]
async fn test_namespace_create_table_basic() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
// Connect to namespace using DirectoryNamespace
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Test: Create a table
let test_data = create_test_data();
let table = conn
.create_table("test_table", test_data)
.execute()
.await
.expect("Failed to create table");
// Verify: Table was created and can be queried
let results = table
.query()
.execute()
.await
.expect("Failed to query table")
.try_collect::<Vec<_>>()
.await
.expect("Failed to collect results");
assert_eq!(results.len(), 1);
assert_eq!(results[0].num_rows(), 5);
// Verify: Table appears in table_names
let table_names = conn
.table_names()
.execute()
.await
.expect("Failed to list tables");
assert!(table_names.contains(&"test_table".to_string()));
}
#[tokio::test]
async fn test_namespace_describe_table() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
// Connect to namespace
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Create a table first
let test_data = create_test_data();
let _table = conn
.create_table("describe_test", test_data)
.execute()
.await
.expect("Failed to create table");
// Test: Open the table (which internally uses describe_table)
let opened_table = conn
.open_table("describe_test")
.execute()
.await
.expect("Failed to open table");
// Verify: Can query the opened table
let results = opened_table
.query()
.execute()
.await
.expect("Failed to query table")
.try_collect::<Vec<_>>()
.await
.expect("Failed to collect results");
assert_eq!(results.len(), 1);
assert_eq!(results[0].num_rows(), 5);
// Verify schema matches
let schema = opened_table.schema().await.expect("Failed to get schema");
assert_eq!(schema.fields.len(), 2);
assert_eq!(schema.field(0).name(), "id");
assert_eq!(schema.field(1).name(), "name");
}
#[tokio::test]
async fn test_namespace_create_table_overwrite_mode() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Create initial table with 5 rows
let test_data1 = create_test_data();
let _table1 = conn
.create_table("overwrite_test", test_data1)
.execute()
.await
.expect("Failed to create table");
// Create new data with 3 rows
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
]));
let id_array = Int32Array::from(vec![10, 20, 30]);
let name_array = StringArray::from(vec!["New1", "New2", "New3"]);
let test_data2 = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(id_array), Arc::new(name_array)],
)
.unwrap();
// Test: Overwrite the table
let table2 = conn
.create_table(
"overwrite_test",
RecordBatchIterator::new(
vec![std::result::Result::Ok(test_data2)].into_iter(),
schema,
),
)
.mode(CreateTableMode::Overwrite)
.execute()
.await
.expect("Failed to overwrite table");
// Verify: Table has new data (3 rows instead of 5)
let results = table2
.query()
.execute()
.await
.expect("Failed to query table")
.try_collect::<Vec<_>>()
.await
.expect("Failed to collect results");
assert_eq!(results.len(), 1);
assert_eq!(results[0].num_rows(), 3);
// Verify the data is actually the new data
let id_col = results[0]
.column(0)
.as_any()
.downcast_ref::<Int32Array>()
.unwrap();
assert_eq!(id_col.value(0), 10);
assert_eq!(id_col.value(1), 20);
assert_eq!(id_col.value(2), 30);
}
#[tokio::test]
async fn test_namespace_create_table_exist_ok_mode() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Create initial table with test data
let test_data1 = create_test_data();
let _table1 = conn
.create_table("exist_ok_test", test_data1)
.execute()
.await
.expect("Failed to create table");
// Try to create again with exist_ok mode
let test_data2 = create_test_data();
let table2 = conn
.create_table("exist_ok_test", test_data2)
.mode(CreateTableMode::exist_ok(|req| req))
.execute()
.await
.expect("Failed with exist_ok mode");
// Verify: Table still has original data (5 rows)
let results = table2
.query()
.execute()
.await
.expect("Failed to query table")
.try_collect::<Vec<_>>()
.await
.expect("Failed to collect results");
assert_eq!(results.len(), 1);
assert_eq!(results[0].num_rows(), 5);
}
#[tokio::test]
async fn test_namespace_create_multiple_tables() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Create first table
let test_data1 = create_test_data();
let _table1 = conn
.create_table("table1", test_data1)
.execute()
.await
.expect("Failed to create first table");
// Create second table
let test_data2 = create_test_data();
let _table2 = conn
.create_table("table2", test_data2)
.execute()
.await
.expect("Failed to create second table");
// Verify: Both tables appear in table list
let table_names = conn
.table_names()
.execute()
.await
.expect("Failed to list tables");
assert!(table_names.contains(&"table1".to_string()));
assert!(table_names.contains(&"table2".to_string()));
// Verify: Can open both tables
let opened_table1 = conn
.open_table("table1")
.execute()
.await
.expect("Failed to open table1");
let opened_table2 = conn
.open_table("table2")
.execute()
.await
.expect("Failed to open table2");
// Verify both tables work
let count1 = opened_table1
.count_rows(None)
.await
.expect("Failed to count rows in table1");
assert_eq!(count1, 5);
let count2 = opened_table2
.count_rows(None)
.await
.expect("Failed to count rows in table2");
assert_eq!(count2, 5);
}
#[tokio::test]
async fn test_namespace_table_not_found() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Test: Try to open a non-existent table
let result = conn.open_table("non_existent_table").execute().await;
// Verify: Should return an error
assert!(result.is_err());
}
#[tokio::test]
async fn test_namespace_drop_table() {
// Setup: Create a temporary directory for the namespace
let tmp_dir = tempdir().unwrap();
let root_path = tmp_dir.path().to_str().unwrap().to_string();
let mut properties = HashMap::new();
properties.insert("root".to_string(), root_path);
let conn = connect_namespace("dir", properties)
.execute()
.await
.expect("Failed to connect to namespace");
// Create a table first
let test_data = create_test_data();
let _table = conn
.create_table("drop_test", test_data)
.execute()
.await
.expect("Failed to create table");
// Verify table exists
let table_names_before = conn
.table_names()
.execute()
.await
.expect("Failed to list tables");
assert!(table_names_before.contains(&"drop_test".to_string()));
// Test: Drop the table
conn.drop_table("drop_test", &[])
.await
.expect("Failed to drop table");
// Verify: Table no longer exists
let table_names_after = conn
.table_names()
.execute()
.await
.expect("Failed to list tables");
assert!(!table_names_after.contains(&"drop_test".to_string()));
// Verify: Cannot open dropped table
let open_result = conn.open_table("drop_test").execute().await;
assert!(open_result.is_err());
}
}

View File

@@ -0,0 +1,7 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
pub mod permutation;
pub mod shuffle;
pub mod split;
pub mod util;

View File

@@ -0,0 +1,294 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! Contains the [PermutationBuilder] to create a permutation "view" of an existing table.
//!
//! A permutation view can apply a filter, divide the data into splits, and shuffle the data.
//! The permutation table only stores the split ids and row ids. It is not a materialized copy of
//! the underlying data and can be very lightweight.
//!
//! Building a permutation table should be fairly quick and memory efficient, even for billions or
//! trillions of rows.
use datafusion::prelude::{SessionConfig, SessionContext};
use datafusion_execution::{disk_manager::DiskManagerBuilder, runtime_env::RuntimeEnvBuilder};
use datafusion_expr::col;
use futures::TryStreamExt;
use lance_datafusion::exec::SessionContextExt;
use crate::{
arrow::{SendableRecordBatchStream, SendableRecordBatchStreamExt, SimpleRecordBatchStream},
dataloader::{
shuffle::{Shuffler, ShufflerConfig},
split::{SplitStrategy, Splitter, SPLIT_ID_COLUMN},
util::{rename_column, TemporaryDirectory},
},
query::{ExecutableQuery, QueryBase},
Connection, Error, Result, Table,
};
/// Configuration for creating a permutation table
#[derive(Debug, Default)]
pub struct PermutationConfig {
/// Splitting configuration
pub split_strategy: SplitStrategy,
/// Shuffle strategy
pub shuffle_strategy: ShuffleStrategy,
/// Optional filter to apply to the base table
pub filter: Option<String>,
/// Directory to use for temporary files
pub temp_dir: TemporaryDirectory,
}
/// Strategy for shuffling the data.
#[derive(Debug, Clone)]
pub enum ShuffleStrategy {
/// The data is randomly shuffled
///
/// A seed can be provided to make the shuffle deterministic.
///
/// If a clump size is provided, then data will be shuffled in small blocks of contiguous rows.
/// This decreases the overall randomization but can improve I/O performance when reading from
/// cloud storage.
///
/// For example, a clump size of 16 will means we will shuffle blocks of 16 contiguous rows. This
/// will mean 16x fewer IOPS but these 16 rows will always be close together and this can influence
/// the performance of the model. Note: shuffling within clumps can still be done at read time but
/// this will only provide a local shuffle and not a global shuffle.
Random {
seed: Option<u64>,
clump_size: Option<u64>,
},
/// The data is not shuffled
///
/// This is useful for debugging and testing.
None,
}
impl Default for ShuffleStrategy {
fn default() -> Self {
Self::None
}
}
/// Builder for creating a permutation table.
///
/// A permutation table is a table that stores split assignments and a shuffled order of rows. This
/// can be used to create a
pub struct PermutationBuilder {
config: PermutationConfig,
base_table: Table,
}
impl PermutationBuilder {
pub fn new(base_table: Table) -> Self {
Self {
config: PermutationConfig::default(),
base_table,
}
}
/// Configures the strategy for assigning rows to splits.
///
/// For example, it is common to create a test/train split of the data. Splits can also be used
/// to limit the number of rows. For example, to only use 10% of the data in a permutation you can
/// create a single split with 10% of the data.
///
/// Splits are _not_ required for parallel processing. A single split can be loaded in parallel across
/// multiple processes and multiple nodes.
///
/// The default is a single split that contains all rows.
pub fn with_split_strategy(mut self, split_strategy: SplitStrategy) -> Self {
self.config.split_strategy = split_strategy;
self
}
/// Configures the strategy for shuffling the data.
///
/// The default is to shuffle the data randomly at row-level granularity (no shard size) and
/// with a random seed.
pub fn with_shuffle_strategy(mut self, shuffle_strategy: ShuffleStrategy) -> Self {
self.config.shuffle_strategy = shuffle_strategy;
self
}
/// Configures a filter to apply to the base table.
///
/// Only rows matching the filter will be included in the permutation.
pub fn with_filter(mut self, filter: String) -> Self {
self.config.filter = Some(filter);
self
}
/// Configures the directory to use for temporary files.
///
/// The default is to use the operating system's default temporary directory.
pub fn with_temp_dir(mut self, temp_dir: TemporaryDirectory) -> Self {
self.config.temp_dir = temp_dir;
self
}
async fn sort_by_split_id(
&self,
data: SendableRecordBatchStream,
) -> Result<SendableRecordBatchStream> {
let ctx = SessionContext::new_with_config_rt(
SessionConfig::default(),
RuntimeEnvBuilder::new()
.with_memory_limit(100 * 1024 * 1024, 1.0)
.with_disk_manager_builder(
DiskManagerBuilder::default()
.with_mode(self.config.temp_dir.to_disk_manager_mode()),
)
.build_arc()
.unwrap(),
);
let df = ctx
.read_one_shot(data.into_df_stream())
.map_err(|e| Error::Other {
message: format!("Failed to setup sort by split id: {}", e),
source: Some(e.into()),
})?;
let df_stream = df
.sort_by(vec![col(SPLIT_ID_COLUMN)])
.map_err(|e| Error::Other {
message: format!("Failed to plan sort by split id: {}", e),
source: Some(e.into()),
})?
.execute_stream()
.await
.map_err(|e| Error::Other {
message: format!("Failed to sort by split id: {}", e),
source: Some(e.into()),
})?;
let schema = df_stream.schema();
let stream = df_stream.map_err(|e| Error::Other {
message: format!("Failed to execute sort by split id: {}", e),
source: Some(e.into()),
});
Ok(Box::pin(SimpleRecordBatchStream { schema, stream }))
}
/// Builds the permutation table and stores it in the given database.
pub async fn build(self, dest_table_name: &str) -> Result<Table> {
// First pass, apply filter and load row ids
let mut rows = self.base_table.query().with_row_id();
if let Some(filter) = &self.config.filter {
rows = rows.only_if(filter);
}
let splitter = Splitter::new(
self.config.temp_dir.clone(),
self.config.split_strategy.clone(),
);
let mut needs_sort = !splitter.orders_by_split_id();
// Might need to load additional columns to calculate splits (e.g. hash columns or calculated
// split id)
rows = splitter.project(rows);
let num_rows = self
.base_table
.count_rows(self.config.filter.clone())
.await? as u64;
// Apply splits
let rows = rows.execute().await?;
let split_data = splitter.apply(rows, num_rows).await?;
// Shuffle data if requested
let shuffled = match self.config.shuffle_strategy {
ShuffleStrategy::None => split_data,
ShuffleStrategy::Random { seed, clump_size } => {
let shuffler = Shuffler::new(ShufflerConfig {
seed,
clump_size,
temp_dir: self.config.temp_dir.clone(),
max_rows_per_file: 10 * 1024 * 1024,
});
shuffler.shuffle(split_data, num_rows).await?
}
};
// We want the final permutation to be sorted by the split id. If we shuffled or if
// the split was not assigned sequentially then we need to sort the data.
needs_sort |= !matches!(self.config.shuffle_strategy, ShuffleStrategy::None);
let sorted = if needs_sort {
self.sort_by_split_id(shuffled).await?
} else {
shuffled
};
// Rename _rowid to row_id
let renamed = rename_column(sorted, "_rowid", "row_id")?;
// Create permutation table
let conn = Connection::new(
self.base_table.database().clone(),
self.base_table.embedding_registry().clone(),
);
conn.create_table_streaming(dest_table_name, renamed)
.execute()
.await
}
}
#[cfg(test)]
mod tests {
use arrow::datatypes::Int32Type;
use lance_datagen::{BatchCount, RowCount};
use crate::{arrow::LanceDbDatagenExt, connect, dataloader::split::SplitSizes};
use super::*;
#[tokio::test]
async fn test_permutation_builder() {
let temp_dir = tempfile::tempdir().unwrap();
let db = connect(temp_dir.path().to_str().unwrap())
.execute()
.await
.unwrap();
let initial_data = lance_datagen::gen_batch()
.col("some_value", lance_datagen::array::step::<Int32Type>())
.into_ldb_stream(RowCount::from(100), BatchCount::from(10));
let data_table = db
.create_table_streaming("mytbl", initial_data)
.execute()
.await
.unwrap();
let permutation_table = PermutationBuilder::new(data_table)
.with_filter("some_value > 57".to_string())
.with_split_strategy(SplitStrategy::Random {
seed: Some(42),
sizes: SplitSizes::Percentages(vec![0.05, 0.30]),
})
.build("permutation")
.await
.unwrap();
// Potentially brittle seed-dependent values below
assert_eq!(permutation_table.count_rows(None).await.unwrap(), 330);
assert_eq!(
permutation_table
.count_rows(Some("split_id = 0".to_string()))
.await
.unwrap(),
47
);
assert_eq!(
permutation_table
.count_rows(Some("split_id = 1".to_string()))
.await
.unwrap(),
283
);
}
}

View File

@@ -0,0 +1,475 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::sync::{Arc, Mutex};
use arrow::compute::concat_batches;
use arrow_array::{RecordBatch, UInt64Array};
use futures::{StreamExt, TryStreamExt};
use lance::io::ObjectStore;
use lance_core::{cache::LanceCache, utils::futures::FinallyStreamExt};
use lance_encoding::decoder::DecoderPlugins;
use lance_file::v2::{
reader::{FileReader, FileReaderOptions},
writer::{FileWriter, FileWriterOptions},
};
use lance_index::scalar::IndexReader;
use lance_io::{
scheduler::{ScanScheduler, SchedulerConfig},
utils::CachedFileSize,
};
use rand::{seq::SliceRandom, Rng, RngCore};
use crate::{
arrow::{SendableRecordBatchStream, SimpleRecordBatchStream},
dataloader::util::{non_crypto_rng, TemporaryDirectory},
Error, Result,
};
#[derive(Debug, Clone)]
pub struct ShufflerConfig {
/// An optional seed to make the shuffle deterministic
pub seed: Option<u64>,
/// The maximum number of rows to write to a single file
///
/// The shuffler will need to hold at least this many rows in memory. Setting this value
/// extremely large could cause the shuffler to use a lot of memory (depending on row size).
///
/// However, the shuffler will also need to hold total_num_rows / max_rows_per_file file
/// writers in memory. Each of these will consume some amount of data for column write buffers.
/// So setting this value too small could _also_ cause the shuffler to use a lot of memory and
/// open file handles.
pub max_rows_per_file: u64,
/// The temporary directory to use for writing files
pub temp_dir: TemporaryDirectory,
/// The size of the clumps to shuffle within
///
/// If a clump size is provided, then data will be shuffled in small blocks of contiguous rows.
/// This decreases the overall randomization but can improve I/O performance when reading from
/// cloud storage.
pub clump_size: Option<u64>,
}
impl Default for ShufflerConfig {
fn default() -> Self {
Self {
max_rows_per_file: 1024 * 1024,
seed: Option::default(),
temp_dir: TemporaryDirectory::default(),
clump_size: None,
}
}
}
/// A shuffler that can shuffle a stream of record batches
///
/// To do this the stream is consumed and written to temporary files. A new stream is returned
/// which returns the shuffled data from the temporary files.
///
/// If there are fewer than max_rows_per_file rows in the input stream, then the shuffler will not
/// write any files and will instead perform an in-memory shuffle.
///
/// The number of rows in the input stream must be known in advance.
pub struct Shuffler {
config: ShufflerConfig,
id: String,
}
impl Shuffler {
pub fn new(config: ShufflerConfig) -> Self {
let id = uuid::Uuid::new_v4().to_string();
Self { config, id }
}
/// Shuffles a single batch of data in memory
fn shuffle_batch(
batch: &RecordBatch,
rng: &mut dyn RngCore,
clump_size: u64,
) -> Result<RecordBatch> {
let num_clumps = (batch.num_rows() as u64).div_ceil(clump_size);
let mut indices = (0..num_clumps).collect::<Vec<_>>();
indices.shuffle(rng);
let indices = if clump_size == 1 {
UInt64Array::from(indices)
} else {
UInt64Array::from_iter_values(indices.iter().flat_map(|&clump_index| {
if clump_index == num_clumps - 1 {
clump_index * clump_size..batch.num_rows() as u64
} else {
clump_index * clump_size..(clump_index + 1) * clump_size
}
}))
};
Ok(arrow::compute::take_record_batch(batch, &indices)?)
}
async fn in_memory_shuffle(
&self,
data: SendableRecordBatchStream,
mut rng: Box<dyn RngCore + Send>,
) -> Result<SendableRecordBatchStream> {
let schema = data.schema();
let batches = data.try_collect::<Vec<_>>().await?;
let batch = concat_batches(&schema, &batches)?;
let shuffled = Self::shuffle_batch(&batch, &mut rng, self.config.clump_size.unwrap_or(1))?;
log::debug!("Shuffle job {}: in-memory shuffle complete", self.id);
Ok(Box::pin(SimpleRecordBatchStream::new(
futures::stream::once(async move { Ok(shuffled) }),
schema,
)))
}
async fn do_shuffle(
&self,
mut data: SendableRecordBatchStream,
num_rows: u64,
mut rng: Box<dyn RngCore + Send>,
) -> Result<SendableRecordBatchStream> {
let num_files = num_rows.div_ceil(self.config.max_rows_per_file);
let temp_dir = self.config.temp_dir.create_temp_dir()?;
let tmp_dir = temp_dir.path().to_path_buf();
let clump_size = self.config.clump_size.unwrap_or(1);
if clump_size == 0 {
return Err(Error::InvalidInput {
message: "clump size must be greater than 0".to_string(),
});
}
let object_store = ObjectStore::local();
let arrow_schema = data.schema();
let schema = lance::datatypes::Schema::try_from(arrow_schema.as_ref())?;
// Create file writers
let mut file_writers = Vec::with_capacity(num_files as usize);
for file_index in 0..num_files {
let path = tmp_dir.join(format!("shuffle_{}_{file_index}.lance", self.id));
let path =
object_store::path::Path::from_absolute_path(path).map_err(|err| Error::Other {
message: format!("Failed to create temporary file: {}", err),
source: None,
})?;
let object_writer = object_store.create(&path).await?;
let writer =
FileWriter::try_new(object_writer, schema.clone(), FileWriterOptions::default())?;
file_writers.push(writer);
}
let mut num_rows_seen = 0;
// Randomly distribute clumps to files
while let Some(batch) = data.try_next().await? {
num_rows_seen += batch.num_rows() as u64;
let is_last = num_rows_seen == num_rows;
if num_rows_seen > num_rows {
return Err(Error::Runtime {
message: format!("Expected {} rows but saw {} rows", num_rows, num_rows_seen),
});
}
// This is kind of an annoying limitation but if we allow runt clumps from batches then
// clumps will get unaligned and we will mess up the clumps when we do the in-memory
// shuffle step. If this is a problem we can probably figure out a better way to do this.
if !is_last && batch.num_rows() as u64 % clump_size != 0 {
return Err(Error::Runtime {
message: format!(
"Expected batch size ({}) to be divisible by clump size ({})",
batch.num_rows(),
clump_size
),
});
}
let num_clumps = (batch.num_rows() as u64).div_ceil(clump_size);
let mut batch_offsets_for_files =
vec![Vec::<u64>::with_capacity(batch.num_rows()); num_files as usize];
// Partition the batch randomly and write to the appropriate accumulator
for clump_offset in 0..num_clumps {
let clump_start = clump_offset * clump_size;
let num_rows_in_clump = clump_size.min(batch.num_rows() as u64 - clump_start);
let clump_end = clump_start + num_rows_in_clump;
let file_index = rng.random_range(0..num_files);
batch_offsets_for_files[file_index as usize].extend(clump_start..clump_end);
}
for (file_index, batch_offsets) in batch_offsets_for_files.into_iter().enumerate() {
if batch_offsets.is_empty() {
continue;
}
let indices = UInt64Array::from(batch_offsets);
let partition = arrow::compute::take_record_batch(&batch, &indices)?;
file_writers[file_index].write_batch(&partition).await?;
}
}
// Finish writing files
for (file_idx, mut writer) in file_writers.into_iter().enumerate() {
let num_written = writer.finish().await?;
log::debug!(
"Shuffle job {}: wrote {} rows to file {}",
self.id,
num_written,
file_idx
);
}
let scheduler_config = SchedulerConfig::max_bandwidth(&object_store);
let scan_scheduler = ScanScheduler::new(Arc::new(object_store), scheduler_config);
let job_id = self.id.clone();
let rng = Arc::new(Mutex::new(rng));
// Second pass, read each file as a single batch and shuffle
let stream = futures::stream::iter(0..num_files)
.then(move |file_index| {
let scan_scheduler = scan_scheduler.clone();
let rng = rng.clone();
let tmp_dir = tmp_dir.clone();
let job_id = job_id.clone();
async move {
let path = tmp_dir.join(format!("shuffle_{}_{file_index}.lance", job_id));
let path = object_store::path::Path::from_absolute_path(path).unwrap();
let file_scheduler = scan_scheduler
.open_file(&path, &CachedFileSize::unknown())
.await?;
let reader = FileReader::try_open(
file_scheduler,
None,
Arc::<DecoderPlugins>::default(),
&LanceCache::no_cache(),
FileReaderOptions::default(),
)
.await?;
// Need to read the entire file in a single batch for in-memory shuffling
let batch = reader.read_record_batch(0, reader.num_rows()).await?;
let mut rng = rng.lock().unwrap();
Self::shuffle_batch(&batch, &mut rng, clump_size)
}
})
.finally(move || drop(temp_dir))
.boxed();
Ok(Box::pin(SimpleRecordBatchStream::new(stream, arrow_schema)))
}
pub async fn shuffle(
self,
data: SendableRecordBatchStream,
num_rows: u64,
) -> Result<SendableRecordBatchStream> {
log::debug!(
"Shuffle job {}: shuffling {} rows and {} columns",
self.id,
num_rows,
data.schema().fields.len()
);
let rng = non_crypto_rng(&self.config.seed);
if num_rows < self.config.max_rows_per_file {
return self.in_memory_shuffle(data, rng).await;
}
self.do_shuffle(data, num_rows, rng).await
}
}
#[cfg(test)]
mod tests {
use crate::arrow::LanceDbDatagenExt;
use super::*;
use arrow::{array::AsArray, datatypes::Int32Type};
use datafusion::prelude::SessionContext;
use datafusion_expr::col;
use futures::TryStreamExt;
use lance_datagen::{BatchCount, BatchGeneratorBuilder, ByteCount, RowCount, Seed};
use rand::{rngs::SmallRng, SeedableRng};
fn test_gen() -> BatchGeneratorBuilder {
lance_datagen::gen_batch()
.with_seed(Seed::from(42))
.col("id", lance_datagen::array::step::<Int32Type>())
.col(
"name",
lance_datagen::array::rand_utf8(ByteCount::from(10), false),
)
}
fn create_test_batch(size: RowCount) -> RecordBatch {
test_gen().into_batch_rows(size).unwrap()
}
fn create_test_stream(
num_batches: BatchCount,
batch_size: RowCount,
) -> SendableRecordBatchStream {
test_gen().into_ldb_stream(batch_size, num_batches)
}
#[test]
fn test_shuffle_batch_deterministic() {
let batch = create_test_batch(RowCount::from(10));
let mut rng1 = SmallRng::seed_from_u64(42);
let mut rng2 = SmallRng::seed_from_u64(42);
let shuffled1 = Shuffler::shuffle_batch(&batch, &mut rng1, 1).unwrap();
let shuffled2 = Shuffler::shuffle_batch(&batch, &mut rng2, 1).unwrap();
// Same seed should produce same shuffle
assert_eq!(shuffled1, shuffled2);
}
#[test]
fn test_shuffle_with_clumps() {
let batch = create_test_batch(RowCount::from(10));
let mut rng = SmallRng::seed_from_u64(42);
let shuffled = Shuffler::shuffle_batch(&batch, &mut rng, 3).unwrap();
let values = shuffled.column(0).as_primitive::<Int32Type>();
let mut iter = values.into_iter().map(|o| o.unwrap());
let mut frag_seen = false;
let mut clumps_seen = 0;
while let Some(first) = iter.next() {
// 9 is the last value and not a full clump
if first != 9 {
// Otherwise we should have a full clump
let second = iter.next().unwrap();
let third = iter.next().unwrap();
assert_eq!(first + 1, second);
assert_eq!(first + 2, third);
clumps_seen += 1;
} else {
frag_seen = true;
}
}
assert_eq!(clumps_seen, 3);
assert!(frag_seen);
}
async fn sort_batch(batch: RecordBatch) -> RecordBatch {
let ctx = SessionContext::new();
let df = ctx.read_batch(batch).unwrap();
let sorted = df.sort_by(vec![col("id")]).unwrap();
let batches = sorted.collect().await.unwrap();
let schema = batches[0].schema();
concat_batches(&schema, &batches).unwrap()
}
#[tokio::test]
async fn test_shuffle_batch_preserves_data() {
let batch = create_test_batch(RowCount::from(100));
let mut rng = SmallRng::seed_from_u64(42);
let shuffled = Shuffler::shuffle_batch(&batch, &mut rng, 1).unwrap();
assert_ne!(shuffled, batch);
let sorted = sort_batch(shuffled).await;
assert_eq!(sorted, batch);
}
#[test]
fn test_shuffle_batch_empty() {
let batch = create_test_batch(RowCount::from(0));
let mut rng = SmallRng::seed_from_u64(42);
let shuffled = Shuffler::shuffle_batch(&batch, &mut rng, 1).unwrap();
assert_eq!(shuffled.num_rows(), 0);
}
#[tokio::test]
async fn test_in_memory_shuffle() {
let config = ShufflerConfig {
temp_dir: TemporaryDirectory::None,
..Default::default()
};
let shuffler = Shuffler::new(config);
let stream = create_test_stream(BatchCount::from(5), RowCount::from(20));
let result_stream = shuffler.shuffle(stream, 100).await.unwrap();
let result_batches: Vec<RecordBatch> = result_stream.try_collect().await.unwrap();
assert_eq!(result_batches.len(), 1);
let result_batch = result_batches.into_iter().next().unwrap();
let unshuffled_batches = create_test_stream(BatchCount::from(5), RowCount::from(20))
.try_collect::<Vec<_>>()
.await
.unwrap();
let schema = unshuffled_batches[0].schema();
let unshuffled_batch = concat_batches(&schema, &unshuffled_batches).unwrap();
let sorted = sort_batch(result_batch).await;
assert_eq!(unshuffled_batch, sorted);
}
#[tokio::test]
async fn test_external_shuffle() {
let config = ShufflerConfig {
max_rows_per_file: 100,
..Default::default()
};
let shuffler = Shuffler::new(config);
let stream = create_test_stream(BatchCount::from(5), RowCount::from(1000));
let result_stream = shuffler.shuffle(stream, 5000).await.unwrap();
let result_batches: Vec<RecordBatch> = result_stream.try_collect().await.unwrap();
let unshuffled_batches = create_test_stream(BatchCount::from(5), RowCount::from(1000))
.try_collect::<Vec<_>>()
.await
.unwrap();
let schema = unshuffled_batches[0].schema();
let unshuffled_batch = concat_batches(&schema, &unshuffled_batches).unwrap();
assert_eq!(result_batches.len(), 50);
let result_batch = concat_batches(&schema, &result_batches).unwrap();
let sorted = sort_batch(result_batch).await;
assert_eq!(unshuffled_batch, sorted);
}
#[test_log::test(tokio::test)]
async fn test_external_clump_shuffle() {
let config = ShufflerConfig {
max_rows_per_file: 100,
clump_size: Some(30),
..Default::default()
};
let shuffler = Shuffler::new(config);
// Batch size (900) must be multiple of clump size (30)
let stream = create_test_stream(BatchCount::from(5), RowCount::from(900));
let schema = stream.schema();
// Remove 10 rows from the last batch to simulate ending with partial clump
let mut batches = stream.try_collect::<Vec<_>>().await.unwrap();
let last_index = batches.len() - 1;
let sliced_last = batches[last_index].slice(0, 890);
batches[last_index] = sliced_last;
let stream = Box::pin(SimpleRecordBatchStream::new(
futures::stream::iter(batches).map(Ok).boxed(),
schema.clone(),
));
let result_stream = shuffler.shuffle(stream, 4490).await.unwrap();
let result_batches: Vec<RecordBatch> = result_stream.try_collect().await.unwrap();
let result_batch = concat_batches(&schema, &result_batches).unwrap();
let ids = result_batch.column(0).as_primitive::<Int32Type>();
let mut iter = ids.into_iter().map(|o| o.unwrap());
while let Some(first) = iter.next() {
let rows_left_in_clump = if first == 4470 { 19 } else { 29 };
let mut expected_next = first + 1;
for _ in 0..rows_left_in_clump {
let next = iter.next().unwrap();
assert_eq!(next, expected_next);
expected_next += 1;
}
}
}
}

View File

@@ -0,0 +1,804 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{
iter,
sync::{
atomic::{AtomicBool, AtomicU64, AtomicUsize, Ordering},
Arc,
},
};
use arrow_array::{Array, BooleanArray, RecordBatch, UInt64Array};
use arrow_schema::{DataType, Field, Schema};
use datafusion_common::hash_utils::create_hashes;
use futures::{StreamExt, TryStreamExt};
use lance::arrow::SchemaExt;
use crate::{
arrow::{SendableRecordBatchStream, SimpleRecordBatchStream},
dataloader::{
shuffle::{Shuffler, ShufflerConfig},
util::TemporaryDirectory,
},
query::{Query, QueryBase, Select},
Error, Result,
};
pub const SPLIT_ID_COLUMN: &str = "split_id";
/// Strategy for assigning rows to splits
#[derive(Debug, Clone)]
pub enum SplitStrategy {
/// All rows will have split id 0
NoSplit,
/// Rows will be randomly assigned to splits
///
/// A seed can be provided to make the assignment deterministic.
Random {
seed: Option<u64>,
sizes: SplitSizes,
},
/// Rows will be assigned to splits based on the values in the specified columns.
///
/// This will ensure rows are always assigned to the same split if the given columns do not change.
///
/// The `split_weights` are used to determine the approximate number of rows in each split. This
/// controls how we divide up the u64 hash space. However, it does not guarantee any particular division
/// of rows. For example, if all rows have identical hash values then all rows will be assigned to the same split
/// regardless of the weights.
///
/// The `discard_weight` controls what percentage of rows should be throw away. For example, if you want your
/// first split to have ~5% of your rows and the second split to have ~10% of your rows then you would set
/// split_weights to [1, 2] and discard weight to 17 (or you could set split_weights to [5, 10] and discard_weight
/// to 85). If you set discard_weight to 0 then all rows will be assigned to a split.
Hash {
columns: Vec<String>,
split_weights: Vec<u64>,
discard_weight: u64,
},
/// Rows will be assigned to splits sequentially.
///
/// The first N1 rows are assigned to split 1, the next N2 rows are assigned to split 2, etc.
///
/// This is mainly useful for debugging and testing.
Sequential { sizes: SplitSizes },
/// Rows will be assigned to splits based on a calculation of one or more columns.
///
/// This is useful when the splits already exist in the base table.
///
/// The provided `calculation` should be an SQL statement that returns an integer value between
/// 0 and the number of splits - 1 (the number of splits is defined by the `splits` configuration).
///
/// If this strategy is used then the counts/percentages in the SplitSizes are ignored.
Calculated { calculation: String },
}
// The default is not to split the data
//
// All data will be assigned to a single split.
impl Default for SplitStrategy {
fn default() -> Self {
Self::NoSplit
}
}
impl SplitStrategy {
pub fn validate(&self, num_rows: u64) -> Result<()> {
match self {
Self::NoSplit => Ok(()),
Self::Random { sizes, .. } => sizes.validate(num_rows),
Self::Hash {
split_weights,
columns,
..
} => {
if columns.is_empty() {
return Err(Error::InvalidInput {
message: "Hash strategy requires at least one column".to_string(),
});
}
if split_weights.is_empty() {
return Err(Error::InvalidInput {
message: "Hash strategy requires at least one split weight".to_string(),
});
}
if split_weights.contains(&0) {
return Err(Error::InvalidInput {
message: "Split weights must be greater than 0".to_string(),
});
}
Ok(())
}
Self::Sequential { sizes } => sizes.validate(num_rows),
Self::Calculated { .. } => Ok(()),
}
}
}
pub struct Splitter {
temp_dir: TemporaryDirectory,
strategy: SplitStrategy,
}
impl Splitter {
pub fn new(temp_dir: TemporaryDirectory, strategy: SplitStrategy) -> Self {
Self { temp_dir, strategy }
}
fn sequential_split_id(
num_rows: u64,
split_sizes: &[u64],
split_index: &AtomicUsize,
counter_in_split: &AtomicU64,
exhausted: &AtomicBool,
) -> UInt64Array {
let mut split_ids = Vec::<u64>::with_capacity(num_rows as usize);
while split_ids.len() < num_rows as usize {
let split_id = split_index.load(Ordering::Relaxed);
let counter = counter_in_split.load(Ordering::Relaxed);
let split_size = split_sizes[split_id];
let remaining_in_split = split_size - counter;
let remaining_in_batch = num_rows - split_ids.len() as u64;
let mut done = false;
let rows_to_add = if remaining_in_batch < remaining_in_split {
counter_in_split.fetch_add(remaining_in_batch, Ordering::Relaxed);
remaining_in_batch
} else {
split_index.fetch_add(1, Ordering::Relaxed);
counter_in_split.store(0, Ordering::Relaxed);
if split_id == split_sizes.len() - 1 {
exhausted.store(true, Ordering::Relaxed);
done = true;
}
remaining_in_split
};
split_ids.extend(iter::repeat(split_id as u64).take(rows_to_add as usize));
if done {
// Quit early if we've run out of splits
break;
}
}
UInt64Array::from(split_ids)
}
async fn apply_sequential(
&self,
source: SendableRecordBatchStream,
num_rows: u64,
sizes: &SplitSizes,
) -> Result<SendableRecordBatchStream> {
let split_sizes = sizes.to_counts(num_rows);
let split_index = AtomicUsize::new(0);
let counter_in_split = AtomicU64::new(0);
let exhausted = AtomicBool::new(false);
let schema = source.schema();
let new_schema = Arc::new(schema.try_with_column(Field::new(
SPLIT_ID_COLUMN,
DataType::UInt64,
false,
))?);
let new_schema_clone = new_schema.clone();
let stream = source.filter_map(move |batch| {
let batch = match batch {
Ok(batch) => batch,
Err(e) => {
return std::future::ready(Some(Err(e)));
}
};
if exhausted.load(Ordering::Relaxed) {
return std::future::ready(None);
}
let split_ids = Self::sequential_split_id(
batch.num_rows() as u64,
&split_sizes,
&split_index,
&counter_in_split,
&exhausted,
);
let mut arrays = batch.columns().to_vec();
// This can happen if we exhaust all splits in the middle of a batch
if split_ids.len() < batch.num_rows() {
arrays = arrays
.iter()
.map(|arr| arr.slice(0, split_ids.len()))
.collect();
}
arrays.push(Arc::new(split_ids));
std::future::ready(Some(Ok(
RecordBatch::try_new(new_schema.clone(), arrays).unwrap()
)))
});
Ok(Box::pin(SimpleRecordBatchStream::new(
stream,
new_schema_clone,
)))
}
fn hash_split_id(batch: &RecordBatch, thresholds: &[u64], total_weight: u64) -> UInt64Array {
let arrays = batch
.columns()
.iter()
// Don't hash the last column which should always be the row id
.take(batch.columns().len() - 1)
.cloned()
.collect::<Vec<_>>();
let mut hashes = vec![0; batch.num_rows()];
let random_state = ahash::RandomState::with_seeds(0, 0, 0, 0);
create_hashes(&arrays, &random_state, &mut hashes).unwrap();
// As an example, let's assume the weights are 1, 2. Our total weight is 3.
//
// Our thresholds are [1, 3]
// Our modulo output will be 0, 1, or 2.
//
// thresholds.binary_search(0) => Err(0) => 0
// thresholds.binary_search(1) => Ok(0) => 1
// thresholds.binary_search(2) => Err(1) => 1
let split_ids = hashes
.iter()
.map(|h| {
let h = h % total_weight;
let split_id = match thresholds.binary_search(&h) {
Ok(i) => (i + 1) as u64,
Err(i) => i as u64,
};
if split_id == thresholds.len() as u64 {
// If we're at the last threshold then we discard the row (indicated by setting
// the split_id to null)
None
} else {
Some(split_id)
}
})
.collect::<Vec<_>>();
UInt64Array::from(split_ids)
}
async fn apply_hash(
&self,
source: SendableRecordBatchStream,
weights: &[u64],
discard_weight: u64,
) -> Result<SendableRecordBatchStream> {
let row_id_index = source.schema().fields.len() - 1;
let new_schema = Arc::new(Schema::new(vec![
source.schema().field(row_id_index).clone(),
Field::new(SPLIT_ID_COLUMN, DataType::UInt64, false),
]));
let total_weight = weights.iter().sum::<u64>() + discard_weight;
// Thresholds are the cumulative sum of the weights
let mut offset = 0;
let thresholds = weights
.iter()
.map(|w| {
let value = offset + w;
offset = value;
value
})
.collect::<Vec<_>>();
let new_schema_clone = new_schema.clone();
let stream = source.map_ok(move |batch| {
let split_ids = Self::hash_split_id(&batch, &thresholds, total_weight);
if split_ids.null_count() > 0 {
let is_valid = split_ids.nulls().unwrap().inner();
let is_valid_mask = BooleanArray::new(is_valid.clone(), None);
let split_ids = arrow::compute::filter(&split_ids, &is_valid_mask).unwrap();
let row_ids = batch.column(row_id_index);
let row_ids = arrow::compute::filter(row_ids.as_ref(), &is_valid_mask).unwrap();
RecordBatch::try_new(new_schema.clone(), vec![row_ids, split_ids]).unwrap()
} else {
RecordBatch::try_new(
new_schema.clone(),
vec![batch.column(row_id_index).clone(), Arc::new(split_ids)],
)
.unwrap()
}
});
Ok(Box::pin(SimpleRecordBatchStream::new(
stream,
new_schema_clone,
)))
}
pub async fn apply(
&self,
source: SendableRecordBatchStream,
num_rows: u64,
) -> Result<SendableRecordBatchStream> {
self.strategy.validate(num_rows)?;
match &self.strategy {
// For consistency, even if no-split, we still give a split id column of all 0s
SplitStrategy::NoSplit => {
self.apply_sequential(source, num_rows, &SplitSizes::Counts(vec![num_rows]))
.await
}
SplitStrategy::Random { seed, sizes } => {
let shuffler = Shuffler::new(ShufflerConfig {
seed: *seed,
// In this case we are only shuffling row ids so we can use a large max_rows_per_file
max_rows_per_file: 10 * 1024 * 1024,
temp_dir: self.temp_dir.clone(),
clump_size: None,
});
let shuffled = shuffler.shuffle(source, num_rows).await?;
self.apply_sequential(shuffled, num_rows, sizes).await
}
SplitStrategy::Sequential { sizes } => {
self.apply_sequential(source, num_rows, sizes).await
}
// Nothing to do, split is calculated in projection
SplitStrategy::Calculated { .. } => Ok(source),
SplitStrategy::Hash {
split_weights,
discard_weight,
..
} => {
self.apply_hash(source, split_weights, *discard_weight)
.await
}
}
}
pub fn project(&self, query: Query) -> Query {
match &self.strategy {
SplitStrategy::Calculated { calculation } => query.select(Select::Dynamic(vec![(
SPLIT_ID_COLUMN.to_string(),
calculation.clone(),
)])),
SplitStrategy::Hash { columns, .. } => query.select(Select::Columns(columns.clone())),
_ => query,
}
}
pub fn orders_by_split_id(&self) -> bool {
match &self.strategy {
SplitStrategy::Hash { .. } | SplitStrategy::Calculated { .. } => true,
SplitStrategy::NoSplit
| SplitStrategy::Sequential { .. }
// It may be strange but for random we shuffle and then assign splits so the result is
// sorted by split id
| SplitStrategy::Random { .. } => false,
}
}
}
/// Split configuration - either percentages or absolute counts
///
/// If the percentages do not sum to 1.0 (or the counts do not sum to the total number of rows)
/// the remaining rows will not be included in the permutation.
///
/// The default implementation assigns all rows to a single split.
#[derive(Debug, Clone)]
pub enum SplitSizes {
/// Percentage splits (must sum to <= 1.0)
///
/// The number of rows in each split is the nearest integer to the percentage multiplied by
/// the total number of rows.
Percentages(Vec<f64>),
/// Absolute row counts per split
///
/// If the dataset doesn't contain enough matching rows to fill all splits then an error
/// will be raised.
Counts(Vec<u64>),
/// Divides data into a fixed number of splits
///
/// Will divide the data evenly.
///
/// If the number of rows is not divisible by the number of splits then the rows per split
/// is rounded down.
Fixed(u64),
}
impl Default for SplitSizes {
fn default() -> Self {
Self::Percentages(vec![1.0])
}
}
impl SplitSizes {
pub fn validate(&self, num_rows: u64) -> Result<()> {
match self {
Self::Percentages(percentages) => {
for percentage in percentages {
if *percentage < 0.0 || *percentage > 1.0 {
return Err(Error::InvalidInput {
message: "Split percentages must be between 0.0 and 1.0".to_string(),
});
}
if percentage * (num_rows as f64) < 1.0 {
return Err(Error::InvalidInput {
message: format!(
"One of the splits has {}% of {} rows which rounds to 0 rows",
percentage * 100.0,
num_rows
),
});
}
}
if percentages.iter().sum::<f64>() > 1.0 {
return Err(Error::InvalidInput {
message: "Split percentages must sum to 1.0 or less".to_string(),
});
}
}
Self::Counts(counts) => {
if counts.iter().sum::<u64>() > num_rows {
return Err(Error::InvalidInput {
message: format!(
"Split counts specified {} rows but only {} are available",
counts.iter().sum::<u64>(),
num_rows
),
});
}
if counts.contains(&0) {
return Err(Error::InvalidInput {
message: "Split counts must be greater than 0".to_string(),
});
}
}
Self::Fixed(num_splits) => {
if *num_splits > num_rows {
return Err(Error::InvalidInput {
message: format!(
"Split fixed config specified {} splits but only {} rows are available. Must have at least 1 row per split.",
*num_splits, num_rows
),
});
}
if (num_rows / num_splits) == 0 {
return Err(Error::InvalidInput {
message: format!(
"Split fixed config specified {} splits but only {} rows are available. Must have at least 1 row per split.",
*num_splits, num_rows
),
});
}
}
}
Ok(())
}
pub fn to_counts(&self, num_rows: u64) -> Vec<u64> {
let sizes = match self {
Self::Percentages(percentages) => {
let mut percentage_sum = 0.0_f64;
let mut counts = percentages
.iter()
.map(|p| {
let count = (p * (num_rows as f64)).round() as u64;
percentage_sum += p;
count
})
.collect::<Vec<_>>();
let sum = counts.iter().sum::<u64>();
let is_basically_one =
(num_rows as f64 - percentage_sum * num_rows as f64).abs() < 0.5;
// If the sum of percentages is close to 1.0 then rounding errors can add up
// to more or less than num_rows
//
// Drop items from buckets until we have the correct number of rows
let mut excess = sum as i64 - num_rows as i64;
let mut drop_idx = 0;
while excess > 0 {
if counts[drop_idx] > 0 {
counts[drop_idx] -= 1;
excess -= 1;
}
drop_idx += 1;
if drop_idx == counts.len() {
drop_idx = 0;
}
}
// On the other hand, if the percentages sum to ~1.0 then the we also shouldn't _lose_
// rows due to rounding errors
let mut add_idx = 0;
while is_basically_one && excess < 0 {
counts[add_idx] += 1;
add_idx += 1;
excess += 1;
if add_idx == counts.len() {
add_idx = 0;
}
}
counts
}
Self::Counts(counts) => counts.clone(),
Self::Fixed(num_splits) => {
let rows_per_split = num_rows / *num_splits;
vec![rows_per_split; *num_splits as usize]
}
};
assert!(sizes.iter().sum::<u64>() <= num_rows);
sizes
}
}
#[cfg(test)]
mod tests {
use crate::arrow::LanceDbDatagenExt;
use super::*;
use arrow::{
array::AsArray,
compute::concat_batches,
datatypes::{Int32Type, UInt64Type},
};
use arrow_array::Int32Array;
use futures::TryStreamExt;
use lance_datagen::{BatchCount, ByteCount, RowCount, Seed};
use std::sync::Arc;
const ID_COLUMN: &str = "id";
#[test]
fn test_split_sizes_percentages_validation() {
// Valid percentages
let sizes = SplitSizes::Percentages(vec![0.7, 0.3]);
assert!(sizes.validate(100).is_ok());
// Sum > 1.0
let sizes = SplitSizes::Percentages(vec![0.7, 0.4]);
assert!(sizes.validate(100).is_err());
// Negative percentage
let sizes = SplitSizes::Percentages(vec![-0.1, 0.5]);
assert!(sizes.validate(100).is_err());
// Percentage > 1.0
let sizes = SplitSizes::Percentages(vec![1.5]);
assert!(sizes.validate(100).is_err());
// Percentage rounds to 0 rows
let sizes = SplitSizes::Percentages(vec![0.001]);
assert!(sizes.validate(100).is_err());
}
#[test]
fn test_split_sizes_counts_validation() {
// Valid counts
let sizes = SplitSizes::Counts(vec![30, 70]);
assert!(sizes.validate(100).is_ok());
// Sum > num_rows
let sizes = SplitSizes::Counts(vec![60, 50]);
assert!(sizes.validate(100).is_err());
// Counts are 0
let sizes = SplitSizes::Counts(vec![0, 100]);
assert!(sizes.validate(100).is_err());
}
#[test]
fn test_split_sizes_fixed_validation() {
// Valid fixed splits
let sizes = SplitSizes::Fixed(5);
assert!(sizes.validate(100).is_ok());
// More splits than rows
let sizes = SplitSizes::Fixed(150);
assert!(sizes.validate(100).is_err());
}
#[test]
fn test_split_sizes_to_sizes_percentages() {
let sizes = SplitSizes::Percentages(vec![0.3, 0.7]);
let result = sizes.to_counts(100);
assert_eq!(result, vec![30, 70]);
// Test rounding
let sizes = SplitSizes::Percentages(vec![0.3, 0.41]);
let result = sizes.to_counts(70);
assert_eq!(result, vec![21, 29]);
}
#[test]
fn test_split_sizes_to_sizes_fixed() {
let sizes = SplitSizes::Fixed(4);
let result = sizes.to_counts(100);
assert_eq!(result, vec![25, 25, 25, 25]);
// Test with remainder
let sizes = SplitSizes::Fixed(3);
let result = sizes.to_counts(10);
assert_eq!(result, vec![3, 3, 3]);
}
fn test_data() -> SendableRecordBatchStream {
lance_datagen::gen_batch()
.with_seed(Seed::from(42))
.col(ID_COLUMN, lance_datagen::array::step::<Int32Type>())
.into_ldb_stream(RowCount::from(10), BatchCount::from(5))
}
async fn verify_splitter(
splitter: Splitter,
data: SendableRecordBatchStream,
num_rows: u64,
expected_split_sizes: &[u64],
row_ids_in_order: bool,
) {
let split_batches = splitter
.apply(data, num_rows)
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let schema = split_batches[0].schema();
let split_batch = concat_batches(&schema, &split_batches).unwrap();
let total_split_sizes = expected_split_sizes.iter().sum::<u64>();
assert_eq!(split_batch.num_rows(), total_split_sizes as usize);
let mut expected = Vec::with_capacity(total_split_sizes as usize);
for (i, size) in expected_split_sizes.iter().enumerate() {
expected.extend(iter::repeat(i as u64).take(*size as usize));
}
let expected = Arc::new(UInt64Array::from(expected)) as Arc<dyn Array>;
assert_eq!(&expected, split_batch.column(1));
let expected_row_ids =
Arc::new(Int32Array::from_iter_values(0..total_split_sizes as i32)) as Arc<dyn Array>;
if row_ids_in_order {
assert_eq!(&expected_row_ids, split_batch.column(0));
} else {
assert_ne!(&expected_row_ids, split_batch.column(0));
}
}
#[tokio::test]
async fn test_fixed_sequential_split() {
let splitter = Splitter::new(
// Sequential splitting doesn't need a temp dir
TemporaryDirectory::None,
SplitStrategy::Sequential {
sizes: SplitSizes::Fixed(3),
},
);
verify_splitter(splitter, test_data(), 50, &[16, 16, 16], true).await;
}
#[tokio::test]
async fn test_fixed_random_split() {
let splitter = Splitter::new(
TemporaryDirectory::None,
SplitStrategy::Random {
seed: Some(42),
sizes: SplitSizes::Fixed(3),
},
);
verify_splitter(splitter, test_data(), 50, &[16, 16, 16], false).await;
}
#[tokio::test]
async fn test_counts_sequential_split() {
let splitter = Splitter::new(
// Sequential splitting doesn't need a temp dir
TemporaryDirectory::None,
SplitStrategy::Sequential {
sizes: SplitSizes::Counts(vec![5, 15, 10]),
},
);
verify_splitter(splitter, test_data(), 50, &[5, 15, 10], true).await;
}
#[tokio::test]
async fn test_counts_random_split() {
let splitter = Splitter::new(
TemporaryDirectory::None,
SplitStrategy::Random {
seed: Some(42),
sizes: SplitSizes::Counts(vec![5, 15, 10]),
},
);
verify_splitter(splitter, test_data(), 50, &[5, 15, 10], false).await;
}
#[tokio::test]
async fn test_percentages_sequential_split() {
let splitter = Splitter::new(
// Sequential splitting doesn't need a temp dir
TemporaryDirectory::None,
SplitStrategy::Sequential {
sizes: SplitSizes::Percentages(vec![0.217, 0.168, 0.17]),
},
);
verify_splitter(splitter, test_data(), 50, &[11, 8, 9], true).await;
}
#[tokio::test]
async fn test_percentages_random_split() {
let splitter = Splitter::new(
TemporaryDirectory::None,
SplitStrategy::Random {
seed: Some(42),
sizes: SplitSizes::Percentages(vec![0.217, 0.168, 0.17]),
},
);
verify_splitter(splitter, test_data(), 50, &[11, 8, 9], false).await;
}
#[tokio::test]
async fn test_hash_split() {
let data = lance_datagen::gen_batch()
.with_seed(Seed::from(42))
.col(
"hash1",
lance_datagen::array::rand_utf8(ByteCount::from(10), false),
)
.col("hash2", lance_datagen::array::step::<Int32Type>())
.col(ID_COLUMN, lance_datagen::array::step::<Int32Type>())
.into_ldb_stream(RowCount::from(10), BatchCount::from(5));
let splitter = Splitter::new(
TemporaryDirectory::None,
SplitStrategy::Hash {
columns: vec!["hash1".to_string(), "hash2".to_string()],
split_weights: vec![1, 2],
discard_weight: 1,
},
);
let split_batches = splitter
.apply(data, 10)
.await
.unwrap()
.try_collect::<Vec<_>>()
.await
.unwrap();
let schema = split_batches[0].schema();
let split_batch = concat_batches(&schema, &split_batches).unwrap();
// These assertions are all based on fixed seed in data generation but they match
// up roughly to what we expect (25% discarded, 25% in split 0, 50% in split 1)
// 14 rows (28%) are discarded because discard_weight is 1
assert_eq!(split_batch.num_rows(), 36);
assert_eq!(split_batch.num_columns(), 2);
let split_ids = split_batch.column(1).as_primitive::<UInt64Type>().values();
let num_in_split_0 = split_ids.iter().filter(|v| **v == 0).count();
let num_in_split_1 = split_ids.iter().filter(|v| **v == 1).count();
assert_eq!(num_in_split_0, 11); // 22%
assert_eq!(num_in_split_1, 25); // 50%
}
}

View File

@@ -0,0 +1,98 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
use std::{path::PathBuf, sync::Arc};
use arrow_array::RecordBatch;
use arrow_schema::{Fields, Schema};
use datafusion_execution::disk_manager::DiskManagerMode;
use futures::TryStreamExt;
use rand::{rngs::SmallRng, RngCore, SeedableRng};
use tempfile::TempDir;
use crate::{
arrow::{SendableRecordBatchStream, SimpleRecordBatchStream},
Error, Result,
};
/// Directory to use for temporary files
#[derive(Debug, Clone, Default)]
pub enum TemporaryDirectory {
/// Use the operating system's default temporary directory (e.g. /tmp)
#[default]
OsDefault,
/// Use the specified directory (must be an absolute path)
Specific(PathBuf),
/// If spilling is required, then error out
None,
}
impl TemporaryDirectory {
pub fn create_temp_dir(&self) -> Result<TempDir> {
match self {
Self::OsDefault => tempfile::tempdir(),
Self::Specific(path) => tempfile::Builder::default().tempdir_in(path),
Self::None => {
return Err(Error::Runtime {
message: "No temporary directory was supplied and this operation requires spilling to disk".to_string(),
});
}
}
.map_err(|err| Error::Other {
message: "Failed to create temporary directory".to_string(),
source: Some(err.into()),
})
}
pub fn to_disk_manager_mode(&self) -> DiskManagerMode {
match self {
Self::OsDefault => DiskManagerMode::OsTmpDirectory,
Self::Specific(path) => DiskManagerMode::Directories(vec![path.clone()]),
Self::None => DiskManagerMode::Disabled,
}
}
}
pub fn non_crypto_rng(seed: &Option<u64>) -> Box<dyn RngCore + Send> {
Box::new(
seed.as_ref()
.map(|seed| SmallRng::seed_from_u64(*seed))
.unwrap_or_else(SmallRng::from_os_rng),
)
}
pub fn rename_column(
stream: SendableRecordBatchStream,
old_name: &str,
new_name: &str,
) -> Result<SendableRecordBatchStream> {
let schema = stream.schema();
let field_index = schema.index_of(old_name)?;
let new_fields = schema
.fields
.iter()
.cloned()
.enumerate()
.map(|(idx, f)| {
if idx == field_index {
Arc::new(f.as_ref().clone().with_name(new_name))
} else {
f
}
})
.collect::<Fields>();
let new_schema = Arc::new(Schema::new(new_fields).with_metadata(schema.metadata().clone()));
let new_schema_clone = new_schema.clone();
let renamed_stream = stream.and_then(move |batch| {
let renamed_batch =
RecordBatch::try_new(new_schema.clone(), batch.columns().to_vec()).map_err(Error::from);
std::future::ready(renamed_batch)
});
Ok(Box::pin(SimpleRecordBatchStream::new(
renamed_stream,
new_schema_clone,
)))
}

View File

@@ -8,6 +8,7 @@ use std::sync::Arc;
use std::time::Duration;
use vector::IvfFlatIndexBuilder;
use crate::index::vector::IvfRqIndexBuilder;
use crate::{table::BaseTable, DistanceType, Error, Result};
use self::{
@@ -53,6 +54,9 @@ pub enum Index {
/// IVF index with Product Quantization
IvfPq(IvfPqIndexBuilder),
/// IVF index with RabitQ Quantization
IvfRq(IvfRqIndexBuilder),
/// IVF-HNSW index with Product Quantization
/// It is a variant of the HNSW algorithm that uses product quantization to compress the vectors.
IvfHnswPq(IvfHnswPqIndexBuilder),
@@ -275,6 +279,8 @@ pub enum IndexType {
IvfFlat,
#[serde(alias = "IVF_PQ")]
IvfPq,
#[serde(alias = "IVF_RQ")]
IvfRq,
#[serde(alias = "IVF_HNSW_PQ")]
IvfHnswPq,
#[serde(alias = "IVF_HNSW_SQ")]
@@ -296,6 +302,7 @@ impl std::fmt::Display for IndexType {
match self {
Self::IvfFlat => write!(f, "IVF_FLAT"),
Self::IvfPq => write!(f, "IVF_PQ"),
Self::IvfRq => write!(f, "IVF_RQ"),
Self::IvfHnswPq => write!(f, "IVF_HNSW_PQ"),
Self::IvfHnswSq => write!(f, "IVF_HNSW_SQ"),
Self::BTree => write!(f, "BTREE"),
@@ -317,6 +324,7 @@ impl std::str::FromStr for IndexType {
"FTS" | "INVERTED" => Ok(Self::FTS),
"IVF_FLAT" => Ok(Self::IvfFlat),
"IVF_PQ" => Ok(Self::IvfPq),
"IVF_RQ" => Ok(Self::IvfRq),
"IVF_HNSW_PQ" => Ok(Self::IvfHnswPq),
"IVF_HNSW_SQ" => Ok(Self::IvfHnswSq),
_ => Err(Error::InvalidInput {

Some files were not shown because too many files have changed in this diff Show More