mirror of
https://github.com/lancedb/lancedb.git
synced 2026-06-24 14:40:41 +00:00
Compare commits
2 Commits
python-v0.
...
will/index
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f66285dd3c | ||
|
|
7ef49beaa8 |
@@ -1,178 +0,0 @@
|
||||
---
|
||||
name: lancedb-column-metadata
|
||||
description: Column metadata authoring for LanceDB tables via the REST API. This skill is required for tasks like writing field descriptions, setting tags on columns (field_type, model, project_id, version), classifying columns as embeddings vs labels vs eval metrics, or grouping versioned columns into logical families — because it has the API integration needed to read the schema and persist metadata back. Invoke whenever someone wants to document, annotate, tag, or classify what their table columns ARE. Trigger even without an explicit "LanceDB" mention, as long as the context is column-level documentation or tagging for an ML or vector database table.
|
||||
metadata:
|
||||
short-description: Write column descriptions, tags, and logical groupings to a LanceDB table
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This skill authors column-level metadata for a LanceDB table. It connects to a LanceDB deployment over its REST API, inspects the table schema, generates appropriate metadata, and writes it back.
|
||||
|
||||
## Step 0: Establish the connection
|
||||
|
||||
Use the `lancedb-connect` skill (invoke it via the Skill tool) to resolve the base URL and auth headers (`x-api-key`, `x-lancedb-database`) for whichever deployment the user is working against — enterprise/self-hosted or a local dev server. Skip it only if the connection details are already established in the conversation.
|
||||
|
||||
All examples below use `{base_url}` — substitute the resolved endpoint and include the resolved headers on every request.
|
||||
|
||||
## Metadata keys
|
||||
|
||||
All metadata uses namespaced keys:
|
||||
|
||||
| Key | Purpose | Example value |
|
||||
|-----|---------|---------------|
|
||||
| `lancedb:description` | Human-readable explanation of what the column contains | `"CLIP ViT-L/14 image embedding, L2-normalized (768-dim)"` |
|
||||
| `lancedb:tag:<name>` | Flexible key-value tag; the suffix names the tag category | `lancedb:tag:field_type: "embedding"`, `lancedb:tag:model: "clip"`, `lancedb:tag:project_id: "foo"` |
|
||||
| `lancedb:logical-column` | Logical group/family this column belongs to | `"clip_features"` |
|
||||
|
||||
Tags are open-ended — use whatever key suffix and value make sense given the user's intent. The tag suffix should describe *what is being classified* (e.g., `field_type`, `model`, `project_id`) and the value describes *how*.
|
||||
|
||||
## Step 1: Resolve the table identifier
|
||||
|
||||
You need:
|
||||
- **Table name** (required) — e.g., `my_table` or `my_namespace.my_table`
|
||||
- **Database name** — ask if not provided and not inferable from context; it goes in the `x-lancedb-database` header, never in the URL path
|
||||
|
||||
The table identifier in the URL path is typically `table_name` for a top-level table, or `namespace$table_name` if the table lives in a namespace. The API accepts a `delimiter` query parameter to parse compound identifiers (default `$`).
|
||||
|
||||
## Step 2: Describe the table
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/describe
|
||||
Content-Type: application/json
|
||||
|
||||
{}
|
||||
```
|
||||
|
||||
The response contains `schema.fields` — an array of field objects:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema": {
|
||||
"fields": [
|
||||
{
|
||||
"name": "clip_embedding_v3",
|
||||
"type": { "type": "FixedSizeList", "fields": [...], "listSize": 768 },
|
||||
"nullable": true,
|
||||
"metadata": { "lancedb:description": "..." }
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each field has:
|
||||
- `name` — field name
|
||||
- `type` — Arrow data type (check `type.type` for the type string)
|
||||
- `nullable` — boolean
|
||||
- `metadata` — existing key-value metadata (read this before writing to avoid redundant updates)
|
||||
|
||||
For struct/nested fields, recurse into `type.fields` and represent them as dot-notation paths (e.g., `parent.child`).
|
||||
|
||||
If the user hasn't specified which columns to update, work with all columns.
|
||||
|
||||
## Step 3: Generate metadata
|
||||
|
||||
Decide what to generate based on the user's request.
|
||||
|
||||
### Writing descriptions (`lancedb:description`)
|
||||
|
||||
Base descriptions on:
|
||||
- The column name and Arrow type (e.g., `FixedSizeList` of floats → likely an embedding)
|
||||
- User-supplied context (upstream pipeline, sample values, domain knowledge)
|
||||
- Name patterns: `_embedding`/`_vec`/`_embed` → vector; `_label`/`_class` → label; `_score`/`_eval`/`_metric` → evaluation metric
|
||||
|
||||
Be specific and concise. Good: `"Sentence-BERT embedding of the query text (768-dim)."` Not: `"An embedding column."`
|
||||
|
||||
### Tagging columns (`lancedb:tag:<name>`)
|
||||
|
||||
Choose tag key names that match what the user asked to annotate. Common patterns:
|
||||
|
||||
- Semantic field type → `lancedb:tag:field_type: "embedding"` / `"text"` / `"image"` / `"label"` / `"eval"` / `"id"` / `"metadata"`
|
||||
- Model or source → `lancedb:tag:model: "clip"` / `"bert"` / `"vit"`
|
||||
- Project affiliation → `lancedb:tag:project_id: "<name>"`
|
||||
- Version → `lancedb:tag:version: "v3"` (and `lancedb:tag:latest: "true"` for the newest)
|
||||
|
||||
Use Arrow type as a hint: `FixedSizeList` + float → embedding; `Utf8`/`LargeUtf8` → text; `Binary` → image or blob.
|
||||
|
||||
Multiple tags on the same column are fine — each is a separate key.
|
||||
|
||||
### Grouping into logical columns (`lancedb:logical-column`)
|
||||
|
||||
Look for naming patterns across columns:
|
||||
- `clip_v1`, `clip_v2`, `clip_v3` → logical column `"clip"`, latest is `v3`
|
||||
- `text_embed_20240101`, `text_embed_20240601` → logical column `"text_embed"`, latest is the most recent date suffix
|
||||
|
||||
Write `lancedb:logical-column` on all members of a group. Mark the newest with `lancedb:tag:latest: "true"` (in addition to its version tag).
|
||||
|
||||
## Step 4: Write the metadata
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/update_field_metadata
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"updates": [
|
||||
{
|
||||
"path": "clip_v3",
|
||||
"metadata": {
|
||||
"lancedb:description": "CLIP ViT-L/14 image embedding, L2-normalized (1024-dim).",
|
||||
"lancedb:tag:field_type": "embedding",
|
||||
"lancedb:tag:model": "clip",
|
||||
"lancedb:tag:version": "v3",
|
||||
"lancedb:tag:latest": "true",
|
||||
"lancedb:logical-column": "clip"
|
||||
},
|
||||
"replace": false
|
||||
},
|
||||
{
|
||||
"path": "clip_v2",
|
||||
"metadata": {
|
||||
"lancedb:description": "CLIP ViT-B/32 image embedding (768-dim), superseded by v3.",
|
||||
"lancedb:tag:field_type": "embedding",
|
||||
"lancedb:tag:model": "clip",
|
||||
"lancedb:tag:version": "v2",
|
||||
"lancedb:logical-column": "clip"
|
||||
},
|
||||
"replace": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
- **Use `"replace": false`** (merge) by default — this preserves existing metadata the user didn't ask to change
|
||||
- Use `"replace": true` only if the user explicitly asks to overwrite all existing metadata on a column
|
||||
- Set a value to `null` to delete a specific key
|
||||
- Batch all updates in a single request when possible
|
||||
|
||||
The response includes `version` (new table version) and `fields` (the updated metadata per field).
|
||||
|
||||
## Step 5: Confirm
|
||||
|
||||
Report back:
|
||||
- Which columns were updated and what was written
|
||||
- The new table version number
|
||||
- Any columns skipped (e.g., already had up-to-date metadata)
|
||||
|
||||
---
|
||||
|
||||
## Quick examples
|
||||
|
||||
**"Write descriptions for all columns in the `product_embeddings` table"**
|
||||
1. POST `/v1/table/product_embeddings/describe` → get all fields
|
||||
2. Generate a `lancedb:description` for each column based on name + type
|
||||
3. POST `update_field_metadata` with descriptions
|
||||
4. Report
|
||||
|
||||
**"Tag the columns in `model_outputs` with their field type and model"**
|
||||
1. Describe `model_outputs`
|
||||
2. For each field, classify by name + Arrow type → set `lancedb:tag:field_type` and `lancedb:tag:model` where applicable
|
||||
3. POST `update_field_metadata`
|
||||
4. Report
|
||||
|
||||
**"Group the feature columns in `training_features` into logical families and mark the latest version"**
|
||||
1. Describe the table
|
||||
2. Find version patterns → assign `lancedb:logical-column` and `lancedb:tag:version`; mark newest with `lancedb:tag:latest: "true"`
|
||||
3. POST `update_field_metadata`
|
||||
4. Show the grouping
|
||||
@@ -1,42 +0,0 @@
|
||||
---
|
||||
name: lancedb-connect
|
||||
description: Resolve how to connect to a LanceDB deployment over the REST API — figure out the base URL, API key, and database header. Use this before making any REST requests to a LanceDB table, whenever the endpoint or auth setup is not already known. Also useful on its own when someone asks how to connect, authenticate, or curl their LanceDB instance.
|
||||
metadata:
|
||||
short-description: Resolve the base URL and auth headers for a LanceDB deployment
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Produce two things every REST request needs:
|
||||
|
||||
1. **Base URL** — the endpoint
|
||||
2. **Headers** — `x-api-key`, and usually `x-lancedb-database`
|
||||
|
||||
## Resolution steps
|
||||
|
||||
1. If the user already gave a URL and API key (or said which environment they're working against), use that.
|
||||
2. Otherwise, look for credentials already available in the environment:
|
||||
- Env vars like `LANCEDB_URI` / `LANCEDB_HOST` / `LANCEDB_API_KEY`
|
||||
- A LanceDB endpoint already running or port-forwarded locally (the REST default port is 2333, i.e. `http://localhost:2333`)
|
||||
3. If you didn't find both pieces, ask the user directly: **"What's your LanceDB endpoint's URL, and what's your API key?"** Also ask which database to use if it isn't obvious. Don't guess or probe further — the user knows their deployment.
|
||||
|
||||
## Validating the connection
|
||||
|
||||
Make a cheap authenticated request and check the status:
|
||||
|
||||
```bash
|
||||
curl -s -w "\n%{http_code}" "{base_url}/v1/table/?limit=1" \
|
||||
-H "x-api-key: <key>" \
|
||||
-H "x-lancedb-database: <database>"
|
||||
```
|
||||
|
||||
- `200` — connection, key, and database header all good
|
||||
- `401` — API key missing or wrong
|
||||
- `400` mentioning a database header — this deployment expects `x-lancedb-database`
|
||||
|
||||
## Non-REST equivalents
|
||||
|
||||
If the caller would rather use the SDK or CLI than raw REST, the same credentials work:
|
||||
|
||||
- Python SDK: `lancedb.connect("db://<database>", api_key="<key>", host_override="<base_url>")`
|
||||
- `lancedb` CLI: a `[profiles.<name>]` entry in `~/.lancedb/config.toml` with `http_server_url`, `api_key`, `database`
|
||||
@@ -1,5 +1,5 @@
|
||||
[tool.bumpversion]
|
||||
current_version = "0.31.0-beta.1"
|
||||
current_version = "0.30.1-beta.2"
|
||||
parse = """(?x)
|
||||
(?P<major>0|[1-9]\\d*)\\.
|
||||
(?P<minor>0|[1-9]\\d*)\\.
|
||||
@@ -23,8 +23,6 @@ allow_dirty = true
|
||||
commit = true
|
||||
message = "Bump version: {current_version} → {new_version}"
|
||||
commit_args = ""
|
||||
# bump-my-version >=1.4.0 rejects pre_commit_hooks containing shell syntax unless opted in.
|
||||
allow_shell_hooks = true
|
||||
|
||||
# Java maven files
|
||||
pre_commit_hooks = [
|
||||
|
||||
163
Cargo.lock
generated
163
Cargo.lock
generated
@@ -1376,7 +1376,18 @@ checksum = "d640d25bc63c50fb1f0b545ffd80207d2e10a4c965530809b40ba3386825c391"
|
||||
dependencies = [
|
||||
"alloc-no-stdlib",
|
||||
"alloc-stdlib",
|
||||
"brotli-decompressor",
|
||||
"brotli-decompressor 2.5.1",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "brotli"
|
||||
version = "8.0.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "4bd8b9603c7aa97359dbd97ecf258968c95f3adddd6db2f7e7a5bef101c84560"
|
||||
dependencies = [
|
||||
"alloc-no-stdlib",
|
||||
"alloc-stdlib",
|
||||
"brotli-decompressor 5.0.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -1389,6 +1400,16 @@ dependencies = [
|
||||
"alloc-stdlib",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "brotli-decompressor"
|
||||
version = "5.0.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "874bb8112abecc98cbd6d81ea4fa7e94fb9449648c93cc89aa40c81c24d7de03"
|
||||
dependencies = [
|
||||
"alloc-no-stdlib",
|
||||
"alloc-stdlib",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "bs58"
|
||||
version = "0.5.1"
|
||||
@@ -1472,9 +1493,9 @@ checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b"
|
||||
|
||||
[[package]]
|
||||
name = "bytes"
|
||||
version = "1.12.0"
|
||||
version = "1.11.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "8ae3f5d315924270530207e2a68396c3cc547f6dca3fbdca317cfb1a51edb593"
|
||||
checksum = "1e748733b7cbc798e1434b6ac524f0c1ff2ab456fe201501e6497c8417a4fc33"
|
||||
|
||||
[[package]]
|
||||
name = "bytes-utils"
|
||||
@@ -3432,8 +3453,8 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
|
||||
|
||||
[[package]]
|
||||
name = "fsst"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"rand 0.9.4",
|
||||
@@ -4735,8 +4756,8 @@ checksum = "e037a2e1d8d5fdbd49b16a4ea09d5d6401c1f29eca5ff29d03d3824dba16256a"
|
||||
|
||||
[[package]]
|
||||
name = "lance"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arc-swap",
|
||||
"arrow",
|
||||
@@ -4810,8 +4831,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-arrow"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -4832,7 +4853,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-arrow-scalar"
|
||||
version = "58.0.0"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -4846,7 +4867,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-arrow-stats"
|
||||
version = "58.0.0"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-schema",
|
||||
@@ -4855,8 +4876,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-bitpacking"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrayref",
|
||||
"paste",
|
||||
@@ -4865,8 +4886,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-core"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -4904,8 +4925,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-datafusion"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-array",
|
||||
@@ -4935,8 +4956,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-datagen"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-array",
|
||||
@@ -4949,12 +4970,13 @@ dependencies = [
|
||||
"rand 0.9.4",
|
||||
"rand_distr 0.5.1",
|
||||
"rand_xoshiro",
|
||||
"random_word 0.5.2",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "lance-derive"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
@@ -4963,8 +4985,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-encoding"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-arith",
|
||||
"arrow-array",
|
||||
@@ -4999,8 +5021,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-file"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-arith",
|
||||
"arrow-array",
|
||||
@@ -5030,8 +5052,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-index"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arc-swap",
|
||||
"arrow",
|
||||
@@ -5083,7 +5105,6 @@ dependencies = [
|
||||
"rand_distr 0.5.1",
|
||||
"rangemap",
|
||||
"rayon",
|
||||
"regex-syntax",
|
||||
"roaring",
|
||||
"serde",
|
||||
"serde_json",
|
||||
@@ -5096,8 +5117,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-io"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-arith",
|
||||
@@ -5138,8 +5159,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-linalg"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -5150,13 +5171,12 @@ dependencies = [
|
||||
"lance-core",
|
||||
"num-traits",
|
||||
"rand 0.9.4",
|
||||
"rayon",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "lance-namespace"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"async-trait",
|
||||
@@ -5168,8 +5188,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-namespace-impls"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-ipc",
|
||||
@@ -5179,8 +5199,6 @@ dependencies = [
|
||||
"base64 0.22.1",
|
||||
"bytes",
|
||||
"chrono",
|
||||
"datafusion-common",
|
||||
"datafusion-physical-plan",
|
||||
"futures",
|
||||
"hmac 0.12.1",
|
||||
"lance",
|
||||
@@ -5195,23 +5213,20 @@ dependencies = [
|
||||
"quick-xml 0.38.4",
|
||||
"rand 0.9.4",
|
||||
"reqwest 0.12.28",
|
||||
"roaring",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"sha2 0.10.9",
|
||||
"time",
|
||||
"tokio",
|
||||
"tower",
|
||||
"tower-http 0.5.2",
|
||||
"url",
|
||||
"uuid",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "lance-namespace-reqwest-client"
|
||||
version = "0.8.6"
|
||||
version = "0.8.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "ba3f0a235e3ed5f8805205649ccc7d7d0f3df23ce1294242c9265ad488d7f19d"
|
||||
checksum = "7a09733325812e046cb217d548afc4864dedb59545389d45cd498b3d8ecb0d20"
|
||||
dependencies = [
|
||||
"reqwest 0.12.28",
|
||||
"serde",
|
||||
@@ -5223,8 +5238,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-select"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -5239,8 +5254,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-table"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-array",
|
||||
@@ -5279,8 +5294,8 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-testing"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-schema",
|
||||
@@ -5293,21 +5308,20 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lance-tokenizer"
|
||||
version = "9.0.0-beta.2"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.2#23211989de648fefc4454f5eee09ec176f0a465b"
|
||||
version = "8.0.0-beta.11"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v8.0.0-beta.11#739ef902201c90b3f8c6d005762d7fd161782bf2"
|
||||
dependencies = [
|
||||
"icu_segmenter",
|
||||
"jieba-rs",
|
||||
"lindera",
|
||||
"rust-stemmers",
|
||||
"serde",
|
||||
"stop-words",
|
||||
"unicode-normalization",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "lancedb"
|
||||
version = "0.31.0-beta.1"
|
||||
version = "0.30.1-beta.2"
|
||||
dependencies = [
|
||||
"ahash",
|
||||
"anyhow",
|
||||
@@ -5369,7 +5383,7 @@ dependencies = [
|
||||
"polars",
|
||||
"polars-arrow",
|
||||
"rand 0.9.4",
|
||||
"random_word",
|
||||
"random_word 0.4.3",
|
||||
"regex",
|
||||
"reqwest 0.12.28",
|
||||
"rstest",
|
||||
@@ -5390,7 +5404,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "lancedb-nodejs"
|
||||
version = "0.31.0-beta.1"
|
||||
version = "0.30.1-beta.2"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -5399,7 +5413,6 @@ dependencies = [
|
||||
"async-trait",
|
||||
"aws-lc-rs",
|
||||
"aws-lc-sys",
|
||||
"chrono",
|
||||
"env_logger",
|
||||
"futures",
|
||||
"half",
|
||||
@@ -5410,17 +5423,15 @@ dependencies = [
|
||||
"napi",
|
||||
"napi-build",
|
||||
"napi-derive",
|
||||
"serde_json",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "lancedb-python"
|
||||
version = "0.34.0-beta.1"
|
||||
version = "0.33.1-beta.2"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"async-trait",
|
||||
"bytes",
|
||||
"chrono",
|
||||
"datafusion-common",
|
||||
"env_logger",
|
||||
"futures",
|
||||
@@ -5958,20 +5969,17 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "napi"
|
||||
version = "3.9.3"
|
||||
version = "3.9.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "fbd9f9295f3ff5921e78a71222c3361a8216f7760b1a99a6ad4e8441de18bbb9"
|
||||
checksum = "ad513ff22558f1830b595ea6eb4091da48145d09a222ce157e781896f78be0b9"
|
||||
dependencies = [
|
||||
"bitflags 2.11.1",
|
||||
"chrono",
|
||||
"ctor 1.0.5",
|
||||
"futures",
|
||||
"napi-build",
|
||||
"napi-sys",
|
||||
"nohash-hasher",
|
||||
"rustc-hash",
|
||||
"serde",
|
||||
"serde_json",
|
||||
"tokio",
|
||||
]
|
||||
|
||||
@@ -7501,7 +7509,6 @@ version = "0.28.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "91fd8e38a3b50ed1167fb981cd6fd60147e091784c427b8f7183a7ee32c31c12"
|
||||
dependencies = [
|
||||
"chrono",
|
||||
"libc",
|
||||
"once_cell",
|
||||
"portable-atomic",
|
||||
@@ -7812,13 +7819,26 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "07eed67a16dde2cc3c7f65c072acd8d5b2e53d4aab95067c320db851c7651f29"
|
||||
dependencies = [
|
||||
"ahash",
|
||||
"brotli",
|
||||
"brotli 3.5.0",
|
||||
"once_cell",
|
||||
"paste",
|
||||
"rand 0.8.6",
|
||||
"unicase",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "random_word"
|
||||
version = "0.5.2"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e47a395bdb55442b883c89062d6bcff25dc90fa5f8369af81e0ac6d49d78cf81"
|
||||
dependencies = [
|
||||
"ahash",
|
||||
"brotli 8.0.2",
|
||||
"paste",
|
||||
"rand 0.9.4",
|
||||
"unicase",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "rangemap"
|
||||
version = "1.7.1"
|
||||
@@ -9207,15 +9227,6 @@ version = "0.2.7"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "e51f1e89f093f99e7432c491c382b88a6860a5adbe6bf02574bf0a08efff1978"
|
||||
|
||||
[[package]]
|
||||
name = "stop-words"
|
||||
version = "0.10.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "d68df56303396bcfb639455b3c166804aeb7994005010aab5e9e8a1277b8871d"
|
||||
dependencies = [
|
||||
"serde_json",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "str_stack"
|
||||
version = "0.1.1"
|
||||
|
||||
28
Cargo.toml
28
Cargo.toml
@@ -13,20 +13,20 @@ categories = ["database-implementations"]
|
||||
rust-version = "1.91.0"
|
||||
|
||||
[workspace.dependencies]
|
||||
lance = { "version" = "=9.0.0-beta.2", default-features = false, "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-core = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datagen = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-file = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-io = { "version" = "=9.0.0-beta.2", default-features = false, "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-index = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-linalg = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=9.0.0-beta.2", default-features = false, "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-table = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-testing = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datafusion = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-encoding = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-arrow = { "version" = "=9.0.0-beta.2", "tag" = "v9.0.0-beta.2", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance = { "version" = "=8.0.0-beta.11", default-features = false, "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-core = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datagen = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-file = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-io = { "version" = "=8.0.0-beta.11", default-features = false, "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-index = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-linalg = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=8.0.0-beta.11", default-features = false, "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-table = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-testing = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datafusion = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-encoding = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-arrow = { "version" = "=8.0.0-beta.11", "tag" = "v8.0.0-beta.11", "git" = "https://github.com/lance-format/lance.git" }
|
||||
ahash = "0.8"
|
||||
# Note that this one does not include pyarrow
|
||||
arrow = { version = "58.0.0", optional = false }
|
||||
|
||||
@@ -113,12 +113,6 @@ ignore = [
|
||||
# rand from a custom logger; upgrade once all pinned chains accept 0.8.6+.
|
||||
# https://rustsec.org/advisories/RUSTSEC-2026-0097
|
||||
{ id = "RUSTSEC-2026-0097", reason = "transitive rand 0.8.5; LanceDB does not call ThreadRng from custom logging" },
|
||||
|
||||
# pyo3 advisories in the Python bindings; tracked pending a patched pyo3 release.
|
||||
# https://rustsec.org/advisories/RUSTSEC-2026-0176
|
||||
# https://rustsec.org/advisories/RUSTSEC-2026-0177
|
||||
{ id = "RUSTSEC-2026-0176", reason = "pyo3 in Python bindings; awaiting patched pyo3 release" },
|
||||
{ id = "RUSTSEC-2026-0177", reason = "pyo3 in Python bindings; awaiting patched pyo3 release" },
|
||||
]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
|
||||
<dependency>
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-core</artifactId>
|
||||
<version>0.31.0-beta.1</version>
|
||||
<version>0.30.1-beta.2</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
|
||||
@@ -295,23 +295,6 @@ await table.createIndex("my_float_col");
|
||||
|
||||
***
|
||||
|
||||
### currentBranch()
|
||||
|
||||
```ts
|
||||
abstract currentBranch(): null | string
|
||||
```
|
||||
|
||||
The branch this table handle is scoped to, or `null` for the main branch.
|
||||
|
||||
A handle returned by [Branches.create](Branches.md#create) or [Branches.checkout](Branches.md#checkout)
|
||||
reports the branch it targets; a handle opened normally reports `null`.
|
||||
|
||||
#### Returns
|
||||
|
||||
`null` \| `string`
|
||||
|
||||
***
|
||||
|
||||
### delete()
|
||||
|
||||
```ts
|
||||
|
||||
@@ -23,31 +23,6 @@ be more columns to represent composite indices.
|
||||
|
||||
***
|
||||
|
||||
### createdAt?
|
||||
|
||||
```ts
|
||||
optional createdAt: Date;
|
||||
```
|
||||
|
||||
When the index was created.
|
||||
|
||||
`undefined` for remote tables or indices created before timestamps were tracked.
|
||||
|
||||
***
|
||||
|
||||
### indexDetails?
|
||||
|
||||
```ts
|
||||
optional indexDetails: any;
|
||||
```
|
||||
|
||||
Index-type-specific details parsed as a JavaScript object.
|
||||
|
||||
Falls back to a raw string if JSON parsing fails. `undefined` for
|
||||
remote tables or when details are unavailable.
|
||||
|
||||
***
|
||||
|
||||
### indexType
|
||||
|
||||
```ts
|
||||
@@ -58,30 +33,6 @@ The type of the index
|
||||
|
||||
***
|
||||
|
||||
### indexUuid?
|
||||
|
||||
```ts
|
||||
optional indexUuid: string;
|
||||
```
|
||||
|
||||
The UUID of the first segment of the index.
|
||||
|
||||
`undefined` for remote tables, which do not yet surface this.
|
||||
|
||||
***
|
||||
|
||||
### indexVersion?
|
||||
|
||||
```ts
|
||||
optional indexVersion: number;
|
||||
```
|
||||
|
||||
The on-disk index format version.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### name
|
||||
|
||||
```ts
|
||||
@@ -89,63 +40,3 @@ name: string;
|
||||
```
|
||||
|
||||
The name of the index
|
||||
|
||||
***
|
||||
|
||||
### numIndexedRows?
|
||||
|
||||
```ts
|
||||
optional numIndexedRows: number;
|
||||
```
|
||||
|
||||
The number of rows indexed, across all segments.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### numSegments?
|
||||
|
||||
```ts
|
||||
optional numSegments: number;
|
||||
```
|
||||
|
||||
The number of segments that make up the index.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### numUnindexedRows?
|
||||
|
||||
```ts
|
||||
optional numUnindexedRows: number;
|
||||
```
|
||||
|
||||
The number of rows not yet covered by this index.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### sizeBytes?
|
||||
|
||||
```ts
|
||||
optional sizeBytes: number;
|
||||
```
|
||||
|
||||
The total size in bytes of all index files across all segments.
|
||||
|
||||
`undefined` for remote tables or indices without size tracking.
|
||||
|
||||
***
|
||||
|
||||
### typeUrl?
|
||||
|
||||
```ts
|
||||
optional typeUrl: string;
|
||||
```
|
||||
|
||||
The protobuf type URL, a precise type identifier for the index.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
<parent>
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-parent</artifactId>
|
||||
<version>0.31.0-beta.1</version>
|
||||
<version>0.30.1-beta.2</version>
|
||||
<relativePath>../pom.xml</relativePath>
|
||||
</parent>
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-parent</artifactId>
|
||||
<version>0.31.0-beta.1</version>
|
||||
<version>0.30.1-beta.2</version>
|
||||
<packaging>pom</packaging>
|
||||
<name>${project.artifactId}</name>
|
||||
<description>LanceDB Java SDK Parent POM</description>
|
||||
@@ -28,7 +28,7 @@
|
||||
<properties>
|
||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||
<arrow.version>15.0.0</arrow.version>
|
||||
<lance-core.version>9.0.0-beta.2</lance-core.version>
|
||||
<lance-core.version>8.0.0-beta.11</lance-core.version>
|
||||
<spotless.skip>false</spotless.skip>
|
||||
<spotless.version>2.30.0</spotless.version>
|
||||
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
[package]
|
||||
name = "lancedb-nodejs"
|
||||
edition.workspace = true
|
||||
version = "0.31.0-beta.1"
|
||||
version = "0.30.1-beta.2"
|
||||
publish = false
|
||||
license.workspace = true
|
||||
description.workspace = true
|
||||
@@ -25,12 +25,8 @@ lancedb = { path = "../rust/lancedb", default-features = false }
|
||||
lance-namespace.workspace = true
|
||||
napi = { version = "3.8.3", default-features = false, features = [
|
||||
"napi9",
|
||||
"async",
|
||||
"chrono_date",
|
||||
"serde-json",
|
||||
"async"
|
||||
] }
|
||||
chrono = { version = "0.4", default-features = false, features = ["clock"] }
|
||||
serde_json = "1"
|
||||
napi-derive = "3.5.2"
|
||||
# Prevent dynamic linking of lzma, which comes from datafusion
|
||||
lzma-sys = { version = "0.1", features = ["static"] }
|
||||
|
||||
@@ -191,36 +191,30 @@ describe("remote connection", () => {
|
||||
);
|
||||
});
|
||||
|
||||
it("supports version time-travel and branches on remote", async () => {
|
||||
it("allows version on remote but rejects a non-main branch", async () => {
|
||||
await withMockDatabase(
|
||||
(req, res) => {
|
||||
const body = req.url?.includes("/branches/list")
|
||||
? JSON.stringify({
|
||||
branches: {
|
||||
exp: { parentVersion: 1, createAt: 1, manifestSize: 1 },
|
||||
},
|
||||
})
|
||||
: JSON.stringify({ name: "t", version: 2, schema: { fields: [] } });
|
||||
(_req, res) => {
|
||||
// describe (table open + version validation) always succeeds
|
||||
const body = JSON.stringify({
|
||||
name: "t",
|
||||
version: 2,
|
||||
schema: { fields: [] },
|
||||
});
|
||||
res.writeHead(200, { "Content-Type": "application/json" }).end(body);
|
||||
},
|
||||
async (db) => {
|
||||
// version-only (and "main" + version) time-travel the main chain
|
||||
const v2 = await db.openTable("t", undefined, { version: 2 });
|
||||
expect(v2.currentBranch()).toBeNull();
|
||||
const mainV2 = await db.openTable("t", undefined, {
|
||||
branch: "main",
|
||||
version: 2,
|
||||
});
|
||||
expect(mainV2.currentBranch()).toBeNull();
|
||||
// version-only (and "main" + version) is allowed: remote supports
|
||||
// version time-travel even though it has no branches
|
||||
await db.openTable("t", undefined, { version: 2 });
|
||||
await db.openTable("t", undefined, { branch: "main", version: 2 });
|
||||
|
||||
// a non-main branch opens a handle scoped to that branch
|
||||
const exp = await db.openTable("t", undefined, { branch: "exp" });
|
||||
expect(exp.currentBranch()).toBe("exp");
|
||||
const expV2 = await db.openTable("t", undefined, {
|
||||
branch: "exp",
|
||||
version: 2,
|
||||
});
|
||||
expect(expV2.currentBranch()).toBe("exp");
|
||||
// a non-main branch is rejected, with or without a version
|
||||
await expect(
|
||||
db.openTable("t", undefined, { branch: "exp" }),
|
||||
).rejects.toThrow(/branching/);
|
||||
await expect(
|
||||
db.openTable("t", undefined, { branch: "exp", version: 2 }),
|
||||
).rejects.toThrow(/branching/);
|
||||
},
|
||||
);
|
||||
});
|
||||
|
||||
@@ -89,11 +89,8 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
|
||||
await table.add([{ id: 1 }]);
|
||||
expect(await table.countRows()).toBe(1);
|
||||
|
||||
expect(table.currentBranch()).toBeNull();
|
||||
|
||||
// fork an isolated, writable branch from main
|
||||
const branch = await (await table.branches()).create("exp");
|
||||
expect(branch.currentBranch()).toBe("exp");
|
||||
expect(await branch.countRows()).toBe(1);
|
||||
await branch.add([{ id: 2 }]);
|
||||
expect(await branch.countRows()).toBe(2);
|
||||
@@ -112,7 +109,6 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
|
||||
|
||||
// checkout returns a handle scoped to the branch's latest
|
||||
const checkedOut = await (await table.branches()).checkout("exp");
|
||||
expect(checkedOut.currentBranch()).toBe("exp");
|
||||
expect(await checkedOut.countRows()).toBe(2);
|
||||
|
||||
// delete removes it
|
||||
@@ -849,13 +845,11 @@ describe("When creating an index", () => {
|
||||
expect(fs.readdirSync(indexDir)).toHaveLength(1);
|
||||
const indices = await tbl.listIndices();
|
||||
expect(indices.length).toBe(1);
|
||||
expect(indices[0]).toEqual(
|
||||
expect.objectContaining({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["vec"],
|
||||
}),
|
||||
);
|
||||
expect(indices[0]).toEqual({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["vec"],
|
||||
});
|
||||
const stats = await tbl.indexStats("vec_idx");
|
||||
expect(stats).toBeDefined();
|
||||
|
||||
@@ -1017,51 +1011,51 @@ describe("When creating an index", () => {
|
||||
const indices = await nestedTable.listIndices();
|
||||
expect(indices).toEqual(
|
||||
expect.arrayContaining([
|
||||
expect.objectContaining({
|
||||
{
|
||||
name: "row_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["rowId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "row_dash_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["`row-id`"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "top_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["userId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "nested_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["metadata.user_id"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "mixed_case_metadata_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["MetaData.userId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "escaped_names_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["`meta-data`.`user-id`"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "literal_dot_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["literal.`a.b`"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "image_embedding_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["image.embedding"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "payload_text_idx",
|
||||
indexType: "FTS",
|
||||
columns: ["payload.text"],
|
||||
}),
|
||||
},
|
||||
]),
|
||||
);
|
||||
|
||||
@@ -1115,16 +1109,16 @@ describe("When creating an index", () => {
|
||||
const indicesAfterOptimize = await nestedTable.listIndices();
|
||||
expect(indicesAfterOptimize).toEqual(
|
||||
expect.arrayContaining([
|
||||
expect.objectContaining({
|
||||
{
|
||||
name: "mixed_case_metadata_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["MetaData.userId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
},
|
||||
{
|
||||
name: "image_embedding_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["image.embedding"],
|
||||
}),
|
||||
},
|
||||
]),
|
||||
);
|
||||
});
|
||||
@@ -1260,13 +1254,11 @@ describe("When creating an index", () => {
|
||||
expect(fs.readdirSync(indexDir)).toHaveLength(1);
|
||||
const indices = await tbl.listIndices();
|
||||
expect(indices.length).toBe(1);
|
||||
expect(indices[0]).toEqual(
|
||||
expect.objectContaining({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfHnswSq",
|
||||
columns: ["vec"],
|
||||
}),
|
||||
);
|
||||
expect(indices[0]).toEqual({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfHnswSq",
|
||||
columns: ["vec"],
|
||||
});
|
||||
|
||||
// Search without specifying the column
|
||||
let rst = await tbl
|
||||
@@ -1612,35 +1604,6 @@ describe("When creating an index", () => {
|
||||
expect(rst64Query.toString()).toEqual(rst64Search.toString());
|
||||
expect(rst64Query.numRows).toBe(2);
|
||||
});
|
||||
|
||||
it("should expose rich metadata fields on IndexConfig", async () => {
|
||||
await tbl.createIndex("id", { config: Index.btree() });
|
||||
await tbl.createIndex("vec");
|
||||
|
||||
const indicesByName = Object.fromEntries(
|
||||
(await tbl.listIndices()).map((idx) => [idx.name, idx]),
|
||||
);
|
||||
|
||||
const scalarIdx = indicesByName["id_idx"];
|
||||
expect(scalarIdx).toBeDefined();
|
||||
expect(typeof scalarIdx.indexUuid).toBe("string");
|
||||
expect(scalarIdx.numIndexedRows).toBe(300);
|
||||
expect(scalarIdx.numUnindexedRows).toBe(0);
|
||||
expect(scalarIdx.numSegments).toBeGreaterThanOrEqual(1);
|
||||
expect(scalarIdx.sizeBytes).toBeGreaterThan(0);
|
||||
// Use toString check to avoid cross-realm instanceof failures with native Date objects
|
||||
expect(Object.prototype.toString.call(scalarIdx.createdAt)).toBe(
|
||||
"[object Date]",
|
||||
);
|
||||
expect((scalarIdx.createdAt as Date).getTime()).toBeGreaterThan(0);
|
||||
expect(typeof scalarIdx.indexDetails).toBe("object");
|
||||
|
||||
const vectorIdx = indicesByName["vec_idx"];
|
||||
expect(vectorIdx).toBeDefined();
|
||||
expect(typeof vectorIdx.indexUuid).toBe("string");
|
||||
expect(vectorIdx.numIndexedRows).toBe(300);
|
||||
expect(typeof vectorIdx.indexDetails).toBe("object");
|
||||
});
|
||||
});
|
||||
|
||||
describe("When querying a table", () => {
|
||||
|
||||
@@ -663,14 +663,6 @@ export abstract class Table {
|
||||
*/
|
||||
abstract branches(): Promise<Branches>;
|
||||
|
||||
/**
|
||||
* The branch this table handle is scoped to, or `null` for the main branch.
|
||||
*
|
||||
* A handle returned by {@link Branches.create} or {@link Branches.checkout}
|
||||
* reports the branch it targets; a handle opened normally reports `null`.
|
||||
*/
|
||||
abstract currentBranch(): string | null;
|
||||
|
||||
/**
|
||||
* Restore the table to the currently checked out version
|
||||
*
|
||||
@@ -1130,10 +1122,6 @@ export class LocalTable extends Table {
|
||||
return new Branches(await this.inner.branches());
|
||||
}
|
||||
|
||||
currentBranch(): string | null {
|
||||
return this.inner.currentBranch() ?? null;
|
||||
}
|
||||
|
||||
async optimize(options?: Partial<OptimizeOptions>): Promise<OptimizeStats> {
|
||||
let cleanupOlderThanMs;
|
||||
if (
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-darwin-arm64",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["darwin"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.darwin-arm64.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-arm64-gnu",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.linux-arm64-gnu.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-arm64-musl",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.linux-arm64-musl.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-x64-gnu",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.linux-x64-gnu.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-x64-musl",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["linux"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.linux-x64-musl.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-win32-arm64-msvc",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": [
|
||||
"win32"
|
||||
],
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-win32-x64-msvc",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"os": ["win32"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.win32-x64-msvc.node",
|
||||
|
||||
4
nodejs/package-lock.json
generated
4
nodejs/package-lock.json
generated
@@ -1,12 +1,12 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"name": "@lancedb/lancedb",
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"cpu": [
|
||||
"x64",
|
||||
"arm64"
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
"ann"
|
||||
],
|
||||
"private": false,
|
||||
"version": "0.31.0-beta.1",
|
||||
"version": "0.30.1-beta.2",
|
||||
"main": "dist/index.js",
|
||||
"exports": {
|
||||
".": "./dist/index.js",
|
||||
|
||||
@@ -3,8 +3,6 @@
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use chrono::{DateTime, Utc};
|
||||
|
||||
use lancedb::ipc::{ipc_file_to_batches, ipc_file_to_schema};
|
||||
use lancedb::table::{
|
||||
AddDataMode, ColumnAlteration as LanceColumnAlteration, Duration,
|
||||
@@ -487,12 +485,6 @@ impl Table {
|
||||
})
|
||||
}
|
||||
|
||||
/// The branch this handle is scoped to, or `null` for the main branch.
|
||||
#[napi]
|
||||
pub fn current_branch(&self) -> napi::Result<Option<String>> {
|
||||
Ok(self.inner_ref()?.current_branch())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn optimize(
|
||||
&self,
|
||||
@@ -610,43 +602,6 @@ pub struct IndexConfig {
|
||||
/// Currently this is always an array of size 1. In the future there may
|
||||
/// be more columns to represent composite indices.
|
||||
pub columns: Vec<String>,
|
||||
/// The UUID of the first segment of the index.
|
||||
///
|
||||
/// `undefined` for remote tables, which do not yet surface this.
|
||||
pub index_uuid: Option<String>,
|
||||
/// The protobuf type URL, a precise type identifier for the index.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub type_url: Option<String>,
|
||||
/// When the index was created.
|
||||
///
|
||||
/// `undefined` for remote tables or indices created before timestamps were tracked.
|
||||
pub created_at: Option<DateTime<Utc>>,
|
||||
/// The number of rows indexed, across all segments.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub num_indexed_rows: Option<i64>,
|
||||
/// The number of rows not yet covered by this index.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub num_unindexed_rows: Option<i64>,
|
||||
/// The total size in bytes of all index files across all segments.
|
||||
///
|
||||
/// `undefined` for remote tables or indices without size tracking.
|
||||
pub size_bytes: Option<i64>,
|
||||
/// The number of segments that make up the index.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub num_segments: Option<i32>,
|
||||
/// The on-disk index format version.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub index_version: Option<i32>,
|
||||
/// Index-type-specific details parsed as a JavaScript object.
|
||||
///
|
||||
/// Falls back to a raw string if JSON parsing fails. `undefined` for
|
||||
/// remote tables or when details are unavailable.
|
||||
pub index_details: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
@@ -656,17 +611,6 @@ impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
index_type,
|
||||
columns: value.columns,
|
||||
name: value.name,
|
||||
index_uuid: value.index_uuid,
|
||||
type_url: value.type_url,
|
||||
created_at: value.created_at,
|
||||
num_indexed_rows: value.num_indexed_rows.map(|n| n as i64),
|
||||
num_unindexed_rows: value.num_unindexed_rows.map(|n| n as i64),
|
||||
size_bytes: value.size_bytes.map(|n| n as i64),
|
||||
num_segments: value.num_segments.map(|n| n as i32),
|
||||
index_version: value.index_version,
|
||||
index_details: value
|
||||
.index_details
|
||||
.and_then(|s| serde_json::from_str(&s).ok()),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
[tool.bumpversion]
|
||||
current_version = "0.34.0-beta.2"
|
||||
current_version = "0.33.1-beta.2"
|
||||
parse = """(?x)
|
||||
(?P<major>0|[1-9]\\d*)\\.
|
||||
(?P<minor>0|[1-9]\\d*)\\.
|
||||
@@ -23,8 +23,6 @@ allow_dirty = true
|
||||
commit = true
|
||||
message = "Bump version: {current_version} → {new_version}"
|
||||
commit_args = ""
|
||||
# bump-my-version >=1.4.0 rejects pre_commit_hooks containing shell syntax unless opted in.
|
||||
allow_shell_hooks = true
|
||||
|
||||
# Update Cargo.lock after version bump
|
||||
pre_commit_hooks = [
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "lancedb-python"
|
||||
version = "0.34.0-beta.2"
|
||||
version = "0.33.1-beta.2"
|
||||
publish = false
|
||||
edition.workspace = true
|
||||
description = "Python bindings for LanceDB"
|
||||
@@ -26,8 +26,7 @@ lance-namespace-impls.workspace = true
|
||||
lance-io.workspace = true
|
||||
env_logger.workspace = true
|
||||
log.workspace = true
|
||||
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39", "chrono"] }
|
||||
chrono = { version = "0.4", default-features = false, features = ["clock"] }
|
||||
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39"] }
|
||||
pyo3-async-runtimes = { version = "0.28", features = [
|
||||
"attributes",
|
||||
"tokio-runtime",
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from datetime import datetime, timedelta
|
||||
from datetime import timedelta
|
||||
from typing import Dict, List, Optional, Tuple, Any, TypedDict, Union, Literal
|
||||
|
||||
import pyarrow as pa
|
||||
@@ -205,7 +205,7 @@ class Table:
|
||||
async def prewarm_index(self, index_name: str) -> None: ...
|
||||
async def prewarm_data(self, columns: Optional[List[str]] = None) -> None: ...
|
||||
async def list_indices(self) -> list[IndexConfig]: ...
|
||||
async def delete(self, filter: Union[str, PyExpr]) -> DeleteResult: ...
|
||||
async def delete(self, filter: str) -> DeleteResult: ...
|
||||
async def add_columns(self, columns: list[tuple[str, str]]) -> AddColumnsResult: ...
|
||||
async def add_columns_with_schema(self, schema: pa.Schema) -> AddColumnsResult: ...
|
||||
async def alter_columns(
|
||||
@@ -259,15 +259,6 @@ class IndexConfig:
|
||||
name: str
|
||||
index_type: str
|
||||
columns: List[str]
|
||||
index_uuid: Optional[str]
|
||||
type_url: Optional[str]
|
||||
created_at: Optional[datetime]
|
||||
num_indexed_rows: Optional[int]
|
||||
num_unindexed_rows: Optional[int]
|
||||
size_bytes: Optional[int]
|
||||
num_segments: Optional[int]
|
||||
index_version: Optional[int]
|
||||
index_details: Optional[Any]
|
||||
|
||||
async def connect(
|
||||
uri: str,
|
||||
|
||||
@@ -5,9 +5,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import timedelta
|
||||
from typing import TYPE_CHECKING, List, Optional, Union
|
||||
|
||||
from .expr import Expr
|
||||
from typing import TYPE_CHECKING, List, Optional
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .common import DATA
|
||||
@@ -34,7 +32,6 @@ class LanceMergeInsertBuilder(object):
|
||||
self._when_not_matched_insert_all = False
|
||||
self._when_not_matched_by_source_delete = False
|
||||
self._when_not_matched_by_source_condition = None
|
||||
self._when_not_matched_by_source_condition_expr = None
|
||||
self._timeout = None
|
||||
self._use_index = True
|
||||
self._use_lsm_write = None
|
||||
@@ -65,7 +62,7 @@ class LanceMergeInsertBuilder(object):
|
||||
return self
|
||||
|
||||
def when_not_matched_by_source_delete(
|
||||
self, condition: Union[str, Expr, None] = None
|
||||
self, condition: Optional[str] = None
|
||||
) -> LanceMergeInsertBuilder:
|
||||
"""
|
||||
Rows that exist only in the target table (old data) will be
|
||||
@@ -74,16 +71,13 @@ class LanceMergeInsertBuilder(object):
|
||||
|
||||
Parameters
|
||||
----------
|
||||
condition: str or :class:`~lancedb.expr.Expr` or None, default None
|
||||
condition: Optional[str], default None
|
||||
If None then all such rows will be deleted. Otherwise the
|
||||
condition will be used as a filter to limit what rows are deleted.
|
||||
Can be a SQL string or a type-safe :class:`~lancedb.expr.Expr`
|
||||
built with :func:`~lancedb.expr.col` and :func:`~lancedb.expr.lit`.
|
||||
condition will be used as an SQL filter to limit what rows
|
||||
are deleted.
|
||||
"""
|
||||
self._when_not_matched_by_source_delete = True
|
||||
if isinstance(condition, Expr):
|
||||
self._when_not_matched_by_source_condition_expr = condition._inner
|
||||
elif condition is not None:
|
||||
if condition is not None:
|
||||
self._when_not_matched_by_source_condition = condition
|
||||
return self
|
||||
|
||||
|
||||
@@ -58,7 +58,6 @@ from lance_namespace import (
|
||||
ListTablesRequest,
|
||||
DescribeNamespaceRequest,
|
||||
DropTableRequest,
|
||||
RenameTableRequest,
|
||||
ListNamespacesRequest,
|
||||
CreateNamespaceRequest,
|
||||
DropNamespaceRequest,
|
||||
@@ -71,9 +70,6 @@ from lancedb.embeddings import EmbeddingFunctionConfig
|
||||
from ._lancedb import Session
|
||||
|
||||
|
||||
_MAX_QUERY_K = 2**31 - 1
|
||||
|
||||
|
||||
def _query_to_namespace_request(
|
||||
table_id: List[str],
|
||||
query: "Query",
|
||||
@@ -151,8 +147,7 @@ def _query_to_namespace_request(
|
||||
if query.limit is not None:
|
||||
k = query.limit
|
||||
elif query.vector is None and query.full_text_query is None:
|
||||
# limit k to max i32 value to avoid client overflows
|
||||
k = _MAX_QUERY_K
|
||||
k = sys.maxsize
|
||||
else:
|
||||
k = 10
|
||||
|
||||
@@ -609,14 +604,9 @@ class LanceNamespaceDBConnection(DBConnection):
|
||||
cur_namespace_path = []
|
||||
if new_namespace_path is None:
|
||||
new_namespace_path = []
|
||||
cur_table_id = cur_namespace_path + [cur_name]
|
||||
new_namespace_id = new_namespace_path if new_namespace_path else None
|
||||
request = RenameTableRequest(
|
||||
id=cur_table_id,
|
||||
new_table_name=new_name,
|
||||
new_namespace_id=new_namespace_id,
|
||||
raise NotImplementedError(
|
||||
"rename_table is not supported for namespace connections"
|
||||
)
|
||||
self._namespace_client.rename_table(request)
|
||||
|
||||
@override
|
||||
def drop_database(self):
|
||||
@@ -1046,19 +1036,14 @@ class AsyncLanceNamespaceDBConnection:
|
||||
cur_namespace_path: Optional[List[str]] = None,
|
||||
new_namespace_path: Optional[List[str]] = None,
|
||||
):
|
||||
"""Rename a table in the namespace."""
|
||||
"""Rename is not supported for namespace connections."""
|
||||
if cur_namespace_path is None:
|
||||
cur_namespace_path = []
|
||||
if new_namespace_path is None:
|
||||
new_namespace_path = []
|
||||
cur_table_id = cur_namespace_path + [cur_name]
|
||||
new_namespace_id = new_namespace_path if new_namespace_path else None
|
||||
request = RenameTableRequest(
|
||||
id=cur_table_id,
|
||||
new_table_name=new_name,
|
||||
new_namespace_id=new_namespace_id,
|
||||
raise NotImplementedError(
|
||||
"rename_table is not supported for namespace connections"
|
||||
)
|
||||
self._namespace_client.rename_table(request)
|
||||
|
||||
async def drop_database(self):
|
||||
"""Deprecated method."""
|
||||
|
||||
@@ -275,18 +275,7 @@ def _py_type_to_arrow_type(py_type: Type[Any], field: FieldInfo) -> pa.DataType:
|
||||
tz = get_extras(field, "tz")
|
||||
return pa.timestamp("us", tz=tz)
|
||||
elif getattr(py_type, "__origin__", None) in (list, tuple):
|
||||
# A bare, unparameterised ``typing.List`` / ``typing.Tuple`` matches this
|
||||
# branch (its ``__origin__`` is ``list`` / ``tuple``) but has no
|
||||
# ``__args__``, so we cannot infer the element type. Raise a clear
|
||||
# ``TypeError`` instead of crashing with an opaque ``AttributeError``.
|
||||
args = getattr(py_type, "__args__", None)
|
||||
if not args:
|
||||
raise TypeError(
|
||||
"Converting Pydantic type to Arrow Type: unsupported type "
|
||||
f"{py_type}. Specify the element type, e.g. List[int] instead "
|
||||
"of a bare List."
|
||||
)
|
||||
child = args[0]
|
||||
child = py_type.__args__[0]
|
||||
return _pydantic_list_child_to_arrow(child, field)
|
||||
raise TypeError(
|
||||
f"Converting Pydantic type to Arrow Type: unsupported type {py_type}."
|
||||
|
||||
@@ -396,13 +396,13 @@ class RemoteDBConnection(DBConnection):
|
||||
The namespace to open the table from.
|
||||
None or empty list represents root namespace.
|
||||
branch: str, optional
|
||||
If provided, open a handle scoped to this branch instead of the
|
||||
default branch. Reads and writes operate in the branch's context.
|
||||
Branching is not yet supported on remote tables, so only the
|
||||
default branch is accepted (``None`` or ``"main"``); any other
|
||||
value raises ``NotImplementedError``.
|
||||
version: int, optional
|
||||
If provided, open the table pinned to this version, producing a
|
||||
read-only handle. Composes with ``branch``: when both are given,
|
||||
opens that branch at the version; otherwise opens ``main`` at the
|
||||
version. Call ``checkout_latest`` to return to a writable state.
|
||||
read-only handle. Call ``checkout_latest`` to return to a writable
|
||||
state.
|
||||
|
||||
Returns
|
||||
-------
|
||||
@@ -410,6 +410,11 @@ class RemoteDBConnection(DBConnection):
|
||||
"""
|
||||
from .table import RemoteTable
|
||||
|
||||
# Remote supports version time-travel but not branches: reject a non-main
|
||||
# branch, but allow a version-only open (or "main").
|
||||
if branch is not None and branch != "main":
|
||||
raise NotImplementedError("branching is not yet supported on remote tables")
|
||||
|
||||
if namespace_path is None:
|
||||
namespace_path = []
|
||||
if index_cache_size is not None:
|
||||
@@ -425,9 +430,7 @@ class RemoteDBConnection(DBConnection):
|
||||
connection_state=self.serialize,
|
||||
namespace_path=namespace_path,
|
||||
)
|
||||
if branch is not None:
|
||||
tbl = tbl.branches.checkout(branch, version)
|
||||
elif version is not None:
|
||||
if version is not None:
|
||||
tbl.checkout(version)
|
||||
return tbl
|
||||
|
||||
|
||||
@@ -56,7 +56,7 @@ from lancedb.embeddings import EmbeddingFunctionRegistry
|
||||
from lancedb.table import _normalize_progress
|
||||
|
||||
from ..query import LanceVectorQueryBuilder, LanceQueryBuilder, LanceTakeQueryBuilder
|
||||
from ..table import AsyncTable, BlobMode, Branches, IndexStatistics, Query, Table, Tags
|
||||
from ..table import AsyncTable, BlobMode, IndexStatistics, Query, Table, Tags
|
||||
from ..types import BaseTokenizerType
|
||||
|
||||
|
||||
@@ -75,9 +75,6 @@ class RemoteTable(Table):
|
||||
self._connection_state = connection_state
|
||||
self._namespace_path = list(namespace_path or [])
|
||||
self._checkout_version: Optional[int] = None
|
||||
# The branch this handle is scoped to (None == main). Persisted so a
|
||||
# fork/pickle reopen restores the branch instead of reverting to main.
|
||||
self._branch: Optional[str] = None
|
||||
self._pid = os.getpid()
|
||||
|
||||
def _serialized_connection_state(self) -> str:
|
||||
@@ -112,14 +109,9 @@ class RemoteTable(Table):
|
||||
from lancedb import deserialize_conn
|
||||
|
||||
db = deserialize_conn(self._serialized_connection_state(), for_worker=True)
|
||||
# Reopen on the same branch and pinned version (branch=None / version=None
|
||||
# reproduce the plain main-latest open).
|
||||
table = db.open_table(
|
||||
self._name,
|
||||
namespace_path=self._namespace_path,
|
||||
branch=self._branch,
|
||||
version=self._checkout_version,
|
||||
)
|
||||
table = db.open_table(self._name, namespace_path=self._namespace_path)
|
||||
if self._checkout_version is not None:
|
||||
table.checkout(self._checkout_version)
|
||||
|
||||
self._table_handle = table._table
|
||||
self.db_name = table.db_name
|
||||
@@ -132,7 +124,6 @@ class RemoteTable(Table):
|
||||
"name": self.name,
|
||||
"namespace_path": self._namespace_path,
|
||||
"checkout_version": self._checkout_version,
|
||||
"branch": self._branch,
|
||||
}
|
||||
|
||||
def __setstate__(self, state: dict) -> None:
|
||||
@@ -142,7 +133,6 @@ class RemoteTable(Table):
|
||||
self._connection_state = state["connection_state"]
|
||||
self._namespace_path = state["namespace_path"]
|
||||
self._checkout_version = state["checkout_version"]
|
||||
self._branch = state.get("branch")
|
||||
self._pid = None
|
||||
|
||||
@property
|
||||
@@ -170,34 +160,6 @@ class RemoteTable(Table):
|
||||
def tags(self) -> Tags:
|
||||
return Tags(self._table)
|
||||
|
||||
@property
|
||||
def branches(self) -> Branches:
|
||||
"""Branch management for the table.
|
||||
|
||||
``create``/``checkout`` return a new table handle scoped to the branch;
|
||||
writes on it do not affect ``main``.
|
||||
"""
|
||||
return Branches(self)
|
||||
|
||||
def current_branch(self) -> Optional[str]:
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
return self._table.current_branch()
|
||||
|
||||
def _wrap_branch_handle(
|
||||
self, async_table: AsyncTable, version: Optional[int] = None
|
||||
) -> "RemoteTable":
|
||||
# A branch handle stays a RemoteTable with the same connection context.
|
||||
# Record the branch and version pin so a fork/pickle reopen restores both.
|
||||
handle = RemoteTable(
|
||||
async_table,
|
||||
self.db_name,
|
||||
connection_state=self._connection_state,
|
||||
namespace_path=self._namespace_path,
|
||||
)
|
||||
handle._branch = async_table.current_branch()
|
||||
handle._checkout_version = version
|
||||
return handle
|
||||
|
||||
@cached_property
|
||||
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
|
||||
"""
|
||||
|
||||
@@ -86,10 +86,7 @@ def _from_list(data: list) -> Scannable:
|
||||
|
||||
@to_scannable.register(dict)
|
||||
def _from_dict(data: dict) -> Scannable:
|
||||
raise ValueError(
|
||||
"Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
raise ValueError("Cannot add a single dictionary to a table. Use a list.")
|
||||
|
||||
|
||||
@to_scannable.register(LanceModel)
|
||||
|
||||
@@ -61,7 +61,6 @@ from .index import (
|
||||
HnswFlat,
|
||||
FTS,
|
||||
)
|
||||
from .expr import Expr
|
||||
from .merge import LanceMergeInsertBuilder
|
||||
from .pydantic import LanceModel, model_to_dict
|
||||
from .query import (
|
||||
@@ -243,10 +242,7 @@ def _into_pyarrow_reader(
|
||||
raise ValueError("Cannot add a single LanceModel to a table. Use a list.")
|
||||
|
||||
if isinstance(data, dict):
|
||||
raise ValueError(
|
||||
"Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
raise ValueError("Cannot add a single dictionary to a table. Use a list.")
|
||||
|
||||
if isinstance(data, list):
|
||||
# Handle empty list case
|
||||
@@ -799,10 +795,6 @@ class Table(ABC):
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def current_branch(self) -> Optional[str]:
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
raise NotImplementedError
|
||||
|
||||
def __len__(self) -> int:
|
||||
"""The number of rows in this Table"""
|
||||
return self.count_rows(None)
|
||||
@@ -1541,7 +1533,7 @@ class Table(ABC):
|
||||
) -> MergeResult: ...
|
||||
|
||||
@abstractmethod
|
||||
def delete(self, where: Union[str, Expr]) -> DeleteResult:
|
||||
def delete(self, where: str) -> DeleteResult:
|
||||
"""Delete rows from the table.
|
||||
|
||||
This can be used to delete a single row, many rows, all rows, or
|
||||
@@ -1549,10 +1541,10 @@ class Table(ABC):
|
||||
|
||||
Parameters
|
||||
----------
|
||||
where: str or :class:`~lancedb.expr.Expr`
|
||||
The filter condition. Can be a SQL string or a type-safe
|
||||
:class:`~lancedb.expr.Expr` built with :func:`~lancedb.expr.col`
|
||||
and :func:`~lancedb.expr.lit`.
|
||||
where: str
|
||||
The SQL where clause to use when deleting rows.
|
||||
|
||||
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
|
||||
|
||||
The filter must not be empty, or it will error.
|
||||
|
||||
@@ -2230,21 +2222,6 @@ class LanceTable(Table):
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
return self._table.current_branch()
|
||||
|
||||
def _wrap_branch_handle(
|
||||
self, async_table: "AsyncTable", version: Optional[int] = None
|
||||
) -> "LanceTable":
|
||||
# version is unused locally: the pin already lives on async_table and a
|
||||
# local handle is not reopened via a serialized connection.
|
||||
return LanceTable(
|
||||
self._conn,
|
||||
async_table.name,
|
||||
namespace_path=self._namespace_path,
|
||||
namespace_client=self._namespace_client,
|
||||
pushdown_operations=self._pushdown_operations,
|
||||
location=self._location,
|
||||
_async=async_table,
|
||||
)
|
||||
|
||||
def checkout(self, version: Union[int, str]):
|
||||
"""Checkout a version of the table. This is an in-place operation.
|
||||
|
||||
@@ -3446,9 +3423,8 @@ class LanceTable(Table):
|
||||
)
|
||||
return self
|
||||
|
||||
def delete(self, where: Union[str, Expr]) -> DeleteResult:
|
||||
predicate = where._inner if isinstance(where, Expr) else where
|
||||
return LOOP.run(self._table.delete(predicate))
|
||||
def delete(self, where: str) -> DeleteResult:
|
||||
return LOOP.run(self._table.delete(where))
|
||||
|
||||
def update(
|
||||
self,
|
||||
@@ -5238,7 +5214,6 @@ class AsyncTable:
|
||||
when_not_matched_insert_all=merge._when_not_matched_insert_all,
|
||||
when_not_matched_by_source_delete=merge._when_not_matched_by_source_delete,
|
||||
when_not_matched_by_source_condition=merge._when_not_matched_by_source_condition,
|
||||
when_not_matched_by_source_condition_expr=merge._when_not_matched_by_source_condition_expr,
|
||||
timeout=merge._timeout,
|
||||
use_index=merge._use_index,
|
||||
use_lsm_write=merge._use_lsm_write,
|
||||
@@ -5246,7 +5221,7 @@ class AsyncTable:
|
||||
),
|
||||
)
|
||||
|
||||
async def delete(self, where: Union[str, Expr]) -> DeleteResult:
|
||||
async def delete(self, where: str) -> DeleteResult:
|
||||
"""Delete rows from the table.
|
||||
|
||||
This can be used to delete a single row, many rows, all rows, or
|
||||
@@ -5254,10 +5229,10 @@ class AsyncTable:
|
||||
|
||||
Parameters
|
||||
----------
|
||||
where: str or :class:`~lancedb.expr.Expr`
|
||||
The filter condition. Can be a SQL string or a type-safe
|
||||
:class:`~lancedb.expr.Expr` built with :func:`~lancedb.expr.col`
|
||||
and :func:`~lancedb.expr.lit`.
|
||||
where: str
|
||||
The SQL where clause to use when deleting rows.
|
||||
|
||||
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
|
||||
|
||||
The filter must not be empty, or it will error.
|
||||
|
||||
@@ -5296,8 +5271,7 @@ class AsyncTable:
|
||||
x vector
|
||||
0 3 [5.0, 6.0]
|
||||
"""
|
||||
predicate = where._inner if isinstance(where, Expr) else where
|
||||
return await self._inner.delete(predicate)
|
||||
return await self._inner.delete(where)
|
||||
|
||||
async def update(
|
||||
self,
|
||||
@@ -5956,7 +5930,7 @@ class Branches:
|
||||
name: str,
|
||||
from_ref: Optional[str] = None,
|
||||
from_version: Optional[int] = None,
|
||||
) -> "Table":
|
||||
) -> "LanceTable":
|
||||
"""Create a branch and return a handle scoped to it.
|
||||
|
||||
Parameters
|
||||
@@ -5973,7 +5947,7 @@ class Branches:
|
||||
)
|
||||
return self._wrap(async_table)
|
||||
|
||||
def checkout(self, name: str, version: Optional[int] = None) -> "Table":
|
||||
def checkout(self, name: str, version: Optional[int] = None) -> "LanceTable":
|
||||
"""Check out an existing branch and return a handle scoped to it.
|
||||
|
||||
Parameters
|
||||
@@ -5986,19 +5960,25 @@ class Branches:
|
||||
the branch's latest and stays writable.
|
||||
"""
|
||||
async_table = LOOP.run(self._table.branches.checkout(name, version))
|
||||
return self._wrap(async_table, version)
|
||||
return self._wrap(async_table)
|
||||
|
||||
def delete(self, name: str) -> None:
|
||||
"""Delete a branch."""
|
||||
LOOP.run(self._table.branches.delete(name))
|
||||
|
||||
def _wrap(
|
||||
self, async_table: "AsyncTable", version: Optional[int] = None
|
||||
) -> "Table":
|
||||
# Delegate to the parent so the branch handle keeps its concrete type
|
||||
# (LanceTable / RemoteTable) and connection context; `version` is the
|
||||
# explicit pin so a remote handle can restore branch+version on reopen.
|
||||
return self._parent._wrap_branch_handle(async_table, version)
|
||||
def _wrap(self, async_table: "AsyncTable") -> "LanceTable":
|
||||
# Reuse the parent's connection + namespace context; from_inner would drop
|
||||
# it and break identity/query routing for namespace-backed tables.
|
||||
parent = self._parent
|
||||
return LanceTable(
|
||||
parent._conn,
|
||||
async_table.name,
|
||||
namespace_path=parent._namespace_path,
|
||||
namespace_client=parent._namespace_client,
|
||||
pushdown_operations=parent._pushdown_operations,
|
||||
location=parent._location,
|
||||
_async=async_table,
|
||||
)
|
||||
|
||||
|
||||
class AsyncTags:
|
||||
|
||||
@@ -373,15 +373,9 @@ def _(value: list):
|
||||
@value_to_sql.register(dict)
|
||||
def _(value: dict):
|
||||
# https://datafusion.apache.org/user-guide/sql/scalar_functions.html#named-struct
|
||||
# Render the field name through value_to_sql(str(...)) as well so that keys
|
||||
# containing characters meaningful in SQL (e.g. a single quote) are escaped
|
||||
# the same way string values are. A bare f"'{k}'" would emit invalid SQL for
|
||||
# a key like "it's".
|
||||
return (
|
||||
"named_struct("
|
||||
+ ", ".join(
|
||||
f"{value_to_sql(str(k))}, {value_to_sql(v)}" for k, v in value.items()
|
||||
)
|
||||
+ ", ".join(f"'{k}', {value_to_sql(v)}" for k, v in value.items())
|
||||
+ ")"
|
||||
)
|
||||
|
||||
|
||||
@@ -91,9 +91,7 @@ async def test_create_scalar_index(some_table: AsyncTable):
|
||||
# Can recreate if replace=True
|
||||
await some_table.create_index("id", replace=True)
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="id_idx", index_type="BTree", columns=["id"]'
|
||||
)
|
||||
assert str(indices) == '[Index(BTree, columns=["id"], name="id_idx")]'
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "BTree"
|
||||
assert indices[0].columns == ["id"]
|
||||
@@ -108,27 +106,6 @@ async def test_create_scalar_index(some_table: AsyncTable):
|
||||
assert len(indices) == 0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_index_config_repr(db_async):
|
||||
# Use >= 1000 rows so the thousands separator in the repr is exercised.
|
||||
nrows = 1500
|
||||
table = await db_async.create_table(
|
||||
"repr_table", pa.Table.from_pydict({"id": list(range(nrows))})
|
||||
)
|
||||
await table.create_index("id", config=BTree())
|
||||
indices = await table.list_indices()
|
||||
assert len(indices) == 1
|
||||
|
||||
r = repr(indices[0])
|
||||
assert r.startswith('IndexConfig(name="id_idx", index_type="BTree", columns=["id"]')
|
||||
# Integer counts use `_` thousands separators (valid Python int syntax).
|
||||
assert "num_indexed_rows=1_500" in r
|
||||
assert "num_unindexed_rows=0" in r
|
||||
# created_at renders as a datetime so the value round-trips.
|
||||
assert "created_at=datetime.datetime(" in r
|
||||
assert r.endswith(")")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
|
||||
metadata_type = pa.struct(
|
||||
@@ -221,9 +198,7 @@ async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
|
||||
async def test_create_fixed_size_binary_index(some_table: AsyncTable):
|
||||
await some_table.create_index("fsb", config=BTree())
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="fsb_idx", index_type="BTree", columns=["fsb"]'
|
||||
)
|
||||
assert str(indices) == '[Index(BTree, columns=["fsb"], name="fsb_idx")]'
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "BTree"
|
||||
assert indices[0].columns == ["fsb"]
|
||||
@@ -272,9 +247,7 @@ async def test_create_bitmap_index(some_table: AsyncTable):
|
||||
async def test_create_label_list_index(some_table: AsyncTable):
|
||||
await some_table.create_index("tags", config=LabelList())
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="tags_idx", index_type="LabelList", columns=["tags"]'
|
||||
)
|
||||
assert str(indices) == '[Index(LabelList, columns=["tags"], name="tags_idx")]'
|
||||
plan = await some_table.query().where("array_has(tags, 'tag0')").explain_plan()
|
||||
assert "ScalarIndexQuery" in plan
|
||||
|
||||
@@ -289,9 +262,7 @@ async def test_create_large_list_label_list_index(db_async):
|
||||
|
||||
await table.create_index("tags", config=LabelList())
|
||||
indices = await table.list_indices()
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="tags_idx", index_type="LabelList", columns=["tags"]'
|
||||
)
|
||||
assert str(indices) == '[Index(LabelList, columns=["tags"], name="tags_idx")]'
|
||||
plan = await table.query().where("array_has(tags, 'shared')").explain_plan()
|
||||
assert "ScalarIndexQuery" in plan
|
||||
|
||||
@@ -328,9 +299,7 @@ async def test_create_label_list_index_rejects_list_struct(db_async):
|
||||
async def test_full_text_search_index(some_table: AsyncTable):
|
||||
await some_table.create_index("tags", config=FTS(with_position=False))
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="tags_idx", index_type="FTS", columns=["tags"]'
|
||||
)
|
||||
assert str(indices) == '[Index(FTS, columns=["tags"], name="tags_idx")]'
|
||||
|
||||
await some_table.prewarm_index("tags_idx")
|
||||
|
||||
|
||||
@@ -5,11 +5,11 @@
|
||||
|
||||
import tempfile
|
||||
import shutil
|
||||
import sys
|
||||
import pytest
|
||||
import pyarrow as pa
|
||||
import lancedb
|
||||
from lance_namespace.errors import NamespaceNotEmptyError, TableNotFoundError
|
||||
from lancedb.namespace import _MAX_QUERY_K
|
||||
from lancedb.table import AsyncTable, LanceTable
|
||||
|
||||
|
||||
@@ -257,15 +257,8 @@ class TestNamespaceConnection:
|
||||
assert table_schema.field("id").type == pa.int64()
|
||||
assert table_schema.field("text").type == pa.string()
|
||||
|
||||
def test_rename_table(self):
|
||||
"""Test that rename_table renames a table in the namespace.
|
||||
|
||||
The `dir` namespace implementation in lance-namespace-impls does not
|
||||
implement `rename_table` yet (only the `rest` backend does), so it
|
||||
currently falls back to the default trait method which raises
|
||||
NotSupported. This is expected to start passing once the `dir`
|
||||
backend gains rename_table support upstream.
|
||||
"""
|
||||
def test_rename_table_not_supported(self):
|
||||
"""Test that rename_table raises NotImplementedError."""
|
||||
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
|
||||
|
||||
# Create a child namespace first
|
||||
@@ -280,14 +273,9 @@ class TestNamespaceConnection:
|
||||
)
|
||||
db.create_table("old_name", schema=schema, namespace_path=["test_ns"])
|
||||
|
||||
# Rename the table within the same namespace
|
||||
with pytest.raises(NotImplementedError, match="rename_table not implemented"):
|
||||
db.rename_table(
|
||||
"old_name",
|
||||
"new_name",
|
||||
cur_namespace_path=["test_ns"],
|
||||
new_namespace_path=["test_ns"],
|
||||
)
|
||||
# Rename should raise NotImplementedError
|
||||
with pytest.raises(NotImplementedError, match="rename_table is not supported"):
|
||||
db.rename_table("old_name", "new_name")
|
||||
|
||||
def test_drop_all_tables(self):
|
||||
"""Test dropping all tables through namespace."""
|
||||
@@ -816,13 +804,10 @@ class TestPushdownOperations:
|
||||
["geneva", "hist"],
|
||||
["geneva", "hist"],
|
||||
]
|
||||
# Unlimited reads cap k at i32::MAX (the namespace query_table `k`
|
||||
# field is i32); sys.maxsize would overflow the Rust binding.
|
||||
assert [request.k for request in namespace_client.requests] == [
|
||||
_MAX_QUERY_K,
|
||||
_MAX_QUERY_K,
|
||||
sys.maxsize,
|
||||
sys.maxsize,
|
||||
]
|
||||
assert all(r.k <= 2**31 - 1 for r in namespace_client.requests)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@@ -877,13 +862,10 @@ class TestAsyncPushdownOperations:
|
||||
["geneva", "hist"],
|
||||
["geneva", "hist"],
|
||||
]
|
||||
# Unlimited reads cap k at i32::MAX (the namespace query_table `k`
|
||||
# field is i32); sys.maxsize would overflow the Rust binding.
|
||||
assert [request.k for request in namespace_client.requests] == [
|
||||
_MAX_QUERY_K,
|
||||
_MAX_QUERY_K,
|
||||
sys.maxsize,
|
||||
sys.maxsize,
|
||||
]
|
||||
assert all(r.k <= 2**31 - 1 for r in namespace_client.requests)
|
||||
|
||||
|
||||
def test_local_table_to_arrow_and_to_pandas_are_unchanged(tmp_path):
|
||||
|
||||
@@ -1,686 +0,0 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
"""Regression matrix for nested field support across LanceDB Python APIs.
|
||||
|
||||
Covers the lifecycle described in lancedb/lancedb#3406:
|
||||
- Nested scalar, vector, and FTS index creation with full dotted paths
|
||||
- list_indices / index_stats return canonical full paths (not leaf names)
|
||||
- search, filter, append, optimize behaviour
|
||||
- Field-name edge cases: mixed case, literal-dot field names, same-name leaves
|
||||
- Both sync and async Python table APIs
|
||||
|
||||
The matrix uses the following field-name variants from the acceptance criteria:
|
||||
- rowId (camelCase top-level)
|
||||
- `row-id` (hyphenated top-level, escaped)
|
||||
- parent.`leaf.name` (struct leaf whose name contains a literal dot)
|
||||
- MetaData.userId (mixed-case nested path)
|
||||
- `meta-data`.`user-id` (hyphenated struct with hyphenated leaf)
|
||||
|
||||
Note: Lance forbids top-level field names that contain a '.', so the literal-dot
|
||||
edge case is exercised via a struct leaf field (parent.`leaf.name`) instead.
|
||||
"""
|
||||
|
||||
from datetime import timedelta
|
||||
|
||||
import pyarrow as pa
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
|
||||
import lancedb
|
||||
from lancedb.db import AsyncConnection, DBConnection
|
||||
from lancedb.index import BTree, FTS, IvfPq
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DIM = 8
|
||||
# IvfPq requires at least num_partitions * 256 rows by default; keeping rows
|
||||
# small means we must drop num_sub_vectors and num_partitions very low.
|
||||
NROWS = 256
|
||||
|
||||
|
||||
def _vec(row: int) -> list:
|
||||
return [float((row * DIM + i) % 256) for i in range(DIM)]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sync_db(tmp_path) -> DBConnection:
|
||||
return lancedb.connect(tmp_path)
|
||||
|
||||
|
||||
@pytest_asyncio.fixture
|
||||
async def async_db(tmp_path) -> AsyncConnection:
|
||||
return await lancedb.connect_async(
|
||||
tmp_path, read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schema / data builders
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _nested_scalar_schema() -> pa.Schema:
|
||||
"""Schema with nested scalar fields covering the acceptance-criteria names.
|
||||
|
||||
Top-level columns:
|
||||
- rowId int32 (camelCase top-level)
|
||||
- row-id int32 (hyphenated top-level name)
|
||||
- MetaData struct{userId int32} (mixed-case nested path)
|
||||
- meta-data struct{user-id int32} (hyphenated struct + hyphenated leaf)
|
||||
|
||||
Lance disallows top-level field names that contain '.' (e.g. a field
|
||||
literally named 'a.b'), so that edge case is tested separately using
|
||||
_literal_dot_schema() below.
|
||||
"""
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("rowId", pa.int32()),
|
||||
pa.field("row-id", pa.int32()),
|
||||
pa.field(
|
||||
"MetaData",
|
||||
pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
pa.field(
|
||||
"meta-data",
|
||||
pa.struct([pa.field("user-id", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _nested_scalar_data(nrows: int = NROWS) -> pa.Table:
|
||||
schema = _nested_scalar_schema()
|
||||
return pa.table(
|
||||
{
|
||||
"rowId": pa.array(list(range(nrows)), pa.int32()),
|
||||
"row-id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"MetaData": pa.array(
|
||||
[{"userId": i} for i in range(nrows)],
|
||||
type=pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
"meta-data": pa.array(
|
||||
[{"user-id": i} for i in range(nrows)],
|
||||
type=pa.struct([pa.field("user-id", pa.int32())]),
|
||||
),
|
||||
},
|
||||
schema=schema,
|
||||
)
|
||||
|
||||
|
||||
def _literal_dot_schema() -> pa.Schema:
|
||||
"""Schema where a struct *leaf* field is named with a literal dot.
|
||||
|
||||
The path used in the index API is ``parent.`leaf.name` ``.
|
||||
"""
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int32()),
|
||||
pa.field(
|
||||
"parent",
|
||||
pa.struct([pa.field("leaf.name", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _literal_dot_data(nrows: int = NROWS) -> pa.Table:
|
||||
parent_type = pa.struct([pa.field("leaf.name", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"parent": pa.array(
|
||||
[{"leaf.name": i} for i in range(nrows)],
|
||||
type=parent_type,
|
||||
),
|
||||
},
|
||||
schema=_literal_dot_schema(),
|
||||
)
|
||||
|
||||
|
||||
def _same_leaf_schema() -> pa.Schema:
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("StructA", pa.struct([pa.field("userId", pa.int32())])),
|
||||
pa.field("StructB", pa.struct([pa.field("userId", pa.int32())])),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _same_leaf_data(nrows: int = NROWS) -> pa.Table:
|
||||
t = pa.struct([pa.field("userId", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"StructA": pa.array([{"userId": i} for i in range(nrows)], type=t),
|
||||
"StructB": pa.array([{"userId": i * 10} for i in range(nrows)], type=t),
|
||||
},
|
||||
schema=_same_leaf_schema(),
|
||||
)
|
||||
|
||||
|
||||
def _nested_vector_schema() -> pa.Schema:
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int32()),
|
||||
pa.field(
|
||||
"image",
|
||||
pa.struct([pa.field("embedding", pa.list_(pa.float32(), DIM))]),
|
||||
),
|
||||
pa.field(
|
||||
"MetaData",
|
||||
pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _nested_vector_data(nrows: int = NROWS) -> pa.Table:
|
||||
embedding_type = pa.list_(pa.float32(), DIM)
|
||||
image_type = pa.struct([pa.field("embedding", embedding_type)])
|
||||
meta_type = pa.struct([pa.field("userId", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"image": pa.array(
|
||||
[{"embedding": _vec(i)} for i in range(nrows)],
|
||||
type=image_type,
|
||||
),
|
||||
"MetaData": pa.array(
|
||||
[{"userId": i} for i in range(nrows)],
|
||||
type=meta_type,
|
||||
),
|
||||
},
|
||||
schema=_nested_vector_schema(),
|
||||
)
|
||||
|
||||
|
||||
def _nested_fts_schema() -> pa.Schema:
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int32()),
|
||||
pa.field(
|
||||
"payload",
|
||||
pa.struct([pa.field("text", pa.utf8())]),
|
||||
),
|
||||
pa.field(
|
||||
"MetaData",
|
||||
pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _nested_fts_data(nrows: int = NROWS) -> pa.Table:
|
||||
words = ["alpha", "bravo", "charlie", "delta", "echo"]
|
||||
payload_type = pa.struct([pa.field("text", pa.utf8())])
|
||||
meta_type = pa.struct([pa.field("userId", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"payload": pa.array(
|
||||
[{"text": words[i % len(words)]} for i in range(nrows)],
|
||||
type=payload_type,
|
||||
),
|
||||
"MetaData": pa.array(
|
||||
[{"userId": i} for i in range(nrows)],
|
||||
type=meta_type,
|
||||
),
|
||||
},
|
||||
schema=_nested_fts_schema(),
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _columns_by_name_sync(tbl) -> dict:
|
||||
return {idx.name: idx.columns for idx in tbl.list_indices()}
|
||||
|
||||
|
||||
async def _columns_by_name_async(tbl) -> dict:
|
||||
return {idx.name: idx.columns for idx in await tbl.list_indices()}
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# SYNC TESTS
|
||||
# ===========================================================================
|
||||
#
|
||||
# The sync LanceTable API uses:
|
||||
# - create_scalar_index(column, ...) for scalar (BTree/Bitmap/LabelList) indices
|
||||
# - create_fts_index(column, ...) for full-text-search indices
|
||||
# - create_index(...) for vector indices (older positional API)
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestNestedScalarIndexSync:
|
||||
"""Sync regression matrix for nested scalar (BTree) indices."""
|
||||
|
||||
def test_top_level_camelcase_field(self, sync_db):
|
||||
"""list_indices must return the full camelCase field name."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index("rowId", index_type="BTREE", name="rowid_idx")
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["rowid_idx"] == ["rowId"], (
|
||||
"list_indices must return 'rowId', not a truncated leaf name"
|
||||
)
|
||||
|
||||
def test_top_level_hyphenated_field_escaped(self, sync_db):
|
||||
"""Top-level field 'row-id' (hyphenated) accessed via escaped path."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index("`row-id`", index_type="BTREE", name="rowid_hyph_idx")
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["rowid_hyph_idx"] == ["`row-id`"], (
|
||||
"list_indices must return escaped path '`row-id`'"
|
||||
)
|
||||
|
||||
def test_struct_leaf_literal_dot_field_escaped(self, sync_db):
|
||||
"""Struct leaf with a literal-dot name: parent.`leaf.name`.
|
||||
|
||||
The index listing must use the full escaped path, not just the leaf.
|
||||
"""
|
||||
tbl = sync_db.create_table("t", _literal_dot_data())
|
||||
tbl.create_scalar_index(
|
||||
"parent.`leaf.name`", index_type="BTREE", name="leaf_dot_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["leaf_dot_idx"] == ["parent.`leaf.name`"], (
|
||||
"list_indices must return 'parent.`leaf.name`', not just '`leaf.name`'"
|
||||
)
|
||||
|
||||
def test_nested_mixed_case_path(self, sync_db):
|
||||
"""Nested path MetaData.userId (mixed case) must appear as full path."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="metadata_userid_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["metadata_userid_idx"] == ["MetaData.userId"], (
|
||||
"list_indices must return 'MetaData.userId', not leaf 'userId'"
|
||||
)
|
||||
|
||||
def test_nested_hyphenated_path_escaped(self, sync_db):
|
||||
"""`meta-data`.`user-id` path with both parts escaped."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"`meta-data`.`user-id`", index_type="BTREE", name="metauid_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["metauid_idx"] == ["`meta-data`.`user-id`"], (
|
||||
"list_indices must return '`meta-data`.`user-id`', not 'user-id'"
|
||||
)
|
||||
|
||||
def test_filter_on_nested_mixed_case(self, sync_db):
|
||||
"""WHERE filter on a nested dotted path works after index creation."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="metadata_userid_idx"
|
||||
)
|
||||
rows = tbl.search().where("MetaData.userId = 5").to_list()
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["MetaData"]["userId"] == 5
|
||||
|
||||
def test_append_and_list_indices_stable(self, sync_db):
|
||||
"""After appending rows the index listing must remain unchanged."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
|
||||
)
|
||||
tbl.add(_nested_scalar_data(nrows=4))
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
def test_optimize_and_list_indices_stable(self, tmp_path):
|
||||
"""After optimize the index listing must still show full paths."""
|
||||
db = lancedb.connect(tmp_path / "opt_db")
|
||||
tbl = db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
|
||||
)
|
||||
tbl.add(_nested_scalar_data(nrows=4))
|
||||
tbl.optimize()
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
def test_same_name_leaves_are_distinct(self, sync_db):
|
||||
"""Two structs sharing a leaf name must produce distinct index paths."""
|
||||
tbl = sync_db.create_table("same_leaf", _same_leaf_data())
|
||||
tbl.create_scalar_index(
|
||||
"StructA.userId", index_type="BTREE", name="a_userid_idx"
|
||||
)
|
||||
tbl.create_scalar_index(
|
||||
"StructB.userId", index_type="BTREE", name="b_userid_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["a_userid_idx"] == ["StructA.userId"]
|
||||
assert col_map["b_userid_idx"] == ["StructB.userId"]
|
||||
|
||||
def test_index_stats_canonical_path(self, sync_db):
|
||||
"""index_stats round-trip: create on nested field, verify row count."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
|
||||
)
|
||||
stats = tbl.index_stats("meta_uid_idx")
|
||||
assert stats is not None
|
||||
assert stats.index_type == "BTREE"
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
|
||||
class TestNestedVectorIndexSync:
|
||||
"""Sync regression matrix for nested vector (IvfPq) indices."""
|
||||
|
||||
def test_nested_vector_index_full_path(self, sync_db):
|
||||
"""Listing after vector index creation must use the full dotted path."""
|
||||
tbl = sync_db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"], (
|
||||
"list_indices must return 'image.embedding', not leaf 'embedding'"
|
||||
)
|
||||
|
||||
def test_nested_vector_search(self, sync_db):
|
||||
"""Vector search on nested embedding field must return results."""
|
||||
tbl = sync_db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
results = (
|
||||
tbl.search(_vec(0), vector_column_name="image.embedding").limit(5).to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
|
||||
def test_nested_vector_index_stats(self, sync_db):
|
||||
"""index_stats for a nested vector index must reflect correct row count."""
|
||||
tbl = sync_db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
stats = tbl.index_stats("image_emb_idx")
|
||||
assert stats is not None
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
def test_nested_vector_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the vector index listing must be stable."""
|
||||
db = lancedb.connect(tmp_path / "vec_opt_db")
|
||||
tbl = db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
tbl.add(_nested_vector_data(nrows=4))
|
||||
tbl.optimize()
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"]
|
||||
|
||||
|
||||
class TestNestedFTSIndexSync:
|
||||
"""Sync regression matrix for nested FTS indices."""
|
||||
|
||||
def test_nested_fts_index_full_path(self, sync_db):
|
||||
"""FTS index on payload.text must be listed with the full path."""
|
||||
tbl = sync_db.create_table("ft", _nested_fts_data())
|
||||
tbl.create_fts_index("payload.text", name="payload_text_idx")
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"], (
|
||||
"list_indices must return 'payload.text', not leaf 'text'"
|
||||
)
|
||||
|
||||
def test_nested_fts_search(self, sync_db):
|
||||
"""FTS search on a nested text field must return correct results."""
|
||||
tbl = sync_db.create_table("ft", _nested_fts_data())
|
||||
tbl.create_fts_index("payload.text", name="payload_text_idx")
|
||||
results = (
|
||||
tbl.search("alpha", query_type="fts", fts_columns="payload.text")
|
||||
.limit(10)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all(row["payload"]["text"] == "alpha" for row in results)
|
||||
|
||||
def test_nested_fts_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the FTS index listing must be stable."""
|
||||
db = lancedb.connect(tmp_path / "fts_opt_db")
|
||||
tbl = db.create_table("ft", _nested_fts_data())
|
||||
tbl.create_fts_index("payload.text", name="payload_text_idx")
|
||||
tbl.add(_nested_fts_data(nrows=4))
|
||||
tbl.optimize()
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"]
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# ASYNC TESTS
|
||||
# ===========================================================================
|
||||
#
|
||||
# The async AsyncTable API uses create_index(column, config=...) uniformly
|
||||
# for scalar, vector, and FTS indices.
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestNestedScalarIndexAsync:
|
||||
"""Async regression matrix for nested scalar (BTree) indices."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_top_level_camelcase_field(self, async_db):
|
||||
"""list_indices must return the full camelCase field name."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("rowId", config=BTree(), name="rowid_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["rowid_idx"] == ["rowId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_top_level_hyphenated_field_escaped(self, async_db):
|
||||
"""Hyphenated top-level field accessed via escaped path."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("`row-id`", config=BTree(), name="rowid_hyph_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["rowid_hyph_idx"] == ["`row-id`"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_struct_leaf_literal_dot_field_escaped(self, async_db):
|
||||
"""Struct leaf with a literal-dot name: parent.`leaf.name`."""
|
||||
tbl = await async_db.create_table("t", _literal_dot_data())
|
||||
await tbl.create_index(
|
||||
"parent.`leaf.name`", config=BTree(), name="leaf_dot_idx"
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["leaf_dot_idx"] == ["parent.`leaf.name`"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_mixed_case_path(self, async_db):
|
||||
"""Mixed-case nested path MetaData.userId must appear as full path."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index(
|
||||
"MetaData.userId", config=BTree(), name="metadata_userid_idx"
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["metadata_userid_idx"] == ["MetaData.userId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_hyphenated_path_escaped(self, async_db):
|
||||
"""`meta-data`.`user-id` path with both parts escaped."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index(
|
||||
"`meta-data`.`user-id`", config=BTree(), name="metauid_idx"
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["metauid_idx"] == ["`meta-data`.`user-id`"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_filter_on_nested_mixed_case(self, async_db):
|
||||
"""WHERE filter on a nested dotted path works after index creation."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index(
|
||||
"MetaData.userId", config=BTree(), name="metadata_userid_idx"
|
||||
)
|
||||
rows = await tbl.query().where("MetaData.userId = 5").to_list()
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["MetaData"]["userId"] == 5
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_index_stats_canonical_path(self, async_db):
|
||||
"""index_stats round-trip: create on nested field, verify stats."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
|
||||
stats = await tbl.index_stats("meta_uid_idx")
|
||||
assert stats is not None
|
||||
assert stats.index_type == "BTREE"
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_append_and_list_indices_stable(self, async_db):
|
||||
"""After appending rows the index listing must remain unchanged."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
|
||||
await tbl.add(_nested_scalar_data(nrows=4))
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_optimize_and_list_indices_stable(self, tmp_path):
|
||||
"""After optimize the index listing must still show full paths."""
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path / "opt_db", read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
tbl = await db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
|
||||
await tbl.add(_nested_scalar_data(nrows=4))
|
||||
await tbl.optimize()
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_same_name_leaves_are_distinct(self, async_db):
|
||||
"""Two structs sharing a leaf name must produce distinct index paths."""
|
||||
tbl = await async_db.create_table("same_leaf", _same_leaf_data())
|
||||
await tbl.create_index("StructA.userId", config=BTree(), name="a_userid_idx")
|
||||
await tbl.create_index("StructB.userId", config=BTree(), name="b_userid_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["a_userid_idx"] == ["StructA.userId"]
|
||||
assert col_map["b_userid_idx"] == ["StructB.userId"]
|
||||
|
||||
|
||||
class TestNestedVectorIndexAsync:
|
||||
"""Async regression matrix for nested vector (IvfPq) indices."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_index_full_path(self, async_db):
|
||||
"""Listing after vector index creation must use the full dotted path."""
|
||||
tbl = await async_db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_search(self, async_db):
|
||||
"""Vector search on nested embedding field must return results."""
|
||||
tbl = await async_db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
results = (
|
||||
await tbl.query()
|
||||
.nearest_to(_vec(0))
|
||||
.column("image.embedding")
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_index_stats(self, async_db):
|
||||
"""index_stats for a nested vector index must reflect correct row count."""
|
||||
tbl = await async_db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
stats = await tbl.index_stats("image_emb_idx")
|
||||
assert stats is not None
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the vector index listing must be stable."""
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path / "vec_opt_db", read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
tbl = await db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
await tbl.add(_nested_vector_data(nrows=4))
|
||||
await tbl.optimize()
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"]
|
||||
|
||||
|
||||
class TestNestedFTSIndexAsync:
|
||||
"""Async regression matrix for nested FTS indices."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_fts_index_full_path(self, async_db):
|
||||
"""FTS index on payload.text must be listed with the full path."""
|
||||
tbl = await async_db.create_table("ft", _nested_fts_data())
|
||||
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_fts_search(self, async_db):
|
||||
"""FTS search on a nested text field must return correct results."""
|
||||
tbl = await async_db.create_table("ft", _nested_fts_data())
|
||||
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
|
||||
results = (
|
||||
await tbl.query()
|
||||
.nearest_to_text("alpha", columns="payload.text")
|
||||
.limit(10)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all(row["payload"]["text"] == "alpha" for row in results)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_fts_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the FTS index listing must be stable."""
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path / "fts_opt_db", read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
tbl = await db.create_table("ft", _nested_fts_data())
|
||||
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
|
||||
await tbl.add(_nested_fts_data(nrows=4))
|
||||
await tbl.optimize()
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"]
|
||||
@@ -188,18 +188,6 @@ def test_nested_struct_list():
|
||||
assert schema == expect_schema
|
||||
|
||||
|
||||
def test_bare_generic_raises_type_error():
|
||||
# A bare, unparameterised List/Tuple has no element type to map to Arrow.
|
||||
# It should raise a clear TypeError, not crash with AttributeError: __args__.
|
||||
for bare in (List, Tuple):
|
||||
|
||||
class TestModel(pydantic.BaseModel):
|
||||
items: bare
|
||||
|
||||
with pytest.raises(TypeError, match="unsupported type"):
|
||||
pydantic_to_schema(TestModel)
|
||||
|
||||
|
||||
def test_nested_struct_list_optional():
|
||||
class SplitInfo(pydantic.BaseModel):
|
||||
start_frame: int
|
||||
|
||||
@@ -154,116 +154,50 @@ async def test_async_checkout():
|
||||
assert await table.count_rows() == 300
|
||||
|
||||
|
||||
def _branch_open_handler(request):
|
||||
if "/branches/list" in request.path:
|
||||
body = json.dumps(
|
||||
{
|
||||
"branches": {
|
||||
"exp": {
|
||||
"parentBranch": None,
|
||||
"parentVersion": 1,
|
||||
"createAt": 1,
|
||||
"manifestSize": 1,
|
||||
}
|
||||
}
|
||||
}
|
||||
).encode()
|
||||
else:
|
||||
# describe (table open + version/branch validation)
|
||||
body = json.dumps({"version": 2, "schema": {"fields": []}}).encode()
|
||||
request.send_response(200)
|
||||
request.send_header("Content-Type", "application/json")
|
||||
request.end_headers()
|
||||
request.wfile.write(body)
|
||||
|
||||
|
||||
def test_remote_open_table_branch_and_version():
|
||||
with mock_lancedb_connection(_branch_open_handler) as db:
|
||||
# version-only (and "main" + version) time-travels the main chain
|
||||
assert db.open_table("test", version=2) is not None
|
||||
assert db.open_table("test", branch="main", version=2).current_branch() is None
|
||||
|
||||
# a non-main branch opens a handle scoped to that branch, with or
|
||||
# without a version
|
||||
assert db.open_table("test", branch="exp").current_branch() == "exp"
|
||||
assert db.open_table("test", branch="exp", version=2).current_branch() == "exp"
|
||||
|
||||
|
||||
def test_remote_table_branches_sync():
|
||||
# Branch CRUD + current_branch on the sync RemoteTable. The handle returned
|
||||
# by create/checkout must stay a RemoteTable scoped to the branch.
|
||||
from lancedb.remote.table import RemoteTable
|
||||
|
||||
def handler(request):
|
||||
if "/branches/list" in request.path:
|
||||
body = json.dumps(
|
||||
{
|
||||
"branches": {
|
||||
"exp": {
|
||||
"parentBranch": None,
|
||||
"parentVersion": 1,
|
||||
"createAt": 1,
|
||||
"manifestSize": 1,
|
||||
}
|
||||
}
|
||||
}
|
||||
).encode()
|
||||
elif "/branches/create" in request.path or "/branches/delete" in request.path:
|
||||
body = b"{}"
|
||||
else:
|
||||
# describe (table open + checkout validation)
|
||||
body = json.dumps({"version": 1, "schema": {"fields": []}}).encode()
|
||||
# describe (table open + version validation) always succeeds
|
||||
request.send_response(200)
|
||||
request.send_header("Content-Type", "application/json")
|
||||
request.end_headers()
|
||||
request.wfile.write(body)
|
||||
request.wfile.write(
|
||||
json.dumps({"version": 2, "schema": {"fields": []}}).encode()
|
||||
)
|
||||
|
||||
with mock_lancedb_connection(handler) as db:
|
||||
table = db.open_table("test")
|
||||
assert isinstance(table, RemoteTable)
|
||||
assert table.current_branch() is None
|
||||
# version-only (and "main" + version) is allowed: remote supports
|
||||
# version time-travel even though it has no branches
|
||||
assert db.open_table("test", version=2) is not None
|
||||
assert db.open_table("test", branch="main", version=2) is not None
|
||||
|
||||
branch = table.branches.create("exp")
|
||||
assert isinstance(branch, RemoteTable)
|
||||
assert branch.current_branch() == "exp"
|
||||
|
||||
# list + checkout round trip; checkout also yields a branch-scoped handle
|
||||
assert "exp" in table.branches.list()
|
||||
checked = table.branches.checkout("exp")
|
||||
assert isinstance(checked, RemoteTable)
|
||||
assert checked.current_branch() == "exp"
|
||||
|
||||
table.branches.delete("exp")
|
||||
# a non-main branch is rejected, with or without a version
|
||||
with pytest.raises(NotImplementedError, match="branching"):
|
||||
db.open_table("test", branch="exp")
|
||||
with pytest.raises(NotImplementedError, match="branching"):
|
||||
db.open_table("test", branch="exp", version=2)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_remote_open_table_branch_and_version():
|
||||
async with mock_lancedb_connection_async(_branch_open_handler) as db:
|
||||
# version-only (and "main" + version) time-travels the main chain
|
||||
def handler(request):
|
||||
request.send_response(200)
|
||||
request.send_header("Content-Type", "application/json")
|
||||
request.end_headers()
|
||||
request.wfile.write(
|
||||
json.dumps({"version": 2, "schema": {"fields": []}}).encode()
|
||||
)
|
||||
|
||||
async with mock_lancedb_connection_async(handler) as db:
|
||||
# version-only (and "main" + version) is allowed: "main" is the default
|
||||
# branch, so it must not hit the unsupported remote branch path
|
||||
assert await db.open_table("test", version=2) is not None
|
||||
main_v2 = await db.open_table("test", branch="main", version=2)
|
||||
assert main_v2.current_branch() is None
|
||||
assert await db.open_table("test", branch="main", version=2) is not None
|
||||
|
||||
# a non-main branch opens a handle scoped to that branch
|
||||
exp = await db.open_table("test", branch="exp")
|
||||
assert exp.current_branch() == "exp"
|
||||
exp_v2 = await db.open_table("test", branch="exp", version=2)
|
||||
assert exp_v2.current_branch() == "exp"
|
||||
|
||||
|
||||
def test_remote_table_branch_survives_pickle():
|
||||
# Regression: a branch-scoped handle must keep its branch across a
|
||||
# pickle/fork round-trip (it used to reopen on main).
|
||||
with mock_lancedb_connection(_branch_open_handler) as db:
|
||||
branch = db.open_table("test", branch="exp")
|
||||
assert branch.current_branch() == "exp"
|
||||
restored = pickle.loads(pickle.dumps(branch))
|
||||
assert restored.current_branch() == "exp"
|
||||
|
||||
# the pinned version is carried through as well
|
||||
branch_v2 = db.open_table("test", branch="exp", version=2)
|
||||
restored_v2 = pickle.loads(pickle.dumps(branch_v2))
|
||||
assert restored_v2.current_branch() == "exp"
|
||||
# a non-main branch is rejected, with or without a version
|
||||
with pytest.raises(NotImplementedError, match="branching"):
|
||||
await db.open_table("test", branch="exp")
|
||||
with pytest.raises(NotImplementedError, match="branching"):
|
||||
await db.open_table("test", branch="exp", version=2)
|
||||
|
||||
|
||||
def test_table_len_sync():
|
||||
|
||||
@@ -22,7 +22,6 @@ import pytest
|
||||
from lancedb.conftest import MockTextEmbeddingFunction
|
||||
from lancedb.db import AsyncConnection, DBConnection
|
||||
from lancedb.embeddings import EmbeddingFunctionConfig, EmbeddingFunctionRegistry
|
||||
from lancedb.expr import col, lit
|
||||
from lancedb.pydantic import LanceModel, Vector
|
||||
from lancedb.table import LanceTable
|
||||
from pydantic import BaseModel
|
||||
@@ -301,16 +300,6 @@ def test_create_table(mem_db: DBConnection):
|
||||
assert expected == tbl
|
||||
|
||||
|
||||
def test_create_table_rejects_single_dictionary(mem_db: DBConnection):
|
||||
data = {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}
|
||||
with pytest.raises(ValueError) as excep_info:
|
||||
mem_db.create_table("test", data=data)
|
||||
assert (
|
||||
str(excep_info.value) == "Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
|
||||
|
||||
def test_empty_table(mem_db: DBConnection):
|
||||
schema = pa.schema(
|
||||
[
|
||||
@@ -340,8 +329,8 @@ def test_add_dictionary(mem_db: DBConnection):
|
||||
with pytest.raises(ValueError) as excep_info:
|
||||
tbl.add(data=data)
|
||||
assert (
|
||||
str(excep_info.value) == "Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
str(excep_info.value)
|
||||
== "Cannot add a single dictionary to a table. Use a list."
|
||||
)
|
||||
|
||||
|
||||
@@ -1977,38 +1966,6 @@ def test_delete(mem_db: DBConnection):
|
||||
assert table.to_arrow()["id"].to_pylist() == [1]
|
||||
|
||||
|
||||
def test_delete_expr(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
data=[
|
||||
{"vector": [1.1, 0.9], "id": 0},
|
||||
{"vector": [1.2, 1.9], "id": 1},
|
||||
{"vector": [1.3, 2.9], "id": 2},
|
||||
],
|
||||
)
|
||||
assert len(table) == 3
|
||||
delete_res = table.delete(col("id") == lit(0))
|
||||
assert delete_res.version == 2
|
||||
assert len(table) == 2
|
||||
assert sorted(table.to_arrow()["id"].to_pylist()) == [1, 2]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_delete_expr_async(mem_db_async: AsyncConnection):
|
||||
table = await mem_db_async.create_table(
|
||||
"my_table",
|
||||
data=[
|
||||
{"vector": [1.1, 0.9], "id": 0},
|
||||
{"vector": [1.2, 1.9], "id": 1},
|
||||
{"vector": [1.3, 2.9], "id": 2},
|
||||
],
|
||||
)
|
||||
assert await table.count_rows() == 3
|
||||
await table.delete(col("id") == lit(0))
|
||||
assert await table.count_rows() == 2
|
||||
assert sorted((await table.to_arrow())["id"].to_pylist()) == [1, 2]
|
||||
|
||||
|
||||
def test_update(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
@@ -2194,50 +2151,6 @@ def test_merge_insert(mem_db: DBConnection):
|
||||
)
|
||||
|
||||
|
||||
def test_merge_insert_by_source_delete_expr(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
data=pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]}),
|
||||
)
|
||||
new_data = pa.table({"a": [2, 4], "b": ["x", "z"]})
|
||||
|
||||
# replace-range, limiting the source-absent delete with an Expr condition
|
||||
merge_insert_res = (
|
||||
table.merge_insert("a")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.when_not_matched_by_source_delete(col("a") > lit(2))
|
||||
.execute(new_data)
|
||||
)
|
||||
assert merge_insert_res.num_inserted_rows == 1
|
||||
assert merge_insert_res.num_updated_rows == 1
|
||||
assert merge_insert_res.num_deleted_rows == 1
|
||||
|
||||
expected = pa.table({"a": [1, 2, 4], "b": ["a", "x", "z"]})
|
||||
assert table.to_arrow().sort_by("a") == expected
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_insert_by_source_delete_expr_async(
|
||||
mem_db_async: AsyncConnection,
|
||||
):
|
||||
data = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
|
||||
table = await mem_db_async.create_table("some_table", data=data)
|
||||
new_data = pa.table({"a": [2, 4], "b": ["x", "z"]})
|
||||
|
||||
# replace-range, limiting the source-absent delete with an Expr condition
|
||||
await (
|
||||
table.merge_insert("a")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.when_not_matched_by_source_delete(col("a") > lit(2))
|
||||
.execute(new_data)
|
||||
)
|
||||
|
||||
expected = pa.table({"a": [1, 2, 4], "b": ["a", "x", "z"]})
|
||||
assert (await table.to_arrow()).sort_by("a") == expected
|
||||
|
||||
|
||||
# We vary the data format because there are slight differences in how
|
||||
# subschemas are handled in different formats
|
||||
@pytest.mark.parametrize(
|
||||
@@ -2576,55 +2489,6 @@ def test_create_index_nested_field_paths(mem_db: DBConnection):
|
||||
assert fts_results[0]["payload"]["text"] == "document 44"
|
||||
|
||||
|
||||
def test_index_config_fields(mem_db: DBConnection):
|
||||
"""Test that IndexConfig exposes the new rich metadata fields."""
|
||||
vec_array = pa.array(
|
||||
[[float(i), float(i + 1)] for i in range(300)], pa.list_(pa.float32(), 2)
|
||||
)
|
||||
data = pa.Table.from_pydict({"x": list(range(300)), "vector": vec_array})
|
||||
table = mem_db.create_table("index_config_fields", data=data)
|
||||
table.create_scalar_index("x", index_type="BTREE")
|
||||
table.create_index(
|
||||
vector_column_name="vector",
|
||||
num_partitions=1,
|
||||
num_sub_vectors=1,
|
||||
)
|
||||
|
||||
indices = {idx.name: idx for idx in table.list_indices()}
|
||||
|
||||
scalar_idx = indices["x_idx"]
|
||||
assert scalar_idx.index_uuid is not None
|
||||
assert isinstance(scalar_idx.index_uuid, str)
|
||||
assert scalar_idx.num_indexed_rows is not None
|
||||
assert scalar_idx.num_indexed_rows == 300
|
||||
assert scalar_idx.num_unindexed_rows is not None
|
||||
assert scalar_idx.num_unindexed_rows == 0
|
||||
assert scalar_idx.num_segments is not None
|
||||
assert scalar_idx.num_segments >= 1
|
||||
assert scalar_idx.size_bytes is not None
|
||||
assert scalar_idx.size_bytes > 0
|
||||
assert scalar_idx.created_at is not None
|
||||
from datetime import datetime, timezone
|
||||
|
||||
assert isinstance(scalar_idx.created_at, datetime)
|
||||
assert scalar_idx.created_at.tzinfo == timezone.utc
|
||||
|
||||
# __getitem__ compatibility
|
||||
assert scalar_idx["index_uuid"] == scalar_idx.index_uuid
|
||||
assert scalar_idx["num_indexed_rows"] == scalar_idx.num_indexed_rows
|
||||
assert scalar_idx["created_at"] == scalar_idx.created_at
|
||||
|
||||
# index_details is parsed from JSON into a Python object
|
||||
assert scalar_idx.index_details is not None
|
||||
assert isinstance(scalar_idx.index_details, dict)
|
||||
assert scalar_idx["index_details"] == scalar_idx.index_details
|
||||
|
||||
vector_idx = indices["vector_idx"]
|
||||
assert vector_idx.index_uuid is not None
|
||||
assert vector_idx.num_indexed_rows == 300
|
||||
assert isinstance(vector_idx.index_details, dict)
|
||||
|
||||
|
||||
def test_empty_query(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
|
||||
@@ -149,21 +149,6 @@ def test_value_to_sql_dict():
|
||||
assert value_to_sql({}) == "named_struct()"
|
||||
|
||||
|
||||
def test_value_to_sql_dict_key_escaping():
|
||||
# Struct field names that contain a single quote must be escaped (doubled)
|
||||
# the same way string values are, otherwise value_to_sql emits invalid SQL
|
||||
# such as named_struct('it's', 1).
|
||||
assert value_to_sql({"it's": 1}) == "named_struct('it''s', 1)"
|
||||
assert (
|
||||
value_to_sql({"o'brien": "d'angelo"}) == "named_struct('o''brien', 'd''angelo')"
|
||||
)
|
||||
# Escaping also applies to keys of nested structs.
|
||||
assert (
|
||||
value_to_sql({"outer": {"in'r": 1}})
|
||||
== "named_struct('outer', named_struct('in''r', 1))"
|
||||
)
|
||||
|
||||
|
||||
def test_value_to_sql_numpy_scalars():
|
||||
# numpy scalars (e.g. pulled from an ndarray or a pandas column) must
|
||||
# convert the same way as their native Python counterparts. np.float64
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
use chrono::{DateTime, Utc};
|
||||
use lancedb::index::vector::{
|
||||
IvfFlatIndexBuilder, IvfHnswFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder,
|
||||
IvfPqIndexBuilder, IvfRqIndexBuilder, IvfSqIndexBuilder,
|
||||
@@ -13,7 +12,7 @@ use lancedb::index::{
|
||||
use pyo3::IntoPyObject;
|
||||
use pyo3::types::PyStringMethods;
|
||||
use pyo3::{
|
||||
Bound, FromPyObject, Py, PyAny, PyResult, Python,
|
||||
Bound, FromPyObject, PyAny, PyResult, Python,
|
||||
exceptions::{PyKeyError, PyValueError},
|
||||
intern, pyclass, pymethods,
|
||||
types::{PyAnyMethods, PyString},
|
||||
@@ -295,77 +294,15 @@ pub struct IndexConfig {
|
||||
pub columns: Vec<String>,
|
||||
/// Name of the index.
|
||||
pub name: String,
|
||||
/// The UUID of the first segment of the index.
|
||||
pub index_uuid: Option<String>,
|
||||
/// The protobuf type URL, a precise type identifier for the index.
|
||||
pub type_url: Option<String>,
|
||||
/// When the index was created.
|
||||
pub created_at: Option<DateTime<Utc>>,
|
||||
/// The number of rows indexed, across all segments.
|
||||
pub num_indexed_rows: Option<u64>,
|
||||
/// The number of rows not yet covered by this index.
|
||||
pub num_unindexed_rows: Option<u64>,
|
||||
/// The total size in bytes of all index files across all segments.
|
||||
pub size_bytes: Option<u64>,
|
||||
/// The number of segments that make up the index.
|
||||
pub num_segments: Option<u32>,
|
||||
/// The on-disk index format version.
|
||||
pub index_version: Option<i32>,
|
||||
/// Index-type-specific details parsed as a Python object (dict, list, etc.).
|
||||
///
|
||||
/// Falls back to a raw string if JSON parsing fails. `None` when unavailable.
|
||||
pub index_details: Option<Py<PyAny>>,
|
||||
}
|
||||
|
||||
#[pymethods]
|
||||
impl IndexConfig {
|
||||
pub fn __repr__(&self, py: Python<'_>) -> String {
|
||||
let mut fields = vec![
|
||||
format!("name={:?}", self.name),
|
||||
format!("index_type={:?}", self.index_type),
|
||||
format!("columns={:?}", self.columns),
|
||||
];
|
||||
if let Some(v) = &self.index_uuid {
|
||||
fields.push(format!("index_uuid={:?}", v));
|
||||
}
|
||||
if let Some(v) = &self.type_url {
|
||||
fields.push(format!("type_url={:?}", v));
|
||||
}
|
||||
if let Some(v) = self.created_at {
|
||||
// Render the datetime's own Python repr so the value round-trips,
|
||||
// falling back to RFC 3339 if the conversion ever fails.
|
||||
let rendered = v
|
||||
.into_pyobject(py)
|
||||
.ok()
|
||||
.and_then(|obj| obj.into_any().repr().ok())
|
||||
.map(|r| r.to_string())
|
||||
.unwrap_or_else(|| v.to_rfc3339());
|
||||
fields.push(format!("created_at={}", rendered));
|
||||
}
|
||||
if let Some(v) = self.num_indexed_rows {
|
||||
fields.push(format!("num_indexed_rows={}", fmt_thousands(v)));
|
||||
}
|
||||
if let Some(v) = self.num_unindexed_rows {
|
||||
fields.push(format!("num_unindexed_rows={}", fmt_thousands(v)));
|
||||
}
|
||||
if let Some(v) = self.size_bytes {
|
||||
fields.push(format!("size_bytes={}", fmt_thousands(v)));
|
||||
}
|
||||
if let Some(v) = self.num_segments {
|
||||
fields.push(format!("num_segments={}", v));
|
||||
}
|
||||
if let Some(v) = self.index_version {
|
||||
fields.push(format!("index_version={}", v));
|
||||
}
|
||||
if let Some(v) = &self.index_details {
|
||||
let details = v
|
||||
.bind(py)
|
||||
.repr()
|
||||
.map(|r| r.to_string())
|
||||
.unwrap_or_else(|_| "<unavailable>".to_string());
|
||||
fields.push(format!("index_details={}", details));
|
||||
}
|
||||
format!("IndexConfig({})", fields.join(", "))
|
||||
pub fn __repr__(&self) -> String {
|
||||
format!(
|
||||
"Index({}, columns={:?}, name=\"{}\")",
|
||||
self.index_type, self.columns, self.name
|
||||
)
|
||||
}
|
||||
|
||||
// For backwards-compatibility with the old sync SDK, we also support getting
|
||||
@@ -375,66 +312,18 @@ impl IndexConfig {
|
||||
"index_type" => Ok(self.index_type.clone().into_pyobject(py)?.into_any()),
|
||||
"columns" => Ok(self.columns.clone().into_pyobject(py)?.into_any()),
|
||||
"name" | "index_name" => Ok(self.name.clone().into_pyobject(py)?.into_any()),
|
||||
"index_uuid" => Ok(self.index_uuid.clone().into_pyobject(py)?.into_any()),
|
||||
"type_url" => Ok(self.type_url.clone().into_pyobject(py)?.into_any()),
|
||||
"created_at" => Ok(self.created_at.into_pyobject(py)?.into_any()),
|
||||
"num_indexed_rows" => Ok(self.num_indexed_rows.into_pyobject(py)?.into_any()),
|
||||
"num_unindexed_rows" => Ok(self.num_unindexed_rows.into_pyobject(py)?.into_any()),
|
||||
"size_bytes" => Ok(self.size_bytes.into_pyobject(py)?.into_any()),
|
||||
"num_segments" => Ok(self.num_segments.into_pyobject(py)?.into_any()),
|
||||
"index_version" => Ok(self.index_version.into_pyobject(py)?.into_any()),
|
||||
"index_details" => Ok(self
|
||||
.index_details
|
||||
.as_ref()
|
||||
.map(|obj| obj.clone_ref(py))
|
||||
.into_pyobject(py)?
|
||||
.into_any()),
|
||||
_ => Err(PyKeyError::new_err(format!("Invalid key: {}", key))),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Format an integer with `_` thousands separators, e.g. `24_500_213`.
|
||||
///
|
||||
/// Underscores are valid Python int-literal syntax, so the repr stays
|
||||
/// copy-pasteable and machine-parseable while remaining readable.
|
||||
fn fmt_thousands(n: u64) -> String {
|
||||
let digits = n.to_string();
|
||||
let bytes = digits.as_bytes();
|
||||
let mut out = String::with_capacity(digits.len() + digits.len() / 3);
|
||||
for (i, b) in bytes.iter().enumerate() {
|
||||
if i > 0 && (bytes.len() - i).is_multiple_of(3) {
|
||||
out.push('_');
|
||||
}
|
||||
out.push(*b as char);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
fn parse_index_details(py: Python<'_>, s: String) -> Py<PyAny> {
|
||||
let json = py.import("json").expect("json module is always available");
|
||||
match json.call_method1("loads", (s.as_str(),)) {
|
||||
Ok(obj) => obj.into_any().unbind(),
|
||||
Err(_) => s.into_pyobject(py).unwrap().into_any().unbind(),
|
||||
}
|
||||
}
|
||||
|
||||
impl IndexConfig {
|
||||
pub fn from_lancedb(py: Python<'_>, value: lancedb::index::IndexConfig) -> Self {
|
||||
impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
fn from(value: lancedb::index::IndexConfig) -> Self {
|
||||
let index_type = format!("{:?}", value.index_type);
|
||||
Self {
|
||||
index_type,
|
||||
columns: value.columns,
|
||||
name: value.name,
|
||||
index_uuid: value.index_uuid,
|
||||
type_url: value.type_url,
|
||||
created_at: value.created_at,
|
||||
num_indexed_rows: value.num_indexed_rows,
|
||||
num_unindexed_rows: value.num_unindexed_rows,
|
||||
size_bytes: value.size_bytes,
|
||||
num_segments: value.num_segments,
|
||||
index_version: value.index_version,
|
||||
index_details: value.index_details.map(|s| parse_index_details(py, s)),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -6,7 +6,6 @@ use crate::runtime::future_into_py;
|
||||
use crate::{
|
||||
connection::Connection,
|
||||
error::PythonErrorExt,
|
||||
expr::PyExpr,
|
||||
index::{IndexConfig, extract_index_params},
|
||||
query::{Query, TakeQuery},
|
||||
table::scannable::PyScannable,
|
||||
@@ -29,12 +28,6 @@ use pyo3::{
|
||||
|
||||
mod scannable;
|
||||
|
||||
#[derive(FromPyObject)]
|
||||
enum PredicateArg {
|
||||
Expr(PyExpr),
|
||||
Sql(String),
|
||||
}
|
||||
|
||||
/// Statistics about a compaction operation.
|
||||
#[pyclass(get_all, from_py_object)]
|
||||
#[derive(Clone, Debug)]
|
||||
@@ -568,15 +561,10 @@ impl Table {
|
||||
})
|
||||
}
|
||||
|
||||
#[allow(private_interfaces)]
|
||||
pub fn delete(self_: PyRef<'_, Self>, condition: PredicateArg) -> PyResult<Bound<'_, PyAny>> {
|
||||
pub fn delete(self_: PyRef<'_, Self>, condition: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let result = match &condition {
|
||||
PredicateArg::Expr(e) => inner.delete(&e.0).await,
|
||||
PredicateArg::Sql(s) => inner.delete(s.as_str()).await,
|
||||
}
|
||||
.infer_error()?;
|
||||
let result = inner.delete(&condition).await.infer_error()?;
|
||||
Ok(DeleteResult::from(result))
|
||||
})
|
||||
}
|
||||
@@ -694,13 +682,13 @@ impl Table {
|
||||
pub fn list_indices(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let indices = inner.list_indices().await.infer_error()?;
|
||||
Python::attach(|py| {
|
||||
Ok(indices
|
||||
.into_iter()
|
||||
.map(|idx| IndexConfig::from_lancedb(py, idx))
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
Ok(inner
|
||||
.list_indices()
|
||||
.await
|
||||
.infer_error()?
|
||||
.into_iter()
|
||||
.map(IndexConfig::from)
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
@@ -971,13 +959,8 @@ impl Table {
|
||||
builder.when_not_matched_insert_all();
|
||||
}
|
||||
if parameters.when_not_matched_by_source_delete {
|
||||
if let Some(e) = parameters.when_not_matched_by_source_condition_expr {
|
||||
builder.when_not_matched_by_source_delete_expr(e.0);
|
||||
} else {
|
||||
builder.when_not_matched_by_source_delete(
|
||||
parameters.when_not_matched_by_source_condition,
|
||||
);
|
||||
}
|
||||
builder
|
||||
.when_not_matched_by_source_delete(parameters.when_not_matched_by_source_condition);
|
||||
}
|
||||
if let Some(timeout) = parameters.timeout {
|
||||
builder.timeout(timeout);
|
||||
@@ -1213,7 +1196,6 @@ pub struct MergeInsertParams {
|
||||
when_not_matched_insert_all: bool,
|
||||
when_not_matched_by_source_delete: bool,
|
||||
when_not_matched_by_source_condition: Option<String>,
|
||||
when_not_matched_by_source_condition_expr: Option<PyExpr>,
|
||||
timeout: Option<std::time::Duration>,
|
||||
use_index: Option<bool>,
|
||||
use_lsm_write: Option<bool>,
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "lancedb"
|
||||
version = "0.31.0-beta.1"
|
||||
version = "0.30.1-beta.2"
|
||||
edition.workspace = true
|
||||
description = "LanceDB: A serverless, low-latency vector database for AI applications"
|
||||
license.workspace = true
|
||||
|
||||
@@ -1,435 +0,0 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
//! Lance blob v2 columns store large binary payloads out of line.
|
||||
//!
|
||||
//! Declare a column with [`blob`]. On write, [`crate::table::Table::add`] coerces
|
||||
//! raw `Binary` / `LargeBinary` into the blob struct layout. Queries return
|
||||
//! small descriptors, not bytes.
|
||||
//!
|
||||
//! Blob tables require Lance file format >= 2.2 and stable row ids at create.
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use arrow_array::builder::LargeBinaryBuilder;
|
||||
use arrow_array::{Array, LargeBinaryArray, RecordBatch, StructArray, UInt8Array, UInt64Array};
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
use lance::dataset::{Dataset, WriteParams};
|
||||
use lance_arrow::FieldExt;
|
||||
use lance_core::datatypes::parse_field_path;
|
||||
use lance_encoding::version::LanceFileVersion;
|
||||
|
||||
use crate::error::{Error, Result};
|
||||
|
||||
pub use lance::dataset::BlobFile;
|
||||
|
||||
/// Creates an Arrow field for a Lance blob v2 column.
|
||||
///
|
||||
/// `Struct<data, uri>` with the `lance.blob.v2` marker. Same layout Lance
|
||||
/// expects on write.
|
||||
///
|
||||
/// A blob column may be top-level or nested inside a struct or list. Nested
|
||||
/// blobs are addressed by a dotted path (e.g. `info.blob`) in the read APIs.
|
||||
///
|
||||
/// ```
|
||||
/// use arrow_schema::{DataType, Field, Schema};
|
||||
///
|
||||
/// let schema = Schema::new(vec![
|
||||
/// Field::new("id", DataType::Int64, false),
|
||||
/// lancedb::blob("image", true),
|
||||
/// ]);
|
||||
/// ```
|
||||
pub fn blob(name: impl AsRef<str>, nullable: bool) -> Field {
|
||||
lance::blob::blob_field(name.as_ref(), nullable)
|
||||
}
|
||||
|
||||
/// Returns true if `field` is a blob v2 column.
|
||||
///
|
||||
/// ```
|
||||
/// let field = lancedb::blob("image", true);
|
||||
/// assert!(lancedb::blob::is_blob(&field));
|
||||
/// ```
|
||||
pub fn is_blob(field: &Field) -> bool {
|
||||
field.is_blob_v2()
|
||||
}
|
||||
|
||||
/// Returns true if `field`, or any field nested under it, is a blob v2 column.
|
||||
fn field_tree_has_blob_v2(field: &Field) -> bool {
|
||||
if field.is_blob_v2() {
|
||||
return true;
|
||||
}
|
||||
match field.data_type() {
|
||||
DataType::Struct(children) => children.iter().any(|c| field_tree_has_blob_v2(c)),
|
||||
DataType::List(child) | DataType::LargeList(child) | DataType::FixedSizeList(child, _) => {
|
||||
field_tree_has_blob_v2(child)
|
||||
}
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
/// Collects the dotted paths of blob v2 columns under `field`, into `paths`.
|
||||
fn collect_blob_paths(field: &Field, prefix: &str, paths: &mut Vec<String>) {
|
||||
let path = if prefix.is_empty() {
|
||||
field.name().clone()
|
||||
} else {
|
||||
format!("{prefix}.{}", field.name())
|
||||
};
|
||||
if field.is_blob_v2() {
|
||||
paths.push(path);
|
||||
return;
|
||||
}
|
||||
match field.data_type() {
|
||||
DataType::Struct(children) => {
|
||||
for child in children {
|
||||
collect_blob_paths(child, &path, paths);
|
||||
}
|
||||
}
|
||||
DataType::List(child) | DataType::LargeList(child) | DataType::FixedSizeList(child, _) => {
|
||||
collect_blob_paths(child, &path, paths)
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns true if `schema` declares any blob v2 column, including nested ones.
|
||||
pub(crate) fn has_blob_columns(schema: &Schema) -> bool {
|
||||
schema.fields().iter().any(|f| field_tree_has_blob_v2(f))
|
||||
}
|
||||
|
||||
/// Blob v2 column paths in `schema`, declaration order preserved. Nested blobs
|
||||
/// are dotted paths (e.g. `info.blob`).
|
||||
pub(crate) fn blob_column_names(schema: &Schema) -> Vec<String> {
|
||||
let mut paths = Vec::new();
|
||||
for field in schema.fields() {
|
||||
collect_blob_paths(field, "", &mut paths);
|
||||
}
|
||||
paths
|
||||
}
|
||||
|
||||
/// Bumps storage format to at least [`LanceFileVersion::V2_2`] for blob schemas.
|
||||
pub(crate) fn ensure_blob_storage_version(schema: &Schema, params: &mut WriteParams) {
|
||||
if !has_blob_columns(schema) {
|
||||
return;
|
||||
}
|
||||
|
||||
let resolved = params
|
||||
.data_storage_version
|
||||
.unwrap_or(LanceFileVersion::Stable)
|
||||
.resolve();
|
||||
if resolved < LanceFileVersion::V2_2 {
|
||||
params.data_storage_version = Some(LanceFileVersion::V2_2);
|
||||
}
|
||||
}
|
||||
|
||||
/// Validate that `column` exists and is a blob v2 column.
|
||||
///
|
||||
/// Legacy v1 columns (`lance-encoding:blob`) error with a migration hint.
|
||||
pub(crate) fn ensure_blob_v2_column(
|
||||
schema: &lance_core::datatypes::Schema,
|
||||
column: &str,
|
||||
) -> Result<()> {
|
||||
match schema.field(column) {
|
||||
Some(field) if field.is_blob_v2() => Ok(()),
|
||||
Some(field) if field.is_blob() => Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"column '{column}' is a legacy blob column; blob APIs require blob v2 columns \
|
||||
(ARROW:extension:name = \"lance.blob.v2\")"
|
||||
),
|
||||
}),
|
||||
Some(_) => Err(Error::InvalidInput {
|
||||
message: format!("column '{column}' is not a blob column"),
|
||||
}),
|
||||
None => Err(Error::InvalidInput {
|
||||
message: format!("no column named '{column}' in this table"),
|
||||
}),
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the leaf descriptor `StructArray` for `column` in a descriptor batch.
|
||||
fn leaf_descriptor_struct<'a>(batch: &'a RecordBatch, column: &str) -> Result<&'a StructArray> {
|
||||
let path = parse_field_path(column).map_err(|e| Error::InvalidInput {
|
||||
message: format!("invalid blob column path '{column}': {e}"),
|
||||
})?;
|
||||
let not_struct = || Error::Runtime {
|
||||
message: format!("blob column '{column}' did not read back as a descriptor struct"),
|
||||
};
|
||||
let mut current = batch
|
||||
.column_by_name(&path[0])
|
||||
.and_then(|c| c.as_any().downcast_ref::<StructArray>())
|
||||
.ok_or_else(not_struct)?;
|
||||
for segment in &path[1..] {
|
||||
current = current
|
||||
.column_by_name(segment)
|
||||
.and_then(|c| c.as_any().downcast_ref::<StructArray>())
|
||||
.ok_or_else(not_struct)?;
|
||||
}
|
||||
Ok(current)
|
||||
}
|
||||
|
||||
/// Null rows in `row_ids`, from a descriptor take.
|
||||
///
|
||||
/// Lance `read_blobs` / `take_blobs` skip null rows (`kind == 0 && position == 0 && size == 0`).
|
||||
/// TODO(lance): aligned read API would drop this pass.
|
||||
async fn blob_null_mask(
|
||||
dataset: &Arc<Dataset>,
|
||||
column: &str,
|
||||
row_ids: &[u64],
|
||||
) -> Result<Vec<bool>> {
|
||||
let projection = dataset.schema().project(&[column])?;
|
||||
let descriptors = dataset.take_builder(row_ids, projection)?.execute().await?;
|
||||
if descriptors.num_rows() != row_ids.len() {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"blob take for column '{column}' requested {} row ids but only {} exist in the \
|
||||
table; pass row ids collected from this table",
|
||||
row_ids.len(),
|
||||
descriptors.num_rows()
|
||||
),
|
||||
});
|
||||
}
|
||||
let descriptor_struct = leaf_descriptor_struct(&descriptors, column)?;
|
||||
let child = |name: &str| {
|
||||
descriptor_struct
|
||||
.column_by_name(name)
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor for '{column}' is missing the '{name}' field"),
|
||||
})
|
||||
};
|
||||
let kinds = child("kind")?
|
||||
.as_any()
|
||||
.downcast_ref::<UInt8Array>()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor 'kind' for '{column}' is not a UInt8 array"),
|
||||
})?;
|
||||
let positions = child("position")?
|
||||
.as_any()
|
||||
.downcast_ref::<UInt64Array>()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor 'position' for '{column}' is not a UInt64 array"),
|
||||
})?;
|
||||
let sizes = child("size")?
|
||||
.as_any()
|
||||
.downcast_ref::<UInt64Array>()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor 'size' for '{column}' is not a UInt64 array"),
|
||||
})?;
|
||||
|
||||
// Match Lance `collect_blob_entries_v2` skip condition (`BlobKind::Inline` == 0).
|
||||
Ok((0..descriptor_struct.len())
|
||||
.map(|i| {
|
||||
descriptor_struct.is_null(i)
|
||||
|| kinds.is_null(i)
|
||||
|| (kinds.value(i) == 0 && positions.value(i) == 0 && sizes.value(i) == 0)
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
fn non_null_row_ids(row_ids: &[u64], null_mask: &[bool]) -> Vec<u64> {
|
||||
row_ids
|
||||
.iter()
|
||||
.zip(null_mask)
|
||||
.filter_map(|(row_id, is_null)| (!is_null).then_some(*row_id))
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// Materialize blob bytes for `row_ids` (same length and order, nulls preserved).
|
||||
pub(crate) async fn take_blobs_aligned(
|
||||
dataset: &Arc<Dataset>,
|
||||
column: &str,
|
||||
row_ids: &[u64],
|
||||
) -> Result<LargeBinaryArray> {
|
||||
ensure_blob_v2_column(dataset.schema(), column)?;
|
||||
if row_ids.is_empty() {
|
||||
return Ok(LargeBinaryBuilder::new().finish());
|
||||
}
|
||||
|
||||
let null_mask = blob_null_mask(dataset, column, row_ids).await?;
|
||||
let non_null_row_ids = non_null_row_ids(row_ids, &null_mask);
|
||||
let non_null_count = non_null_row_ids.len();
|
||||
let payloads = if non_null_count == 0 {
|
||||
Vec::new()
|
||||
} else {
|
||||
dataset
|
||||
.read_blobs(column)?
|
||||
.with_row_ids(non_null_row_ids)
|
||||
.preserve_order(true)
|
||||
.execute()
|
||||
.await?
|
||||
};
|
||||
|
||||
if payloads.len() != non_null_count {
|
||||
return Err(Error::Runtime {
|
||||
message: format!(
|
||||
"blob read for column '{column}' returned {} payloads for {} non-null rows",
|
||||
payloads.len(),
|
||||
non_null_count
|
||||
),
|
||||
});
|
||||
}
|
||||
|
||||
let mut builder = LargeBinaryBuilder::new();
|
||||
let mut payload_idx = 0;
|
||||
for is_null in &null_mask {
|
||||
if *is_null {
|
||||
builder.append_null();
|
||||
} else {
|
||||
builder.append_value(payloads[payload_idx].data.as_ref());
|
||||
payload_idx += 1;
|
||||
}
|
||||
}
|
||||
Ok(builder.finish())
|
||||
}
|
||||
|
||||
/// Open lazy [`BlobFile`] handles for `row_ids` (same length and order, nulls as `None`).
|
||||
pub(crate) async fn take_blob_files_aligned(
|
||||
dataset: &Arc<Dataset>,
|
||||
column: &str,
|
||||
row_ids: &[u64],
|
||||
) -> Result<Vec<Option<BlobFile>>> {
|
||||
ensure_blob_v2_column(dataset.schema(), column)?;
|
||||
if row_ids.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
let null_mask = blob_null_mask(dataset, column, row_ids).await?;
|
||||
let non_null_row_ids = non_null_row_ids(row_ids, &null_mask);
|
||||
let handles = if non_null_row_ids.is_empty() {
|
||||
Vec::new()
|
||||
} else {
|
||||
dataset.take_blobs(&non_null_row_ids, column).await?
|
||||
};
|
||||
if handles.len() != non_null_row_ids.len() {
|
||||
return Err(Error::Runtime {
|
||||
message: format!(
|
||||
"blob take for column '{column}' returned {} handles for {} non-null rows",
|
||||
handles.len(),
|
||||
non_null_row_ids.len()
|
||||
),
|
||||
});
|
||||
}
|
||||
|
||||
let mut handles = handles.into_iter();
|
||||
Ok(null_mask
|
||||
.iter()
|
||||
.map(|is_null| {
|
||||
if *is_null {
|
||||
None
|
||||
} else {
|
||||
Some(handles.next().unwrap())
|
||||
}
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use arrow_schema::DataType;
|
||||
use lance_arrow::ARROW_EXT_NAME_KEY;
|
||||
|
||||
fn blob_schema() -> Schema {
|
||||
Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob("image", true),
|
||||
])
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn blob_field_carries_v2_extension_marker() {
|
||||
let field = blob("image", true);
|
||||
assert_eq!(
|
||||
field.metadata().get(ARROW_EXT_NAME_KEY).map(String::as_str),
|
||||
Some("lance.blob.v2")
|
||||
);
|
||||
assert!(matches!(field.data_type(), DataType::Struct(_)));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn has_blob_columns_detects_blob_fields() {
|
||||
assert!(has_blob_columns(&blob_schema()));
|
||||
let plain = Schema::new(vec![Field::new("id", DataType::Int64, false)]);
|
||||
assert!(!has_blob_columns(&plain));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_bumps_to_v2_2() {
|
||||
let mut params = WriteParams::default();
|
||||
ensure_blob_storage_version(&blob_schema(), &mut params);
|
||||
assert_eq!(
|
||||
params.data_storage_version.unwrap().resolve(),
|
||||
LanceFileVersion::V2_2
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_overrides_lower_explicit_version() {
|
||||
let mut params = WriteParams {
|
||||
data_storage_version: Some(LanceFileVersion::V2_0),
|
||||
..Default::default()
|
||||
};
|
||||
ensure_blob_storage_version(&blob_schema(), &mut params);
|
||||
assert_eq!(
|
||||
params.data_storage_version.unwrap().resolve(),
|
||||
LanceFileVersion::V2_2
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_keeps_higher_explicit_version() {
|
||||
let mut params = WriteParams {
|
||||
data_storage_version: Some(LanceFileVersion::V2_3),
|
||||
..Default::default()
|
||||
};
|
||||
ensure_blob_storage_version(&blob_schema(), &mut params);
|
||||
assert_eq!(params.data_storage_version.unwrap(), LanceFileVersion::V2_3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn legacy_v1_blob_column_is_rejected_with_migration_hint() {
|
||||
let legacy = Field::new("image", DataType::LargeBinary, true).with_metadata(
|
||||
std::collections::HashMap::from([(
|
||||
"lance-encoding:blob".to_string(),
|
||||
"true".to_string(),
|
||||
)]),
|
||||
);
|
||||
let arrow_schema = Schema::new(vec![legacy]);
|
||||
let lance_schema = lance_core::datatypes::Schema::try_from(&arrow_schema).unwrap();
|
||||
|
||||
let err = ensure_blob_v2_column(&lance_schema, "image").unwrap_err();
|
||||
assert!(matches!(err, Error::InvalidInput { .. }));
|
||||
assert!(err.to_string().contains("legacy blob column"));
|
||||
assert!(err.to_string().contains("lance.blob.v2"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_blob_and_unknown_columns_are_rejected_by_name() {
|
||||
let arrow_schema = Schema::new(vec![Field::new("id", DataType::Int64, false)]);
|
||||
let lance_schema = lance_core::datatypes::Schema::try_from(&arrow_schema).unwrap();
|
||||
|
||||
let err = ensure_blob_v2_column(&lance_schema, "id").unwrap_err();
|
||||
assert!(err.to_string().contains("'id' is not a blob column"));
|
||||
|
||||
let err = ensure_blob_v2_column(&lance_schema, "missing").unwrap_err();
|
||||
assert!(err.to_string().contains("no column named 'missing'"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn blob_column_names_includes_nested_path() {
|
||||
let blob_field = blob("blob", true);
|
||||
let info = Field::new(
|
||||
"info",
|
||||
DataType::Struct(vec![Field::new("name", DataType::Utf8, false), blob_field].into()),
|
||||
true,
|
||||
);
|
||||
let schema = Schema::new(vec![Field::new("id", DataType::Int64, false), info]);
|
||||
assert_eq!(blob_column_names(&schema), vec!["info.blob"]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_noop_without_blob_columns() {
|
||||
let schema = Schema::new(vec![Field::new("id", DataType::Int64, false)]);
|
||||
let mut params = WriteParams::default();
|
||||
ensure_blob_storage_version(&schema, &mut params);
|
||||
assert!(params.data_storage_version.is_none());
|
||||
}
|
||||
}
|
||||
@@ -32,7 +32,6 @@ use crate::table::{BaseTable, WriteOptions};
|
||||
|
||||
pub mod listing;
|
||||
pub mod namespace;
|
||||
pub(crate) mod read_freshness;
|
||||
|
||||
pub trait DatabaseOptions {
|
||||
fn serialize_into_map(&self, map: &mut HashMap<String, String>);
|
||||
|
||||
@@ -18,7 +18,6 @@ use lance_table::io::commit::commit_handler_from_url;
|
||||
use object_store::local::LocalFileSystem;
|
||||
use snafu::ResultExt;
|
||||
|
||||
use crate::blob::{ensure_blob_storage_version, has_blob_columns};
|
||||
use crate::connection::ConnectRequest;
|
||||
use crate::database::ReadConsistency;
|
||||
use crate::database::namespace::LanceNamespaceDatabase;
|
||||
@@ -839,16 +838,13 @@ impl ListingDatabase {
|
||||
write_params.enable_v2_manifest_paths = enable_v2_manifest_paths;
|
||||
}
|
||||
|
||||
let data_schema = request.data.arrow_schema();
|
||||
if let Some(enable_stable_row_ids) = stable_row_ids_override
|
||||
.or(self.new_table_config.enable_stable_row_ids)
|
||||
.or(has_blob_columns(&data_schema).then_some(true))
|
||||
// Apply enable_stable_row_ids: table-level override takes precedence over connection config
|
||||
if let Some(enable_stable_row_ids) =
|
||||
stable_row_ids_override.or(self.new_table_config.enable_stable_row_ids)
|
||||
{
|
||||
write_params.enable_stable_row_ids = enable_stable_row_ids;
|
||||
}
|
||||
|
||||
ensure_blob_storage_version(&data_schema, &mut write_params);
|
||||
|
||||
if matches!(&request.mode, CreateTableMode::Overwrite) {
|
||||
write_params.mode = WriteMode::Overwrite;
|
||||
}
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
//! Namespace-based database implementation that delegates table management to lance-namespace
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
use std::sync::{Arc, Mutex};
|
||||
use std::sync::Arc;
|
||||
|
||||
use async_trait::async_trait;
|
||||
use lance::io::commit::namespace_manifest::LanceNamespaceExternalManifestStore;
|
||||
@@ -16,23 +16,19 @@ use lance_namespace::{
|
||||
CreateNamespaceRequest, CreateNamespaceResponse, DeclareTableRequest,
|
||||
DescribeNamespaceRequest, DescribeNamespaceResponse, DescribeTableRequest,
|
||||
DropNamespaceRequest, DropNamespaceResponse, DropTableRequest, ListNamespacesRequest,
|
||||
ListNamespacesResponse, ListTablesRequest, ListTablesResponse, RenameTableRequest,
|
||||
ListNamespacesResponse, ListTablesRequest, ListTablesResponse,
|
||||
},
|
||||
};
|
||||
use lance_namespace_impls::ConnectBuilder;
|
||||
use lance_table::io::commit::CommitHandler;
|
||||
use lance_table::io::commit::external_manifest::ExternalManifestCommitHandler;
|
||||
|
||||
use crate::blob::{ensure_blob_storage_version, has_blob_columns};
|
||||
use crate::connection::NamespaceClientPushdownOperation;
|
||||
use crate::database::ReadConsistency;
|
||||
use crate::database::listing::{
|
||||
NewTableConfig, OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS, OPT_NEW_TABLE_STORAGE_VERSION,
|
||||
OPT_NEW_TABLE_V2_MANIFEST_PATHS,
|
||||
};
|
||||
use crate::database::read_freshness::{
|
||||
FreshnessBaselines, ReadFreshnessContextProvider, TableFreshness,
|
||||
};
|
||||
use crate::error::{Error, Result};
|
||||
use crate::table::{NativeTable, map_namespace_lance_error};
|
||||
use lance::dataset::WriteMode;
|
||||
@@ -55,10 +51,6 @@ fn is_table_already_exists_namespace_error(err: &lance::Error) -> bool {
|
||||
false
|
||||
}
|
||||
|
||||
/// Object-id delimiter default (matches `RestNamespaceBuilder`'s); overridable
|
||||
/// via the `delimiter` property.
|
||||
const DEFAULT_NAMESPACE_DELIMITER: &str = "$";
|
||||
|
||||
/// A database implementation that uses lance-namespace for table management
|
||||
pub struct LanceNamespaceDatabase {
|
||||
namespace: Arc<dyn LanceNamespace>,
|
||||
@@ -78,17 +70,6 @@ pub struct LanceNamespaceDatabase {
|
||||
ns_properties: HashMap<String, String>,
|
||||
// Options for tables created by this connection
|
||||
new_table_config: NewTableConfig,
|
||||
// Per-table read-freshness baselines, shared with the context provider.
|
||||
freshness_baselines: FreshnessBaselines,
|
||||
// Delimiter for building freshness keys; see `table_freshness`.
|
||||
delimiter: String,
|
||||
}
|
||||
|
||||
fn resolve_delimiter(ns_properties: &HashMap<String, String>) -> String {
|
||||
ns_properties
|
||||
.get("delimiter")
|
||||
.cloned()
|
||||
.unwrap_or_else(|| DEFAULT_NAMESPACE_DELIMITER.to_string())
|
||||
}
|
||||
|
||||
impl LanceNamespaceDatabase {
|
||||
@@ -101,9 +82,6 @@ impl LanceNamespaceDatabase {
|
||||
session: Option<Arc<lance::session::Session>>,
|
||||
namespace_client_pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
|
||||
) -> Self {
|
||||
// Client is pre-built, so we can't install the freshness provider here;
|
||||
// baselines are still tracked for a uniform bump path.
|
||||
let delimiter = resolve_delimiter(&namespace_client_properties);
|
||||
Self {
|
||||
namespace: namespace_client,
|
||||
storage_options,
|
||||
@@ -114,8 +92,6 @@ impl LanceNamespaceDatabase {
|
||||
ns_impl: namespace_client_impl,
|
||||
ns_properties: namespace_client_properties,
|
||||
new_table_config: NewTableConfig::default(),
|
||||
freshness_baselines: Arc::new(Mutex::new(HashMap::new())),
|
||||
delimiter,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -160,19 +136,10 @@ impl LanceNamespaceDatabase {
|
||||
if let Some(ref sess) = session {
|
||||
builder = builder.session(sess.clone());
|
||||
}
|
||||
|
||||
// Install the read-freshness provider before building the client.
|
||||
let freshness_baselines: FreshnessBaselines = Arc::new(Mutex::new(HashMap::new()));
|
||||
builder = builder.context_provider(Arc::new(ReadFreshnessContextProvider::new(
|
||||
freshness_baselines.clone(),
|
||||
read_consistency_interval,
|
||||
)));
|
||||
|
||||
let namespace = builder.connect().await.map_err(|e| Error::InvalidInput {
|
||||
message: format!("Failed to connect to namespace: {:?}", e),
|
||||
})?;
|
||||
|
||||
let delimiter = resolve_delimiter(&ns_properties);
|
||||
Ok(Self {
|
||||
namespace,
|
||||
storage_options,
|
||||
@@ -183,20 +150,9 @@ impl LanceNamespaceDatabase {
|
||||
ns_impl: ns_impl.to_string(),
|
||||
ns_properties,
|
||||
new_table_config,
|
||||
freshness_baselines,
|
||||
delimiter,
|
||||
})
|
||||
}
|
||||
|
||||
/// Build a table's freshness handle, keyed to match the `object_id` the
|
||||
/// namespace client sends on reads (table-id parts joined by the delimiter).
|
||||
fn table_freshness(&self, namespace_path: &[String], name: &str) -> TableFreshness {
|
||||
let mut parts = namespace_path.to_vec();
|
||||
parts.push(name.to_string());
|
||||
let key = parts.join(&self.delimiter);
|
||||
TableFreshness::new(self.freshness_baselines.clone(), key)
|
||||
}
|
||||
|
||||
fn extract_storage_overrides(
|
||||
&self,
|
||||
request: &DbCreateTableRequest,
|
||||
@@ -258,16 +214,12 @@ impl LanceNamespaceDatabase {
|
||||
params.enable_v2_manifest_paths = enable_v2_manifest_paths;
|
||||
}
|
||||
|
||||
let data_schema = request.data.schema();
|
||||
if let Some(enable_stable_row_ids) = stable_row_ids_override
|
||||
.or(self.new_table_config.enable_stable_row_ids)
|
||||
.or(has_blob_columns(data_schema.as_ref()).then_some(true))
|
||||
if let Some(enable_stable_row_ids) =
|
||||
stable_row_ids_override.or(self.new_table_config.enable_stable_row_ids)
|
||||
{
|
||||
params.enable_stable_row_ids = enable_stable_row_ids;
|
||||
}
|
||||
|
||||
ensure_blob_storage_version(data_schema.as_ref(), params);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -379,8 +331,7 @@ impl Database for LanceNamespaceDatabase {
|
||||
self.pushdown_operations.clone(),
|
||||
self.session.clone(),
|
||||
)
|
||||
.await?
|
||||
.with_freshness(self.table_freshness(&request.namespace_path, &request.name));
|
||||
.await?;
|
||||
|
||||
return Ok(Arc::new(native_table));
|
||||
}
|
||||
@@ -511,8 +462,7 @@ impl Database for LanceNamespaceDatabase {
|
||||
self.pushdown_operations.clone(),
|
||||
self.session.clone(),
|
||||
)
|
||||
.await?
|
||||
.with_freshness(self.table_freshness(&request.namespace_path, &request.name));
|
||||
.await?;
|
||||
|
||||
Ok(Arc::new(native_table))
|
||||
}
|
||||
@@ -528,8 +478,7 @@ impl Database for LanceNamespaceDatabase {
|
||||
self.pushdown_operations.clone(),
|
||||
self.session.clone(),
|
||||
)
|
||||
.await?
|
||||
.with_freshness(self.table_freshness(&request.namespace_path, &request.name));
|
||||
.await?;
|
||||
|
||||
Ok(Arc::new(native_table))
|
||||
}
|
||||
@@ -542,34 +491,14 @@ impl Database for LanceNamespaceDatabase {
|
||||
|
||||
async fn rename_table(
|
||||
&self,
|
||||
cur_name: &str,
|
||||
new_name: &str,
|
||||
cur_namespace_path: &[String],
|
||||
new_namespace_path: &[String],
|
||||
_cur_name: &str,
|
||||
_new_name: &str,
|
||||
_cur_namespace_path: &[String],
|
||||
_new_namespace_path: &[String],
|
||||
) -> Result<()> {
|
||||
let mut cur_table_id = cur_namespace_path.to_vec();
|
||||
cur_table_id.push(cur_name.to_string());
|
||||
|
||||
let new_namespace_id = if new_namespace_path.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(new_namespace_path.to_vec())
|
||||
};
|
||||
|
||||
let rename_request = RenameTableRequest {
|
||||
id: Some(cur_table_id),
|
||||
new_table_name: new_name.to_string(),
|
||||
new_namespace_id,
|
||||
..Default::default()
|
||||
};
|
||||
self.namespace
|
||||
.rename_table(rename_request)
|
||||
.await
|
||||
.map_err(|e| Error::Runtime {
|
||||
message: format!("Failed to rename table: {}", e),
|
||||
})?;
|
||||
|
||||
Ok(())
|
||||
Err(Error::NotSupported {
|
||||
message: "rename_table is not supported for namespace connections".to_string(),
|
||||
})
|
||||
}
|
||||
|
||||
async fn drop_table(&self, name: &str, namespace_path: &[String]) -> Result<()> {
|
||||
|
||||
@@ -1,312 +0,0 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
//! Read-freshness signaling for the lance-namespace path.
|
||||
//!
|
||||
//! Against a server that serves cached table metadata up to some staleness
|
||||
//! window, a handle that just wrote (or asked for the latest version via
|
||||
//! `checkout_latest`) can still read a stale snapshot. To prevent that, reads
|
||||
//! routed through the namespace client carry an `x-lancedb-min-timestamp`
|
||||
//! header naming the oldest snapshot the caller will accept.
|
||||
//!
|
||||
//! This mirrors `remote::table`: a per-table baseline is bumped to "now" on
|
||||
//! every write and on `checkout_latest()`, and reads send
|
||||
//! `max(baseline, now - read_consistency_interval)`. Since the namespace client
|
||||
//! takes no headers directly, a [`DynamicContextProvider`] injects it per request.
|
||||
|
||||
use std::collections::HashMap;
|
||||
use std::sync::{Arc, Mutex};
|
||||
use std::time::{Duration, SystemTime};
|
||||
|
||||
use lance_namespace_impls::{DynamicContextProvider, OperationInfo};
|
||||
|
||||
/// Provider context keys prefixed with `headers.` become HTTP headers (prefix
|
||||
/// stripped), so this emits the `x-lancedb-min-timestamp` header.
|
||||
const MIN_TIMESTAMP_CONTEXT_KEY: &str = "headers.x-lancedb-min-timestamp";
|
||||
|
||||
/// Per-table freshness baselines (keyed by namespace object id), shared between
|
||||
/// the provider that reads them and the table handles that bump them.
|
||||
pub type FreshnessBaselines = Arc<Mutex<HashMap<String, SystemTime>>>;
|
||||
|
||||
/// `max(baseline, now - interval)`, or `None` when neither constraint applies.
|
||||
fn compute_min_timestamp(
|
||||
baseline: Option<SystemTime>,
|
||||
interval: Option<Duration>,
|
||||
now: SystemTime,
|
||||
) -> Option<SystemTime> {
|
||||
let interval_based = match interval {
|
||||
None => None,
|
||||
Some(d) if d.is_zero() => Some(now),
|
||||
Some(d) => Some(now.checked_sub(d).unwrap_or(now)),
|
||||
};
|
||||
match (interval_based, baseline) {
|
||||
(None, None) => None,
|
||||
(Some(t), None) | (None, Some(t)) => Some(t),
|
||||
(Some(a), Some(b)) => Some(a.max(b)),
|
||||
}
|
||||
}
|
||||
|
||||
/// Advance the baseline to `now`, never backwards, so a concurrent handle's
|
||||
/// write can't lower a floor another handle already set.
|
||||
fn next_freshness_baseline(prev: Option<SystemTime>, now: SystemTime) -> SystemTime {
|
||||
match prev {
|
||||
Some(p) => p.max(now),
|
||||
None => now,
|
||||
}
|
||||
}
|
||||
|
||||
/// A handle's view of the shared baseline map for a single table.
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct TableFreshness {
|
||||
baselines: FreshnessBaselines,
|
||||
/// Namespace object id for this table (matches the read's `object_id`).
|
||||
key: String,
|
||||
}
|
||||
|
||||
impl TableFreshness {
|
||||
pub fn new(baselines: FreshnessBaselines, key: String) -> Self {
|
||||
Self { baselines, key }
|
||||
}
|
||||
|
||||
pub fn bump(&self) {
|
||||
let now = SystemTime::now();
|
||||
let mut baselines = self.baselines.lock().unwrap();
|
||||
let prev = baselines.get(&self.key).copied();
|
||||
baselines.insert(self.key.clone(), next_freshness_baseline(prev, now));
|
||||
}
|
||||
}
|
||||
|
||||
/// Read ops that can be served stale and so carry the freshness floor.
|
||||
/// `list_table_versions` resolves "latest" for managed-versioning tables, so it
|
||||
/// is what makes `checkout_latest()` observe a prior write.
|
||||
fn is_read_operation(operation: &str) -> bool {
|
||||
matches!(
|
||||
operation,
|
||||
"describe_table" | "list_table_versions" | "query_table" | "list_tables"
|
||||
)
|
||||
}
|
||||
|
||||
/// Injects `x-lancedb-min-timestamp` on namespace reads, per addressed table.
|
||||
#[derive(Debug)]
|
||||
pub struct ReadFreshnessContextProvider {
|
||||
baselines: FreshnessBaselines,
|
||||
read_consistency_interval: Option<Duration>,
|
||||
}
|
||||
|
||||
impl ReadFreshnessContextProvider {
|
||||
pub fn new(baselines: FreshnessBaselines, read_consistency_interval: Option<Duration>) -> Self {
|
||||
Self {
|
||||
baselines,
|
||||
read_consistency_interval,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl DynamicContextProvider for ReadFreshnessContextProvider {
|
||||
fn provide_context(&self, info: &OperationInfo) -> HashMap<String, String> {
|
||||
if !is_read_operation(&info.operation) {
|
||||
return HashMap::new();
|
||||
}
|
||||
|
||||
let baseline = self.baselines.lock().unwrap().get(&info.object_id).copied();
|
||||
match compute_min_timestamp(baseline, self.read_consistency_interval, SystemTime::now()) {
|
||||
Some(ts) => {
|
||||
let dt: chrono::DateTime<chrono::Utc> = ts.into();
|
||||
HashMap::from([(MIN_TIMESTAMP_CONTEXT_KEY.to_string(), dt.to_rfc3339())])
|
||||
}
|
||||
None => HashMap::new(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
/// Allowed slop when comparing a header timestamp against a locally
|
||||
/// captured wall-clock bound. Tests run fast enough that 1s is plenty.
|
||||
const TOLERANCE: Duration = Duration::from_secs(1);
|
||||
|
||||
fn parse_header_ts(headers: &HashMap<String, String>) -> SystemTime {
|
||||
let value = headers
|
||||
.get(MIN_TIMESTAMP_CONTEXT_KEY)
|
||||
.expect("expected min-timestamp context key");
|
||||
chrono::DateTime::parse_from_rfc3339(value)
|
||||
.unwrap()
|
||||
.with_timezone(&chrono::Utc)
|
||||
.into()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_compute_min_timestamp_combines_baseline_and_interval() {
|
||||
let now = SystemTime::now();
|
||||
let baseline = now - Duration::from_secs(60);
|
||||
|
||||
// No interval, no baseline -> no header.
|
||||
assert_eq!(compute_min_timestamp(None, None, now), None);
|
||||
|
||||
// Baseline only -> baseline.
|
||||
assert_eq!(
|
||||
compute_min_timestamp(Some(baseline), None, now),
|
||||
Some(baseline)
|
||||
);
|
||||
|
||||
// ZERO interval, no baseline -> now (strong consistency).
|
||||
assert_eq!(
|
||||
compute_min_timestamp(None, Some(Duration::ZERO), now),
|
||||
Some(now)
|
||||
);
|
||||
|
||||
// Positive interval, no baseline -> now - interval.
|
||||
assert_eq!(
|
||||
compute_min_timestamp(None, Some(Duration::from_secs(10)), now),
|
||||
Some(now - Duration::from_secs(10))
|
||||
);
|
||||
|
||||
// Both: pick the more-recent (tighter) constraint.
|
||||
// baseline = now-60, now-interval = now-10. now-10 is newer.
|
||||
assert_eq!(
|
||||
compute_min_timestamp(Some(baseline), Some(Duration::from_secs(10)), now),
|
||||
Some(now - Duration::from_secs(10))
|
||||
);
|
||||
|
||||
// Both, baseline newer: pick baseline.
|
||||
let recent_baseline = now - Duration::from_secs(5);
|
||||
assert_eq!(
|
||||
compute_min_timestamp(Some(recent_baseline), Some(Duration::from_secs(60)), now),
|
||||
Some(recent_baseline)
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_next_freshness_baseline_is_monotonic() {
|
||||
let now = SystemTime::now();
|
||||
let earlier = now - Duration::from_secs(30);
|
||||
let later = now + Duration::from_secs(30);
|
||||
|
||||
// No prior baseline -> now.
|
||||
assert_eq!(next_freshness_baseline(None, now), now);
|
||||
// Prior baseline older than now -> now.
|
||||
assert_eq!(next_freshness_baseline(Some(earlier), now), now);
|
||||
// Prior baseline newer than now -> keep the newer baseline.
|
||||
assert_eq!(next_freshness_baseline(Some(later), now), later);
|
||||
}
|
||||
|
||||
fn provider_with(
|
||||
entries: &[(&str, SystemTime)],
|
||||
interval: Option<Duration>,
|
||||
) -> ReadFreshnessContextProvider {
|
||||
let map: HashMap<String, SystemTime> =
|
||||
entries.iter().map(|(k, v)| (k.to_string(), *v)).collect();
|
||||
ReadFreshnessContextProvider::new(Arc::new(Mutex::new(map)), interval)
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_provider_emits_header_at_or_after_bumped_baseline() {
|
||||
// A baseline set "now" with no interval: every read op must carry a
|
||||
// floor at or after that baseline. `list_table_versions` is the hook
|
||||
// that makes managed-versioning `checkout_latest()` observe a write.
|
||||
let baseline = SystemTime::now();
|
||||
let provider = provider_with(&[("ns$tbl", baseline)], None);
|
||||
|
||||
// These ops are keyed by the table id, so they pick up the per-table
|
||||
// baseline. (`list_tables` is keyed by the namespace, so it is covered
|
||||
// separately by the interval-floor test.)
|
||||
for op in ["describe_table", "list_table_versions", "query_table"] {
|
||||
let ctx = provider.provide_context(&OperationInfo::new(op, "ns$tbl"));
|
||||
let sent = parse_header_ts(&ctx);
|
||||
assert!(
|
||||
sent >= baseline - TOLERANCE && sent <= baseline + TOLERANCE,
|
||||
"operation {op} should carry a floor at the bumped baseline"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_provider_list_tables_uses_interval_floor_not_table_baseline() {
|
||||
// `list_tables` is addressed by the namespace id, which never matches a
|
||||
// per-table baseline key, so a bumped table baseline must not leak onto
|
||||
// it. With no interval it sends nothing; with one it sends now-interval.
|
||||
let provider = provider_with(&[("ns$tbl", SystemTime::now())], None);
|
||||
let ctx = provider.provide_context(&OperationInfo::new("list_tables", "ns"));
|
||||
assert!(
|
||||
ctx.is_empty(),
|
||||
"list_tables must not inherit a per-table baseline"
|
||||
);
|
||||
|
||||
let interval = Duration::from_secs(30);
|
||||
let provider = provider_with(&[("ns$tbl", SystemTime::now())], Some(interval));
|
||||
let before = SystemTime::now();
|
||||
let ctx = provider.provide_context(&OperationInfo::new("list_tables", "ns"));
|
||||
let after = SystemTime::now();
|
||||
let sent = parse_header_ts(&ctx);
|
||||
assert!(
|
||||
sent >= before - interval - TOLERANCE && sent <= after - interval + TOLERANCE,
|
||||
"list_tables should carry the interval floor"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_provider_no_header_for_empty_baseline_and_no_interval() {
|
||||
// Manual consistency (no interval) on a table that was never bumped:
|
||||
// no floor, so the server may serve from cache.
|
||||
let provider = provider_with(&[], None);
|
||||
let ctx = provider.provide_context(&OperationInfo::new("describe_table", "ns$tbl"));
|
||||
assert!(ctx.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_provider_interval_floor_applies_without_baseline() {
|
||||
// With a consistency interval and no baseline, the floor is now-interval.
|
||||
let interval = Duration::from_secs(30);
|
||||
let provider = provider_with(&[], Some(interval));
|
||||
|
||||
let before = SystemTime::now();
|
||||
let ctx = provider.provide_context(&OperationInfo::new("query_table", "ns$tbl"));
|
||||
let after = SystemTime::now();
|
||||
|
||||
let sent = parse_header_ts(&ctx);
|
||||
assert!(
|
||||
sent >= before - interval - TOLERANCE && sent <= after - interval + TOLERANCE,
|
||||
"expected floor at roughly now - interval"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_provider_non_read_ops_emit_nothing() {
|
||||
// Even with a fresh baseline and a zero interval, a non-read operation
|
||||
// (which establishes rather than consumes a baseline) sends no header.
|
||||
let provider = provider_with(&[("ns$tbl", SystemTime::now())], Some(Duration::ZERO));
|
||||
for op in [
|
||||
"create_table",
|
||||
"register_table",
|
||||
"drop_table",
|
||||
"rename_table",
|
||||
// Pinned to an immutable version, so it cannot be served stale.
|
||||
"describe_table_version",
|
||||
] {
|
||||
let ctx = provider.provide_context(&OperationInfo::new(op, "ns$tbl"));
|
||||
assert!(
|
||||
ctx.is_empty(),
|
||||
"operation {op} must not send a freshness header"
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_provider_uses_per_table_baseline() {
|
||||
// The floor is looked up by object id, so an unrelated table's baseline
|
||||
// does not leak onto another table's read.
|
||||
let baseline = SystemTime::now();
|
||||
let provider = provider_with(&[("ns$has_baseline", baseline)], None);
|
||||
|
||||
// The bumped table gets a header.
|
||||
let hit =
|
||||
provider.provide_context(&OperationInfo::new("describe_table", "ns$has_baseline"));
|
||||
assert!(!hit.is_empty());
|
||||
|
||||
// A different table with no baseline (and no interval) gets nothing.
|
||||
let miss = provider.provide_context(&OperationInfo::new("describe_table", "ns$other"));
|
||||
assert!(miss.is_empty());
|
||||
}
|
||||
}
|
||||
@@ -13,7 +13,7 @@ use serde_json::{Value, json};
|
||||
use super::EmbeddingFunction;
|
||||
use crate::{Error, Result};
|
||||
|
||||
use tokio::runtime::{Handle, RuntimeFlavor};
|
||||
use tokio::runtime::Handle;
|
||||
use tokio::task::block_in_place;
|
||||
|
||||
#[derive(Debug)]
|
||||
@@ -148,12 +148,6 @@ impl BedrockEmbeddingFunction {
|
||||
_ => unreachable!(),
|
||||
};
|
||||
|
||||
// Bedrock's SDK is async but this trait method is synchronous, so we
|
||||
// bridge with `block_in_place` + `block_on`. That requires a
|
||||
// multi-threaded Tokio runtime; return a typed error instead of
|
||||
// panicking when no compatible runtime is available.
|
||||
let handle = current_multi_thread_handle()?;
|
||||
|
||||
for text in texts {
|
||||
let request_body = match self.model {
|
||||
BedrockEmbeddingModel::TitanEmbedding => {
|
||||
@@ -169,28 +163,24 @@ impl BedrockEmbeddingFunction {
|
||||
}
|
||||
};
|
||||
|
||||
// Serialize before entering the blocking section so a serialization
|
||||
// failure surfaces as a typed error rather than an `unwrap` panic.
|
||||
let body = serde_json::to_vec(&request_body).map_err(|e| Error::Runtime {
|
||||
message: format!("Failed to serialize Bedrock request: {e}"),
|
||||
})?;
|
||||
|
||||
let client = self.client.clone();
|
||||
let model_id = self.model.model_id().to_string();
|
||||
let request_body = request_body.clone();
|
||||
|
||||
let response = block_in_place(|| {
|
||||
handle.block_on(async move {
|
||||
let response = block_in_place(move || {
|
||||
Handle::current().block_on(async move {
|
||||
client
|
||||
.invoke_model()
|
||||
.model_id(model_id)
|
||||
.body(aws_sdk_bedrockruntime::primitives::Blob::new(body))
|
||||
.body(aws_sdk_bedrockruntime::primitives::Blob::new(
|
||||
serde_json::to_vec(&request_body).unwrap(),
|
||||
))
|
||||
.send()
|
||||
.await
|
||||
.map_err(|e| Error::Runtime {
|
||||
message: format!("Bedrock invoke_model request failed: {e}"),
|
||||
})
|
||||
.map_err(Box::new)
|
||||
})
|
||||
})?;
|
||||
})
|
||||
.unwrap();
|
||||
|
||||
let response_json: Value =
|
||||
serde_json::from_slice(response.body.as_ref()).map_err(|e| Error::Runtime {
|
||||
@@ -198,12 +188,22 @@ impl BedrockEmbeddingFunction {
|
||||
})?;
|
||||
|
||||
let embedding = match self.model {
|
||||
BedrockEmbeddingModel::TitanEmbedding => {
|
||||
json_array_to_f32(&response_json["embedding"], "embedding")?
|
||||
}
|
||||
BedrockEmbeddingModel::CohereLarge => {
|
||||
json_array_to_f32(&response_json["embeddings"][0], "embeddings")?
|
||||
}
|
||||
BedrockEmbeddingModel::TitanEmbedding => response_json["embedding"]
|
||||
.as_array()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: "Missing embedding in response".to_string(),
|
||||
})?
|
||||
.iter()
|
||||
.map(|v| v.as_f64().unwrap() as f32)
|
||||
.collect::<Vec<f32>>(),
|
||||
BedrockEmbeddingModel::CohereLarge => response_json["embeddings"][0]
|
||||
.as_array()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: "Missing embeddings in response".to_string(),
|
||||
})?
|
||||
.iter()
|
||||
.map(|v| v.as_f64().unwrap() as f32)
|
||||
.collect::<Vec<f32>>(),
|
||||
};
|
||||
|
||||
builder.append_slice(&embedding);
|
||||
@@ -212,86 +212,3 @@ impl BedrockEmbeddingFunction {
|
||||
Ok(builder.finish())
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns a handle to the current multi-threaded Tokio runtime, or a typed
|
||||
/// [`Error::Runtime`] when called outside a runtime or on the current-thread
|
||||
/// runtime. This keeps the synchronous-over-async bridge in
|
||||
/// [`BedrockEmbeddingFunction::compute_inner`] from panicking on runtime
|
||||
/// configurations that cannot support `block_in_place`.
|
||||
fn current_multi_thread_handle() -> Result<Handle> {
|
||||
let handle = Handle::try_current().map_err(|e| Error::Runtime {
|
||||
message: format!("Bedrock embedding must be called from within a Tokio runtime: {e}"),
|
||||
})?;
|
||||
if handle.runtime_flavor() == RuntimeFlavor::CurrentThread {
|
||||
return Err(Error::Runtime {
|
||||
message: "Bedrock embedding requires a multi-threaded Tokio runtime; the \
|
||||
current-thread runtime cannot use `block_in_place`"
|
||||
.to_string(),
|
||||
});
|
||||
}
|
||||
Ok(handle)
|
||||
}
|
||||
|
||||
/// Converts a JSON value expected to be an array of numbers into `Vec<f32>`.
|
||||
///
|
||||
/// Returns a typed [`Error::Runtime`] (rather than panicking) when the value is
|
||||
/// not an array or contains a non-numeric element, so malformed provider
|
||||
/// responses degrade gracefully.
|
||||
fn json_array_to_f32(value: &Value, field: &str) -> Result<Vec<f32>> {
|
||||
let arr = value.as_array().ok_or_else(|| Error::Runtime {
|
||||
message: format!("Missing or non-array '{field}' field in Bedrock response"),
|
||||
})?;
|
||||
arr.iter()
|
||||
.map(|v| {
|
||||
v.as_f64().map(|f| f as f32).ok_or_else(|| Error::Runtime {
|
||||
message: format!("Non-numeric value in Bedrock '{field}' embedding: {v}"),
|
||||
})
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn json_array_to_f32_parses_numbers() {
|
||||
let v = json!([1.0, 2, -3.5]);
|
||||
let out = json_array_to_f32(&v, "embedding").unwrap();
|
||||
assert_eq!(out, vec![1.0_f32, 2.0, -3.5]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn json_array_to_f32_rejects_non_array() {
|
||||
// Missing field indexes to `Value::Null`; a malformed payload should be
|
||||
// a typed error, not a panic.
|
||||
let v = json!({"unexpected": "shape"});
|
||||
let err = json_array_to_f32(&v["embedding"], "embedding").unwrap_err();
|
||||
assert!(matches!(err, Error::Runtime { .. }), "got {err:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn json_array_to_f32_rejects_non_numeric_element() {
|
||||
let v = json!([1.0, "not-a-number", 3.0]);
|
||||
let err = json_array_to_f32(&v, "embedding").unwrap_err();
|
||||
assert!(matches!(err, Error::Runtime { .. }), "got {err:?}");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn handle_errors_without_runtime() {
|
||||
// No Tokio runtime in scope -> typed error instead of a panic.
|
||||
let err = current_multi_thread_handle().unwrap_err();
|
||||
assert!(matches!(err, Error::Runtime { .. }), "got {err:?}");
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "current_thread")]
|
||||
async fn handle_errors_on_current_thread_runtime() {
|
||||
let err = current_multi_thread_handle().unwrap_err();
|
||||
assert!(matches!(err, Error::Runtime { .. }), "got {err:?}");
|
||||
}
|
||||
|
||||
#[tokio::test(flavor = "multi_thread")]
|
||||
async fn handle_ok_on_multi_thread_runtime() {
|
||||
current_multi_thread_handle().expect("multi-threaded runtime should be accepted");
|
||||
}
|
||||
}
|
||||
|
||||
@@ -163,7 +163,6 @@
|
||||
//! ```
|
||||
|
||||
pub mod arrow;
|
||||
pub mod blob;
|
||||
pub mod connection;
|
||||
pub mod data;
|
||||
pub mod database;
|
||||
@@ -185,14 +184,12 @@ pub mod table;
|
||||
pub mod test_utils;
|
||||
pub mod utils;
|
||||
|
||||
use std::{fmt::Display, str::FromStr};
|
||||
use std::fmt::Display;
|
||||
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
pub use blob::{blob, is_blob};
|
||||
pub use connection::{ConnectNamespaceBuilder, Connection};
|
||||
pub use error::{Error, Result};
|
||||
use lance_index::vector::ApproxMode as LanceApproxMode;
|
||||
use lance_linalg::distance::DistanceType as LanceDistanceType;
|
||||
pub use table::Table;
|
||||
|
||||
@@ -261,79 +258,6 @@ impl Display for DistanceType {
|
||||
}
|
||||
}
|
||||
|
||||
/// Controls the speed / accuracy tradeoff for approximate vector search.
|
||||
///
|
||||
/// This currently only affects RQ-quantized vector indexes, such as IVF_RQ.
|
||||
/// Other index types ignore this setting.
|
||||
#[derive(Debug, Copy, Clone, PartialEq, Eq, Serialize, Deserialize, Default)]
|
||||
#[non_exhaustive]
|
||||
#[serde(rename_all = "lowercase")]
|
||||
pub enum ApproxMode {
|
||||
/// Prefer lower query latency, which can reduce recall.
|
||||
Fast,
|
||||
/// Use the default balance between query latency and recall.
|
||||
#[default]
|
||||
Normal,
|
||||
/// Prefer higher recall, which can increase query latency.
|
||||
Accurate,
|
||||
}
|
||||
|
||||
impl From<ApproxMode> for LanceApproxMode {
|
||||
fn from(value: ApproxMode) -> Self {
|
||||
match value {
|
||||
ApproxMode::Fast => Self::Fast,
|
||||
ApproxMode::Normal => Self::Normal,
|
||||
ApproxMode::Accurate => Self::Accurate,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<LanceApproxMode> for ApproxMode {
|
||||
fn from(value: LanceApproxMode) -> Self {
|
||||
match value {
|
||||
LanceApproxMode::Fast => Self::Fast,
|
||||
LanceApproxMode::Normal => Self::Normal,
|
||||
LanceApproxMode::Accurate => Self::Accurate,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl TryFrom<&str> for ApproxMode {
|
||||
type Error = Error;
|
||||
|
||||
fn try_from(value: &str) -> std::prelude::v1::Result<Self, Self::Error> {
|
||||
Self::from_str(value)
|
||||
}
|
||||
}
|
||||
|
||||
impl FromStr for ApproxMode {
|
||||
type Err = Error;
|
||||
|
||||
fn from_str(value: &str) -> std::prelude::v1::Result<Self, Self::Err> {
|
||||
match value.to_ascii_lowercase().as_str() {
|
||||
"fast" => Ok(Self::Fast),
|
||||
"normal" => Ok(Self::Normal),
|
||||
"accurate" => Ok(Self::Accurate),
|
||||
_ => Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"approx_mode must be one of 'fast', 'normal', or 'accurate', got '{}'",
|
||||
value
|
||||
),
|
||||
}),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Display for ApproxMode {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
match self {
|
||||
Self::Fast => write!(f, "fast"),
|
||||
Self::Normal => write!(f, "normal"),
|
||||
Self::Accurate => write!(f, "accurate"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Connect to a database
|
||||
pub use connection::connect;
|
||||
/// Connect to a namespace-backed database
|
||||
|
||||
@@ -20,12 +20,12 @@ use lance_index::scalar::FullTextSearchQuery;
|
||||
use lance_index::scalar::inverted::SCORE_COL;
|
||||
use lance_index::vector::DIST_COL;
|
||||
|
||||
use crate::DistanceType;
|
||||
use crate::error::{Error, Result};
|
||||
use crate::rerankers::rrf::RRFReranker;
|
||||
use crate::rerankers::{NormalizeMethod, Reranker, check_reranker_result};
|
||||
use crate::table::BaseTable;
|
||||
use crate::utils::{MaxBatchLengthStream, TimeoutStream};
|
||||
use crate::{ApproxMode, DistanceType};
|
||||
use crate::{
|
||||
arrow::{SendableRecordBatchStream, SimpleRecordBatchStream},
|
||||
table::AnyQuery,
|
||||
@@ -935,8 +935,6 @@ pub struct VectorQueryRequest {
|
||||
pub refine_factor: Option<u32>,
|
||||
/// The distance type to use for the search
|
||||
pub distance_type: Option<DistanceType>,
|
||||
/// The speed / accuracy tradeoff to use for approximate vector search
|
||||
pub approx_mode: Option<ApproxMode>,
|
||||
/// Default is true. Set to false to enforce a brute force search.
|
||||
pub use_index: bool,
|
||||
}
|
||||
@@ -954,7 +952,6 @@ impl Default for VectorQueryRequest {
|
||||
ef: None,
|
||||
refine_factor: None,
|
||||
distance_type: None,
|
||||
approx_mode: None,
|
||||
use_index: true,
|
||||
}
|
||||
}
|
||||
@@ -1195,15 +1192,6 @@ impl VectorQuery {
|
||||
self
|
||||
}
|
||||
|
||||
/// Set the speed / accuracy tradeoff for approximate vector search.
|
||||
///
|
||||
/// This setting is currently only used by RQ-quantized indexes, such as
|
||||
/// IVF_RQ. Other index types ignore this setting.
|
||||
pub fn approx_mode(mut self, approx_mode: ApproxMode) -> Self {
|
||||
self.request.approx_mode = Some(approx_mode);
|
||||
self
|
||||
}
|
||||
|
||||
/// If this is called then any vector index is skipped
|
||||
///
|
||||
/// An exhaustive (flat) search will be performed. The query vector will
|
||||
@@ -1558,7 +1546,6 @@ mod tests {
|
||||
.nprobes(1000)
|
||||
.postfilter()
|
||||
.distance_type(DistanceType::Cosine)
|
||||
.approx_mode(ApproxMode::Accurate)
|
||||
.refine_factor(999);
|
||||
|
||||
assert_eq!(
|
||||
@@ -1577,49 +1564,9 @@ mod tests {
|
||||
assert_eq!(query.request.maximum_nprobes, Some(1000));
|
||||
assert!(query.request.use_index);
|
||||
assert_eq!(query.request.distance_type, Some(DistanceType::Cosine));
|
||||
assert_eq!(query.request.approx_mode, Some(ApproxMode::Accurate));
|
||||
assert_eq!(query.request.refine_factor, Some(999));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_approx_mode_serde_parse_default_and_display() {
|
||||
assert_eq!(ApproxMode::default(), ApproxMode::Normal);
|
||||
assert_eq!(
|
||||
serde_json::to_string(&ApproxMode::Fast).unwrap(),
|
||||
"\"fast\""
|
||||
);
|
||||
assert_eq!(
|
||||
serde_json::from_str::<ApproxMode>("\"accurate\"").unwrap(),
|
||||
ApproxMode::Accurate
|
||||
);
|
||||
assert_eq!("normal".parse::<ApproxMode>().unwrap(), ApproxMode::Normal);
|
||||
assert_eq!(ApproxMode::try_from("FAST").unwrap(), ApproxMode::Fast);
|
||||
assert_eq!(ApproxMode::Accurate.to_string(), "accurate");
|
||||
assert!(ApproxMode::try_from("invalid").is_err());
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_vector_query_approx_mode_builder() {
|
||||
let tmp_dir = tempdir().unwrap();
|
||||
let dataset_path = tmp_dir.path().join("test.lance");
|
||||
let uri = dataset_path.to_str().unwrap();
|
||||
|
||||
let conn = connect(uri).execute().await.unwrap();
|
||||
let table = conn
|
||||
.create_table("my_table", make_test_batches())
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let query = table
|
||||
.query()
|
||||
.nearest_to(&[0.1, 0.2])
|
||||
.unwrap()
|
||||
.approx_mode(ApproxMode::Fast);
|
||||
|
||||
assert_eq!(query.request.approx_mode, Some(ApproxMode::Fast));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_execute() {
|
||||
// TODO: Switch back to memory://foo after https://github.com/lancedb/lancedb/issues/1051
|
||||
|
||||
@@ -985,42 +985,45 @@ mod tests {
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_open_table_branch_and_version() {
|
||||
// Remote supports version time-travel but not branches. A version-only
|
||||
// open (or one on the default "main" branch) must succeed; a non-main
|
||||
// branch must be rejected, with or without a version.
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
let body = if request.url().path() == "/v1/table/t/branches/list/" {
|
||||
// checkout_branch validates the branch exists via list_branches.
|
||||
r#"{"branches":{"exp":{"parentVersion":1,"createAt":1,"manifestSize":1}}}"#
|
||||
} else {
|
||||
// describe (table open + version/branch validation)
|
||||
r#"{"table": "t", "version": 2, "schema": {"fields": [
|
||||
{"name": "a", "type": { "type": "int32" }, "nullable": false}
|
||||
]}}"#
|
||||
};
|
||||
http::Response::builder().status(200).body(body).unwrap()
|
||||
assert_eq!(request.url().path(), "/v1/table/t/describe/");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(
|
||||
r#"{"table": "t", "version": 2, "schema": {"fields": [
|
||||
{"name": "a", "type": { "type": "int32" }, "nullable": false}
|
||||
]}}"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
|
||||
// version-only (and "main" + version) time-travel the main chain
|
||||
let v2 = conn.open_table("t").version(2).execute().await.unwrap();
|
||||
assert_eq!(v2.current_branch(), None);
|
||||
let main_v2 = conn
|
||||
.open_table("t")
|
||||
// version-only: allowed (open + checkout(version) both round-trip)
|
||||
conn.open_table("t").version(2).execute().await.unwrap();
|
||||
|
||||
// "main" is the default branch, so it counts as no branch
|
||||
conn.open_table("t")
|
||||
.branch("main")
|
||||
.version(2)
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(main_v2.current_branch(), None);
|
||||
|
||||
// a non-main branch opens a handle scoped to that branch
|
||||
let exp = conn.open_table("t").branch("exp").execute().await.unwrap();
|
||||
assert_eq!(exp.current_branch(), Some("exp".to_string()));
|
||||
let exp_v2 = conn
|
||||
.open_table("t")
|
||||
.branch("exp")
|
||||
.version(2)
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(exp_v2.current_branch(), Some("exp".to_string()));
|
||||
// a non-main branch is rejected, with or without a version
|
||||
assert!(matches!(
|
||||
conn.open_table("t").branch("exp").execute().await,
|
||||
Err(Error::NotSupported { .. })
|
||||
));
|
||||
assert!(matches!(
|
||||
conn.open_table("t")
|
||||
.branch("exp")
|
||||
.version(2)
|
||||
.execute()
|
||||
.await,
|
||||
Err(Error::NotSupported { .. })
|
||||
));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -48,8 +48,6 @@ pub struct RemoteInsertExec<S: HttpSend = Sender> {
|
||||
metrics: ExecutionPlanMetricsSet,
|
||||
upload_id: Option<String>,
|
||||
tracker: Option<Arc<WriteProgressTracker>>,
|
||||
/// Branch to write to via `?branch=`. `None` targets the main branch.
|
||||
branch: Option<String>,
|
||||
}
|
||||
|
||||
impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
@@ -61,10 +59,9 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
input: Arc<dyn ExecutionPlan>,
|
||||
overwrite: bool,
|
||||
tracker: Option<Arc<WriteProgressTracker>>,
|
||||
branch: Option<String>,
|
||||
) -> Self {
|
||||
Self::new_inner(
|
||||
table_name, identifier, client, input, overwrite, None, tracker, branch,
|
||||
table_name, identifier, client, input, overwrite, None, tracker,
|
||||
)
|
||||
}
|
||||
|
||||
@@ -73,7 +70,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
/// Each partition's insert is staged under the given `upload_id` without
|
||||
/// committing. The caller is responsible for calling the complete (or abort)
|
||||
/// endpoint after all partitions finish.
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub fn new_multipart(
|
||||
table_name: String,
|
||||
identifier: String,
|
||||
@@ -82,7 +78,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
overwrite: bool,
|
||||
upload_id: String,
|
||||
tracker: Option<Arc<WriteProgressTracker>>,
|
||||
branch: Option<String>,
|
||||
) -> Self {
|
||||
Self::new_inner(
|
||||
table_name,
|
||||
@@ -92,11 +87,9 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
overwrite,
|
||||
Some(upload_id),
|
||||
tracker,
|
||||
branch,
|
||||
)
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
fn new_inner(
|
||||
table_name: String,
|
||||
identifier: String,
|
||||
@@ -105,7 +98,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
overwrite: bool,
|
||||
upload_id: Option<String>,
|
||||
tracker: Option<Arc<WriteProgressTracker>>,
|
||||
branch: Option<String>,
|
||||
) -> Self {
|
||||
let num_partitions = if upload_id.is_some() {
|
||||
input.output_partitioning().partition_count()
|
||||
@@ -131,7 +123,6 @@ impl<S: HttpSend + 'static> RemoteInsertExec<S> {
|
||||
metrics: ExecutionPlanMetricsSet::new(),
|
||||
upload_id,
|
||||
tracker,
|
||||
branch,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -282,7 +273,6 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
|
||||
self.overwrite,
|
||||
self.upload_id.clone(),
|
||||
self.tracker.clone(),
|
||||
self.branch.clone(),
|
||||
)))
|
||||
}
|
||||
|
||||
@@ -314,7 +304,6 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
|
||||
let table_name = self.table_name.clone();
|
||||
let upload_id = self.upload_id.clone();
|
||||
let tracker = self.tracker.clone();
|
||||
let branch = self.branch.clone();
|
||||
|
||||
let stream = futures::stream::once(async move {
|
||||
let mut request = client
|
||||
@@ -327,9 +316,6 @@ impl<S: HttpSend + 'static> ExecutionPlan for RemoteInsertExec<S> {
|
||||
if let Some(ref uid) = upload_id {
|
||||
request = request.query(&[("upload_id", uid.as_str())]);
|
||||
}
|
||||
if let Some(ref b) = branch {
|
||||
request = request.query(&[("branch", b.as_str())]);
|
||||
}
|
||||
|
||||
let (error_tx, mut error_rx) = tokio::sync::oneshot::channel();
|
||||
let body = Self::stream_as_http_body(input_stream, error_tx, tracker)?;
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -26,9 +26,6 @@ pub enum AddDataMode {
|
||||
#[default]
|
||||
Append,
|
||||
/// The existing table will be overwritten with the new data
|
||||
///
|
||||
/// On overwrite, raw binary is not coerced into a blob struct. The input
|
||||
/// must declare blob v2 for the column to stay a blob column.
|
||||
Overwrite,
|
||||
}
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -3,7 +3,6 @@
|
||||
|
||||
//! This module contains adapters to allow LanceDB tables to be used as DataFusion table providers.
|
||||
|
||||
mod blob_coerce;
|
||||
pub mod cast;
|
||||
pub mod insert;
|
||||
pub mod reject_nan;
|
||||
|
||||
@@ -1,495 +0,0 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
//! Coerces write-path input into blob v2 struct columns.
|
||||
//!
|
||||
//! [`super::cast::cast_to_table_schema`] calls [`coerce_blob_expr`].
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use arrow_schema::{DataType, Field, FieldRef};
|
||||
use datafusion::functions::core::{get_field, named_struct};
|
||||
use datafusion_common::ScalarValue;
|
||||
use datafusion_common::config::ConfigOptions;
|
||||
use datafusion_physical_expr::ScalarFunctionExpr;
|
||||
use datafusion_physical_expr::expressions::{CastExpr, Literal};
|
||||
use datafusion_physical_plan::PhysicalExpr;
|
||||
|
||||
use crate::error::{Error, Result};
|
||||
|
||||
/// Build a projection expression coercing `input_expr` into the blob struct
|
||||
/// declared by `table_field`, composing `named_struct` / `get_field` / `cast`.
|
||||
pub(super) fn coerce_blob_expr(
|
||||
input_expr: Arc<dyn PhysicalExpr>,
|
||||
input_field: &Field,
|
||||
table_field: &FieldRef,
|
||||
config: &Arc<ConfigOptions>,
|
||||
) -> Result<(Arc<dyn PhysicalExpr>, FieldRef)> {
|
||||
let DataType::Struct(declared_fields) = table_field.data_type() else {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"blob v2 column '{}' must be a struct, table declares {}",
|
||||
table_field.name(),
|
||||
table_field.data_type()
|
||||
),
|
||||
});
|
||||
};
|
||||
|
||||
let input_struct_children = match input_field.data_type() {
|
||||
DataType::Binary | DataType::LargeBinary | DataType::BinaryView => None,
|
||||
DataType::Struct(children) => {
|
||||
if !children
|
||||
.iter()
|
||||
.any(|c| c.name() == "data" || c.name() == "uri")
|
||||
{
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"blob struct input for column '{}' must contain a 'data' or 'uri' child",
|
||||
table_field.name()
|
||||
),
|
||||
});
|
||||
}
|
||||
Some(children)
|
||||
}
|
||||
other => {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"cannot coerce column '{}' with type {} into a blob v2 struct. \
|
||||
expected Binary, LargeBinary, BinaryView, or a Struct with a 'data' or 'uri' child",
|
||||
table_field.name(),
|
||||
other,
|
||||
),
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
let mut ns_args: Vec<Arc<dyn PhysicalExpr>> = Vec::with_capacity(declared_fields.len() * 2);
|
||||
for declared in declared_fields.iter() {
|
||||
ns_args.push(Arc::new(Literal::new(ScalarValue::from(
|
||||
declared.name().as_str(),
|
||||
))));
|
||||
|
||||
let value: Arc<dyn PhysicalExpr> = match input_struct_children {
|
||||
// Raw binary lands in `data` and everything else is a typed null.
|
||||
None => {
|
||||
if declared.name() == "data" {
|
||||
Arc::new(CastExpr::new(
|
||||
input_expr.clone(),
|
||||
declared.data_type().clone(),
|
||||
None,
|
||||
))
|
||||
} else {
|
||||
typed_null(declared.data_type())?
|
||||
}
|
||||
}
|
||||
Some(children) => match children.iter().find(|c| c.name() == declared.name()) {
|
||||
Some(child) => {
|
||||
let field_expr: Arc<dyn PhysicalExpr> = Arc::new(ScalarFunctionExpr::new(
|
||||
&format!("get_field({})", declared.name()),
|
||||
get_field(),
|
||||
vec![
|
||||
input_expr.clone(),
|
||||
Arc::new(Literal::new(ScalarValue::from(declared.name().as_str()))),
|
||||
],
|
||||
Arc::new(child.as_ref().clone()),
|
||||
config.clone(),
|
||||
));
|
||||
if child.data_type() == declared.data_type() {
|
||||
field_expr
|
||||
} else {
|
||||
Arc::new(CastExpr::new(
|
||||
field_expr,
|
||||
declared.data_type().clone(),
|
||||
None,
|
||||
))
|
||||
}
|
||||
}
|
||||
None => typed_null(declared.data_type())?,
|
||||
},
|
||||
};
|
||||
ns_args.push(value);
|
||||
}
|
||||
|
||||
let expr: Arc<dyn PhysicalExpr> = Arc::new(ScalarFunctionExpr::new(
|
||||
&format!("named_struct({})", table_field.name()),
|
||||
named_struct(),
|
||||
ns_args,
|
||||
table_field.clone(),
|
||||
config.clone(),
|
||||
));
|
||||
Ok((expr, table_field.clone()))
|
||||
}
|
||||
|
||||
fn typed_null(data_type: &DataType) -> Result<Arc<dyn PhysicalExpr>> {
|
||||
let scalar = ScalarValue::try_from(data_type).map_err(|e| Error::InvalidInput {
|
||||
message: format!("cannot build null literal for blob child type {data_type}: {e}"),
|
||||
})?;
|
||||
Ok(Arc::new(Literal::new(scalar)))
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::super::cast::cast_to_table_schema;
|
||||
use super::*;
|
||||
use crate::blob::blob;
|
||||
use arrow_array::{
|
||||
Array, ArrayRef, BinaryArray, BinaryViewArray, Int32Array, Int64Array, LargeBinaryArray,
|
||||
RecordBatch, StringArray, StructArray, UInt8Array, UInt64Array,
|
||||
};
|
||||
use arrow_schema::Schema;
|
||||
use datafusion::prelude::SessionContext;
|
||||
use datafusion_catalog::MemTable;
|
||||
use datafusion_physical_plan::ExecutionPlan;
|
||||
use futures::TryStreamExt;
|
||||
use lance_arrow::FieldExt;
|
||||
use std::collections::HashMap;
|
||||
|
||||
fn wide_blob_field(name: &str) -> Field {
|
||||
Field::new(
|
||||
name,
|
||||
DataType::Struct(
|
||||
vec![
|
||||
Field::new("data", DataType::LargeBinary, true),
|
||||
Field::new("uri", DataType::Utf8, true),
|
||||
Field::new("position", DataType::UInt64, true),
|
||||
Field::new("size", DataType::UInt64, true),
|
||||
]
|
||||
.into(),
|
||||
),
|
||||
true,
|
||||
)
|
||||
.with_metadata(HashMap::from([(
|
||||
"ARROW:extension:name".to_string(),
|
||||
"lance.blob.v2".to_string(),
|
||||
)]))
|
||||
}
|
||||
|
||||
fn blob_table_schema() -> Schema {
|
||||
Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob("image", true),
|
||||
])
|
||||
}
|
||||
|
||||
fn batch_with_image(image_field: Field, image: ArrayRef) -> RecordBatch {
|
||||
let len = image.len();
|
||||
RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
image_field,
|
||||
])),
|
||||
vec![Arc::new(Int64Array::from_iter_values(0..len as i64)), image],
|
||||
)
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
fn image_struct(batch: &RecordBatch) -> &StructArray {
|
||||
batch
|
||||
.column_by_name("image")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref::<StructArray>()
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
async fn plan_from_batch(batch: RecordBatch) -> Arc<dyn ExecutionPlan> {
|
||||
let schema = batch.schema();
|
||||
let table = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
|
||||
let ctx = SessionContext::new();
|
||||
ctx.register_table("t", Arc::new(table)).unwrap();
|
||||
let df = ctx.table("t").await.unwrap();
|
||||
df.create_physical_plan().await.unwrap()
|
||||
}
|
||||
|
||||
async fn coerce(batch: RecordBatch, table_schema: &Schema) -> RecordBatch {
|
||||
let plan = plan_from_batch(batch).await;
|
||||
let plan = cast_to_table_schema(plan, table_schema).unwrap();
|
||||
let ctx = SessionContext::new();
|
||||
let stream = plan.execute(0, ctx.task_ctx()).unwrap();
|
||||
let batches: Vec<RecordBatch> = stream.try_collect().await.unwrap();
|
||||
arrow_select::concat::concat_batches(&plan.schema(), &batches).unwrap()
|
||||
}
|
||||
|
||||
async fn coerce_err(batch: RecordBatch, table_schema: &Schema) -> Error {
|
||||
let plan = plan_from_batch(batch).await;
|
||||
cast_to_table_schema(plan, table_schema).unwrap_err()
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn large_binary_coerces_to_declared_blob_struct() {
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", DataType::LargeBinary, true),
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"hello".as_slice()])),
|
||||
);
|
||||
let coerced = coerce(batch, &blob_table_schema()).await;
|
||||
let image_field = coerced.schema().field_with_name("image").unwrap().clone();
|
||||
assert!(image_field.is_blob_v2());
|
||||
assert!(matches!(image_field.data_type(), DataType::Struct(_)));
|
||||
let data = image_struct(&coerced).column_by_name("data").unwrap();
|
||||
let data: &LargeBinaryArray = data.as_any().downcast_ref().unwrap();
|
||||
assert_eq!(data.value(0), b"hello");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn binary_coerces_to_declared_blob_struct() {
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", DataType::Binary, true),
|
||||
Arc::new(BinaryArray::from_iter_values([b"hi".as_slice()])),
|
||||
);
|
||||
let coerced = coerce(batch, &blob_table_schema()).await;
|
||||
assert!(
|
||||
coerced
|
||||
.schema()
|
||||
.field_with_name("image")
|
||||
.unwrap()
|
||||
.is_blob_v2()
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn binary_view_coerces_to_declared_blob_struct() {
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", DataType::BinaryView, true),
|
||||
Arc::new(BinaryViewArray::from_iter_values([b"view".as_slice()])),
|
||||
);
|
||||
let coerced = coerce(batch, &blob_table_schema()).await;
|
||||
let data = image_struct(&coerced).column_by_name("data").unwrap();
|
||||
let data: &LargeBinaryArray = data.as_any().downcast_ref().unwrap();
|
||||
assert_eq!(data.value(0), b"view");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn binary_nulls_stay_null_after_coercion() {
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", DataType::Binary, true),
|
||||
Arc::new(BinaryArray::from_iter(vec![
|
||||
Some(b"present".as_slice()),
|
||||
None,
|
||||
])),
|
||||
);
|
||||
let coerced = coerce(batch, &blob_table_schema()).await;
|
||||
let image = image_struct(&coerced);
|
||||
let data = image.column_by_name("data").unwrap();
|
||||
assert!(!data.is_null(0));
|
||||
assert!(data.is_null(1));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn binary_coerces_into_four_child_blob_layout() {
|
||||
let table_schema = Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
wide_blob_field("image"),
|
||||
]);
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", DataType::LargeBinary, true),
|
||||
Arc::new(LargeBinaryArray::from_iter(vec![
|
||||
Some(b"alpha".as_slice()),
|
||||
None,
|
||||
])),
|
||||
);
|
||||
let coerced = coerce(batch, &table_schema).await;
|
||||
let image = image_struct(&coerced);
|
||||
assert_eq!(
|
||||
image.num_columns(),
|
||||
4,
|
||||
"coerced struct keeps the declared layout"
|
||||
);
|
||||
assert!(image.column_by_name("position").unwrap().is_null(0));
|
||||
assert!(image.column_by_name("size").unwrap().is_null(0));
|
||||
assert!(!image.column_by_name("data").unwrap().is_null(0));
|
||||
assert!(image.column_by_name("data").unwrap().is_null(1));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn prebuilt_struct_gains_blob_field_metadata() {
|
||||
let DataType::Struct(children) = blob("image", true).data_type().clone() else {
|
||||
unreachable!("blob field is a struct")
|
||||
};
|
||||
let prebuilt = StructArray::new(
|
||||
children,
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"prebuilt".as_slice()])),
|
||||
Arc::new(StringArray::from(vec![None::<&str>])),
|
||||
],
|
||||
None,
|
||||
);
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", prebuilt.data_type().clone(), true),
|
||||
Arc::new(prebuilt),
|
||||
);
|
||||
let coerced = coerce(batch, &blob_table_schema()).await;
|
||||
assert!(
|
||||
coerced
|
||||
.schema()
|
||||
.field_with_name("image")
|
||||
.unwrap()
|
||||
.is_blob_v2()
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn prebuilt_narrow_struct_widens_to_declared_layout() {
|
||||
let DataType::Struct(narrow_children) = blob("image", true).data_type().clone() else {
|
||||
unreachable!("blob field is a struct")
|
||||
};
|
||||
let prebuilt = StructArray::new(
|
||||
narrow_children,
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"prebuilt".as_slice()])),
|
||||
Arc::new(StringArray::from(vec![None::<&str>])),
|
||||
],
|
||||
None,
|
||||
);
|
||||
let table_schema = Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
wide_blob_field("image"),
|
||||
]);
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", prebuilt.data_type().clone(), true),
|
||||
Arc::new(prebuilt),
|
||||
);
|
||||
let coerced = coerce(batch, &table_schema).await;
|
||||
let image = image_struct(&coerced);
|
||||
assert_eq!(image.num_columns(), 4);
|
||||
assert!(image.column_by_name("position").unwrap().is_null(0));
|
||||
assert!(image.column_by_name("size").unwrap().is_null(0));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn external_reference_struct_preserves_uri_position_and_size() {
|
||||
let prebuilt = StructArray::new(
|
||||
vec![
|
||||
Field::new("data", DataType::LargeBinary, true),
|
||||
Field::new("uri", DataType::Utf8, true),
|
||||
Field::new("position", DataType::UInt64, true),
|
||||
Field::new("size", DataType::UInt64, true),
|
||||
]
|
||||
.into(),
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from(vec![None::<&[u8]>])) as ArrayRef,
|
||||
Arc::new(StringArray::from(vec![Some("s3://bucket/blob.bin")])) as ArrayRef,
|
||||
Arc::new(UInt64Array::from(vec![Some(7)])) as ArrayRef,
|
||||
Arc::new(UInt64Array::from(vec![Some(6)])) as ArrayRef,
|
||||
],
|
||||
None,
|
||||
);
|
||||
let table_schema = Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
wide_blob_field("image"),
|
||||
]);
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", prebuilt.data_type().clone(), true),
|
||||
Arc::new(prebuilt),
|
||||
);
|
||||
let coerced = coerce(batch, &table_schema).await;
|
||||
let image = image_struct(&coerced);
|
||||
|
||||
let uri: &StringArray = image
|
||||
.column_by_name("uri")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref()
|
||||
.unwrap();
|
||||
assert_eq!(uri.value(0), "s3://bucket/blob.bin");
|
||||
let position: &UInt64Array = image
|
||||
.column_by_name("position")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref()
|
||||
.unwrap();
|
||||
assert_eq!(position.value(0), 7);
|
||||
let size: &UInt64Array = image
|
||||
.column_by_name("size")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref()
|
||||
.unwrap();
|
||||
assert_eq!(size.value(0), 6);
|
||||
assert!(image.column_by_name("data").unwrap().is_null(0));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn descriptor_struct_without_value_child_is_rejected() {
|
||||
let descriptor = StructArray::new(
|
||||
vec![
|
||||
Field::new("kind", DataType::UInt8, false),
|
||||
Field::new("position", DataType::UInt64, false),
|
||||
Field::new("size", DataType::UInt64, false),
|
||||
]
|
||||
.into(),
|
||||
vec![
|
||||
Arc::new(UInt8Array::from(vec![0])),
|
||||
Arc::new(UInt64Array::from(vec![0])),
|
||||
Arc::new(UInt64Array::from(vec![0])),
|
||||
],
|
||||
None,
|
||||
);
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", descriptor.data_type().clone(), true),
|
||||
Arc::new(descriptor),
|
||||
);
|
||||
let err = coerce_err(batch, &blob_table_schema()).await;
|
||||
assert!(err.to_string().contains("'data' or 'uri'"));
|
||||
assert!(err.to_string().contains("image"));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn unsupported_input_type_is_rejected_with_column_name() {
|
||||
let batch = batch_with_image(
|
||||
Field::new("image", DataType::Utf8, true),
|
||||
Arc::new(StringArray::from(vec!["not bytes"])),
|
||||
);
|
||||
let err = coerce_err(batch, &blob_table_schema()).await;
|
||||
assert!(matches!(err, Error::InvalidInput { .. }), "got {err:?}");
|
||||
assert!(err.to_string().contains("image"));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn blob_metadata_survives_cast_of_sibling_column() {
|
||||
let batch = RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int32, false),
|
||||
Field::new("image", DataType::LargeBinary, true),
|
||||
])),
|
||||
vec![
|
||||
Arc::new(Int32Array::from(vec![1])),
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"x".as_slice()])),
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
let coerced = coerce(batch, &blob_table_schema()).await;
|
||||
|
||||
let image_field = coerced.schema().field_with_name("image").unwrap().clone();
|
||||
assert!(
|
||||
image_field.is_blob_v2(),
|
||||
"expected blob marker on image field, got {:?}",
|
||||
image_field.metadata()
|
||||
);
|
||||
assert_eq!(
|
||||
coerced.schema().field_with_name("id").unwrap().data_type(),
|
||||
&DataType::Int64
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn exact_blob_input_passes_through_unchanged() {
|
||||
let DataType::Struct(children) = blob("image", true).data_type().clone() else {
|
||||
unreachable!("blob field is a struct")
|
||||
};
|
||||
let image = StructArray::new(
|
||||
children,
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"exact".as_slice()])),
|
||||
Arc::new(StringArray::from(vec![None::<&str>])),
|
||||
],
|
||||
None,
|
||||
);
|
||||
let batch = batch_with_image(blob("image", true), Arc::new(image));
|
||||
let table_schema = blob_table_schema();
|
||||
|
||||
let input = plan_from_batch(batch).await;
|
||||
let input_ptr = Arc::as_ptr(&input);
|
||||
let plan = cast_to_table_schema(input, &table_schema).unwrap();
|
||||
assert_eq!(Arc::as_ptr(&plan), input_ptr, "no projection inserted");
|
||||
}
|
||||
}
|
||||
@@ -13,10 +13,8 @@ use datafusion_physical_expr::expressions::{CastExpr, Literal};
|
||||
use datafusion_physical_plan::expressions::Column;
|
||||
use datafusion_physical_plan::projection::ProjectionExec;
|
||||
use datafusion_physical_plan::{ExecutionPlan, PhysicalExpr};
|
||||
use lance_arrow::FieldExt;
|
||||
use lance_arrow::json::{is_arrow_json_field, is_json_field};
|
||||
|
||||
use super::blob_coerce::coerce_blob_expr;
|
||||
use crate::{Error, Result};
|
||||
|
||||
pub fn cast_to_table_schema(
|
||||
@@ -79,17 +77,6 @@ fn build_field_exprs(
|
||||
continue;
|
||||
}
|
||||
|
||||
// Blob columns accept raw binary on write; exact matches pass through below.
|
||||
if table_field.is_blob_v2() && input_field.as_ref() != table_field.as_ref() {
|
||||
result.push(coerce_blob_expr(
|
||||
input_expr,
|
||||
input_field,
|
||||
table_field,
|
||||
&config,
|
||||
)?);
|
||||
continue;
|
||||
}
|
||||
|
||||
let expr = match (input_field.data_type(), table_field.data_type()) {
|
||||
// Both are structs: recurse into sub-fields to handle subschemas and casts.
|
||||
(DataType::Struct(in_children), DataType::Struct(tbl_children))
|
||||
|
||||
@@ -4,7 +4,6 @@
|
||||
//! DataFusion ExecutionPlan for inserting data into LanceDB tables.
|
||||
|
||||
use std::any::Any;
|
||||
use std::sync::atomic::{AtomicU64, Ordering};
|
||||
use std::sync::{Arc, LazyLock, Mutex};
|
||||
|
||||
use arrow_array::{RecordBatch, UInt64Array};
|
||||
@@ -21,12 +20,11 @@ use datafusion_physical_plan::{
|
||||
use futures::TryStreamExt;
|
||||
use lance::Dataset;
|
||||
use lance::dataset::transaction::{Operation, Transaction};
|
||||
use lance::dataset::{CommitBuilder, InsertBuilder, WriteParams, WriteProgressFn};
|
||||
use lance::dataset::{CommitBuilder, InsertBuilder, WriteParams};
|
||||
use lance::io::exec::utils::InstrumentedRecordBatchStreamAdapter;
|
||||
use lance_table::format::Fragment;
|
||||
|
||||
use crate::table::dataset::DatasetConsistencyWrapper;
|
||||
use crate::table::write_progress::WriteProgressTracker;
|
||||
|
||||
pub(crate) static COUNT_SCHEMA: LazyLock<SchemaRef> = LazyLock::new(|| {
|
||||
Arc::new(ArrowSchema::new(vec![Field::new(
|
||||
@@ -83,7 +81,6 @@ pub struct InsertExec {
|
||||
dataset: Arc<Dataset>,
|
||||
input: Arc<dyn ExecutionPlan>,
|
||||
write_params: WriteParams,
|
||||
tracker: Option<Arc<WriteProgressTracker>>,
|
||||
properties: Arc<PlanProperties>,
|
||||
partial_transactions: Arc<Mutex<Vec<Transaction>>>,
|
||||
metrics: ExecutionPlanMetricsSet,
|
||||
@@ -95,16 +92,6 @@ impl InsertExec {
|
||||
dataset: Arc<Dataset>,
|
||||
input: Arc<dyn ExecutionPlan>,
|
||||
write_params: WriteParams,
|
||||
) -> Self {
|
||||
Self::new_with_tracker(ds_wrapper, dataset, input, write_params, None)
|
||||
}
|
||||
|
||||
pub(crate) fn new_with_tracker(
|
||||
ds_wrapper: DatasetConsistencyWrapper,
|
||||
dataset: Arc<Dataset>,
|
||||
input: Arc<dyn ExecutionPlan>,
|
||||
write_params: WriteParams,
|
||||
tracker: Option<Arc<WriteProgressTracker>>,
|
||||
) -> Self {
|
||||
let schema = COUNT_SCHEMA.clone();
|
||||
let num_partitions = input.output_partitioning().partition_count();
|
||||
@@ -120,7 +107,6 @@ impl InsertExec {
|
||||
dataset,
|
||||
input,
|
||||
write_params,
|
||||
tracker,
|
||||
properties: Arc::new(properties),
|
||||
partial_transactions: Arc::new(Mutex::new(Vec::with_capacity(num_partitions))),
|
||||
metrics: ExecutionPlanMetricsSet::new(),
|
||||
@@ -175,12 +161,11 @@ impl ExecutionPlan for InsertExec {
|
||||
"InsertExec requires exactly one child".to_string(),
|
||||
));
|
||||
}
|
||||
Ok(Arc::new(Self::new_with_tracker(
|
||||
Ok(Arc::new(Self::new(
|
||||
self.ds_wrapper.clone(),
|
||||
self.dataset.clone(),
|
||||
children[0].clone(),
|
||||
self.write_params.clone(),
|
||||
self.tracker.clone(),
|
||||
)))
|
||||
}
|
||||
|
||||
@@ -191,11 +176,10 @@ impl ExecutionPlan for InsertExec {
|
||||
) -> DataFusionResult<SendableRecordBatchStream> {
|
||||
let input_stream = self.input.execute(partition, context)?;
|
||||
let dataset = self.dataset.clone();
|
||||
let mut write_params = self.write_params.clone();
|
||||
let write_params = self.write_params.clone();
|
||||
let partial_transactions = self.partial_transactions.clone();
|
||||
let total_partitions = self.input.output_partitioning().partition_count();
|
||||
let ds_wrapper = self.ds_wrapper.clone();
|
||||
let tracker = self.tracker.clone();
|
||||
|
||||
let output_bytes = MetricBuilder::new(&self.metrics).output_bytes(partition);
|
||||
let input_schema = input_stream.schema();
|
||||
@@ -211,20 +195,6 @@ impl ExecutionPlan for InsertExec {
|
||||
));
|
||||
|
||||
let stream = futures::stream::once(async move {
|
||||
if let Some(tracker) = tracker
|
||||
&& write_params.write_progress.is_none()
|
||||
{
|
||||
let last_bytes = Arc::new(AtomicU64::new(0));
|
||||
write_params.write_progress = Some(WriteProgressFn::new(move |stats| {
|
||||
let previous = last_bytes.swap(stats.bytes_written, Ordering::Relaxed);
|
||||
if stats.bytes_written > previous {
|
||||
let delta =
|
||||
usize::try_from(stats.bytes_written - previous).unwrap_or(usize::MAX);
|
||||
tracker.record_bytes(delta);
|
||||
}
|
||||
}));
|
||||
}
|
||||
|
||||
let transaction = InsertBuilder::new(dataset.clone())
|
||||
.with_params(&write_params)
|
||||
.execute_uncommitted_stream(input_stream)
|
||||
|
||||
@@ -516,11 +516,11 @@ mod tests {
|
||||
let uri = dir.path().to_str().unwrap();
|
||||
let ds = create_test_dataset(uri).await;
|
||||
|
||||
let wrapper = DatasetConsistencyWrapper::new_latest(ds, Some(Duration::from_millis(200)));
|
||||
// Other tests use a thread-local mock clock. Simulate leaked state from a
|
||||
// previous test to ensure this wrapper starts from real time.
|
||||
clock::advance_by(Duration::from_secs(60));
|
||||
|
||||
// Freeze `cached_at` on the mock clock so a slow external write below can't
|
||||
// expire the TTL before the explicit advance_by() does (flake on loaded CI).
|
||||
clock::pin();
|
||||
let wrapper = DatasetConsistencyWrapper::new_latest(ds, Some(Duration::from_millis(200)));
|
||||
|
||||
// Populate the cache
|
||||
let v1 = wrapper.get().await.unwrap().version().version;
|
||||
@@ -529,13 +529,12 @@ mod tests {
|
||||
// External write
|
||||
append_to_dataset(uri).await;
|
||||
|
||||
// Should return cached value immediately (within TTL), regardless of how
|
||||
// long the external write above took on a slow CI runner.
|
||||
// Should return cached value immediately (within TTL)
|
||||
let v_cached = wrapper.get().await.unwrap().version().version;
|
||||
assert_eq!(v_cached, 1);
|
||||
|
||||
// Advance the mock clock past the TTL so the next get() triggers a refresh.
|
||||
clock::advance_by(Duration::from_millis(300));
|
||||
// Wait for TTL to expire, then get() should trigger a refresh
|
||||
tokio::time::sleep(Duration::from_millis(300)).await;
|
||||
let v_after = wrapper.get().await.unwrap().version().version;
|
||||
assert_eq!(v_after, 2);
|
||||
}
|
||||
|
||||
@@ -44,35 +44,17 @@ pub async fn execute_query(
|
||||
// QueryTable pushdown runs the query server-side, but only on the main
|
||||
// branch: the namespace request carries no branch yet, so a branch handle
|
||||
// must fall through to local execution.
|
||||
if can_execute_namespace_query(table, query)
|
||||
if table
|
||||
.pushdown_operations
|
||||
.contains(&NamespaceClientPushdownOperation::QueryTable)
|
||||
&& let Some(ref namespace_client) = table.namespace_client
|
||||
&& table.dataset.current_branch().is_none()
|
||||
{
|
||||
return execute_namespace_query(table, namespace_client.clone(), query, options).await;
|
||||
}
|
||||
execute_generic_query(table, query, options).await
|
||||
}
|
||||
|
||||
fn can_execute_namespace_query(table: &NativeTable, query: &AnyQuery) -> bool {
|
||||
table
|
||||
.pushdown_operations
|
||||
.contains(&NamespaceClientPushdownOperation::QueryTable)
|
||||
&& table.namespace_client.is_some()
|
||||
&& table.dataset.current_branch().is_none()
|
||||
&& !requires_local_namespace_execution(query)
|
||||
}
|
||||
|
||||
fn requires_local_namespace_execution(query: &AnyQuery) -> bool {
|
||||
// The namespace QueryTable request has no approx_mode field yet, so
|
||||
// pushing this query down would silently ignore the user's setting.
|
||||
matches!(
|
||||
query,
|
||||
AnyQuery::VectorQuery(VectorQueryRequest {
|
||||
approx_mode: Some(_),
|
||||
..
|
||||
})
|
||||
)
|
||||
}
|
||||
|
||||
pub async fn analyze_query_plan(
|
||||
table: &NativeTable,
|
||||
query: &AnyQuery,
|
||||
@@ -185,10 +167,6 @@ pub async fn create_plan(
|
||||
scanner.nearest(&column, query_vector.as_ref(), top_k)?;
|
||||
}
|
||||
|
||||
if let Some(approx_mode) = query.approx_mode {
|
||||
scanner.approx_mode(approx_mode.into());
|
||||
}
|
||||
|
||||
scanner.minimum_nprobes(query.minimum_nprobes);
|
||||
if let Some(maximum_nprobes) = query.maximum_nprobes {
|
||||
scanner.maximum_nprobes(maximum_nprobes);
|
||||
@@ -609,20 +587,12 @@ async fn parse_arrow_ipc_response(bytes: bytes::Bytes) -> Result<DatasetRecordBa
|
||||
#[cfg(test)]
|
||||
#[allow(deprecated)]
|
||||
mod tests {
|
||||
use arrow_array::{ArrayRef, FixedSizeListArray, Float32Array};
|
||||
use arrow_array::Float32Array;
|
||||
use futures::TryStreamExt;
|
||||
use lance_arrow::FixedSizeListArrayExt;
|
||||
use std::sync::{
|
||||
Arc,
|
||||
atomic::{AtomicUsize, Ordering},
|
||||
};
|
||||
use std::sync::Arc;
|
||||
|
||||
use super::*;
|
||||
use crate::query::{QueryExecutionOptions, QueryRequest};
|
||||
|
||||
fn fixed_size_list_array(values: Vec<f32>, dimension: i32) -> FixedSizeListArray {
|
||||
FixedSizeListArray::try_new_from_values(Float32Array::from(values), dimension).unwrap()
|
||||
}
|
||||
use crate::query::QueryExecutionOptions;
|
||||
|
||||
#[test]
|
||||
fn test_convert_to_namespace_query_vector() {
|
||||
@@ -745,80 +715,6 @@ mod tests {
|
||||
assert_eq!(count, 2); // 4 and 5
|
||||
}
|
||||
|
||||
#[derive(Debug, Default)]
|
||||
struct CountingNamespaceClient {
|
||||
query_table_calls: AtomicUsize,
|
||||
}
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl LanceNamespace for CountingNamespaceClient {
|
||||
fn namespace_id(&self) -> String {
|
||||
"counting".to_string()
|
||||
}
|
||||
|
||||
async fn query_table(&self, _request: NsQueryTableRequest) -> lance::Result<bytes::Bytes> {
|
||||
self.query_table_calls.fetch_add(1, Ordering::SeqCst);
|
||||
panic!("approx_mode queries must not be pushed down to namespace query_table");
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_execute_query_approx_mode_with_namespace_pushdown_runs_locally() {
|
||||
use crate::connect;
|
||||
use crate::table::query::execute_query;
|
||||
use arrow_array::{Int32Array, RecordBatch};
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
|
||||
let conn = connect("memory://").execute().await.unwrap();
|
||||
|
||||
let vectors = Arc::new(fixed_size_list_array(
|
||||
vec![0.0, 0.0, 10.0, 10.0, 20.0, 20.0],
|
||||
2,
|
||||
));
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int32, false),
|
||||
Field::new("vector", vectors.data_type().clone(), false),
|
||||
]));
|
||||
let batch = RecordBatch::try_new(
|
||||
schema,
|
||||
vec![Arc::new(Int32Array::from(vec![1, 2, 3])), vectors],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let table = conn
|
||||
.create_table("test_approx_mode_namespace_fallback", batch)
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
let namespace_client = Arc::new(CountingNamespaceClient::default());
|
||||
let mut native_table = table.as_native().unwrap().clone();
|
||||
native_table.namespace_client = Some(namespace_client.clone());
|
||||
native_table
|
||||
.pushdown_operations
|
||||
.insert(NamespaceClientPushdownOperation::QueryTable);
|
||||
|
||||
let query_vector = Arc::new(Float32Array::from(vec![0.0, 0.0]));
|
||||
let query = AnyQuery::VectorQuery(VectorQueryRequest {
|
||||
base: QueryRequest {
|
||||
limit: Some(1),
|
||||
..Default::default()
|
||||
},
|
||||
column: Some("vector".to_string()),
|
||||
query_vector: vec![query_vector as ArrayRef],
|
||||
approx_mode: Some(crate::ApproxMode::Accurate),
|
||||
..Default::default()
|
||||
});
|
||||
|
||||
let stream = execute_query(&native_table, &query, QueryExecutionOptions::default())
|
||||
.await
|
||||
.unwrap();
|
||||
let batches = stream.try_collect::<Vec<_>>().await.unwrap();
|
||||
let count: usize = batches.iter().map(|b| b.num_rows()).sum();
|
||||
|
||||
assert_eq!(count, 1);
|
||||
assert_eq!(namespace_client.query_table_calls.load(Ordering::SeqCst), 0);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_create_plan_multivector_structure() {
|
||||
use arrow_array::{Float32Array, RecordBatch};
|
||||
@@ -883,97 +779,4 @@ mod tests {
|
||||
"Plan should add query_index column"
|
||||
);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_create_plan_applies_approx_mode_to_ann_query() {
|
||||
use arrow_array::RecordBatch;
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
use datafusion_physical_plan::ExecutionPlan;
|
||||
use lance::io::exec::{ANNIvfPartitionExec, ANNIvfSubIndexExec};
|
||||
use lance_index::vector::ApproxMode;
|
||||
|
||||
use crate::connect;
|
||||
use crate::index::{Index, vector::IvfRqIndexBuilder};
|
||||
use crate::table::query::create_plan;
|
||||
|
||||
fn find_ann_approx_mode(plan: &dyn ExecutionPlan) -> Option<ApproxMode> {
|
||||
if let Some(ann) = plan.as_any().downcast_ref::<ANNIvfSubIndexExec>() {
|
||||
return Some(ann.query().approx_mode);
|
||||
}
|
||||
if let Some(ann) = plan.as_any().downcast_ref::<ANNIvfPartitionExec>() {
|
||||
return Some(ann.query.approx_mode);
|
||||
}
|
||||
plan.children()
|
||||
.into_iter()
|
||||
.find_map(|child| find_ann_approx_mode(child.as_ref()))
|
||||
}
|
||||
|
||||
let conn = connect("memory://").execute().await.unwrap();
|
||||
let dimension = 8;
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int32, false),
|
||||
Field::new(
|
||||
"vector",
|
||||
DataType::FixedSizeList(
|
||||
Arc::new(Field::new("item", DataType::Float32, true)),
|
||||
dimension,
|
||||
),
|
||||
false,
|
||||
),
|
||||
]));
|
||||
|
||||
let vectors = Arc::new(fixed_size_list_array(
|
||||
(0..512 * dimension)
|
||||
.map(|value| value as f32 / dimension as f32)
|
||||
.collect(),
|
||||
dimension,
|
||||
));
|
||||
let batch = RecordBatch::try_new(
|
||||
schema,
|
||||
vec![
|
||||
Arc::new(arrow_array::Int32Array::from_iter_values(0..512)),
|
||||
vectors,
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
let table = conn
|
||||
.create_table("test_approx_mode_plan", vec![batch])
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
table
|
||||
.create_index(
|
||||
&["vector"],
|
||||
Index::IvfRq(
|
||||
IvfRqIndexBuilder::default()
|
||||
.num_partitions(1)
|
||||
.sample_rate(1)
|
||||
.max_iterations(1)
|
||||
.num_bits(1),
|
||||
),
|
||||
)
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
let native_table = table.as_native().unwrap();
|
||||
let query_vector = Arc::new(Float32Array::from(vec![0.0; dimension as usize]));
|
||||
let query = AnyQuery::VectorQuery(VectorQueryRequest {
|
||||
column: Some("vector".to_string()),
|
||||
query_vector: vec![query_vector as ArrayRef],
|
||||
base: QueryRequest {
|
||||
limit: Some(1),
|
||||
..Default::default()
|
||||
},
|
||||
approx_mode: Some(crate::ApproxMode::Accurate),
|
||||
..Default::default()
|
||||
});
|
||||
|
||||
let plan = create_plan(native_table, &query, QueryExecutionOptions::default())
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(
|
||||
find_ann_approx_mode(plan.as_ref()),
|
||||
Some(ApproxMode::Accurate)
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -142,21 +142,11 @@ impl WriteProgressTracker {
|
||||
cb(&progress);
|
||||
}
|
||||
|
||||
/// Record wire bytes from the insert layer.
|
||||
///
|
||||
/// These bytes may be IPC-encoded bytes for remote writes or bytes handed
|
||||
/// to Lance's local writer. When wire bytes are recorded, they take
|
||||
/// precedence over the in-memory Arrow bytes tracked by [`record_batch`].
|
||||
/// Record wire bytes from the insert layer (e.g. IPC-encoded bytes for
|
||||
/// remote writes). When wire bytes are recorded, they take precedence over
|
||||
/// the in-memory Arrow bytes tracked by [`record_batch`].
|
||||
pub fn record_bytes(&self, bytes: usize) {
|
||||
self.wire_bytes.fetch_add(bytes, Ordering::Relaxed);
|
||||
let mut cb = self.callback.lock().unwrap_or_else(|e| e.into_inner());
|
||||
let guard = self
|
||||
.rows_and_bytes
|
||||
.lock()
|
||||
.unwrap_or_else(|e| e.into_inner());
|
||||
let progress = self.snapshot(guard.0, guard.1, false);
|
||||
drop(guard);
|
||||
cb(&progress);
|
||||
}
|
||||
|
||||
/// Emit the final progress callback indicating the write is complete.
|
||||
@@ -179,6 +169,8 @@ impl WriteProgressTracker {
|
||||
let wire = self.wire_bytes.load(Ordering::Relaxed);
|
||||
// Prefer wire bytes (actual I/O size) when the insert layer is
|
||||
// tracking them; fall back to in-memory Arrow size otherwise.
|
||||
// TODO: for local writes, track actual bytes written by Lance
|
||||
// instead of using in-memory Arrow size as a proxy.
|
||||
let output_bytes = if wire > 0 { wire } else { in_memory_bytes };
|
||||
WriteProgress {
|
||||
elapsed: self.start.elapsed(),
|
||||
@@ -391,54 +383,6 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_progress_uses_lance_write_bytes_for_local_tables() {
|
||||
let dir = tempfile::tempdir().unwrap();
|
||||
let db = connect(dir.path().to_str().unwrap())
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let batch = record_batch!(("id", Int32, [1, 2, 3])).unwrap();
|
||||
let table = db
|
||||
.create_table("local_write_bytes", batch)
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let new_data = record_batch!(("id", Int32, [4, 5, 6])).unwrap();
|
||||
let in_memory_bytes = new_data.get_array_memory_size();
|
||||
let final_bytes = Arc::new(AtomicUsize::new(0));
|
||||
let seen_non_memory_bytes = Arc::new(std::sync::atomic::AtomicBool::new(false));
|
||||
let final_bytes_cb = final_bytes.clone();
|
||||
let seen_non_memory_bytes_cb = seen_non_memory_bytes.clone();
|
||||
|
||||
table
|
||||
.add(new_data)
|
||||
.write_parallelism(1)
|
||||
.progress(move |p| {
|
||||
if p.output_bytes() > 0 && p.output_bytes() != in_memory_bytes {
|
||||
seen_non_memory_bytes_cb.store(true, Ordering::SeqCst);
|
||||
}
|
||||
if p.done() {
|
||||
final_bytes_cb.store(p.output_bytes(), Ordering::SeqCst);
|
||||
}
|
||||
})
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert!(
|
||||
seen_non_memory_bytes.load(Ordering::SeqCst),
|
||||
"progress should report Lance writer bytes, not only Arrow memory bytes"
|
||||
);
|
||||
assert_ne!(
|
||||
final_bytes.load(Ordering::SeqCst),
|
||||
in_memory_bytes,
|
||||
"final progress bytes should come from Lance write stats"
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_record_batch_recovers_from_poisoned_callback_lock() {
|
||||
use super::{ProgressCallback, WriteProgressTracker};
|
||||
|
||||
@@ -329,15 +329,6 @@ pub mod clock {
|
||||
});
|
||||
}
|
||||
|
||||
/// Start mock time at the current instant if not already pinned.
|
||||
pub fn pin() {
|
||||
MOCK_NOW.with(|mock| {
|
||||
if mock.get().is_none() {
|
||||
mock.set(Some(Instant::now()));
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
#[allow(dead_code)]
|
||||
pub fn clear_mock() {
|
||||
MOCK_NOW.with(|mock| mock.set(None));
|
||||
|
||||
@@ -1,949 +0,0 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use arrow_array::{
|
||||
Array, ArrayRef, BinaryArray, Int64Array, LargeBinaryArray, RecordBatch, StringArray,
|
||||
StructArray, UInt64Array,
|
||||
};
|
||||
use arrow_schema::{DataType, Field, Fields, Schema};
|
||||
use futures::TryStreamExt;
|
||||
use lance_encoding::version::LanceFileVersion;
|
||||
use lancedb::{
|
||||
Connection, Error, Result, Table,
|
||||
blob::blob,
|
||||
connect, connect_namespace,
|
||||
database::listing::OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS,
|
||||
query::{ExecutableQuery, QueryBase},
|
||||
table::{AddDataMode, CompactionOptions, OptimizeAction},
|
||||
};
|
||||
use tempfile::tempdir;
|
||||
|
||||
fn blob_table_schema() -> Arc<Schema> {
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob("image", true),
|
||||
]))
|
||||
}
|
||||
|
||||
fn binary_input_batch(ids: &[i64], payloads: &[Option<&[u8]>]) -> RecordBatch {
|
||||
RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
Field::new("image", DataType::LargeBinary, true),
|
||||
])),
|
||||
vec![
|
||||
Arc::new(Int64Array::from(ids.to_vec())),
|
||||
Arc::new(LargeBinaryArray::from_iter(payloads.iter().copied())),
|
||||
],
|
||||
)
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
async fn create_inline_blob_table(
|
||||
db: &Connection,
|
||||
name: &str,
|
||||
ids: &[i64],
|
||||
payloads: &[Option<&[u8]>],
|
||||
) -> Result<Table> {
|
||||
let table = db
|
||||
.create_empty_table(name, blob_table_schema())
|
||||
.execute()
|
||||
.await?;
|
||||
table
|
||||
.add(binary_input_batch(ids, payloads))
|
||||
.execute()
|
||||
.await?;
|
||||
Ok(table)
|
||||
}
|
||||
|
||||
async fn storage_format_version(table: &Table) -> LanceFileVersion {
|
||||
table
|
||||
.as_native()
|
||||
.unwrap()
|
||||
.manifest()
|
||||
.await
|
||||
.unwrap()
|
||||
.data_storage_format
|
||||
.lance_file_version()
|
||||
.unwrap()
|
||||
.resolve()
|
||||
}
|
||||
|
||||
async fn uses_stable_row_ids(table: &Table) -> bool {
|
||||
table
|
||||
.as_native()
|
||||
.unwrap()
|
||||
.manifest()
|
||||
.await
|
||||
.unwrap()
|
||||
.uses_stable_row_ids()
|
||||
}
|
||||
|
||||
async fn query_image_struct(table: &Table) -> StructArray {
|
||||
let batches = table
|
||||
.query()
|
||||
.execute()
|
||||
.await
|
||||
.unwrap()
|
||||
.try_collect::<Vec<_>>()
|
||||
.await
|
||||
.unwrap();
|
||||
let batch = arrow_select::concat::concat_batches(&batches[0].schema(), &batches).unwrap();
|
||||
batch
|
||||
.column_by_name("image")
|
||||
.expect("image column present")
|
||||
.as_any()
|
||||
.downcast_ref::<StructArray>()
|
||||
.expect("image column is a descriptor struct")
|
||||
.clone()
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn declaring_blob_column_bumps_format_and_enables_stable_row_ids() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = db
|
||||
.create_empty_table("t", blob_table_schema())
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
assert!(storage_format_version(&table).await >= LanceFileVersion::V2_2);
|
||||
assert!(uses_stable_row_ids(&table).await);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn explicit_stable_row_id_setting_wins_over_blob_default() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = db
|
||||
.create_empty_table("t", blob_table_schema())
|
||||
.storage_option(OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS, "false")
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
assert!(storage_format_version(&table).await >= LanceFileVersion::V2_2);
|
||||
assert!(!uses_stable_row_ids(&table).await);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn non_blob_table_keeps_default_format_and_row_id_setting() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let schema = Arc::new(Schema::new(vec![Field::new("id", DataType::Int64, false)]));
|
||||
let table = db.create_empty_table("t", schema).execute().await?;
|
||||
|
||||
assert!(storage_format_version(&table).await < LanceFileVersion::V2_2);
|
||||
assert!(!uses_stable_row_ids(&table).await);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn creating_with_blob_data_bumps_format() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
|
||||
let blob_field = blob("image", true);
|
||||
let DataType::Struct(children) = blob_field.data_type().clone() else {
|
||||
unreachable!("blob field is a struct")
|
||||
};
|
||||
let image = StructArray::new(
|
||||
children,
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"payload".as_slice()])),
|
||||
Arc::new(StringArray::from(vec![None::<&str>])),
|
||||
],
|
||||
None,
|
||||
);
|
||||
let batch = RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob_field,
|
||||
])),
|
||||
vec![Arc::new(Int64Array::from(vec![1])), Arc::new(image)],
|
||||
)
|
||||
.unwrap();
|
||||
let table = db.create_table("t", batch).execute().await?;
|
||||
|
||||
assert!(storage_format_version(&table).await >= LanceFileVersion::V2_2);
|
||||
assert!(uses_stable_row_ids(&table).await);
|
||||
assert_eq!(table.count_rows(None).await?, 1);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn add_coerces_large_binary_into_blob_column() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table =
|
||||
create_inline_blob_table(&db, "t", &[1, 2], &[Some(b"cat".as_slice()), Some(b"dog")])
|
||||
.await?;
|
||||
|
||||
assert_eq!(table.count_rows(None).await?, 2);
|
||||
let image = query_image_struct(&table).await;
|
||||
assert_eq!(image.len(), 2);
|
||||
let schema = table.schema().await?;
|
||||
let field = schema.field_with_name("image").unwrap();
|
||||
assert_eq!(
|
||||
field
|
||||
.metadata()
|
||||
.get("ARROW:extension:name")
|
||||
.map(String::as_str),
|
||||
Some("lance.blob.v2")
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn add_coerces_binary_into_blob_column() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = db
|
||||
.create_empty_table("t", blob_table_schema())
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
let batch = RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
Field::new("image", DataType::Binary, true),
|
||||
])),
|
||||
vec![
|
||||
Arc::new(Int64Array::from(vec![1])),
|
||||
Arc::new(BinaryArray::from_iter_values([b"small".as_slice()])),
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
table.add(batch).execute().await?;
|
||||
|
||||
assert_eq!(table.count_rows(None).await?, 1);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn add_accepts_null_blob_rows() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(
|
||||
&db,
|
||||
"t",
|
||||
&[1, 2, 3],
|
||||
&[Some(b"first".as_slice()), None, Some(b"third")],
|
||||
)
|
||||
.await?;
|
||||
|
||||
assert_eq!(table.count_rows(None).await?, 3);
|
||||
let image = query_image_struct(&table).await;
|
||||
assert_eq!(image.len(), 3);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn add_rejects_uncoercible_blob_input() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = db
|
||||
.create_empty_table("t", blob_table_schema())
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
let batch = RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
Field::new("image", DataType::Utf8, true),
|
||||
])),
|
||||
vec![
|
||||
Arc::new(Int64Array::from(vec![1])),
|
||||
Arc::new(StringArray::from(vec!["not bytes"])),
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
let err = table.add(batch).execute().await.unwrap_err();
|
||||
assert!(err.to_string().contains("image"));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn connection_level_stable_row_id_setting_wins_over_blob_default() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap())
|
||||
.storage_option(OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS, "false")
|
||||
.execute()
|
||||
.await?;
|
||||
let table = db
|
||||
.create_empty_table("t", blob_table_schema())
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
assert!(storage_format_version(&table).await >= LanceFileVersion::V2_2);
|
||||
assert!(!uses_stable_row_ids(&table).await);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn namespace_create_applies_blob_defaults() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let mut properties = std::collections::HashMap::new();
|
||||
properties.insert("root".to_string(), tmp.path().to_str().unwrap().to_string());
|
||||
let db = connect_namespace("dir", properties).execute().await?;
|
||||
let table = db
|
||||
.create_empty_table("t", blob_table_schema())
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
assert!(storage_format_version(&table).await >= LanceFileVersion::V2_2);
|
||||
assert!(uses_stable_row_ids(&table).await);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
// Overwrite takes the input schema as-is. A raw-binary overwrite drops the blob
|
||||
// marker; re-declaring blob v2 in the input restores it.
|
||||
#[tokio::test]
|
||||
async fn overwrite_replaces_blob_schema_with_input_schema() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"blob".as_slice())]).await?;
|
||||
|
||||
let raw_schema = Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
Field::new("image", DataType::LargeBinary, true),
|
||||
]));
|
||||
let raw_batch = RecordBatch::try_new(
|
||||
raw_schema.clone(),
|
||||
vec![
|
||||
Arc::new(Int64Array::from(vec![2])),
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"plain".as_slice()])),
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
table
|
||||
.add(raw_batch)
|
||||
.mode(AddDataMode::Overwrite)
|
||||
.execute()
|
||||
.await?;
|
||||
let schema = table.schema().await?;
|
||||
assert_eq!(schema, raw_schema);
|
||||
assert!(
|
||||
!schema
|
||||
.field_with_name("image")
|
||||
.unwrap()
|
||||
.metadata()
|
||||
.contains_key("ARROW:extension:name")
|
||||
);
|
||||
|
||||
let blob_field = blob("image", true);
|
||||
let DataType::Struct(children) = blob_field.data_type().clone() else {
|
||||
unreachable!("blob field is a struct")
|
||||
};
|
||||
let image = StructArray::new(
|
||||
children,
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from_iter_values([b"declared".as_slice()])),
|
||||
Arc::new(StringArray::from(vec![None::<&str>])),
|
||||
],
|
||||
None,
|
||||
);
|
||||
let declared_batch = RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob_field,
|
||||
])),
|
||||
vec![Arc::new(Int64Array::from(vec![3])), Arc::new(image)],
|
||||
)
|
||||
.unwrap();
|
||||
table
|
||||
.add(declared_batch)
|
||||
.mode(AddDataMode::Overwrite)
|
||||
.execute()
|
||||
.await?;
|
||||
let schema = table.schema().await?;
|
||||
assert_eq!(
|
||||
schema
|
||||
.field_with_name("image")
|
||||
.unwrap()
|
||||
.metadata()
|
||||
.get("ARROW:extension:name")
|
||||
.map(String::as_str),
|
||||
Some("lance.blob.v2")
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn collect_row_ids(table: &Table) -> Result<Vec<u64>> {
|
||||
let batches = table
|
||||
.query()
|
||||
.with_row_id()
|
||||
.execute()
|
||||
.await?
|
||||
.try_collect::<Vec<_>>()
|
||||
.await?;
|
||||
let batch = arrow_select::concat::concat_batches(&batches[0].schema(), &batches).unwrap();
|
||||
Ok(batch
|
||||
.column_by_name("_rowid")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref::<UInt64Array>()
|
||||
.unwrap()
|
||||
.values()
|
||||
.to_vec())
|
||||
}
|
||||
|
||||
async fn collect_id_rowid(table: &Table) -> Result<Vec<(i64, u64)>> {
|
||||
let batches = table
|
||||
.query()
|
||||
.with_row_id()
|
||||
.execute()
|
||||
.await?
|
||||
.try_collect::<Vec<_>>()
|
||||
.await?;
|
||||
let batch = arrow_select::concat::concat_batches(&batches[0].schema(), &batches).unwrap();
|
||||
let ids = batch
|
||||
.column_by_name("id")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref::<Int64Array>()
|
||||
.unwrap();
|
||||
let row_ids = batch
|
||||
.column_by_name("_rowid")
|
||||
.unwrap()
|
||||
.as_any()
|
||||
.downcast_ref::<UInt64Array>()
|
||||
.unwrap();
|
||||
Ok(ids
|
||||
.values()
|
||||
.iter()
|
||||
.copied()
|
||||
.zip(row_ids.values().iter().copied())
|
||||
.collect())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_round_trips_bytes() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let payload: &[u8] = b"blob-round-trip-payload";
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(payload)]).await?;
|
||||
|
||||
let ids = collect_row_ids(&table).await?;
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
assert_eq!(bytes.len(), 1);
|
||||
assert_eq!(bytes.value(0), payload);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_round_trips_nested_blob_column() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
|
||||
let blob_field = blob("blob", true);
|
||||
let DataType::Struct(blob_children) = blob_field.data_type().clone() else {
|
||||
unreachable!("blob field is a struct")
|
||||
};
|
||||
let blob_array = StructArray::new(
|
||||
blob_children,
|
||||
vec![
|
||||
Arc::new(LargeBinaryArray::from_iter_values([
|
||||
b"hello".as_slice(),
|
||||
b"world".as_slice(),
|
||||
])) as ArrayRef,
|
||||
Arc::new(StringArray::from(vec![None::<&str>, None::<&str>])) as ArrayRef,
|
||||
],
|
||||
None,
|
||||
);
|
||||
let info_fields: Fields = vec![Field::new("name", DataType::Utf8, false), blob_field].into();
|
||||
let info_array = StructArray::new(
|
||||
info_fields.clone(),
|
||||
vec![
|
||||
Arc::new(StringArray::from(vec!["a", "b"])) as ArrayRef,
|
||||
Arc::new(blob_array) as ArrayRef,
|
||||
],
|
||||
None,
|
||||
);
|
||||
let schema = Arc::new(Schema::new(vec![Field::new(
|
||||
"info",
|
||||
DataType::Struct(info_fields),
|
||||
true,
|
||||
)]));
|
||||
let batch = RecordBatch::try_new(schema, vec![Arc::new(info_array) as ArrayRef]).unwrap();
|
||||
let table = db.create_table("t", batch).execute().await?;
|
||||
|
||||
assert!(storage_format_version(&table).await >= LanceFileVersion::V2_2);
|
||||
assert!(uses_stable_row_ids(&table).await);
|
||||
|
||||
let ids = collect_row_ids(&table).await?;
|
||||
let bytes = table.fetch_blobs("info.blob", &ids).await?;
|
||||
assert_eq!(bytes.len(), 2);
|
||||
let values: std::collections::HashSet<&[u8]> =
|
||||
(0..bytes.len()).map(|i| bytes.value(i)).collect();
|
||||
assert!(values.contains(b"hello".as_slice()));
|
||||
assert!(values.contains(b"world".as_slice()));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn blob_columns_lists_nested_dotted_paths() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let blob_field = blob("blob", true);
|
||||
let info = Field::new(
|
||||
"info",
|
||||
DataType::Struct(vec![Field::new("name", DataType::Utf8, false), blob_field].into()),
|
||||
true,
|
||||
);
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
blob("thumbnail", true),
|
||||
Field::new("id", DataType::Int64, false),
|
||||
info,
|
||||
]));
|
||||
let table = db.create_empty_table("t", schema).execute().await?;
|
||||
assert_eq!(table.blob_columns().await?, vec!["thumbnail", "info.blob"]);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn blob_columns_lists_blob_fields_in_order() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
blob("thumbnail", true),
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob("image", true),
|
||||
]));
|
||||
let table = db.create_empty_table("t", schema).execute().await?;
|
||||
assert_eq!(table.blob_columns().await?, vec!["thumbnail", "image"]);
|
||||
|
||||
let plain = db
|
||||
.create_empty_table(
|
||||
"plain",
|
||||
Arc::new(Schema::new(vec![Field::new("id", DataType::Int64, false)])),
|
||||
)
|
||||
.execute()
|
||||
.await?;
|
||||
assert!(plain.blob_columns().await?.is_empty());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_preserves_null_alignment() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(
|
||||
&db,
|
||||
"t",
|
||||
&[1, 2, 3, 4],
|
||||
&[Some(b"a".as_slice()), None, Some(b"c"), None],
|
||||
)
|
||||
.await?;
|
||||
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
let ids: Vec<u64> = pairs.iter().map(|(_, rowid)| *rowid).collect();
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
assert_eq!(bytes.len(), ids.len());
|
||||
for (i, (id, _)) in pairs.iter().enumerate() {
|
||||
match id {
|
||||
1 => assert_eq!(bytes.value(i), b"a"),
|
||||
2 | 4 => assert!(bytes.is_null(i)),
|
||||
3 => assert_eq!(bytes.value(i), b"c"),
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_all_null_column_returns_all_nulls() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1, 2], &[None, None]).await?;
|
||||
|
||||
let ids = collect_row_ids(&table).await?;
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
assert_eq!(bytes.len(), 2);
|
||||
assert_eq!(bytes.null_count(), 2);
|
||||
|
||||
let files = table.fetch_blob_files("image", &ids).await?;
|
||||
assert_eq!(files.len(), 2);
|
||||
assert!(files.iter().all(Option::is_none));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_aligns_with_reordered_and_duplicate_ids() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(
|
||||
&db,
|
||||
"t",
|
||||
&[1, 2, 3],
|
||||
&[Some(b"one".as_slice()), Some(b"two"), Some(b"three")],
|
||||
)
|
||||
.await?;
|
||||
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
let by_id = |want: i64| pairs.iter().find(|(id, _)| *id == want).unwrap().1;
|
||||
let request = vec![by_id(3), by_id(1), by_id(3), by_id(2)];
|
||||
let bytes = table.fetch_blobs("image", &request).await?;
|
||||
assert_eq!(bytes.len(), 4);
|
||||
assert_eq!(bytes.value(0), b"three");
|
||||
assert_eq!(bytes.value(1), b"one");
|
||||
assert_eq!(bytes.value(2), b"three");
|
||||
assert_eq!(bytes.value(3), b"two");
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_empty_ids_returns_empty() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"x".as_slice())]).await?;
|
||||
|
||||
assert_eq!(table.fetch_blobs("image", &[]).await?.len(), 0);
|
||||
assert!(table.fetch_blob_files("image", &[]).await?.is_empty());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_out_of_range_id_errors_without_panic() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"x".as_slice())]).await?;
|
||||
|
||||
let err = table.fetch_blobs("image", &[u64::MAX]).await.unwrap_err();
|
||||
assert!(err.to_string().contains("row ids"));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_rejects_non_blob_column() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"x".as_slice())]).await?;
|
||||
|
||||
let err = table.fetch_blobs("id", &[0]).await.unwrap_err();
|
||||
assert!(matches!(err, Error::InvalidInput { .. }));
|
||||
assert!(err.to_string().contains("'id' is not a blob column"));
|
||||
|
||||
let err = table.fetch_blob_files("id", &[0]).await.unwrap_err();
|
||||
assert!(err.to_string().contains("'id' is not a blob column"));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_rejects_unknown_column() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"x".as_slice())]).await?;
|
||||
|
||||
let err = table.fetch_blobs("missing", &[0]).await.unwrap_err();
|
||||
assert!(err.to_string().contains("no column named 'missing'"));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_rejects_legacy_v1_blob_column() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let legacy = Field::new("image", DataType::LargeBinary, true).with_metadata(
|
||||
std::collections::HashMap::from([("lance-encoding:blob".to_string(), "true".to_string())]),
|
||||
);
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
legacy,
|
||||
]));
|
||||
let table = db.create_empty_table("t", schema).execute().await?;
|
||||
|
||||
let err = table.fetch_blobs("image", &[0]).await.unwrap_err();
|
||||
assert!(err.to_string().contains("legacy blob column"));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blob_files_reads_lazily_and_aligns_nulls() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table =
|
||||
create_inline_blob_table(&db, "t", &[1, 2], &[Some(b"lazy-bytes".as_slice()), None])
|
||||
.await?;
|
||||
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
let ids: Vec<u64> = pairs.iter().map(|(_, rowid)| *rowid).collect();
|
||||
let files = table.fetch_blob_files("image", &ids).await?;
|
||||
assert_eq!(files.len(), 2);
|
||||
for ((id, _), file) in pairs.iter().zip(&files) {
|
||||
match id {
|
||||
1 => {
|
||||
let handle = file.as_ref().unwrap();
|
||||
assert_eq!(handle.read().await.unwrap().as_ref(), b"lazy-bytes");
|
||||
}
|
||||
2 => assert!(file.is_none()),
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_reads_multiple_blob_columns_independently() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let schema = Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob("image", true),
|
||||
blob("thumbnail", true),
|
||||
]));
|
||||
let table = db.create_empty_table("t", schema).execute().await?;
|
||||
let batch = RecordBatch::try_new(
|
||||
Arc::new(Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
Field::new("image", DataType::LargeBinary, true),
|
||||
Field::new("thumbnail", DataType::LargeBinary, true),
|
||||
])),
|
||||
vec![
|
||||
Arc::new(Int64Array::from(vec![1, 2])),
|
||||
Arc::new(LargeBinaryArray::from_iter(vec![
|
||||
Some(b"image-1".as_slice()),
|
||||
None,
|
||||
])),
|
||||
Arc::new(LargeBinaryArray::from_iter(vec![
|
||||
None,
|
||||
Some(b"thumb-2".as_slice()),
|
||||
])),
|
||||
],
|
||||
)
|
||||
.unwrap();
|
||||
table.add(batch).execute().await?;
|
||||
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
let ids: Vec<u64> = pairs.iter().map(|(_, rowid)| *rowid).collect();
|
||||
let images = table.fetch_blobs("image", &ids).await?;
|
||||
let thumbs = table.fetch_blobs("thumbnail", &ids).await?;
|
||||
for (i, (id, _)) in pairs.iter().enumerate() {
|
||||
match id {
|
||||
1 => {
|
||||
assert_eq!(images.value(i), b"image-1");
|
||||
assert!(thumbs.is_null(i));
|
||||
}
|
||||
2 => {
|
||||
assert!(images.is_null(i));
|
||||
assert_eq!(thumbs.value(i), b"thumb-2");
|
||||
}
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_spans_fragments() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"frag-one".as_slice())]).await?;
|
||||
table
|
||||
.add(binary_input_batch(&[2], &[Some(b"frag-two".as_slice())]))
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
let ids: Vec<u64> = pairs.iter().map(|(_, rowid)| *rowid).collect();
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
for (i, (id, _)) in pairs.iter().enumerate() {
|
||||
match id {
|
||||
1 => assert_eq!(bytes.value(i), b"frag-one"),
|
||||
2 => assert_eq!(bytes.value(i), b"frag-two"),
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_packed_payload_round_trip() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let big = vec![0xAB_u8; 100 * 1024];
|
||||
let small = b"small".to_vec();
|
||||
let table = create_inline_blob_table(
|
||||
&db,
|
||||
"t",
|
||||
&[1, 2],
|
||||
&[Some(big.as_slice()), Some(small.as_slice())],
|
||||
)
|
||||
.await?;
|
||||
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
let ids: Vec<u64> = pairs.iter().map(|(_, rowid)| *rowid).collect();
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
for (i, (id, _)) in pairs.iter().enumerate() {
|
||||
match id {
|
||||
1 => assert_eq!(bytes.value(i), big.as_slice()),
|
||||
2 => assert_eq!(bytes.value(i), small.as_slice()),
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_after_delete() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(
|
||||
&db,
|
||||
"t",
|
||||
&[1, 2, 3],
|
||||
&[Some(b"one".as_slice()), Some(b"two"), Some(b"three")],
|
||||
)
|
||||
.await?;
|
||||
|
||||
table.delete("id = 2").await?;
|
||||
let pairs = collect_id_rowid(&table).await?;
|
||||
assert_eq!(pairs.len(), 2);
|
||||
let ids: Vec<u64> = pairs.iter().map(|(_, rowid)| *rowid).collect();
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
for (i, (id, _)) in pairs.iter().enumerate() {
|
||||
match id {
|
||||
1 => assert_eq!(bytes.value(i), b"one"),
|
||||
3 => assert_eq!(bytes.value(i), b"three"),
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_with_precompaction_row_ids_survives_compaction() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"frag-one".as_slice())]).await?;
|
||||
table
|
||||
.add(binary_input_batch(&[2], &[Some(b"frag-two".as_slice())]))
|
||||
.execute()
|
||||
.await?;
|
||||
|
||||
let pairs_before = collect_id_rowid(&table).await?;
|
||||
let ids_before: Vec<u64> = pairs_before.iter().map(|(_, rowid)| *rowid).collect();
|
||||
|
||||
table
|
||||
.optimize(OptimizeAction::Compact {
|
||||
options: CompactionOptions::default(),
|
||||
remap_options: None,
|
||||
})
|
||||
.await?;
|
||||
|
||||
let bytes_after = table.fetch_blobs("image", &ids_before).await?;
|
||||
assert_eq!(bytes_after.len(), 2);
|
||||
for (i, (id, _)) in pairs_before.iter().enumerate() {
|
||||
match id {
|
||||
1 => assert_eq!(bytes_after.value(i), b"frag-one"),
|
||||
2 => assert_eq!(bytes_after.value(i), b"frag-two"),
|
||||
_ => unreachable!(),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn zero_length_blob_reads_back_as_null() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = create_inline_blob_table(&db, "t", &[1], &[Some(b"".as_slice())]).await?;
|
||||
|
||||
let ids = collect_row_ids(&table).await?;
|
||||
let bytes = table.fetch_blobs("image", &ids).await?;
|
||||
assert_eq!(bytes.len(), 1);
|
||||
assert!(bytes.is_null(0));
|
||||
Ok(())
|
||||
}
|
||||
|
||||
const DEDICATED_BLOB_LEN: usize = 64 * 1024;
|
||||
const SCRAMBLED_LOGICAL_IDS: [i64; 7] = [6, 3, 1, 4, 6, 2, 5];
|
||||
|
||||
fn dedicated_blob_bytes(tag: u8) -> Vec<u8> {
|
||||
vec![tag; DEDICATED_BLOB_LEN]
|
||||
}
|
||||
|
||||
async fn multi_fragment_dedicated_blob_table(db: &Connection) -> Result<Table> {
|
||||
let rows: [(i64, Option<u8>); 6] = [
|
||||
(1, Some(1)),
|
||||
(2, Some(2)),
|
||||
(3, None),
|
||||
(4, Some(4)),
|
||||
(5, None),
|
||||
(6, Some(6)),
|
||||
];
|
||||
let mut table: Option<Table> = None;
|
||||
for (logical_id, blob_tag) in rows {
|
||||
let bytes = blob_tag.map(dedicated_blob_bytes);
|
||||
let image = [bytes.as_deref()];
|
||||
table = Some(match table {
|
||||
None => create_inline_blob_table(db, "t", &[logical_id], &image).await?,
|
||||
Some(t) => {
|
||||
t.add(binary_input_batch(&[logical_id], &image))
|
||||
.execute()
|
||||
.await?;
|
||||
t
|
||||
}
|
||||
});
|
||||
}
|
||||
Ok(table.unwrap())
|
||||
}
|
||||
|
||||
async fn row_ids_for_logical(table: &Table, logical_ids: &[i64]) -> Result<Vec<u64>> {
|
||||
let id_rowid = collect_id_rowid(table).await?;
|
||||
Ok(logical_ids
|
||||
.iter()
|
||||
.map(|logical_id| {
|
||||
id_rowid
|
||||
.iter()
|
||||
.find(|(id, _)| id == logical_id)
|
||||
.map(|(_, row_id)| *row_id)
|
||||
.unwrap()
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blobs_aligns_across_fragments_with_nulls_and_dups() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = multi_fragment_dedicated_blob_table(&db).await?;
|
||||
let row_ids = row_ids_for_logical(&table, &SCRAMBLED_LOGICAL_IDS).await?;
|
||||
|
||||
let bytes = table.fetch_blobs("image", &row_ids).await?;
|
||||
assert_eq!(bytes.len(), SCRAMBLED_LOGICAL_IDS.len());
|
||||
for (slot, logical_id) in SCRAMBLED_LOGICAL_IDS.iter().enumerate() {
|
||||
match logical_id {
|
||||
3 | 5 => assert!(bytes.is_null(slot)),
|
||||
id => assert_eq!(
|
||||
bytes.value(slot),
|
||||
dedicated_blob_bytes(*id as u8).as_slice()
|
||||
),
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn fetch_blob_files_aligns_across_fragments_with_nulls_and_dups() -> Result<()> {
|
||||
let tmp = tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
let table = multi_fragment_dedicated_blob_table(&db).await?;
|
||||
let row_ids = row_ids_for_logical(&table, &SCRAMBLED_LOGICAL_IDS).await?;
|
||||
|
||||
let files = table.fetch_blob_files("image", &row_ids).await?;
|
||||
assert_eq!(files.len(), SCRAMBLED_LOGICAL_IDS.len());
|
||||
for (slot, logical_id) in SCRAMBLED_LOGICAL_IDS.iter().enumerate() {
|
||||
match logical_id {
|
||||
3 | 5 => assert!(files[slot].is_none()),
|
||||
id => {
|
||||
let payload = files[slot].as_ref().unwrap().read().await?;
|
||||
assert_eq!(payload.as_ref(), dedicated_blob_bytes(*id as u8).as_slice());
|
||||
}
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
Reference in New Issue
Block a user