mirror of
https://github.com/lancedb/lancedb.git
synced 2026-06-30 17:40:40 +00:00
Compare commits
111 Commits
python-v0.
...
jack/sopho
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
bff911a65d | ||
|
|
3a4cdb7aff | ||
|
|
142ac835d3 | ||
|
|
3f44f93e92 | ||
|
|
9dfa43a9de | ||
|
|
03e895fa5c | ||
|
|
c31e53088e | ||
|
|
434a5be187 | ||
|
|
78aa005093 | ||
|
|
6191542cfe | ||
|
|
6af3088b91 | ||
|
|
e73d4618d8 | ||
|
|
3d92106394 | ||
|
|
5810974b37 | ||
|
|
8b38500b07 | ||
|
|
fd0a3b97d0 | ||
|
|
b9f33ba1c9 | ||
|
|
d4f4fef3ba | ||
|
|
fbe6a5a3fd | ||
|
|
127054069a | ||
|
|
b20931b8f7 | ||
|
|
396d68e490 | ||
|
|
ad37f87387 | ||
|
|
e93476f0e0 | ||
|
|
2b41fce033 | ||
|
|
04948fc4f6 | ||
|
|
ff3c7111b9 | ||
|
|
10fecdf051 | ||
|
|
c9ae93a7fa | ||
|
|
05756f0bbf | ||
|
|
2a0945443e | ||
|
|
39e819b6a7 | ||
|
|
70126943ff | ||
|
|
e01777070d | ||
|
|
3878adc6dc | ||
|
|
3df3043563 | ||
|
|
8a5cd74e48 | ||
|
|
448d5ec20f | ||
|
|
8718345229 | ||
|
|
026fedc286 | ||
|
|
fe287dc98c | ||
|
|
411568b72c | ||
|
|
ebf8d55ede | ||
|
|
0ba70d96c3 | ||
|
|
0749532c3c | ||
|
|
26481a4b74 | ||
|
|
08596f1644 | ||
|
|
f16da19b78 | ||
|
|
41ac32a344 | ||
|
|
ba1ef34481 | ||
|
|
85d870b397 | ||
|
|
c46d59d2ee | ||
|
|
113f187c2d | ||
|
|
3b279f5705 | ||
|
|
e1334954d7 | ||
|
|
2f65a233fe | ||
|
|
e81356089a | ||
|
|
4f4cce3f64 | ||
|
|
c1c19cd133 | ||
|
|
ce5dadd386 | ||
|
|
1f8ebef3cd | ||
|
|
217fd8491d | ||
|
|
9128dbcd7a | ||
|
|
394bb34fa2 | ||
|
|
b2ae763254 | ||
|
|
1bead6960c | ||
|
|
0abf641733 | ||
|
|
976edeb2ff | ||
|
|
b46a44f873 | ||
|
|
f76b075d13 | ||
|
|
393ec981bf | ||
|
|
6219975222 | ||
|
|
d9f9a51668 | ||
|
|
c187ff7712 | ||
|
|
dfbe5becaa | ||
|
|
49815da933 | ||
|
|
f8caef3aca | ||
|
|
40f3e22600 | ||
|
|
04480c274a | ||
|
|
ae7f2cbfe8 | ||
|
|
4fb7c92e86 | ||
|
|
f03abc27e3 | ||
|
|
85d9c1ce63 | ||
|
|
d786e39fdc | ||
|
|
8373318e89 | ||
|
|
8308cca05e | ||
|
|
566b67a634 | ||
|
|
9c12fb6437 | ||
|
|
f260d3bf12 | ||
|
|
d9018067b3 | ||
|
|
53517b3aaa | ||
|
|
3e25f584eb | ||
|
|
59fbfd4158 | ||
|
|
f37e698e2f | ||
|
|
09b1bbc12a | ||
|
|
c484b24e51 | ||
|
|
3868965413 | ||
|
|
c13ebc6796 | ||
|
|
4b287fd9c4 | ||
|
|
64194ea8ad | ||
|
|
e6c5de1a58 | ||
|
|
39a9f3e1e9 | ||
|
|
952055d428 | ||
|
|
927ba2c948 | ||
|
|
415d199c15 | ||
|
|
a16676e05f | ||
|
|
4e44262499 | ||
|
|
632375faf1 | ||
|
|
9969191d0d | ||
|
|
1e7326cd8c | ||
|
|
9483b534af |
178
.agents/skills/lancedb-column-metadata/SKILL.md
Normal file
178
.agents/skills/lancedb-column-metadata/SKILL.md
Normal file
@@ -0,0 +1,178 @@
|
||||
---
|
||||
name: lancedb-column-metadata
|
||||
description: Column metadata authoring for LanceDB tables via the REST API. This skill is required for tasks like writing field descriptions, setting tags on columns (field_type, model, project_id, version), classifying columns as embeddings vs labels vs eval metrics, or grouping versioned columns into logical families — because it has the API integration needed to read the schema and persist metadata back. Invoke whenever someone wants to document, annotate, tag, or classify what their table columns ARE. Trigger even without an explicit "LanceDB" mention, as long as the context is column-level documentation or tagging for an ML or vector database table.
|
||||
metadata:
|
||||
short-description: Write column descriptions, tags, and logical groupings to a LanceDB table
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
This skill authors column-level metadata for a LanceDB table. It connects to a LanceDB deployment over its REST API, inspects the table schema, generates appropriate metadata, and writes it back.
|
||||
|
||||
## Step 0: Establish the connection
|
||||
|
||||
Use the `lancedb-connect` skill (invoke it via the Skill tool) to resolve the base URL and auth headers (`x-api-key`, `x-lancedb-database`) for whichever deployment the user is working against — enterprise/self-hosted or a local dev server. Skip it only if the connection details are already established in the conversation.
|
||||
|
||||
All examples below use `{base_url}` — substitute the resolved endpoint and include the resolved headers on every request.
|
||||
|
||||
## Metadata keys
|
||||
|
||||
All metadata uses namespaced keys:
|
||||
|
||||
| Key | Purpose | Example value |
|
||||
|-----|---------|---------------|
|
||||
| `lancedb:description` | Human-readable explanation of what the column contains | `"CLIP ViT-L/14 image embedding, L2-normalized (768-dim)"` |
|
||||
| `lancedb:tag:<name>` | Flexible key-value tag; the suffix names the tag category | `lancedb:tag:field_type: "embedding"`, `lancedb:tag:model: "clip"`, `lancedb:tag:project_id: "foo"` |
|
||||
| `lancedb:logical-column` | Logical group/family this column belongs to | `"clip_features"` |
|
||||
|
||||
Tags are open-ended — use whatever key suffix and value make sense given the user's intent. The tag suffix should describe *what is being classified* (e.g., `field_type`, `model`, `project_id`) and the value describes *how*.
|
||||
|
||||
## Step 1: Resolve the table identifier
|
||||
|
||||
You need:
|
||||
- **Table name** (required) — e.g., `my_table` or `my_namespace.my_table`
|
||||
- **Database name** — ask if not provided and not inferable from context; it goes in the `x-lancedb-database` header, never in the URL path
|
||||
|
||||
The table identifier in the URL path is typically `table_name` for a top-level table, or `namespace$table_name` if the table lives in a namespace. The API accepts a `delimiter` query parameter to parse compound identifiers (default `$`).
|
||||
|
||||
## Step 2: Describe the table
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/describe
|
||||
Content-Type: application/json
|
||||
|
||||
{}
|
||||
```
|
||||
|
||||
The response contains `schema.fields` — an array of field objects:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema": {
|
||||
"fields": [
|
||||
{
|
||||
"name": "clip_embedding_v3",
|
||||
"type": { "type": "FixedSizeList", "fields": [...], "listSize": 768 },
|
||||
"nullable": true,
|
||||
"metadata": { "lancedb:description": "..." }
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each field has:
|
||||
- `name` — field name
|
||||
- `type` — Arrow data type (check `type.type` for the type string)
|
||||
- `nullable` — boolean
|
||||
- `metadata` — existing key-value metadata (read this before writing to avoid redundant updates)
|
||||
|
||||
For struct/nested fields, recurse into `type.fields` and represent them as dot-notation paths (e.g., `parent.child`).
|
||||
|
||||
If the user hasn't specified which columns to update, work with all columns.
|
||||
|
||||
## Step 3: Generate metadata
|
||||
|
||||
Decide what to generate based on the user's request.
|
||||
|
||||
### Writing descriptions (`lancedb:description`)
|
||||
|
||||
Base descriptions on:
|
||||
- The column name and Arrow type (e.g., `FixedSizeList` of floats → likely an embedding)
|
||||
- User-supplied context (upstream pipeline, sample values, domain knowledge)
|
||||
- Name patterns: `_embedding`/`_vec`/`_embed` → vector; `_label`/`_class` → label; `_score`/`_eval`/`_metric` → evaluation metric
|
||||
|
||||
Be specific and concise. Good: `"Sentence-BERT embedding of the query text (768-dim)."` Not: `"An embedding column."`
|
||||
|
||||
### Tagging columns (`lancedb:tag:<name>`)
|
||||
|
||||
Choose tag key names that match what the user asked to annotate. Common patterns:
|
||||
|
||||
- Semantic field type → `lancedb:tag:field_type: "embedding"` / `"text"` / `"image"` / `"label"` / `"eval"` / `"id"` / `"metadata"`
|
||||
- Model or source → `lancedb:tag:model: "clip"` / `"bert"` / `"vit"`
|
||||
- Project affiliation → `lancedb:tag:project_id: "<name>"`
|
||||
- Version → `lancedb:tag:version: "v3"` (and `lancedb:tag:latest: "true"` for the newest)
|
||||
|
||||
Use Arrow type as a hint: `FixedSizeList` + float → embedding; `Utf8`/`LargeUtf8` → text; `Binary` → image or blob.
|
||||
|
||||
Multiple tags on the same column are fine — each is a separate key.
|
||||
|
||||
### Grouping into logical columns (`lancedb:logical-column`)
|
||||
|
||||
Look for naming patterns across columns:
|
||||
- `clip_v1`, `clip_v2`, `clip_v3` → logical column `"clip"`, latest is `v3`
|
||||
- `text_embed_20240101`, `text_embed_20240601` → logical column `"text_embed"`, latest is the most recent date suffix
|
||||
|
||||
Write `lancedb:logical-column` on all members of a group. Mark the newest with `lancedb:tag:latest: "true"` (in addition to its version tag).
|
||||
|
||||
## Step 4: Write the metadata
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/update_field_metadata
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"updates": [
|
||||
{
|
||||
"path": "clip_v3",
|
||||
"metadata": {
|
||||
"lancedb:description": "CLIP ViT-L/14 image embedding, L2-normalized (1024-dim).",
|
||||
"lancedb:tag:field_type": "embedding",
|
||||
"lancedb:tag:model": "clip",
|
||||
"lancedb:tag:version": "v3",
|
||||
"lancedb:tag:latest": "true",
|
||||
"lancedb:logical-column": "clip"
|
||||
},
|
||||
"replace": false
|
||||
},
|
||||
{
|
||||
"path": "clip_v2",
|
||||
"metadata": {
|
||||
"lancedb:description": "CLIP ViT-B/32 image embedding (768-dim), superseded by v3.",
|
||||
"lancedb:tag:field_type": "embedding",
|
||||
"lancedb:tag:model": "clip",
|
||||
"lancedb:tag:version": "v2",
|
||||
"lancedb:logical-column": "clip"
|
||||
},
|
||||
"replace": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
- **Use `"replace": false`** (merge) by default — this preserves existing metadata the user didn't ask to change
|
||||
- Use `"replace": true` only if the user explicitly asks to overwrite all existing metadata on a column
|
||||
- Set a value to `null` to delete a specific key
|
||||
- Batch all updates in a single request when possible
|
||||
|
||||
The response includes `version` (new table version) and `fields` (the updated metadata per field).
|
||||
|
||||
## Step 5: Confirm
|
||||
|
||||
Report back:
|
||||
- Which columns were updated and what was written
|
||||
- The new table version number
|
||||
- Any columns skipped (e.g., already had up-to-date metadata)
|
||||
|
||||
---
|
||||
|
||||
## Quick examples
|
||||
|
||||
**"Write descriptions for all columns in the `product_embeddings` table"**
|
||||
1. POST `/v1/table/product_embeddings/describe` → get all fields
|
||||
2. Generate a `lancedb:description` for each column based on name + type
|
||||
3. POST `update_field_metadata` with descriptions
|
||||
4. Report
|
||||
|
||||
**"Tag the columns in `model_outputs` with their field type and model"**
|
||||
1. Describe `model_outputs`
|
||||
2. For each field, classify by name + Arrow type → set `lancedb:tag:field_type` and `lancedb:tag:model` where applicable
|
||||
3. POST `update_field_metadata`
|
||||
4. Report
|
||||
|
||||
**"Group the feature columns in `training_features` into logical families and mark the latest version"**
|
||||
1. Describe the table
|
||||
2. Find version patterns → assign `lancedb:logical-column` and `lancedb:tag:version`; mark newest with `lancedb:tag:latest: "true"`
|
||||
3. POST `update_field_metadata`
|
||||
4. Show the grouping
|
||||
42
.agents/skills/lancedb-connect/SKILL.md
Normal file
42
.agents/skills/lancedb-connect/SKILL.md
Normal file
@@ -0,0 +1,42 @@
|
||||
---
|
||||
name: lancedb-connect
|
||||
description: Resolve how to connect to a LanceDB deployment over the REST API — figure out the base URL, API key, and database header. Use this before making any REST requests to a LanceDB table, whenever the endpoint or auth setup is not already known. Also useful on its own when someone asks how to connect, authenticate, or curl their LanceDB instance.
|
||||
metadata:
|
||||
short-description: Resolve the base URL and auth headers for a LanceDB deployment
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Produce two things every REST request needs:
|
||||
|
||||
1. **Base URL** — the endpoint
|
||||
2. **Headers** — `x-api-key`, and usually `x-lancedb-database`
|
||||
|
||||
## Resolution steps
|
||||
|
||||
1. If the user already gave a URL and API key (or said which environment they're working against), use that.
|
||||
2. Otherwise, look for credentials already available in the environment:
|
||||
- Env vars like `LANCEDB_URI` / `LANCEDB_HOST` / `LANCEDB_API_KEY`
|
||||
- A LanceDB endpoint already running or port-forwarded locally (the REST default port is 2333, i.e. `http://localhost:2333`)
|
||||
3. If you didn't find both pieces, ask the user directly: **"What's your LanceDB endpoint's URL, and what's your API key?"** Also ask which database to use if it isn't obvious. Don't guess or probe further — the user knows their deployment.
|
||||
|
||||
## Validating the connection
|
||||
|
||||
Make a cheap authenticated request and check the status:
|
||||
|
||||
```bash
|
||||
curl -s -w "\n%{http_code}" "{base_url}/v1/table/?limit=1" \
|
||||
-H "x-api-key: <key>" \
|
||||
-H "x-lancedb-database: <database>"
|
||||
```
|
||||
|
||||
- `200` — connection, key, and database header all good
|
||||
- `401` — API key missing or wrong
|
||||
- `400` mentioning a database header — this deployment expects `x-lancedb-database`
|
||||
|
||||
## Non-REST equivalents
|
||||
|
||||
If the caller would rather use the SDK or CLI than raw REST, the same credentials work:
|
||||
|
||||
- Python SDK: `lancedb.connect("db://<database>", api_key="<key>", host_override="<base_url>")`
|
||||
- `lancedb` CLI: a `[profiles.<name>]` entry in `~/.lancedb/config.toml` with `http_server_url`, `api_key`, `database`
|
||||
@@ -1,5 +1,5 @@
|
||||
[tool.bumpversion]
|
||||
current_version = "0.30.1-beta.0"
|
||||
current_version = "0.31.0-beta.4"
|
||||
parse = """(?x)
|
||||
(?P<major>0|[1-9]\\d*)\\.
|
||||
(?P<minor>0|[1-9]\\d*)\\.
|
||||
@@ -23,6 +23,8 @@ allow_dirty = true
|
||||
commit = true
|
||||
message = "Bump version: {current_version} → {new_version}"
|
||||
commit_args = ""
|
||||
# bump-my-version >=1.4.0 rejects pre_commit_hooks containing shell syntax unless opted in.
|
||||
allow_shell_hooks = true
|
||||
|
||||
# Java maven files
|
||||
pre_commit_hooks = [
|
||||
|
||||
11
.github/dependabot.yml
vendored
11
.github/dependabot.yml
vendored
@@ -21,3 +21,14 @@ updates:
|
||||
update-types:
|
||||
- minor
|
||||
- patch
|
||||
|
||||
- package-ecosystem: pip
|
||||
directory: /python
|
||||
schedule:
|
||||
interval: weekly
|
||||
# Only update uv.lock, never widen version requirements in pyproject.toml.
|
||||
versioning-strategy: lockfile-only
|
||||
groups:
|
||||
python-deps:
|
||||
patterns:
|
||||
- "*"
|
||||
|
||||
1
.gitignore
vendored
1
.gitignore
vendored
@@ -27,6 +27,7 @@ python/dist
|
||||
*.so
|
||||
*.dylib
|
||||
*.dll
|
||||
*.pdb
|
||||
|
||||
## Javascript
|
||||
*.node
|
||||
|
||||
783
Cargo.lock
generated
783
Cargo.lock
generated
File diff suppressed because it is too large
Load Diff
28
Cargo.toml
28
Cargo.toml
@@ -13,20 +13,20 @@ categories = ["database-implementations"]
|
||||
rust-version = "1.91.0"
|
||||
|
||||
[workspace.dependencies]
|
||||
lance = { "version" = "=7.2.0-beta.3", default-features = false, "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-core = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datagen = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-file = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-io = { "version" = "=7.2.0-beta.3", default-features = false, "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-index = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-linalg = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=7.2.0-beta.3", default-features = false, "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-table = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-testing = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datafusion = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-encoding = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-arrow = { "version" = "=7.2.0-beta.3", "tag" = "v7.2.0-beta.3", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-core = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-datagen = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-file = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-io = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-index = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-linalg = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-namespace = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-table = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-testing = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-datafusion = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-encoding = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-arrow = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
ahash = "0.8"
|
||||
# Note that this one does not include pyarrow
|
||||
arrow = { version = "58.0.0", optional = false }
|
||||
|
||||
26
REVIEW.md
Normal file
26
REVIEW.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Code review guidelines
|
||||
|
||||
Repo-specific guidance for automated PR reviews.
|
||||
|
||||
## Cross-SDK parity
|
||||
|
||||
LanceDB exposes the same core (`rust/lancedb`) through Python, TypeScript (`nodejs`),
|
||||
and Java bindings. Behavioral drift between SDKs is a recurring problem, so watch for
|
||||
parity gaps when reviewing — but only flag real ones:
|
||||
|
||||
* If the change adds or modifies user-facing API or behavior in the shared core
|
||||
(`rust/lancedb`), check whether each binding that should expose it (`python`,
|
||||
`nodejs`) does. A core change with no corresponding binding update is worth a note.
|
||||
* If the change adds or modifies a public API in one SDK but not the other, open the
|
||||
sibling SDK's corresponding module and state whether an equivalent exists. If not,
|
||||
note it as a possible parity gap and suggest a follow-up issue.
|
||||
* For bug fixes, first read the sibling SDK's analogous code path to check whether the
|
||||
same bug exists there. Only raise parity if it actually does. Do not ask to "port" a
|
||||
fix for a bug that only ever existed in one binding.
|
||||
* Stay silent on internal-only refactors, tests, docs, and changes with no cross-SDK
|
||||
surface.
|
||||
* Parity expectations apply to the Python and TypeScript (`nodejs`) SDKs. Java currently
|
||||
implements only the remote table, not the local/embedded backend, so it is expected to
|
||||
be partial — do not flag Java for missing local-only functionality.
|
||||
* Keep parity feedback to a short, clearly-labeled note (e.g. "Possible SDK parity
|
||||
gap: …"). It is advisory, not a merge blocker.
|
||||
14
deny.toml
14
deny.toml
@@ -113,6 +113,12 @@ ignore = [
|
||||
# rand from a custom logger; upgrade once all pinned chains accept 0.8.6+.
|
||||
# https://rustsec.org/advisories/RUSTSEC-2026-0097
|
||||
{ id = "RUSTSEC-2026-0097", reason = "transitive rand 0.8.5; LanceDB does not call ThreadRng from custom logging" },
|
||||
|
||||
# pyo3 advisories in the Python bindings; tracked pending a patched pyo3 release.
|
||||
# https://rustsec.org/advisories/RUSTSEC-2026-0176
|
||||
# https://rustsec.org/advisories/RUSTSEC-2026-0177
|
||||
{ id = "RUSTSEC-2026-0176", reason = "pyo3 in Python bindings; awaiting patched pyo3 release" },
|
||||
{ id = "RUSTSEC-2026-0177", reason = "pyo3 in Python bindings; awaiting patched pyo3 release" },
|
||||
]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
@@ -147,6 +153,14 @@ allow = [
|
||||
"CDLA-Permissive-2.0",
|
||||
]
|
||||
confidence-threshold = 0.8
|
||||
# Per-crate license exceptions: allow a license for a specific crate only,
|
||||
# rather than globally via the `allow` list above.
|
||||
exceptions = [
|
||||
# CDDL-1.0 (copyleft) is pulled in only as a dev/profiling dependency via
|
||||
# `inferno` -> `pprof` -> `lance-testing`; it is a test dependency that we
|
||||
# do not distribute, so scope the allowance to `inferno` alone.
|
||||
{ allow = ["CDDL-1.0"], crate = "inferno" },
|
||||
]
|
||||
# Crates whose license cannot be determined from Cargo metadata but whose
|
||||
# license we've manually confirmed from upstream. Keep this list minimal.
|
||||
[[licenses.clarify]]
|
||||
|
||||
@@ -14,7 +14,7 @@ Add the following dependency to your `pom.xml`:
|
||||
<dependency>
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-core</artifactId>
|
||||
<version>0.30.1-beta.0</version>
|
||||
<version>0.31.0-beta.4</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
|
||||
43
docs/src/js/classes/BranchContents.md
Normal file
43
docs/src/js/classes/BranchContents.md
Normal file
@@ -0,0 +1,43 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / BranchContents
|
||||
|
||||
# Class: BranchContents
|
||||
|
||||
## Constructors
|
||||
|
||||
### new BranchContents()
|
||||
|
||||
```ts
|
||||
new BranchContents(): BranchContents
|
||||
```
|
||||
|
||||
#### Returns
|
||||
|
||||
[`BranchContents`](BranchContents.md)
|
||||
|
||||
## Properties
|
||||
|
||||
### manifestSize
|
||||
|
||||
```ts
|
||||
manifestSize: number;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### parentBranch?
|
||||
|
||||
```ts
|
||||
optional parentBranch: string;
|
||||
```
|
||||
|
||||
***
|
||||
|
||||
### parentVersion
|
||||
|
||||
```ts
|
||||
parentVersion: number;
|
||||
```
|
||||
96
docs/src/js/classes/Branches.md
Normal file
96
docs/src/js/classes/Branches.md
Normal file
@@ -0,0 +1,96 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / Branches
|
||||
|
||||
# Class: Branches
|
||||
|
||||
Branch manager for a [Table](Table.md).
|
||||
|
||||
Unlike tags, `create` and `checkout` return a new [Table](Table.md) handle scoped
|
||||
to the branch; writes on it do not affect `main`.
|
||||
|
||||
## Methods
|
||||
|
||||
### checkout()
|
||||
|
||||
```ts
|
||||
checkout(name, version?): Promise<Table>
|
||||
```
|
||||
|
||||
Check out an existing branch and return a handle scoped to it.
|
||||
|
||||
With `version` set, the returned handle is pinned to that version of the
|
||||
branch (a read-only, detached view); otherwise it tracks the branch's
|
||||
latest and stays writable.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **name**: `string`
|
||||
|
||||
* **version?**: `number`
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Table`](Table.md)>
|
||||
|
||||
***
|
||||
|
||||
### create()
|
||||
|
||||
```ts
|
||||
create(
|
||||
name,
|
||||
fromRef?,
|
||||
fromVersion?): Promise<Table>
|
||||
```
|
||||
|
||||
Create a branch and return a handle scoped to it.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **name**: `string`
|
||||
Name of the new branch.
|
||||
|
||||
* **fromRef?**: `string`
|
||||
Source branch to fork from. Defaults to `main`.
|
||||
|
||||
* **fromVersion?**: `number`
|
||||
A specific version on `fromRef`. Defaults to latest.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Table`](Table.md)>
|
||||
|
||||
***
|
||||
|
||||
### delete()
|
||||
|
||||
```ts
|
||||
delete(name): Promise<void>
|
||||
```
|
||||
|
||||
Delete a branch.
|
||||
|
||||
#### Parameters
|
||||
|
||||
* **name**: `string`
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`void`>
|
||||
|
||||
***
|
||||
|
||||
### list()
|
||||
|
||||
```ts
|
||||
list(): Promise<Record<string, BranchContents>>
|
||||
```
|
||||
|
||||
List all branches, mapping name to branch metadata.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<`Record`<`string`, [`BranchContents`](BranchContents.md)>>
|
||||
@@ -57,6 +57,24 @@ block size may be added in the future.
|
||||
|
||||
***
|
||||
|
||||
### fm()
|
||||
|
||||
```ts
|
||||
static fm(): Index
|
||||
```
|
||||
|
||||
Create an FM-Index.
|
||||
|
||||
An FM-Index is a scalar index on string or binary columns that accelerates
|
||||
substring search, i.e. `contains(col, 'needle')`. Unlike the tokenized
|
||||
full-text-search index, it matches arbitrary substrings of the raw bytes.
|
||||
|
||||
#### Returns
|
||||
|
||||
[`Index`](Index.md)
|
||||
|
||||
***
|
||||
|
||||
### fts()
|
||||
|
||||
```ts
|
||||
|
||||
@@ -110,6 +110,23 @@ containing the new version number of the table after altering the columns.
|
||||
|
||||
***
|
||||
|
||||
### branches()
|
||||
|
||||
```ts
|
||||
abstract branches(): Promise<Branches>
|
||||
```
|
||||
|
||||
Get the branch manager for this table.
|
||||
|
||||
Branches are isolated, writable lines of history forked from another
|
||||
branch (or version). Writes on a branch do not affect `main`.
|
||||
|
||||
#### Returns
|
||||
|
||||
`Promise`<[`Branches`](Branches.md)>
|
||||
|
||||
***
|
||||
|
||||
### checkout()
|
||||
|
||||
```ts
|
||||
@@ -278,6 +295,23 @@ await table.createIndex("my_float_col");
|
||||
|
||||
***
|
||||
|
||||
### currentBranch()
|
||||
|
||||
```ts
|
||||
abstract currentBranch(): null | string
|
||||
```
|
||||
|
||||
The branch this table handle is scoped to, or `null` for the main branch.
|
||||
|
||||
A handle returned by [Branches.create](Branches.md#create) or [Branches.checkout](Branches.md#checkout)
|
||||
reports the branch it targets; a handle opened normally reports `null`.
|
||||
|
||||
#### Returns
|
||||
|
||||
`null` \| `string`
|
||||
|
||||
***
|
||||
|
||||
### delete()
|
||||
|
||||
```ts
|
||||
|
||||
29
docs/src/js/enumerations/OAuthFlowType.md
Normal file
29
docs/src/js/enumerations/OAuthFlowType.md
Normal file
@@ -0,0 +1,29 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / OAuthFlowType
|
||||
|
||||
# Enumeration: OAuthFlowType
|
||||
|
||||
OAuth authentication flow types.
|
||||
|
||||
## Enumeration Members
|
||||
|
||||
### AzureManagedIdentity
|
||||
|
||||
```ts
|
||||
AzureManagedIdentity: "azure_managed_identity";
|
||||
```
|
||||
|
||||
Azure Managed Identity via IMDS.
|
||||
|
||||
***
|
||||
|
||||
### ClientCredentials
|
||||
|
||||
```ts
|
||||
ClientCredentials: "client_credentials";
|
||||
```
|
||||
|
||||
Client Credentials grant (service-to-service / M2M).
|
||||
@@ -12,6 +12,7 @@
|
||||
## Enumerations
|
||||
|
||||
- [FullTextQueryType](enumerations/FullTextQueryType.md)
|
||||
- [OAuthFlowType](enumerations/OAuthFlowType.md)
|
||||
- [Occur](enumerations/Occur.md)
|
||||
- [Operator](enumerations/Operator.md)
|
||||
|
||||
@@ -19,6 +20,8 @@
|
||||
|
||||
- [BooleanQuery](classes/BooleanQuery.md)
|
||||
- [BoostQuery](classes/BoostQuery.md)
|
||||
- [BranchContents](classes/BranchContents.md)
|
||||
- [Branches](classes/Branches.md)
|
||||
- [Connection](classes/Connection.md)
|
||||
- [HeaderProvider](classes/HeaderProvider.md)
|
||||
- [Index](classes/Index.md)
|
||||
@@ -83,6 +86,8 @@
|
||||
- [ListNamespacesResponse](interfaces/ListNamespacesResponse.md)
|
||||
- [LsmWriteSpec](interfaces/LsmWriteSpec.md)
|
||||
- [MergeResult](interfaces/MergeResult.md)
|
||||
- [NativeOAuthConfig](interfaces/NativeOAuthConfig.md)
|
||||
- [OAuthConfig](interfaces/OAuthConfig.md)
|
||||
- [OpenTableOptions](interfaces/OpenTableOptions.md)
|
||||
- [OptimizeOptions](interfaces/OptimizeOptions.md)
|
||||
- [OptimizeStats](interfaces/OptimizeStats.md)
|
||||
|
||||
@@ -64,6 +64,19 @@ client used by manifest-enabled native connections.
|
||||
|
||||
***
|
||||
|
||||
### oauthConfig?
|
||||
|
||||
```ts
|
||||
optional oauthConfig: NativeOAuthConfig;
|
||||
```
|
||||
|
||||
(For LanceDB cloud only): OAuth configuration for IdP-based
|
||||
authentication (e.g., Azure Entra ID). When set, token acquisition
|
||||
and refresh are handled entirely in Rust. TypeScript users should pass
|
||||
the public `OAuthConfig` type exported from `@lancedb/lancedb`.
|
||||
|
||||
***
|
||||
|
||||
### readConsistencyInterval?
|
||||
|
||||
```ts
|
||||
|
||||
@@ -23,6 +23,31 @@ be more columns to represent composite indices.
|
||||
|
||||
***
|
||||
|
||||
### createdAt?
|
||||
|
||||
```ts
|
||||
optional createdAt: Date;
|
||||
```
|
||||
|
||||
When the index was created.
|
||||
|
||||
`undefined` for remote tables or indices created before timestamps were tracked.
|
||||
|
||||
***
|
||||
|
||||
### indexDetails?
|
||||
|
||||
```ts
|
||||
optional indexDetails: any;
|
||||
```
|
||||
|
||||
Index-type-specific details parsed as a JavaScript object.
|
||||
|
||||
Falls back to a raw string if JSON parsing fails. `undefined` for
|
||||
remote tables or when details are unavailable.
|
||||
|
||||
***
|
||||
|
||||
### indexType
|
||||
|
||||
```ts
|
||||
@@ -33,6 +58,30 @@ The type of the index
|
||||
|
||||
***
|
||||
|
||||
### indexUuid?
|
||||
|
||||
```ts
|
||||
optional indexUuid: string;
|
||||
```
|
||||
|
||||
The UUID of the first segment of the index.
|
||||
|
||||
`undefined` for remote tables, which do not yet surface this.
|
||||
|
||||
***
|
||||
|
||||
### indexVersion?
|
||||
|
||||
```ts
|
||||
optional indexVersion: number;
|
||||
```
|
||||
|
||||
The on-disk index format version.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### name
|
||||
|
||||
```ts
|
||||
@@ -40,3 +89,63 @@ name: string;
|
||||
```
|
||||
|
||||
The name of the index
|
||||
|
||||
***
|
||||
|
||||
### numIndexedRows?
|
||||
|
||||
```ts
|
||||
optional numIndexedRows: number;
|
||||
```
|
||||
|
||||
The number of rows indexed, across all segments.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### numSegments?
|
||||
|
||||
```ts
|
||||
optional numSegments: number;
|
||||
```
|
||||
|
||||
The number of segments that make up the index.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### numUnindexedRows?
|
||||
|
||||
```ts
|
||||
optional numUnindexedRows: number;
|
||||
```
|
||||
|
||||
The number of rows not yet covered by this index.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
***
|
||||
|
||||
### sizeBytes?
|
||||
|
||||
```ts
|
||||
optional sizeBytes: number;
|
||||
```
|
||||
|
||||
The total size in bytes of all index files across all segments.
|
||||
|
||||
`undefined` for remote tables or indices without size tracking.
|
||||
|
||||
***
|
||||
|
||||
### typeUrl?
|
||||
|
||||
```ts
|
||||
optional typeUrl: string;
|
||||
```
|
||||
|
||||
The protobuf type URL, a precise type identifier for the index.
|
||||
|
||||
`undefined` for remote tables.
|
||||
|
||||
@@ -30,17 +30,6 @@ The type of the index
|
||||
|
||||
***
|
||||
|
||||
### loss?
|
||||
|
||||
```ts
|
||||
optional loss: number;
|
||||
```
|
||||
|
||||
The KMeans loss value of the index,
|
||||
it is only present for vector indices.
|
||||
|
||||
***
|
||||
|
||||
### numIndexedRows
|
||||
|
||||
```ts
|
||||
|
||||
88
docs/src/js/interfaces/NativeOAuthConfig.md
Normal file
88
docs/src/js/interfaces/NativeOAuthConfig.md
Normal file
@@ -0,0 +1,88 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / NativeOAuthConfig
|
||||
|
||||
# Interface: NativeOAuthConfig
|
||||
|
||||
OAuth configuration for LanceDB authentication.
|
||||
|
||||
This is the generated napi-rs binding shape. TypeScript users should prefer
|
||||
the public `OAuthConfig` type exported from `@lancedb/lancedb`.
|
||||
|
||||
All token acquisition and refresh is handled in the Rust layer.
|
||||
|
||||
## Properties
|
||||
|
||||
### clientId
|
||||
|
||||
```ts
|
||||
clientId: string;
|
||||
```
|
||||
|
||||
Application / Client ID.
|
||||
|
||||
***
|
||||
|
||||
### clientSecret?
|
||||
|
||||
```ts
|
||||
optional clientSecret: string;
|
||||
```
|
||||
|
||||
Client secret (required for client_credentials).
|
||||
|
||||
***
|
||||
|
||||
### flow?
|
||||
|
||||
```ts
|
||||
optional flow: string;
|
||||
```
|
||||
|
||||
Authentication flow: "client_credentials" or "azure_managed_identity"
|
||||
|
||||
***
|
||||
|
||||
### issuerUrl
|
||||
|
||||
```ts
|
||||
issuerUrl: string;
|
||||
```
|
||||
|
||||
OIDC issuer URL or OAuth authority URL.
|
||||
For Azure: `https://login.microsoftonline.com/{tenant_id}/v2.0`
|
||||
|
||||
***
|
||||
|
||||
### managedIdentityClientId?
|
||||
|
||||
```ts
|
||||
optional managedIdentityClientId: string;
|
||||
```
|
||||
|
||||
Client ID for user-assigned managed identity (azure_managed_identity).
|
||||
|
||||
***
|
||||
|
||||
### refreshBufferSecs?
|
||||
|
||||
```ts
|
||||
optional refreshBufferSecs: number;
|
||||
```
|
||||
|
||||
Seconds before expiry to trigger proactive refresh (default: 300).
|
||||
Keep this well below the token TTL; if it is greater than or equal to
|
||||
the TTL, each request refreshes the token.
|
||||
|
||||
***
|
||||
|
||||
### scopes
|
||||
|
||||
```ts
|
||||
scopes: string[];
|
||||
```
|
||||
|
||||
OAuth scopes to request. For Azure managed identity, exactly one scope
|
||||
or resource is required. For example: `["api://{app_id}/.default"]`
|
||||
111
docs/src/js/interfaces/OAuthConfig.md
Normal file
111
docs/src/js/interfaces/OAuthConfig.md
Normal file
@@ -0,0 +1,111 @@
|
||||
[**@lancedb/lancedb**](../README.md) • **Docs**
|
||||
|
||||
***
|
||||
|
||||
[@lancedb/lancedb](../globals.md) / OAuthConfig
|
||||
|
||||
# Interface: OAuthConfig
|
||||
|
||||
OAuth configuration for LanceDB authentication.
|
||||
|
||||
This is the public TypeScript OAuth configuration type. The generated
|
||||
`NativeOAuthConfig` type has the same runtime shape but is an implementation
|
||||
detail of the napi-rs binding.
|
||||
|
||||
All token acquisition and refresh is handled in the Rust layer.
|
||||
This config is passed through to Rust via napi-rs.
|
||||
|
||||
## Examples
|
||||
|
||||
```typescript
|
||||
const config: OAuthConfig = {
|
||||
issuerUrl: "https://login.microsoftonline.com/{tenant}/v2.0",
|
||||
clientId: "app-id",
|
||||
clientSecret: "secret",
|
||||
scopes: ["api://lancedb-api/.default"],
|
||||
};
|
||||
```
|
||||
|
||||
```typescript
|
||||
const config: OAuthConfig = {
|
||||
issuerUrl: "https://login.microsoftonline.com/{tenant}/v2.0",
|
||||
clientId: "app-id",
|
||||
scopes: ["api://lancedb-api/.default"],
|
||||
flow: OAuthFlowType.AzureManagedIdentity,
|
||||
};
|
||||
```
|
||||
|
||||
## Properties
|
||||
|
||||
### clientId
|
||||
|
||||
```ts
|
||||
clientId: string;
|
||||
```
|
||||
|
||||
Application / Client ID.
|
||||
|
||||
***
|
||||
|
||||
### clientSecret?
|
||||
|
||||
```ts
|
||||
optional clientSecret: string;
|
||||
```
|
||||
|
||||
Client secret (required for ClientCredentials).
|
||||
|
||||
***
|
||||
|
||||
### flow?
|
||||
|
||||
```ts
|
||||
optional flow: OAuthFlowType;
|
||||
```
|
||||
|
||||
Authentication flow (default: ClientCredentials).
|
||||
|
||||
***
|
||||
|
||||
### issuerUrl
|
||||
|
||||
```ts
|
||||
issuerUrl: string;
|
||||
```
|
||||
|
||||
OIDC issuer URL or OAuth authority URL.
|
||||
For Azure: `https://login.microsoftonline.com/{tenant_id}/v2.0`
|
||||
|
||||
***
|
||||
|
||||
### managedIdentityClientId?
|
||||
|
||||
```ts
|
||||
optional managedIdentityClientId: string;
|
||||
```
|
||||
|
||||
Client ID for user-assigned managed identity (AzureManagedIdentity).
|
||||
|
||||
***
|
||||
|
||||
### refreshBufferSecs?
|
||||
|
||||
```ts
|
||||
optional refreshBufferSecs: number;
|
||||
```
|
||||
|
||||
Seconds before expiry to trigger proactive refresh (default: 300).
|
||||
Keep this well below the token TTL; if it is greater than or equal to
|
||||
the TTL, each request refreshes the token.
|
||||
|
||||
***
|
||||
|
||||
### scopes
|
||||
|
||||
```ts
|
||||
scopes: string[];
|
||||
```
|
||||
|
||||
OAuth scopes to request.
|
||||
For Azure managed identity, exactly one scope or resource is required.
|
||||
For example: `["api://{app_id}/.default"]`
|
||||
@@ -8,6 +8,18 @@
|
||||
|
||||
## Properties
|
||||
|
||||
### branch?
|
||||
|
||||
```ts
|
||||
optional branch: string;
|
||||
```
|
||||
|
||||
Open the table scoped to this branch instead of the default branch.
|
||||
|
||||
Reads and writes on the returned table operate in the branch's context.
|
||||
|
||||
***
|
||||
|
||||
### ~~indexCacheSize?~~
|
||||
|
||||
```ts
|
||||
@@ -43,3 +55,17 @@ Options already set on the connection will be inherited by the table,
|
||||
but can be overridden here.
|
||||
|
||||
The available options are described at https://docs.lancedb.com/storage/
|
||||
|
||||
***
|
||||
|
||||
### version?
|
||||
|
||||
```ts
|
||||
optional version: number;
|
||||
```
|
||||
|
||||
Open the table pinned to this version, producing a read-only view.
|
||||
|
||||
Composes with [OpenTableOptions.branch](OpenTableOptions.md#branch): when both are set, opens
|
||||
that branch at the version; otherwise opens `main` at the version. Call
|
||||
`checkoutLatest` to return to a writable state.
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
<parent>
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-parent</artifactId>
|
||||
<version>0.30.1-beta.0</version>
|
||||
<version>0.31.0-beta.4</version>
|
||||
<relativePath>../pom.xml</relativePath>
|
||||
</parent>
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
|
||||
<groupId>com.lancedb</groupId>
|
||||
<artifactId>lancedb-parent</artifactId>
|
||||
<version>0.30.1-beta.0</version>
|
||||
<version>0.31.0-beta.4</version>
|
||||
<packaging>pom</packaging>
|
||||
<name>${project.artifactId}</name>
|
||||
<description>LanceDB Java SDK Parent POM</description>
|
||||
@@ -28,7 +28,7 @@
|
||||
<properties>
|
||||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
|
||||
<arrow.version>15.0.0</arrow.version>
|
||||
<lance-core.version>7.2.0-beta.1</lance-core.version>
|
||||
<lance-core.version>9.0.0-beta.10</lance-core.version>
|
||||
<spotless.skip>false</spotless.skip>
|
||||
<spotless.version>2.30.0</spotless.version>
|
||||
<spotless.java.googlejavaformat.version>1.7</spotless.java.googlejavaformat.version>
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
[package]
|
||||
name = "lancedb-nodejs"
|
||||
edition.workspace = true
|
||||
version = "0.30.1-beta.0"
|
||||
version = "0.31.0-beta.4"
|
||||
publish = false
|
||||
license.workspace = true
|
||||
description.workspace = true
|
||||
@@ -25,8 +25,12 @@ lancedb = { path = "../rust/lancedb", default-features = false }
|
||||
lance-namespace.workspace = true
|
||||
napi = { version = "3.8.3", default-features = false, features = [
|
||||
"napi9",
|
||||
"async"
|
||||
"async",
|
||||
"chrono_date",
|
||||
"serde-json",
|
||||
] }
|
||||
chrono = { version = "0.4", default-features = false, features = ["clock"] }
|
||||
serde_json = "1"
|
||||
napi-derive = "3.5.2"
|
||||
# Prevent dynamic linking of lzma, which comes from datafusion
|
||||
lzma-sys = { version = "0.1", features = ["static"] }
|
||||
|
||||
@@ -191,6 +191,40 @@ describe("remote connection", () => {
|
||||
);
|
||||
});
|
||||
|
||||
it("supports version time-travel and branches on remote", async () => {
|
||||
await withMockDatabase(
|
||||
(req, res) => {
|
||||
const body = req.url?.includes("/branches/list")
|
||||
? JSON.stringify({
|
||||
branches: {
|
||||
exp: { parentVersion: 1, createAt: 1, manifestSize: 1 },
|
||||
},
|
||||
})
|
||||
: JSON.stringify({ name: "t", version: 2, schema: { fields: [] } });
|
||||
res.writeHead(200, { "Content-Type": "application/json" }).end(body);
|
||||
},
|
||||
async (db) => {
|
||||
// version-only (and "main" + version) time-travel the main chain
|
||||
const v2 = await db.openTable("t", undefined, { version: 2 });
|
||||
expect(v2.currentBranch()).toBeNull();
|
||||
const mainV2 = await db.openTable("t", undefined, {
|
||||
branch: "main",
|
||||
version: 2,
|
||||
});
|
||||
expect(mainV2.currentBranch()).toBeNull();
|
||||
|
||||
// a non-main branch opens a handle scoped to that branch
|
||||
const exp = await db.openTable("t", undefined, { branch: "exp" });
|
||||
expect(exp.currentBranch()).toBe("exp");
|
||||
const expV2 = await db.openTable("t", undefined, {
|
||||
branch: "exp",
|
||||
version: 2,
|
||||
});
|
||||
expect(expV2.currentBranch()).toBe("exp");
|
||||
},
|
||||
);
|
||||
});
|
||||
|
||||
describe("TlsConfig", () => {
|
||||
it("should create TlsConfig with all fields", () => {
|
||||
const tlsConfig: TlsConfig = {
|
||||
|
||||
@@ -85,6 +85,140 @@ describe.each([arrow15, arrow16, arrow17, arrow18])(
|
||||
await expect(table.countRows()).resolves.toBe(3);
|
||||
});
|
||||
|
||||
it("should support branches", async () => {
|
||||
await table.add([{ id: 1 }]);
|
||||
expect(await table.countRows()).toBe(1);
|
||||
|
||||
expect(table.currentBranch()).toBeNull();
|
||||
|
||||
// fork an isolated, writable branch from main
|
||||
const branch = await (await table.branches()).create("exp");
|
||||
expect(branch.currentBranch()).toBe("exp");
|
||||
expect(await branch.countRows()).toBe(1);
|
||||
await branch.add([{ id: 2 }]);
|
||||
expect(await branch.countRows()).toBe(2);
|
||||
// main is untouched by branch writes
|
||||
expect(await table.countRows()).toBe(1);
|
||||
|
||||
// listed, with main (null) as the parent
|
||||
const list = await (await table.branches()).list();
|
||||
expect(Object.keys(list)).toContain("exp");
|
||||
expect(list["exp"].parentBranch).toBeNull();
|
||||
|
||||
// fromRef="main" is equivalent to the default
|
||||
await (await table.branches()).create("exp2", "main");
|
||||
const list2 = await (await table.branches()).list();
|
||||
expect(list2["exp2"].parentBranch).toBeNull();
|
||||
|
||||
// checkout returns a handle scoped to the branch's latest
|
||||
const checkedOut = await (await table.branches()).checkout("exp");
|
||||
expect(checkedOut.currentBranch()).toBe("exp");
|
||||
expect(await checkedOut.countRows()).toBe(2);
|
||||
|
||||
// delete removes it
|
||||
await (await table.branches()).delete("exp");
|
||||
await (await table.branches()).delete("exp2");
|
||||
const after = await (await table.branches()).list();
|
||||
expect(Object.keys(after)).not.toContain("exp");
|
||||
});
|
||||
|
||||
it("should open a branch via open_table", async () => {
|
||||
const db = await connect(tmpDir.name);
|
||||
await table.add([{ id: 1 }]);
|
||||
const branch = await (await table.branches()).create("exp");
|
||||
await branch.add([{ id: 2 }]);
|
||||
|
||||
// open_table(..., { branch }) returns a handle scoped to the branch
|
||||
const opened = await db.openTable("some_table", undefined, {
|
||||
branch: "exp",
|
||||
});
|
||||
expect(await opened.countRows()).toBe(2);
|
||||
// opening without branch still tracks main
|
||||
expect(await (await db.openTable("some_table")).countRows()).toBe(1);
|
||||
});
|
||||
|
||||
it("should open a branch at a version isolated from main and HEAD", async () => {
|
||||
const db = await connect(tmpDir.name);
|
||||
// main: a single fork-point row
|
||||
const t = await db.createTable("bv_table", [{ id: 0 }]);
|
||||
const mainV1 = await t.version();
|
||||
|
||||
// fork "exp", then advance exp AND main independently past the fork so
|
||||
// they diverge while sharing version numbers
|
||||
const exp = await (await t.branches()).create("exp");
|
||||
await exp.add([{ id: 1 }]); // exp: {0, 1}
|
||||
const expV2 = await exp.version();
|
||||
await exp.add([{ id: 2 }]); // exp HEAD: {0, 1, 2}
|
||||
await t.add([{ id: 100 }, { id: 101 }, { id: 102 }]); // main HEAD: {0,100,101,102}
|
||||
expect(await t.version()).toBe(expV2);
|
||||
|
||||
// open exp at the shared version: the data must be exp's, not main's.
|
||||
// count alone cannot prove this (main@v2 also exists), so assert
|
||||
// provenance by content.
|
||||
const pinned = await db.openTable("bv_table", undefined, {
|
||||
branch: "exp",
|
||||
version: expV2,
|
||||
});
|
||||
expect(await pinned.countRows()).toBe(2); // not exp HEAD (3), not main@v2 (4)
|
||||
expect(await pinned.countRows("id = 1")).toBe(1); // exp's post-fork row
|
||||
expect(await pinned.countRows("id = 100")).toBe(0); // main's rows invisible
|
||||
|
||||
// the same coordinate is reachable directly via branches().checkout(name, version)
|
||||
const pinnedDirect = await (await t.branches()).checkout("exp", expV2);
|
||||
expect(await pinnedDirect.countRows()).toBe(2);
|
||||
|
||||
// the HEADs are unaffected
|
||||
expect(
|
||||
await (
|
||||
await db.openTable("bv_table", undefined, { branch: "exp" })
|
||||
).countRows(),
|
||||
).toBe(3);
|
||||
expect(await (await db.openTable("bv_table")).countRows()).toBe(4);
|
||||
|
||||
// version-only (no branch) time-travels main itself: its fork-point
|
||||
// version holds only main's first row, and the shared version number
|
||||
// resolves to main's data, not the branch's ("opens main at the version")
|
||||
const oldMain = await db.openTable("bv_table", undefined, {
|
||||
version: mainV1,
|
||||
});
|
||||
expect(await oldMain.countRows()).toBe(1);
|
||||
const sharedOnMain = await db.openTable("bv_table", undefined, {
|
||||
version: expV2,
|
||||
});
|
||||
expect(await sharedOnMain.countRows()).toBe(4); // main@v2, not exp@v2 (2)
|
||||
|
||||
// detached head: writing to a pinned version is rejected
|
||||
await expect(pinned.add([{ id: 9 }])).rejects.toThrow(
|
||||
/cannot be modified/,
|
||||
);
|
||||
|
||||
// a nonexistent version is rejected -- on main, and on a branch (a
|
||||
// distinct resolution path, on the branch's manifests)
|
||||
await expect(
|
||||
db.openTable("bv_table", undefined, { version: 9999 }),
|
||||
).rejects.toThrow();
|
||||
await expect(
|
||||
db.openTable("bv_table", undefined, { branch: "exp", version: 9999 }),
|
||||
).rejects.toThrow();
|
||||
|
||||
// checkoutLatest re-attaches the pinned handle to the BRANCH's HEAD
|
||||
// (writable again), not main's HEAD (4), and not staying pinned (2)
|
||||
await pinned.checkoutLatest();
|
||||
expect(await pinned.countRows()).toBe(3); // exp HEAD
|
||||
await pinned.add([{ id: 3 }]);
|
||||
expect(await pinned.countRows()).toBe(4); // writable again
|
||||
});
|
||||
|
||||
it("rejects invalid branch inputs", async () => {
|
||||
const branches = await table.branches();
|
||||
await expect(branches.create("")).rejects.toThrow("non-empty");
|
||||
await expect(branches.checkout("")).rejects.toThrow("non-empty");
|
||||
await expect(branches.delete("")).rejects.toThrow("non-empty");
|
||||
await expect(branches.create("bad", "main", -1)).rejects.toThrow(
|
||||
"non-negative",
|
||||
);
|
||||
});
|
||||
|
||||
it("should show table stats", async () => {
|
||||
await table.add([{ id: 1 }, { id: 2 }]);
|
||||
await table.add([{ id: 1 }]);
|
||||
@@ -715,13 +849,15 @@ describe("When creating an index", () => {
|
||||
expect(fs.readdirSync(indexDir)).toHaveLength(1);
|
||||
const indices = await tbl.listIndices();
|
||||
expect(indices.length).toBe(1);
|
||||
expect(indices[0]).toEqual({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["vec"],
|
||||
});
|
||||
expect(indices[0]).toEqual(
|
||||
expect.objectContaining({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["vec"],
|
||||
}),
|
||||
);
|
||||
const stats = await tbl.indexStats("vec_idx");
|
||||
expect(stats?.loss).toBeDefined();
|
||||
expect(stats).toBeDefined();
|
||||
|
||||
// Search without specifying the column
|
||||
let rst = await tbl
|
||||
@@ -781,10 +917,22 @@ describe("When creating an index", () => {
|
||||
expect(indices2.length).toBe(0);
|
||||
});
|
||||
|
||||
it("should create and search a nested vector index", async () => {
|
||||
it("should preserve canonical nested field paths across index lifecycle", async () => {
|
||||
const db = await connect(tmpDir.name);
|
||||
const nestedSchema = new Schema([
|
||||
new Field("id", new Int32(), true),
|
||||
new Field("rowId", new Int32(), true),
|
||||
new Field("row-id", new Int32(), true),
|
||||
new Field("userId", new Int32(), true),
|
||||
new Field(
|
||||
"metadata",
|
||||
new Struct([new Field("user_id", new Int32(), true)]),
|
||||
true,
|
||||
),
|
||||
new Field(
|
||||
"MetaData",
|
||||
new Struct([new Field("userId", new Int32(), true)]),
|
||||
true,
|
||||
),
|
||||
new Field(
|
||||
"image",
|
||||
new Struct([
|
||||
@@ -796,28 +944,147 @@ describe("When creating an index", () => {
|
||||
]),
|
||||
true,
|
||||
),
|
||||
new Field(
|
||||
"payload",
|
||||
new Struct([new Field("text", new Utf8(), true)]),
|
||||
true,
|
||||
),
|
||||
new Field(
|
||||
"meta-data",
|
||||
new Struct([new Field("user-id", new Int32(), true)]),
|
||||
true,
|
||||
),
|
||||
new Field(
|
||||
"literal",
|
||||
new Struct([new Field("a.b", new Int32(), true)]),
|
||||
true,
|
||||
),
|
||||
]);
|
||||
const nestedTable = await db.createTable(
|
||||
"nested_vector",
|
||||
"nested_field_index_lifecycle",
|
||||
makeArrowTable(
|
||||
Array.from({ length: 300 }, (_, id) => ({
|
||||
id,
|
||||
image: { embedding: [id, id + 1] },
|
||||
Array.from({ length: 300 }, (_, rowId) => ({
|
||||
rowId,
|
||||
"row-id": rowId,
|
||||
userId: rowId,
|
||||
metadata: { ["user_id"]: rowId },
|
||||
["MetaData"]: { userId: rowId },
|
||||
image: { embedding: [rowId, rowId + 1] },
|
||||
payload: { text: `document ${rowId}` },
|
||||
"meta-data": { "user-id": rowId },
|
||||
literal: { "a.b": rowId },
|
||||
})),
|
||||
{ schema: nestedSchema },
|
||||
),
|
||||
);
|
||||
|
||||
await nestedTable.createIndex("rowId", {
|
||||
config: Index.btree(),
|
||||
name: "row_id_idx",
|
||||
});
|
||||
await nestedTable.createIndex("`row-id`", {
|
||||
config: Index.btree(),
|
||||
name: "row_dash_id_idx",
|
||||
});
|
||||
await nestedTable.createIndex("userId", {
|
||||
config: Index.btree(),
|
||||
name: "top_user_id_idx",
|
||||
});
|
||||
await nestedTable.createIndex("metadata.user_id", {
|
||||
config: Index.btree(),
|
||||
name: "nested_user_id_idx",
|
||||
});
|
||||
await nestedTable.createIndex("MetaData.userId", {
|
||||
config: Index.btree(),
|
||||
name: "mixed_case_metadata_user_id_idx",
|
||||
});
|
||||
await nestedTable.createIndex("`meta-data`.`user-id`", {
|
||||
config: Index.btree(),
|
||||
name: "escaped_names_idx",
|
||||
});
|
||||
await nestedTable.createIndex("literal.`a.b`", {
|
||||
config: Index.btree(),
|
||||
name: "literal_dot_idx",
|
||||
});
|
||||
await nestedTable.createIndex("image.embedding", {
|
||||
name: "image_embedding_idx",
|
||||
});
|
||||
const indices = await nestedTable.listIndices();
|
||||
expect(indices).toContainEqual({
|
||||
name: "image_embedding_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["image.embedding"],
|
||||
await nestedTable.createIndex("payload.text", {
|
||||
config: Index.fts({ withPosition: false }),
|
||||
name: "payload_text_idx",
|
||||
});
|
||||
|
||||
const indices = await nestedTable.listIndices();
|
||||
expect(indices).toEqual(
|
||||
expect.arrayContaining([
|
||||
expect.objectContaining({
|
||||
name: "row_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["rowId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "row_dash_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["`row-id`"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "top_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["userId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "nested_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["metadata.user_id"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "mixed_case_metadata_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["MetaData.userId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "escaped_names_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["`meta-data`.`user-id`"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "literal_dot_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["literal.`a.b`"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "image_embedding_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["image.embedding"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "payload_text_idx",
|
||||
indexType: "FTS",
|
||||
columns: ["payload.text"],
|
||||
}),
|
||||
]),
|
||||
);
|
||||
|
||||
const stats = await nestedTable.indexStats(
|
||||
"mixed_case_metadata_user_id_idx",
|
||||
);
|
||||
expect(stats?.numIndexedRows).toEqual(300);
|
||||
expect(stats?.indexType).toEqual("BTREE");
|
||||
|
||||
const filtered = await nestedTable
|
||||
.query()
|
||||
.where("MetaData.userId = 42")
|
||||
.limit(1)
|
||||
.toArray();
|
||||
expect(filtered[0].MetaData.userId).toEqual(42);
|
||||
|
||||
const escapedFiltered = await nestedTable
|
||||
.query()
|
||||
.where("`row-id` = 43")
|
||||
.limit(1)
|
||||
.toArray();
|
||||
expect(escapedFiltered[0]["row-id"]).toEqual(43);
|
||||
|
||||
const explicit = await nestedTable
|
||||
.query()
|
||||
.nearestTo([0.0, 1.0])
|
||||
@@ -829,7 +1096,37 @@ describe("When creating an index", () => {
|
||||
.nearestTo([0.0, 1.0])
|
||||
.limit(1)
|
||||
.toArray();
|
||||
expect(inferred[0].id).toEqual(explicit[0].id);
|
||||
expect(inferred[0].rowId).toEqual(explicit[0].rowId);
|
||||
|
||||
await nestedTable.add([
|
||||
{
|
||||
rowId: 300,
|
||||
"row-id": 300,
|
||||
userId: 300,
|
||||
metadata: { ["user_id"]: 300 },
|
||||
["MetaData"]: { userId: 300 },
|
||||
image: { embedding: [300.0, 301.0] },
|
||||
payload: { text: "document 300" },
|
||||
"meta-data": { "user-id": 300 },
|
||||
literal: { "a.b": 300 },
|
||||
},
|
||||
]);
|
||||
await nestedTable.optimize();
|
||||
const indicesAfterOptimize = await nestedTable.listIndices();
|
||||
expect(indicesAfterOptimize).toEqual(
|
||||
expect.arrayContaining([
|
||||
expect.objectContaining({
|
||||
name: "mixed_case_metadata_user_id_idx",
|
||||
indexType: "BTree",
|
||||
columns: ["MetaData.userId"],
|
||||
}),
|
||||
expect.objectContaining({
|
||||
name: "image_embedding_idx",
|
||||
indexType: "IvfPq",
|
||||
columns: ["image.embedding"],
|
||||
}),
|
||||
]),
|
||||
);
|
||||
});
|
||||
|
||||
it("should report multiple nested vector candidates", async () => {
|
||||
@@ -963,11 +1260,13 @@ describe("When creating an index", () => {
|
||||
expect(fs.readdirSync(indexDir)).toHaveLength(1);
|
||||
const indices = await tbl.listIndices();
|
||||
expect(indices.length).toBe(1);
|
||||
expect(indices[0]).toEqual({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfHnswSq",
|
||||
columns: ["vec"],
|
||||
});
|
||||
expect(indices[0]).toEqual(
|
||||
expect.objectContaining({
|
||||
name: "vec_idx",
|
||||
indexType: "IvfHnswSq",
|
||||
columns: ["vec"],
|
||||
}),
|
||||
);
|
||||
|
||||
// Search without specifying the column
|
||||
let rst = await tbl
|
||||
@@ -1140,6 +1439,20 @@ describe("When creating an index", () => {
|
||||
expect(fs.readdirSync(indexDir)).toHaveLength(1);
|
||||
});
|
||||
|
||||
test("create an FM index", async () => {
|
||||
// FM-Index accelerates substring search on a string/binary column.
|
||||
const db = await connect(tmpDir.name);
|
||||
const fmTbl = await db.createTable("fm_table", [
|
||||
{ id: 0, text: "hello world" },
|
||||
{ id: 1, text: "foo bar" },
|
||||
]);
|
||||
await fmTbl.createIndex("text", {
|
||||
config: Index.fm(),
|
||||
});
|
||||
const indexDir = path.join(tmpDir.name, "fm_table.lance", "_indices");
|
||||
expect(fs.readdirSync(indexDir)).toHaveLength(1);
|
||||
});
|
||||
|
||||
test("should be able to get index stats", async () => {
|
||||
await tbl.createIndex("id");
|
||||
|
||||
@@ -1150,7 +1463,6 @@ describe("When creating an index", () => {
|
||||
expect(stats?.distanceType).toBeUndefined();
|
||||
expect(stats?.indexType).toEqual("BTREE");
|
||||
expect(stats?.numIndices).toEqual(1);
|
||||
expect(stats?.loss).toBeUndefined();
|
||||
});
|
||||
|
||||
test("when getting stats on non-existent index", async () => {
|
||||
@@ -1300,6 +1612,35 @@ describe("When creating an index", () => {
|
||||
expect(rst64Query.toString()).toEqual(rst64Search.toString());
|
||||
expect(rst64Query.numRows).toBe(2);
|
||||
});
|
||||
|
||||
it("should expose rich metadata fields on IndexConfig", async () => {
|
||||
await tbl.createIndex("id", { config: Index.btree() });
|
||||
await tbl.createIndex("vec");
|
||||
|
||||
const indicesByName = Object.fromEntries(
|
||||
(await tbl.listIndices()).map((idx) => [idx.name, idx]),
|
||||
);
|
||||
|
||||
const scalarIdx = indicesByName["id_idx"];
|
||||
expect(scalarIdx).toBeDefined();
|
||||
expect(typeof scalarIdx.indexUuid).toBe("string");
|
||||
expect(scalarIdx.numIndexedRows).toBe(300);
|
||||
expect(scalarIdx.numUnindexedRows).toBe(0);
|
||||
expect(scalarIdx.numSegments).toBeGreaterThanOrEqual(1);
|
||||
expect(scalarIdx.sizeBytes).toBeGreaterThan(0);
|
||||
// Use toString check to avoid cross-realm instanceof failures with native Date objects
|
||||
expect(Object.prototype.toString.call(scalarIdx.createdAt)).toBe(
|
||||
"[object Date]",
|
||||
);
|
||||
expect((scalarIdx.createdAt as Date).getTime()).toBeGreaterThan(0);
|
||||
expect(typeof scalarIdx.indexDetails).toBe("object");
|
||||
|
||||
const vectorIdx = indicesByName["vec_idx"];
|
||||
expect(vectorIdx).toBeDefined();
|
||||
expect(typeof vectorIdx.indexUuid).toBe("string");
|
||||
expect(vectorIdx.numIndexedRows).toBe(300);
|
||||
expect(typeof vectorIdx.indexDetails).toBe("object");
|
||||
});
|
||||
});
|
||||
|
||||
describe("When querying a table", () => {
|
||||
|
||||
@@ -84,6 +84,20 @@ export interface CreateTableOptions {
|
||||
}
|
||||
|
||||
export interface OpenTableOptions {
|
||||
/**
|
||||
* Open the table scoped to this branch instead of the default branch.
|
||||
*
|
||||
* Reads and writes on the returned table operate in the branch's context.
|
||||
*/
|
||||
branch?: string;
|
||||
/**
|
||||
* Open the table pinned to this version, producing a read-only view.
|
||||
*
|
||||
* Composes with {@link OpenTableOptions.branch}: when both are set, opens
|
||||
* that branch at the version; otherwise opens `main` at the version. Call
|
||||
* `checkoutLatest` to return to a writable state.
|
||||
*/
|
||||
version?: number;
|
||||
/**
|
||||
* Configuration for object storage.
|
||||
*
|
||||
@@ -483,7 +497,20 @@ export class LocalConnection extends Connection {
|
||||
options?.indexCacheSize,
|
||||
);
|
||||
|
||||
return new LocalTable(innerTable);
|
||||
let table: Table = new LocalTable(innerTable);
|
||||
// "main" is the default branch, so treat it as no branch. On a real branch,
|
||||
// scope and pin in one step (yielding "version V of branch B"); otherwise
|
||||
// pin the version, if any, against main.
|
||||
const branch =
|
||||
options?.branch != null && options.branch !== "main"
|
||||
? options.branch
|
||||
: undefined;
|
||||
if (branch != null) {
|
||||
table = await (await table.branches()).checkout(branch, options?.version);
|
||||
} else if (options?.version != null) {
|
||||
await table.checkout(options.version);
|
||||
}
|
||||
return table;
|
||||
}
|
||||
|
||||
async cloneTable(
|
||||
|
||||
@@ -38,6 +38,7 @@ export {
|
||||
FragmentSummaryStats,
|
||||
Tags,
|
||||
TagContents,
|
||||
BranchContents,
|
||||
MergeResult,
|
||||
AddResult,
|
||||
AddColumnsResult,
|
||||
@@ -51,6 +52,7 @@ export {
|
||||
SplitHashOptions,
|
||||
SplitSequentialOptions,
|
||||
ShuffleOptions,
|
||||
OAuthConfig as NativeOAuthConfig,
|
||||
} from "./native.js";
|
||||
|
||||
export {
|
||||
@@ -111,6 +113,7 @@ export {
|
||||
|
||||
export {
|
||||
Table,
|
||||
Branches,
|
||||
AddDataOptions,
|
||||
UpdateOptions,
|
||||
OptimizeOptions,
|
||||
@@ -128,6 +131,8 @@ export {
|
||||
TokenResponse,
|
||||
} from "./header";
|
||||
|
||||
export { OAuthConfig, OAuthFlowType } from "./oauth";
|
||||
|
||||
export { MergeInsertBuilder, WriteExecutionOptions } from "./merge";
|
||||
|
||||
export * as embedding from "./embedding";
|
||||
|
||||
@@ -702,6 +702,17 @@ export class Index {
|
||||
return new Index(LanceDbIndex.labelList());
|
||||
}
|
||||
|
||||
/**
|
||||
* Create an FM-Index.
|
||||
*
|
||||
* An FM-Index is a scalar index on string or binary columns that accelerates
|
||||
* substring search, i.e. `contains(col, 'needle')`. Unlike the tokenized
|
||||
* full-text-search index, it matches arbitrary substrings of the raw bytes.
|
||||
*/
|
||||
static fm() {
|
||||
return new Index(LanceDbIndex.fm());
|
||||
}
|
||||
|
||||
/**
|
||||
* Create a full text search index
|
||||
*
|
||||
|
||||
76
nodejs/lancedb/oauth.ts
Normal file
76
nodejs/lancedb/oauth.ts
Normal file
@@ -0,0 +1,76 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
/**
|
||||
* OAuth authentication flow types.
|
||||
*/
|
||||
export enum OAuthFlowType {
|
||||
/** Client Credentials grant (service-to-service / M2M). */
|
||||
ClientCredentials = "client_credentials",
|
||||
/** Azure Managed Identity via IMDS. */
|
||||
AzureManagedIdentity = "azure_managed_identity",
|
||||
}
|
||||
|
||||
/**
|
||||
* OAuth configuration for LanceDB authentication.
|
||||
*
|
||||
* This is the public TypeScript OAuth configuration type. The generated
|
||||
* `NativeOAuthConfig` type has the same runtime shape but is an implementation
|
||||
* detail of the napi-rs binding.
|
||||
*
|
||||
* All token acquisition and refresh is handled in the Rust layer.
|
||||
* This config is passed through to Rust via napi-rs.
|
||||
*
|
||||
* @example Client Credentials (service-to-service):
|
||||
* ```typescript
|
||||
* const config: OAuthConfig = {
|
||||
* issuerUrl: "https://login.microsoftonline.com/{tenant}/v2.0",
|
||||
* clientId: "app-id",
|
||||
* clientSecret: "secret",
|
||||
* scopes: ["api://lancedb-api/.default"],
|
||||
* };
|
||||
* ```
|
||||
*
|
||||
* @example Azure Managed Identity:
|
||||
* ```typescript
|
||||
* const config: OAuthConfig = {
|
||||
* issuerUrl: "https://login.microsoftonline.com/{tenant}/v2.0",
|
||||
* clientId: "app-id",
|
||||
* scopes: ["api://lancedb-api/.default"],
|
||||
* flow: OAuthFlowType.AzureManagedIdentity,
|
||||
* };
|
||||
* ```
|
||||
*/
|
||||
export interface OAuthConfig {
|
||||
/**
|
||||
* OIDC issuer URL or OAuth authority URL.
|
||||
* For Azure: `https://login.microsoftonline.com/{tenant_id}/v2.0`
|
||||
*/
|
||||
issuerUrl: string;
|
||||
|
||||
/** Application / Client ID. */
|
||||
clientId: string;
|
||||
|
||||
/**
|
||||
* OAuth scopes to request.
|
||||
* For Azure managed identity, exactly one scope or resource is required.
|
||||
* For example: `["api://{app_id}/.default"]`
|
||||
*/
|
||||
scopes: string[];
|
||||
|
||||
/** Authentication flow (default: ClientCredentials). */
|
||||
flow?: OAuthFlowType;
|
||||
|
||||
/** Client secret (required for ClientCredentials). */
|
||||
clientSecret?: string;
|
||||
|
||||
/** Client ID for user-assigned managed identity (AzureManagedIdentity). */
|
||||
managedIdentityClientId?: string;
|
||||
|
||||
/**
|
||||
* Seconds before expiry to trigger proactive refresh (default: 300).
|
||||
* Keep this well below the token TTL; if it is greater than or equal to
|
||||
* the TTL, each request refreshes the token.
|
||||
*/
|
||||
refreshBufferSecs?: number;
|
||||
}
|
||||
@@ -25,10 +25,12 @@ import {
|
||||
AddColumnsSql,
|
||||
AddResult,
|
||||
AlterColumnsResult,
|
||||
BranchContents,
|
||||
DeleteResult,
|
||||
DropColumnsResult,
|
||||
IndexConfig,
|
||||
IndexStatistics,
|
||||
Branches as NativeBranches,
|
||||
OptimizeStats,
|
||||
TableStatistics,
|
||||
Tags,
|
||||
@@ -653,6 +655,22 @@ export abstract class Table {
|
||||
*/
|
||||
abstract tags(): Promise<Tags>;
|
||||
|
||||
/**
|
||||
* Get the branch manager for this table.
|
||||
*
|
||||
* Branches are isolated, writable lines of history forked from another
|
||||
* branch (or version). Writes on a branch do not affect `main`.
|
||||
*/
|
||||
abstract branches(): Promise<Branches>;
|
||||
|
||||
/**
|
||||
* The branch this table handle is scoped to, or `null` for the main branch.
|
||||
*
|
||||
* A handle returned by {@link Branches.create} or {@link Branches.checkout}
|
||||
* reports the branch it targets; a handle opened normally reports `null`.
|
||||
*/
|
||||
abstract currentBranch(): string | null;
|
||||
|
||||
/**
|
||||
* Restore the table to the currently checked out version
|
||||
*
|
||||
@@ -1108,6 +1126,14 @@ export class LocalTable extends Table {
|
||||
return await this.inner.tags();
|
||||
}
|
||||
|
||||
async branches(): Promise<Branches> {
|
||||
return new Branches(await this.inner.branches());
|
||||
}
|
||||
|
||||
currentBranch(): string | null {
|
||||
return this.inner.currentBranch() ?? null;
|
||||
}
|
||||
|
||||
async optimize(options?: Partial<OptimizeOptions>): Promise<OptimizeStats> {
|
||||
let cleanupOlderThanMs;
|
||||
if (
|
||||
@@ -1238,3 +1264,57 @@ export interface FieldMetadataUpdate {
|
||||
/** If true, replace the field's entire metadata map instead of merging. */
|
||||
replace?: boolean;
|
||||
}
|
||||
|
||||
/**
|
||||
* Branch manager for a {@link Table}.
|
||||
*
|
||||
* Unlike tags, `create` and `checkout` return a new {@link Table} handle scoped
|
||||
* to the branch; writes on it do not affect `main`.
|
||||
*/
|
||||
export class Branches {
|
||||
#inner: NativeBranches;
|
||||
|
||||
/**
|
||||
* Construct a Branches manager. Internal use only.
|
||||
* @hidden
|
||||
*/
|
||||
constructor(inner: NativeBranches) {
|
||||
this.#inner = inner;
|
||||
}
|
||||
|
||||
/** List all branches, mapping name to branch metadata. */
|
||||
async list(): Promise<Record<string, BranchContents>> {
|
||||
return await this.#inner.list();
|
||||
}
|
||||
|
||||
/**
|
||||
* Create a branch and return a handle scoped to it.
|
||||
*
|
||||
* @param name Name of the new branch.
|
||||
* @param fromRef Source branch to fork from. Defaults to `main`.
|
||||
* @param fromVersion A specific version on `fromRef`. Defaults to latest.
|
||||
*/
|
||||
async create(
|
||||
name: string,
|
||||
fromRef?: string,
|
||||
fromVersion?: number,
|
||||
): Promise<Table> {
|
||||
return new LocalTable(await this.#inner.create(name, fromRef, fromVersion));
|
||||
}
|
||||
|
||||
/**
|
||||
* Check out an existing branch and return a handle scoped to it.
|
||||
*
|
||||
* With `version` set, the returned handle is pinned to that version of the
|
||||
* branch (a read-only, detached view); otherwise it tracks the branch's
|
||||
* latest and stays writable.
|
||||
*/
|
||||
async checkout(name: string, version?: number): Promise<Table> {
|
||||
return new LocalTable(await this.#inner.checkout(name, version));
|
||||
}
|
||||
|
||||
/** Delete a branch. */
|
||||
async delete(name: string): Promise<void> {
|
||||
return await this.#inner.delete(name);
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-darwin-arm64",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": ["darwin"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.darwin-arm64.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-arm64-gnu",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": ["linux"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.linux-arm64-gnu.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-arm64-musl",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": ["linux"],
|
||||
"cpu": ["arm64"],
|
||||
"main": "lancedb.linux-arm64-musl.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-x64-gnu",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": ["linux"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.linux-x64-gnu.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-linux-x64-musl",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": ["linux"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.linux-x64-musl.node",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-win32-arm64-msvc",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": [
|
||||
"win32"
|
||||
],
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb-win32-x64-msvc",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"os": ["win32"],
|
||||
"cpu": ["x64"],
|
||||
"main": "lancedb.win32-x64-msvc.node",
|
||||
|
||||
14
nodejs/package-lock.json
generated
14
nodejs/package-lock.json
generated
@@ -1,12 +1,12 @@
|
||||
{
|
||||
"name": "@lancedb/lancedb",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"lockfileVersion": 3,
|
||||
"requires": true,
|
||||
"packages": {
|
||||
"": {
|
||||
"name": "@lancedb/lancedb",
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"cpu": [
|
||||
"x64",
|
||||
"arm64"
|
||||
@@ -26,7 +26,7 @@
|
||||
"@aws-sdk/client-s3": "3.1003.0",
|
||||
"@biomejs/biome": "^1.7.3",
|
||||
"@jest/globals": "^29.7.0",
|
||||
"@napi-rs/cli": "3.5.1",
|
||||
"@napi-rs/cli": "3.7.0",
|
||||
"@types/axios": "^0.14.0",
|
||||
"@types/jest": "^29.1.2",
|
||||
"@types/node": "22.7.4",
|
||||
@@ -2942,9 +2942,9 @@
|
||||
}
|
||||
},
|
||||
"node_modules/@napi-rs/cli": {
|
||||
"version": "3.5.1",
|
||||
"resolved": "https://registry.npmjs.org/@napi-rs/cli/-/cli-3.5.1.tgz",
|
||||
"integrity": "sha512-XBfLQRDcB3qhu6bazdMJsecWW55kR85l5/k0af9BIBELXQSsCFU0fzug7PX8eQp6vVdm7W/U3z6uP5WmITB2Gw==",
|
||||
"version": "3.7.0",
|
||||
"resolved": "https://registry.npmjs.org/@napi-rs/cli/-/cli-3.7.0.tgz",
|
||||
"integrity": "sha512-3d3+rmxlOIV/G1zPWeX4PCxuYnhcCQM2BvY9rtimC8RO0dFR9gtYP+Grov+WoduZtfWRj5N1XvytWeRxxCk5zw==",
|
||||
"dev": true,
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
@@ -2954,7 +2954,7 @@
|
||||
"@octokit/rest": "^22.0.1",
|
||||
"clipanion": "^4.0.0-rc.4",
|
||||
"colorette": "^2.0.20",
|
||||
"emnapi": "^1.7.1",
|
||||
"emnapi": "^1.10.0",
|
||||
"es-toolkit": "^1.41.0",
|
||||
"js-yaml": "^4.1.0",
|
||||
"obug": "^2.0.0",
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
"ann"
|
||||
],
|
||||
"private": false,
|
||||
"version": "0.30.1-beta.0",
|
||||
"version": "0.31.0-beta.4",
|
||||
"main": "dist/index.js",
|
||||
"exports": {
|
||||
".": "./dist/index.js",
|
||||
@@ -43,7 +43,7 @@
|
||||
"@aws-sdk/client-s3": "3.1003.0",
|
||||
"@biomejs/biome": "^1.7.3",
|
||||
"@jest/globals": "^29.7.0",
|
||||
"@napi-rs/cli": "3.5.1",
|
||||
"@napi-rs/cli": "3.7.0",
|
||||
"@types/axios": "^0.14.0",
|
||||
"@types/jest": "^29.1.2",
|
||||
"@types/node": "22.7.4",
|
||||
|
||||
10
nodejs/pnpm-lock.yaml
generated
10
nodejs/pnpm-lock.yaml
generated
@@ -31,8 +31,8 @@ importers:
|
||||
specifier: ^29.7.0
|
||||
version: 29.7.0
|
||||
'@napi-rs/cli':
|
||||
specifier: 3.5.1
|
||||
version: 3.5.1(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)
|
||||
specifier: 3.7.0
|
||||
version: 3.7.0(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)
|
||||
'@types/axios':
|
||||
specifier: ^0.14.0
|
||||
version: 0.14.4
|
||||
@@ -887,8 +887,8 @@ packages:
|
||||
'@jridgewell/trace-mapping@0.3.31':
|
||||
resolution: {integrity: sha512-zzNR+SdQSDJzc8joaeP8QQoCQr8NuYx2dIIytl1QeBEZHJ9uW6hebsrYgbz8hJwUQao3TWCMtmfV8Nu1twOLAw==}
|
||||
|
||||
'@napi-rs/cli@3.5.1':
|
||||
resolution: {integrity: sha512-XBfLQRDcB3qhu6bazdMJsecWW55kR85l5/k0af9BIBELXQSsCFU0fzug7PX8eQp6vVdm7W/U3z6uP5WmITB2Gw==}
|
||||
'@napi-rs/cli@3.7.0':
|
||||
resolution: {integrity: sha512-3d3+rmxlOIV/G1zPWeX4PCxuYnhcCQM2BvY9rtimC8RO0dFR9gtYP+Grov+WoduZtfWRj5N1XvytWeRxxCk5zw==}
|
||||
engines: {node: '>= 16'}
|
||||
hasBin: true
|
||||
peerDependencies:
|
||||
@@ -4582,7 +4582,7 @@ snapshots:
|
||||
'@jridgewell/resolve-uri': 3.1.2
|
||||
'@jridgewell/sourcemap-codec': 1.5.5
|
||||
|
||||
'@napi-rs/cli@3.5.1(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)':
|
||||
'@napi-rs/cli@3.7.0(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)(@types/node@22.7.4)':
|
||||
dependencies:
|
||||
'@inquirer/prompts': 8.4.3(@types/node@22.7.4)
|
||||
'@napi-rs/cross-toolchain': 1.0.3(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)
|
||||
|
||||
@@ -112,6 +112,12 @@ impl Connection {
|
||||
|
||||
builder = builder.client_config(rust_config);
|
||||
|
||||
if let Some(oauth_config) = options.oauth_config {
|
||||
let config: lancedb::remote::oauth::OAuthConfig =
|
||||
oauth_config.try_into().default_error()?;
|
||||
builder = builder.oauth_config(config);
|
||||
}
|
||||
|
||||
if let Some(api_key) = options.api_key {
|
||||
builder = builder.api_key(&api_key);
|
||||
}
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
use std::sync::Mutex;
|
||||
|
||||
use lancedb::index::Index as LanceDbIndex;
|
||||
use lancedb::index::scalar::{BTreeIndexBuilder, FtsIndexBuilder};
|
||||
use lancedb::index::scalar::{BTreeIndexBuilder, FmIndexBuilder, FtsIndexBuilder};
|
||||
use lancedb::index::vector::{
|
||||
IvfFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder, IvfPqIndexBuilder,
|
||||
IvfRqIndexBuilder,
|
||||
@@ -143,6 +143,13 @@ impl Index {
|
||||
}
|
||||
}
|
||||
|
||||
#[napi(factory)]
|
||||
pub fn fm() -> Self {
|
||||
Self {
|
||||
inner: Mutex::new(Some(LanceDbIndex::Fm(FmIndexBuilder::default()))),
|
||||
}
|
||||
}
|
||||
|
||||
#[napi(factory)]
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub fn fts(
|
||||
|
||||
@@ -65,6 +65,11 @@ pub struct ConnectionOptions {
|
||||
/// (For LanceDB cloud only): the host to use for LanceDB cloud. Used
|
||||
/// for testing purposes.
|
||||
pub host_override: Option<String>,
|
||||
/// (For LanceDB cloud only): OAuth configuration for IdP-based
|
||||
/// authentication (e.g., Azure Entra ID). When set, token acquisition
|
||||
/// and refresh are handled entirely in Rust. TypeScript users should pass
|
||||
/// the public `OAuthConfig` type exported from `@lancedb/lancedb`.
|
||||
pub oauth_config: Option<remote::OAuthConfig>,
|
||||
}
|
||||
|
||||
#[napi(object)]
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use lancedb::error::Error;
|
||||
use napi_derive::*;
|
||||
|
||||
/// Timeout configuration for remote HTTP client.
|
||||
@@ -140,6 +141,84 @@ impl From<TlsConfig> for lancedb::remote::TlsConfig {
|
||||
}
|
||||
}
|
||||
|
||||
/// OAuth configuration for LanceDB authentication.
|
||||
///
|
||||
/// This is the generated napi-rs binding shape. TypeScript users should prefer
|
||||
/// the public `OAuthConfig` type exported from `@lancedb/lancedb`.
|
||||
///
|
||||
/// All token acquisition and refresh is handled in the Rust layer.
|
||||
#[napi(object)]
|
||||
#[derive(Clone)]
|
||||
pub struct OAuthConfig {
|
||||
/// OIDC issuer URL or OAuth authority URL.
|
||||
/// For Azure: `https://login.microsoftonline.com/{tenant_id}/v2.0`
|
||||
pub issuer_url: String,
|
||||
/// Application / Client ID.
|
||||
pub client_id: String,
|
||||
/// OAuth scopes to request. For Azure managed identity, exactly one scope
|
||||
/// or resource is required. For example: `["api://{app_id}/.default"]`
|
||||
pub scopes: Vec<String>,
|
||||
/// Authentication flow: "client_credentials" or "azure_managed_identity"
|
||||
pub flow: Option<String>,
|
||||
/// Client secret (required for client_credentials).
|
||||
pub client_secret: Option<String>,
|
||||
/// Client ID for user-assigned managed identity (azure_managed_identity).
|
||||
pub managed_identity_client_id: Option<String>,
|
||||
/// Seconds before expiry to trigger proactive refresh (default: 300).
|
||||
/// Keep this well below the token TTL; if it is greater than or equal to
|
||||
/// the TTL, each request refreshes the token.
|
||||
pub refresh_buffer_secs: Option<u32>,
|
||||
}
|
||||
|
||||
impl std::fmt::Debug for OAuthConfig {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
f.debug_struct("OAuthConfig")
|
||||
.field("issuer_url", &self.issuer_url)
|
||||
.field("client_id", &self.client_id)
|
||||
.field("scopes", &self.scopes)
|
||||
.field("flow", &self.flow)
|
||||
.field(
|
||||
"client_secret",
|
||||
&self.client_secret.as_deref().map(|_| "<redacted>"),
|
||||
)
|
||||
.field(
|
||||
"managed_identity_client_id",
|
||||
&self.managed_identity_client_id,
|
||||
)
|
||||
.field("refresh_buffer_secs", &self.refresh_buffer_secs)
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
|
||||
impl TryFrom<OAuthConfig> for lancedb::remote::oauth::OAuthConfig {
|
||||
type Error = Error;
|
||||
|
||||
fn try_from(config: OAuthConfig) -> Result<Self, Self::Error> {
|
||||
use lancedb::remote::oauth::OAuthFlow;
|
||||
|
||||
let flow = match config.flow.as_deref().unwrap_or("client_credentials") {
|
||||
"client_credentials" => OAuthFlow::ClientCredentials,
|
||||
"azure_managed_identity" => OAuthFlow::AzureManagedIdentity {
|
||||
client_id: config.managed_identity_client_id,
|
||||
},
|
||||
other => {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!("Unknown OAuth flow type: {other}"),
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
Ok(Self {
|
||||
issuer_url: config.issuer_url,
|
||||
client_id: config.client_id,
|
||||
client_secret: config.client_secret,
|
||||
scopes: config.scopes,
|
||||
flow,
|
||||
refresh_buffer_secs: config.refresh_buffer_secs.map(|v| v as u64),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl From<ClientConfig> for lancedb::remote::ClientConfig {
|
||||
fn from(config: ClientConfig) -> Self {
|
||||
Self {
|
||||
@@ -156,3 +235,45 @@ impl From<ClientConfig> for lancedb::remote::ClientConfig {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_unknown_oauth_flow_returns_invalid_input() {
|
||||
let config = OAuthConfig {
|
||||
issuer_url: "https://issuer.example.com".to_string(),
|
||||
client_id: "client-id".to_string(),
|
||||
scopes: vec!["scope".to_string()],
|
||||
flow: Some("typo".to_string()),
|
||||
client_secret: None,
|
||||
managed_identity_client_id: None,
|
||||
refresh_buffer_secs: None,
|
||||
};
|
||||
|
||||
let err = lancedb::remote::oauth::OAuthConfig::try_from(config).unwrap_err();
|
||||
assert!(matches!(
|
||||
err,
|
||||
Error::InvalidInput { message }
|
||||
if message == "Unknown OAuth flow type: typo"
|
||||
));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_oauth_config_debug_redacts_client_secret() {
|
||||
let config = OAuthConfig {
|
||||
issuer_url: "https://issuer.example.com".to_string(),
|
||||
client_id: "client-id".to_string(),
|
||||
scopes: vec!["scope".to_string()],
|
||||
flow: Some("client_credentials".to_string()),
|
||||
client_secret: Some("super-secret".to_string()),
|
||||
managed_identity_client_id: None,
|
||||
refresh_buffer_secs: None,
|
||||
};
|
||||
|
||||
let debug = format!("{config:?}");
|
||||
assert!(!debug.contains("super-secret"));
|
||||
assert!(debug.contains("client_secret: Some(\"<redacted>\")"));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -3,11 +3,13 @@
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use chrono::{DateTime, Utc};
|
||||
|
||||
use lancedb::ipc::{ipc_file_to_batches, ipc_file_to_schema};
|
||||
use lancedb::table::{
|
||||
AddDataMode, ColumnAlteration as LanceColumnAlteration, Duration,
|
||||
FieldMetadataUpdate as LanceFieldMetadataUpdate, NewColumnTransform, OptimizeAction,
|
||||
OptimizeOptions, Table as LanceDbTable,
|
||||
OptimizeOptions, Ref, Table as LanceDbTable,
|
||||
};
|
||||
use napi::bindgen_prelude::*;
|
||||
use napi::threadsafe_function::{ThreadsafeFunction, ThreadsafeFunctionCallMode};
|
||||
@@ -478,6 +480,19 @@ impl Table {
|
||||
})
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn branches(&self) -> napi::Result<Branches> {
|
||||
Ok(Branches {
|
||||
inner: self.inner_ref()?.clone(),
|
||||
})
|
||||
}
|
||||
|
||||
/// The branch this handle is scoped to, or `null` for the main branch.
|
||||
#[napi]
|
||||
pub fn current_branch(&self) -> napi::Result<Option<String>> {
|
||||
Ok(self.inner_ref()?.current_branch())
|
||||
}
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn optimize(
|
||||
&self,
|
||||
@@ -595,6 +610,43 @@ pub struct IndexConfig {
|
||||
/// Currently this is always an array of size 1. In the future there may
|
||||
/// be more columns to represent composite indices.
|
||||
pub columns: Vec<String>,
|
||||
/// The UUID of the first segment of the index.
|
||||
///
|
||||
/// `undefined` for remote tables, which do not yet surface this.
|
||||
pub index_uuid: Option<String>,
|
||||
/// The protobuf type URL, a precise type identifier for the index.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub type_url: Option<String>,
|
||||
/// When the index was created.
|
||||
///
|
||||
/// `undefined` for remote tables or indices created before timestamps were tracked.
|
||||
pub created_at: Option<DateTime<Utc>>,
|
||||
/// The number of rows indexed, across all segments.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub num_indexed_rows: Option<i64>,
|
||||
/// The number of rows not yet covered by this index.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub num_unindexed_rows: Option<i64>,
|
||||
/// The total size in bytes of all index files across all segments.
|
||||
///
|
||||
/// `undefined` for remote tables or indices without size tracking.
|
||||
pub size_bytes: Option<i64>,
|
||||
/// The number of segments that make up the index.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub num_segments: Option<i32>,
|
||||
/// The on-disk index format version.
|
||||
///
|
||||
/// `undefined` for remote tables.
|
||||
pub index_version: Option<i32>,
|
||||
/// Index-type-specific details parsed as a JavaScript object.
|
||||
///
|
||||
/// Falls back to a raw string if JSON parsing fails. `undefined` for
|
||||
/// remote tables or when details are unavailable.
|
||||
pub index_details: Option<serde_json::Value>,
|
||||
}
|
||||
|
||||
impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
@@ -604,6 +656,17 @@ impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
index_type,
|
||||
columns: value.columns,
|
||||
name: value.name,
|
||||
index_uuid: value.index_uuid,
|
||||
type_url: value.type_url,
|
||||
created_at: value.created_at,
|
||||
num_indexed_rows: value.num_indexed_rows.map(|n| n as i64),
|
||||
num_unindexed_rows: value.num_unindexed_rows.map(|n| n as i64),
|
||||
size_bytes: value.size_bytes.map(|n| n as i64),
|
||||
num_segments: value.num_segments.map(|n| n as i32),
|
||||
index_version: value.index_version,
|
||||
index_details: value
|
||||
.index_details
|
||||
.and_then(|s| serde_json::from_str(&s).ok()),
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -838,9 +901,6 @@ pub struct IndexStatistics {
|
||||
pub distance_type: Option<String>,
|
||||
/// The number of parts this index is split into.
|
||||
pub num_indices: Option<u32>,
|
||||
/// The KMeans loss value of the index,
|
||||
/// it is only present for vector indices.
|
||||
pub loss: Option<f64>,
|
||||
}
|
||||
impl From<lancedb::index::IndexStatistics> for IndexStatistics {
|
||||
fn from(value: lancedb::index::IndexStatistics) -> Self {
|
||||
@@ -850,7 +910,6 @@ impl From<lancedb::index::IndexStatistics> for IndexStatistics {
|
||||
index_type: value.index_type.to_string(),
|
||||
distance_type: value.distance_type.map(|d| d.to_string()),
|
||||
num_indices: value.num_indices,
|
||||
loss: value.loss,
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1060,6 +1119,13 @@ pub struct TagContents {
|
||||
pub manifest_size: i64,
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub struct BranchContents {
|
||||
pub parent_branch: Option<String>,
|
||||
pub parent_version: i64,
|
||||
pub manifest_size: i64,
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub struct Tags {
|
||||
inner: LanceDbTable,
|
||||
@@ -1128,3 +1194,75 @@ impl Tags {
|
||||
.default_error()
|
||||
}
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub struct Branches {
|
||||
inner: LanceDbTable,
|
||||
}
|
||||
|
||||
#[napi]
|
||||
impl Branches {
|
||||
#[napi]
|
||||
pub async fn list(&self) -> napi::Result<HashMap<String, BranchContents>> {
|
||||
let branches = self.inner.list_branches().await.default_error()?;
|
||||
let result = branches
|
||||
.into_iter()
|
||||
.map(|(k, v)| {
|
||||
(
|
||||
k,
|
||||
BranchContents {
|
||||
parent_branch: v.parent_branch,
|
||||
parent_version: v.parent_version as i64,
|
||||
manifest_size: v.manifest_size as i64,
|
||||
},
|
||||
)
|
||||
})
|
||||
.collect();
|
||||
Ok(result)
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub async fn create(
|
||||
&self,
|
||||
name: String,
|
||||
from_ref: Option<String>,
|
||||
from_version: Option<i64>,
|
||||
) -> napi::Result<Table> {
|
||||
let from_ref = from_ref.filter(|b| b != "main");
|
||||
let from_version = from_version
|
||||
.map(|v| {
|
||||
u64::try_from(v).map_err(|_| {
|
||||
napi::Error::from_reason("from_version must be a non-negative integer")
|
||||
})
|
||||
})
|
||||
.transpose()?;
|
||||
let from = Ref::Version(from_ref, from_version);
|
||||
let table = self
|
||||
.inner
|
||||
.create_branch(&name, from)
|
||||
.await
|
||||
.default_error()?;
|
||||
Ok(Table::new(table))
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub async fn checkout(&self, name: String, version: Option<i64>) -> napi::Result<Table> {
|
||||
let version = version
|
||||
.map(|v| {
|
||||
u64::try_from(v)
|
||||
.map_err(|_| napi::Error::from_reason("version must be a non-negative integer"))
|
||||
})
|
||||
.transpose()?;
|
||||
let table = self
|
||||
.inner
|
||||
.checkout_branch(&name, version)
|
||||
.await
|
||||
.default_error()?;
|
||||
Ok(Table::new(table))
|
||||
}
|
||||
|
||||
#[napi]
|
||||
pub async fn delete(&self, name: String) -> napi::Result<()> {
|
||||
self.inner.delete_branch(&name).await.default_error()
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
[tool.bumpversion]
|
||||
current_version = "0.33.1-beta.1"
|
||||
current_version = "0.34.0-beta.4"
|
||||
parse = """(?x)
|
||||
(?P<major>0|[1-9]\\d*)\\.
|
||||
(?P<minor>0|[1-9]\\d*)\\.
|
||||
@@ -23,6 +23,8 @@ allow_dirty = true
|
||||
commit = true
|
||||
message = "Bump version: {current_version} → {new_version}"
|
||||
commit_args = ""
|
||||
# bump-my-version >=1.4.0 rejects pre_commit_hooks containing shell syntax unless opted in.
|
||||
allow_shell_hooks = true
|
||||
|
||||
# Update Cargo.lock after version bump
|
||||
pre_commit_hooks = [
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "lancedb-python"
|
||||
version = "0.33.1-beta.1"
|
||||
version = "0.34.0-beta.4"
|
||||
publish = false
|
||||
edition.workspace = true
|
||||
description = "Python bindings for LanceDB"
|
||||
@@ -26,7 +26,8 @@ lance-namespace-impls.workspace = true
|
||||
lance-io.workspace = true
|
||||
env_logger.workspace = true
|
||||
log.workspace = true
|
||||
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39"] }
|
||||
pyo3 = { version = "0.28", features = ["extension-module", "abi3-py39", "chrono"] }
|
||||
chrono = { version = "0.4", default-features = false, features = ["clock"] }
|
||||
pyo3-async-runtimes = { version = "0.28", features = [
|
||||
"attributes",
|
||||
"tokio-runtime",
|
||||
|
||||
@@ -17,6 +17,17 @@ from .db import AsyncConnection, DBConnection, LanceDBConnection
|
||||
from .remote import ClientConfig
|
||||
from .remote.db import RemoteDBConnection
|
||||
from .expr import Expr, col, lit, func
|
||||
from .udf import (
|
||||
udf,
|
||||
table_udf,
|
||||
Udf,
|
||||
JobHandle,
|
||||
JobFailedError,
|
||||
MaterializedView,
|
||||
AsyncJobHandle,
|
||||
AsyncMaterializedView,
|
||||
)
|
||||
from .lineage import Lineage, Node, Edge, FunctionRef
|
||||
from .schema import vector
|
||||
from .table import AsyncTable, Table
|
||||
from ._lancedb import Session
|
||||
@@ -89,6 +100,8 @@ def connect(
|
||||
If presented, connect to LanceDB cloud.
|
||||
Otherwise, connect to a database on file system or cloud storage.
|
||||
Can be set via environment variable `LANCEDB_API_KEY`.
|
||||
OAuth configuration is currently supported only by ``connect_async``;
|
||||
synchronous LanceDB Cloud connections require an API key.
|
||||
region: str, default "us-east-1"
|
||||
The region to use for LanceDB Cloud.
|
||||
host_override: str, optional
|
||||
@@ -340,6 +353,7 @@ async def connect_async(
|
||||
session: Optional[Session] = None,
|
||||
manifest_enabled: bool = False,
|
||||
namespace_client_properties: Optional[Dict[str, str]] = None,
|
||||
oauth_config=None,
|
||||
) -> AsyncConnection:
|
||||
"""Connect to a LanceDB database.
|
||||
|
||||
@@ -389,6 +403,10 @@ async def connect_async(
|
||||
namespace_client_properties : dict, optional
|
||||
Additional directory namespace client properties to use with
|
||||
``manifest_enabled=True``.
|
||||
oauth_config : OAuthConfig, optional
|
||||
OAuth configuration for LanceDB Cloud/Enterprise. This is supported by
|
||||
``connect_async`` only; synchronous ``connect`` uses API key
|
||||
authentication for ``db://`` URIs.
|
||||
|
||||
Examples
|
||||
--------
|
||||
@@ -435,11 +453,24 @@ async def connect_async(
|
||||
session,
|
||||
manifest_enabled,
|
||||
namespace_client_properties,
|
||||
oauth_config,
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
__all__ = [
|
||||
"udf",
|
||||
"table_udf",
|
||||
"Udf",
|
||||
"JobHandle",
|
||||
"JobFailedError",
|
||||
"MaterializedView",
|
||||
"AsyncJobHandle",
|
||||
"AsyncMaterializedView",
|
||||
"Lineage",
|
||||
"Node",
|
||||
"Edge",
|
||||
"FunctionRef",
|
||||
"connect",
|
||||
"connect_async",
|
||||
"connect_namespace",
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
from datetime import timedelta
|
||||
from datetime import datetime, timedelta
|
||||
from typing import Dict, List, Optional, Tuple, Any, TypedDict, Union, Literal
|
||||
|
||||
import pyarrow as pa
|
||||
@@ -10,6 +10,7 @@ from .index import (
|
||||
IvfSq,
|
||||
Bitmap,
|
||||
LabelList,
|
||||
Fm,
|
||||
HnswPq,
|
||||
HnswSq,
|
||||
HnswFlat,
|
||||
@@ -47,6 +48,7 @@ class PyExpr:
|
||||
def lower(self) -> "PyExpr": ...
|
||||
def upper(self) -> "PyExpr": ...
|
||||
def contains(self, substr: "PyExpr") -> "PyExpr": ...
|
||||
def isin(self, values: List["PyExpr"]) -> "PyExpr": ...
|
||||
def cast(self, data_type: pa.DataType) -> "PyExpr": ...
|
||||
def to_sql(self) -> str: ...
|
||||
|
||||
@@ -186,6 +188,7 @@ class Table:
|
||||
BTree,
|
||||
Bitmap,
|
||||
LabelList,
|
||||
Fm,
|
||||
FTS,
|
||||
],
|
||||
replace: Optional[bool],
|
||||
@@ -202,7 +205,7 @@ class Table:
|
||||
async def prewarm_index(self, index_name: str) -> None: ...
|
||||
async def prewarm_data(self, columns: Optional[List[str]] = None) -> None: ...
|
||||
async def list_indices(self) -> list[IndexConfig]: ...
|
||||
async def delete(self, filter: str) -> DeleteResult: ...
|
||||
async def delete(self, filter: Union[str, PyExpr]) -> DeleteResult: ...
|
||||
async def add_columns(self, columns: list[tuple[str, str]]) -> AddColumnsResult: ...
|
||||
async def add_columns_with_schema(self, schema: pa.Schema) -> AddColumnsResult: ...
|
||||
async def alter_columns(
|
||||
@@ -226,6 +229,9 @@ class Table:
|
||||
async def close_lsm_writers(self) -> None: ...
|
||||
@property
|
||||
def tags(self) -> Tags: ...
|
||||
@property
|
||||
def branches(self) -> Branches: ...
|
||||
def current_branch(self) -> Optional[str]: ...
|
||||
def query(self) -> Query: ...
|
||||
def take_offsets(self, offsets: list[int]) -> TakeQuery: ...
|
||||
def take_row_ids(self, row_ids: list[int]) -> TakeQuery: ...
|
||||
@@ -238,10 +244,30 @@ class Tags:
|
||||
async def delete(self, tag: str): ...
|
||||
async def update(self, tag: str, version: int): ...
|
||||
|
||||
class Branches:
|
||||
async def list(self) -> Dict[str, Any]: ...
|
||||
async def create(
|
||||
self,
|
||||
name: str,
|
||||
from_ref: Optional[str] = None,
|
||||
from_version: Optional[int] = None,
|
||||
) -> Table: ...
|
||||
async def checkout(self, name: str, version: Optional[int] = None) -> Table: ...
|
||||
async def delete(self, name: str) -> None: ...
|
||||
|
||||
class IndexConfig:
|
||||
name: str
|
||||
index_type: str
|
||||
columns: List[str]
|
||||
index_uuid: Optional[str]
|
||||
type_url: Optional[str]
|
||||
created_at: Optional[datetime]
|
||||
num_indexed_rows: Optional[int]
|
||||
num_unindexed_rows: Optional[int]
|
||||
size_bytes: Optional[int]
|
||||
num_segments: Optional[int]
|
||||
index_version: Optional[int]
|
||||
index_details: Optional[Any]
|
||||
|
||||
async def connect(
|
||||
uri: str,
|
||||
@@ -254,6 +280,7 @@ async def connect(
|
||||
session: Optional[Session],
|
||||
manifest_enabled: bool = False,
|
||||
namespace_client_properties: Optional[Dict[str, str]] = None,
|
||||
oauth_config: Optional[Any] = None,
|
||||
) -> Connection: ...
|
||||
|
||||
class RecordBatchStream:
|
||||
|
||||
@@ -2,6 +2,7 @@
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import asyncio
|
||||
import concurrent.futures
|
||||
import os
|
||||
import threading
|
||||
import warnings
|
||||
@@ -37,6 +38,24 @@ class BackgroundEventLoop:
|
||||
|
||||
LOOP = BackgroundEventLoop()
|
||||
|
||||
|
||||
def _new_embedding_executor() -> concurrent.futures.ThreadPoolExecutor:
|
||||
return concurrent.futures.ThreadPoolExecutor(thread_name_prefix="lancedb-embedding")
|
||||
|
||||
|
||||
# Embedding functions can block for a long time -- a heavy local model or an
|
||||
# HTTP request to a remote embeddings API. Running them on asyncio's default
|
||||
# executor lets them starve the unrelated blocking I/O that shares that pool,
|
||||
# so they get a dedicated one. See
|
||||
# https://github.com/lancedb/lancedb/issues/3310.
|
||||
_EMBEDDING_EXECUTOR = _new_embedding_executor()
|
||||
|
||||
|
||||
def embedding_executor() -> concurrent.futures.ThreadPoolExecutor:
|
||||
"""Return the executor dedicated to running blocking embedding calls."""
|
||||
return _EMBEDDING_EXECUTOR
|
||||
|
||||
|
||||
_FORK_WARNED = False
|
||||
|
||||
|
||||
@@ -47,6 +66,12 @@ def _reset_after_fork():
|
||||
# the new state. The Rust-side tokio runtime is reset analogously by a
|
||||
# pthread_atfork hook installed in the _lancedb extension.
|
||||
LOOP._start()
|
||||
# The embedding executor's worker threads are dead in the child as well.
|
||||
# Replace it with a fresh pool (threads are spawned lazily, so this is
|
||||
# cheap); we don't shut down the old one, since joining its dead workers
|
||||
# could hang.
|
||||
global _EMBEDDING_EXECUTOR
|
||||
_EMBEDDING_EXECUTOR = _new_embedding_executor()
|
||||
global _FORK_WARNED
|
||||
if not _FORK_WARNED:
|
||||
_FORK_WARNED = True
|
||||
|
||||
@@ -65,6 +65,7 @@ if TYPE_CHECKING:
|
||||
from .common import DATA, URI
|
||||
from .embeddings import EmbeddingFunctionConfig
|
||||
from ._lancedb import Session
|
||||
from .udf import MaterializedView, AsyncMaterializedView
|
||||
|
||||
from .namespace_utils import (
|
||||
_normalize_create_namespace_mode,
|
||||
@@ -416,6 +417,8 @@ class DBConnection(EnforceOverrides):
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
index_cache_size: Optional[int] = None,
|
||||
branch: Optional[str] = None,
|
||||
version: Optional[int] = None,
|
||||
) -> Table:
|
||||
"""Open a Lance Table in the database.
|
||||
|
||||
@@ -444,6 +447,14 @@ class DBConnection(EnforceOverrides):
|
||||
connection will be inherited by the table, but can be overridden here.
|
||||
See available options at
|
||||
<https://docs.lancedb.com/storage/>
|
||||
branch: str, optional
|
||||
If provided, open a handle scoped to this branch instead of the
|
||||
default branch. Reads and writes operate in the branch's context.
|
||||
version: int, optional
|
||||
If provided, open the table pinned to this version, producing a
|
||||
read-only handle. Composes with ``branch``: when both are given,
|
||||
opens that branch at the version; otherwise opens ``main`` at the
|
||||
version. Call ``checkout_latest`` to return to a writable state.
|
||||
|
||||
Returns
|
||||
-------
|
||||
@@ -552,6 +563,259 @@ class DBConnection(EnforceOverrides):
|
||||
"""
|
||||
raise NotImplementedError("serialize is not supported for this connection type")
|
||||
|
||||
# -- Derived compute: functions, materialized views, jobs -------------
|
||||
# Server-backed features (LanceDB Enterprise / Cloud); local
|
||||
# connections raise NotImplementedError for now.
|
||||
|
||||
def create_function(
|
||||
self,
|
||||
name,
|
||||
language: str = "python",
|
||||
return_type: Optional[str] = None,
|
||||
body: Optional[str] = None,
|
||||
options: Optional[Dict[str, str]] = None,
|
||||
*,
|
||||
replace: bool = False,
|
||||
):
|
||||
"""Register a UDF (CREATE FUNCTION).
|
||||
|
||||
Pass a ``@udf`` / ``@table_udf``-decorated function (preferred):
|
||||
|
||||
db.create_function(embed)
|
||||
|
||||
or the explicit fields:
|
||||
|
||||
Parameters
|
||||
----------
|
||||
name: str or Udf
|
||||
A decorated UDF object, or the function name.
|
||||
language: str
|
||||
Implementation language (currently "python").
|
||||
return_type: str
|
||||
SQL return type, e.g. "FLOAT", "FLOAT[1536]",
|
||||
"STRUCT(a FLOAT, b VARCHAR)", "TABLE(chunk VARCHAR, idx INT)".
|
||||
body: str
|
||||
Function body: source text, or base64 cloudpickle bytes when
|
||||
options["body_format"] == "cloudpickle".
|
||||
options: dict, optional
|
||||
input_columns, pip, num_gpus, batch_size, timeout,
|
||||
error_policy, docker_image, body_format, ...
|
||||
replace: bool
|
||||
Drop an existing function of the same name first.
|
||||
"""
|
||||
from .udf import Udf
|
||||
|
||||
if isinstance(name, Udf):
|
||||
req = name.create_request()
|
||||
name, language, return_type, body, options = (
|
||||
req["name"],
|
||||
req["language"],
|
||||
req["return_type"],
|
||||
req["body"],
|
||||
req["options"],
|
||||
)
|
||||
if replace:
|
||||
try:
|
||||
self.drop_function(name)
|
||||
except Exception:
|
||||
pass
|
||||
LOOP.run(self._conn.create_function(name, language, return_type, body, options))
|
||||
|
||||
def list_functions(self):
|
||||
"""List registered functions (SHOW FUNCTIONS)."""
|
||||
return LOOP.run(self._conn.list_functions())
|
||||
|
||||
def drop_function(self, name: str):
|
||||
"""Drop a registered function (DROP FUNCTION)."""
|
||||
LOOP.run(self._conn.drop_function(name))
|
||||
|
||||
def create_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
source=None,
|
||||
select=None,
|
||||
*,
|
||||
query: Optional[str] = None,
|
||||
where: Optional[str] = None,
|
||||
auto_refresh: bool = False,
|
||||
with_no_data: bool = False,
|
||||
replace: bool = False,
|
||||
partition_by: Optional[str] = None,
|
||||
) -> "MaterializedView":
|
||||
"""Create a materialized view (CREATE MATERIALIZED VIEW); returns a
|
||||
`MaterializedView` handle (``.wait()`` blocks until it is populated).
|
||||
|
||||
Two ways to specify the view body:
|
||||
|
||||
- ergonomic: pass ``source`` (a table name or table) and ``select``
|
||||
items -- column names, expression strings ("embed(body)"),
|
||||
(alias, expression) tuples, or ``@udf`` / ``@table_udf`` objects.
|
||||
The SELECT is assembled and parsed server-side (one parser, shared
|
||||
with SQL).
|
||||
- raw: pass ``query=`` with a full SELECT, e.g.
|
||||
"SELECT id, embed(body) AS vec FROM articles WHERE id > 1".
|
||||
|
||||
`partition_by` partitions the view's (single) table function on a source
|
||||
column. If that column has an IVF vector index the server partitions by
|
||||
its index clusters (image-dedup style); otherwise it groups by distinct
|
||||
value. (Geneva's `partition_by` and `partition_by_indexed_column` unify
|
||||
here -- the engine picks the strategy from the column.)
|
||||
"""
|
||||
from .udf import build_view_query, MaterializedView
|
||||
|
||||
if query is None:
|
||||
if source is None or select is None:
|
||||
raise ValueError(
|
||||
"create_materialized_view needs either query= or both "
|
||||
"source and select"
|
||||
)
|
||||
query = build_view_query(source, select)
|
||||
if where:
|
||||
query += f" WHERE {where}"
|
||||
if replace:
|
||||
self._drop_view_if_exists(name)
|
||||
job_id = LOOP.run(
|
||||
self._conn.create_materialized_view(
|
||||
name,
|
||||
query=query,
|
||||
auto_refresh=auto_refresh,
|
||||
with_no_data=with_no_data,
|
||||
partition_by=partition_by,
|
||||
)
|
||||
)
|
||||
return MaterializedView(self, name, job_id=job_id)
|
||||
|
||||
def _drop_view_if_exists(self, name: str) -> None:
|
||||
# `replace=True` is "drop if present"; only a not-found error is
|
||||
# benign here. Anything else (perms, server fault) must surface rather
|
||||
# than be masked by a later create failure.
|
||||
try:
|
||||
self.drop_materialized_view(name)
|
||||
except Exception as e:
|
||||
msg = str(e).lower()
|
||||
if "not found" not in msg and "does not exist" not in msg:
|
||||
raise
|
||||
|
||||
def job(self, job_id: str):
|
||||
"""A `JobHandle` for reconnecting to an inflight job by id -- e.g. an
|
||||
id you stored, or one returned from the SQL / REST surface. Submit
|
||||
methods (`refresh_column`, `MaterializedView.refresh`) already return a
|
||||
handle directly, so you do not need this to wait on a fresh submission."""
|
||||
from .udf import JobHandle
|
||||
|
||||
return JobHandle(self, job_id)
|
||||
|
||||
def lineage(
|
||||
self,
|
||||
table: str,
|
||||
column: Optional[str] = None,
|
||||
*,
|
||||
direction: Optional[str] = None,
|
||||
depth: Optional[int] = None,
|
||||
):
|
||||
"""Derived-compute lineage of a table/view, or one of its columns:
|
||||
upstream sources, downstream dependents, and the function version +
|
||||
location that produced each derived column (with a drift flag). Returns
|
||||
a `Lineage`. `direction` is "upstream" | "downstream" | "both" (server
|
||||
default both); `depth` limits column-hops (transitive when omitted)."""
|
||||
# `self._conn` is the AsyncConnection; drive its async `lineage`
|
||||
# (which parses the JSON) on the loop, mirroring create_materialized_view.
|
||||
return LOOP.run(
|
||||
self._conn.lineage(table, column, direction=direction, depth=depth)
|
||||
)
|
||||
|
||||
def _refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Internal: submit a materialized-view refresh, return the job id.
|
||||
The public surface is ``MaterializedView.refresh()`` (which returns a
|
||||
`JobHandle`); this stays private so refresh is only reached through the
|
||||
handle.
|
||||
|
||||
``full=True`` forces a full rebuild (recompute and replace every row)
|
||||
instead of the default incremental refresh.
|
||||
"""
|
||||
return LOOP.run(
|
||||
self._conn._refresh_materialized_view(
|
||||
name,
|
||||
full=full,
|
||||
src_version=src_version,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
)
|
||||
)
|
||||
|
||||
def explain_refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
):
|
||||
"""Plan a refresh without running it (EXPLAIN REFRESH). Returns a
|
||||
plan with .has_work / .source_version / .last_refreshed_version /
|
||||
.full_refresh / .rebuild / .units_total. `full=True` plans a full
|
||||
rebuild (incremental planning needs stable row IDs on the source)."""
|
||||
return LOOP.run(
|
||||
self._conn.explain_refresh_materialized_view(
|
||||
name, full=full, src_version=src_version
|
||||
)
|
||||
)
|
||||
|
||||
def alter_materialized_view(self, name: str, *, auto_refresh: bool):
|
||||
"""Update a materialized view's options (ALTER MATERIALIZED VIEW)."""
|
||||
LOOP.run(self._conn.alter_materialized_view(name, auto_refresh=auto_refresh))
|
||||
|
||||
def drop_materialized_view(self, name: str):
|
||||
"""Drop a materialized view definition (DROP MATERIALIZED VIEW)."""
|
||||
LOOP.run(self._conn.drop_materialized_view(name))
|
||||
|
||||
def list_materialized_views(self):
|
||||
"""List registered materialized view definitions."""
|
||||
return LOOP.run(self._conn.list_materialized_views())
|
||||
|
||||
def list_jobs(self):
|
||||
"""List inflight server-side jobs across the database's tables."""
|
||||
return LOOP.run(self._conn.list_jobs())
|
||||
|
||||
def get_job(self, job_id: str, table: "str | None" = None):
|
||||
"""Look up one server-side job by id (the wait()/status poll path).
|
||||
|
||||
Passing ``table`` (the job's table) lets the server answer with an O(1)
|
||||
single-node read instead of scanning the database's active jobs.
|
||||
Returns the job's status, or None if it's unknown or no longer active.
|
||||
"""
|
||||
return LOOP.run(self._conn.get_job(job_id, table))
|
||||
|
||||
def cancel_job(self, job_id: str) -> bool:
|
||||
"""Cancel an inflight server-side job by id (CANCEL JOB).
|
||||
|
||||
Returns True if a matching inflight job was found and flagged for
|
||||
cancellation, False if none was inflight (already finished or
|
||||
unknown id) -- cancellation is best-effort.
|
||||
"""
|
||||
return LOOP.run(self._conn.cancel_job(job_id))
|
||||
|
||||
def job_history(self, job_id: "str | None" = None):
|
||||
"""Durable history of completed server-side jobs (SHOW JOB HISTORY).
|
||||
|
||||
Pass ``job_id`` to narrow to a single job. Unlike :meth:`list_jobs`
|
||||
(live, inflight) these are the terminal records.
|
||||
"""
|
||||
return LOOP.run(self._conn.job_history(job_id))
|
||||
|
||||
def errors(self, job_id: "str | None" = None, table: "str | None" = None):
|
||||
"""Per-row UDF errors recorded by ``error_policy=skip`` (SHOW ERRORS),
|
||||
optionally filtered by ``job_id`` and/or ``table``.
|
||||
"""
|
||||
return LOOP.run(self._conn.errors(job_id, table))
|
||||
|
||||
|
||||
class LanceDBConnection(DBConnection):
|
||||
"""
|
||||
@@ -958,6 +1222,8 @@ class LanceDBConnection(DBConnection):
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
index_cache_size: Optional[int] = None,
|
||||
branch: Optional[str] = None,
|
||||
version: Optional[int] = None,
|
||||
) -> LanceTable:
|
||||
"""Open a table in the database.
|
||||
|
||||
@@ -968,6 +1234,14 @@ class LanceDBConnection(DBConnection):
|
||||
namespace_path: List[str], optional
|
||||
The namespace to open the table from. When non-empty, the
|
||||
table is resolved through the directory namespace client.
|
||||
branch: str, optional
|
||||
If provided, open a handle scoped to this branch instead of the
|
||||
default branch. Reads and writes operate in the branch's context.
|
||||
version: int, optional
|
||||
If provided, open the table pinned to this version, producing a
|
||||
read-only handle. Composes with ``branch``: when both are given,
|
||||
opens that branch at the version; otherwise opens ``main`` at the
|
||||
version. Call ``checkout_latest`` to return to a writable state.
|
||||
|
||||
Returns
|
||||
-------
|
||||
@@ -987,20 +1261,26 @@ class LanceDBConnection(DBConnection):
|
||||
)
|
||||
|
||||
if namespace_path:
|
||||
return self._namespace_conn().open_table(
|
||||
tbl = self._namespace_conn().open_table(
|
||||
name,
|
||||
namespace_path=namespace_path,
|
||||
storage_options=storage_options,
|
||||
index_cache_size=index_cache_size,
|
||||
)
|
||||
else:
|
||||
tbl = LanceTable.open(
|
||||
self,
|
||||
name,
|
||||
namespace_path=namespace_path,
|
||||
storage_options=storage_options,
|
||||
index_cache_size=index_cache_size,
|
||||
)
|
||||
|
||||
return LanceTable.open(
|
||||
self,
|
||||
name,
|
||||
namespace_path=namespace_path,
|
||||
storage_options=storage_options,
|
||||
index_cache_size=index_cache_size,
|
||||
)
|
||||
if branch is not None:
|
||||
tbl = tbl.branches.checkout(branch, version)
|
||||
elif version is not None:
|
||||
tbl.checkout(version)
|
||||
return tbl
|
||||
|
||||
def clone_table(
|
||||
self,
|
||||
@@ -1641,6 +1921,8 @@ class AsyncConnection(object):
|
||||
location: Optional[str] = None,
|
||||
namespace_client: Optional[Any] = None,
|
||||
managed_versioning: Optional[bool] = None,
|
||||
branch: Optional[str] = None,
|
||||
version: Optional[int] = None,
|
||||
) -> AsyncTable:
|
||||
"""Open a Lance Table in the database.
|
||||
|
||||
@@ -1676,6 +1958,14 @@ class AsyncConnection(object):
|
||||
managed_versioning: bool, optional
|
||||
Whether managed versioning is enabled for this table. If provided,
|
||||
avoids a redundant describe_table call when namespace_client is set.
|
||||
branch: str, optional
|
||||
If provided, open a handle scoped to this branch instead of the
|
||||
default branch. Reads and writes operate in the branch's context.
|
||||
version: int, optional
|
||||
If provided, open the table pinned to this version, producing a
|
||||
read-only handle. Composes with ``branch``: when both are given,
|
||||
opens that branch at the version; otherwise opens ``main`` at the
|
||||
version. Call ``checkout_latest`` to return to a writable state.
|
||||
|
||||
Returns
|
||||
-------
|
||||
@@ -1692,7 +1982,14 @@ class AsyncConnection(object):
|
||||
namespace_client=namespace_client,
|
||||
managed_versioning=managed_versioning,
|
||||
)
|
||||
return AsyncTable(table)
|
||||
tbl = AsyncTable(table)
|
||||
# "main" is the default branch, so treat it as no branch: remote rejects
|
||||
# every branch checkout (even "main"), and the version still applies.
|
||||
if branch is not None and branch != "main":
|
||||
tbl = await tbl.branches.checkout(branch, version)
|
||||
elif version is not None:
|
||||
await tbl.checkout(version)
|
||||
return tbl
|
||||
|
||||
async def clone_table(
|
||||
self,
|
||||
@@ -1744,6 +2041,200 @@ class AsyncConnection(object):
|
||||
)
|
||||
return AsyncTable(table)
|
||||
|
||||
# -- Derived compute: functions, materialized views, jobs -------------
|
||||
# Server-backed features (LanceDB Enterprise / Cloud); local
|
||||
# connections raise NotImplementedError for now.
|
||||
|
||||
async def create_function(
|
||||
self,
|
||||
name,
|
||||
language: str = "python",
|
||||
return_type: Optional[str] = None,
|
||||
body: Optional[str] = None,
|
||||
options: Optional[Dict[str, str]] = None,
|
||||
*,
|
||||
replace: bool = False,
|
||||
):
|
||||
"""Register a UDF (CREATE FUNCTION). Accepts a ``@udf``/``@table_udf``
|
||||
object (preferred) or the explicit (name, language, return_type, body,
|
||||
options)."""
|
||||
from .udf import Udf
|
||||
|
||||
if isinstance(name, Udf):
|
||||
req = name.create_request()
|
||||
name, language, return_type, body, options = (
|
||||
req["name"],
|
||||
req["language"],
|
||||
req["return_type"],
|
||||
req["body"],
|
||||
req["options"],
|
||||
)
|
||||
if replace:
|
||||
try:
|
||||
await self.drop_function(name)
|
||||
except Exception:
|
||||
pass
|
||||
await self._inner.create_function(name, language, return_type, body, options)
|
||||
|
||||
async def list_functions(self):
|
||||
"""List registered functions (SHOW FUNCTIONS)."""
|
||||
return await self._inner.list_functions()
|
||||
|
||||
async def drop_function(self, name: str):
|
||||
"""Drop a registered function (DROP FUNCTION)."""
|
||||
await self._inner.drop_function(name)
|
||||
|
||||
async def create_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
source=None,
|
||||
select=None,
|
||||
*,
|
||||
query: Optional[str] = None,
|
||||
where: Optional[str] = None,
|
||||
auto_refresh: bool = False,
|
||||
with_no_data: bool = False,
|
||||
replace: bool = False,
|
||||
partition_by: Optional[str] = None,
|
||||
) -> "AsyncMaterializedView":
|
||||
"""Create a materialized view; returns an `AsyncMaterializedView`
|
||||
handle (``.wait()`` blocks until populated). Pass either ``query=`` (a
|
||||
full SELECT) or ``source`` + ``select`` items; `partition_by`
|
||||
partitions the view's table function on a source column (index-cluster
|
||||
if the column is IVF-indexed, else distinct-value). See the sync
|
||||
method for the select grammar."""
|
||||
from .udf import build_view_query, AsyncMaterializedView
|
||||
|
||||
if query is None:
|
||||
if source is None or select is None:
|
||||
raise ValueError(
|
||||
"create_materialized_view needs either query= or both "
|
||||
"source and select"
|
||||
)
|
||||
query = build_view_query(source, select)
|
||||
if where:
|
||||
query += f" WHERE {where}"
|
||||
if replace:
|
||||
try:
|
||||
await self.drop_materialized_view(name)
|
||||
except Exception as e:
|
||||
msg = str(e).lower()
|
||||
if "not found" not in msg and "does not exist" not in msg:
|
||||
raise
|
||||
job_id = await self._inner.create_materialized_view(
|
||||
name,
|
||||
query,
|
||||
auto_refresh=auto_refresh,
|
||||
with_no_data=with_no_data,
|
||||
partition_by=partition_by,
|
||||
)
|
||||
return AsyncMaterializedView(self, name, job_id=job_id)
|
||||
|
||||
def job(self, job_id: str):
|
||||
"""An `AsyncJobHandle` for reconnecting to an inflight job by id (a
|
||||
stored id, or one from the SQL / REST surface). Submit methods already
|
||||
return a handle, so this is only needed to re-attach to an existing
|
||||
job."""
|
||||
from .udf import AsyncJobHandle
|
||||
|
||||
return AsyncJobHandle(self, job_id)
|
||||
|
||||
async def lineage(
|
||||
self,
|
||||
table: str,
|
||||
column: Optional[str] = None,
|
||||
*,
|
||||
direction: Optional[str] = None,
|
||||
depth: Optional[int] = None,
|
||||
):
|
||||
"""Derived-compute lineage of a table/view (or column). See the sync
|
||||
`Connection.lineage`. Returns a `Lineage`."""
|
||||
from .lineage import Lineage
|
||||
|
||||
raw = await self._inner.table_lineage(table, column, direction, depth)
|
||||
return Lineage.from_json(raw)
|
||||
|
||||
async def _refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Internal: submit a refresh, return the job id. The public surface is
|
||||
``AsyncMaterializedView.refresh()`` (returns an `AsyncJobHandle`).
|
||||
|
||||
``full=True`` forces a full rebuild (recompute and replace every row)
|
||||
instead of the default incremental refresh.
|
||||
"""
|
||||
return await self._inner.refresh_materialized_view(
|
||||
name,
|
||||
full=full,
|
||||
src_version=src_version,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
)
|
||||
|
||||
async def explain_refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
):
|
||||
"""Plan a refresh without running it (EXPLAIN REFRESH)."""
|
||||
return await self._inner.explain_refresh_materialized_view(
|
||||
name, full=full, src_version=src_version
|
||||
)
|
||||
|
||||
async def alter_materialized_view(self, name: str, *, auto_refresh: bool):
|
||||
"""Update a materialized view's options."""
|
||||
await self._inner.alter_materialized_view(name, auto_refresh)
|
||||
|
||||
async def drop_materialized_view(self, name: str):
|
||||
"""Drop a materialized view definition."""
|
||||
await self._inner.drop_materialized_view(name)
|
||||
|
||||
async def list_materialized_views(self):
|
||||
"""List registered materialized view definitions."""
|
||||
return await self._inner.list_materialized_views()
|
||||
|
||||
async def list_jobs(self):
|
||||
"""List inflight server-side jobs across the database's tables."""
|
||||
return await self._inner.list_jobs()
|
||||
|
||||
async def get_job(self, job_id: str, table: "str | None" = None):
|
||||
"""Look up one server-side job by id (the wait()/status poll path).
|
||||
``table`` (the job's table) enables an O(1) server-side lookup.
|
||||
Returns the job's status, or None if unknown / no longer active."""
|
||||
return await self._inner.get_job(job_id, table)
|
||||
|
||||
async def cancel_job(self, job_id: str) -> bool:
|
||||
"""Cancel an inflight server-side job by id (CANCEL JOB).
|
||||
|
||||
Returns True if a matching inflight job was found and flagged for
|
||||
cancellation, False otherwise (best-effort).
|
||||
"""
|
||||
return await self._inner.cancel_job(job_id)
|
||||
|
||||
async def job_history(self, job_id: "str | None" = None):
|
||||
"""Durable history of completed server-side jobs (SHOW JOB HISTORY).
|
||||
|
||||
Reads each table's durable job-history store. Pass ``job_id`` to narrow
|
||||
to a single job. Unlike :meth:`list_jobs` (live, inflight) these are the
|
||||
terminal records, with created/updated/completed timestamps.
|
||||
"""
|
||||
return await self._inner.job_history(job_id)
|
||||
|
||||
async def errors(self, job_id: "str | None" = None, table: "str | None" = None):
|
||||
"""Per-row UDF errors recorded by ``error_policy=skip`` (SHOW ERRORS).
|
||||
|
||||
Optionally filtered by ``job_id`` and/or ``table``.
|
||||
"""
|
||||
return await self._inner.errors(job_id, table)
|
||||
|
||||
async def rename_table(
|
||||
self,
|
||||
cur_name: str,
|
||||
|
||||
@@ -81,6 +81,7 @@ class ColPaliEmbeddings(EmbeddingFunction):
|
||||
warnings.warn(
|
||||
"use_token_pooling is deprecated, use pooling_strategy=None instead",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
self.pooling_strategy = None
|
||||
|
||||
|
||||
@@ -19,7 +19,7 @@ operators::
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Union
|
||||
from typing import Iterable, Union
|
||||
|
||||
import pyarrow as pa
|
||||
|
||||
@@ -174,6 +174,11 @@ class Expr:
|
||||
"""Return True where the string contains *substr*."""
|
||||
return Expr(self._inner.contains(_coerce(substr)._inner))
|
||||
|
||||
def isin(self, values: "Iterable[ExprLike]") -> "Expr":
|
||||
"""Return True where the value is one of *values* (SQL ``IN``)."""
|
||||
inner = [_coerce(v)._inner for v in values]
|
||||
return Expr(self._inner.isin(inner))
|
||||
|
||||
# ── type cast ────────────────────────────────────────────────────────────
|
||||
|
||||
def cast(self, data_type: Union[str, "pa.DataType"]) -> "Expr":
|
||||
|
||||
@@ -93,6 +93,20 @@ class LabelList:
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class Fm:
|
||||
"""Describe an FM-Index configuration.
|
||||
|
||||
`Fm` is a scalar index on string or binary columns that accelerates
|
||||
substring search, i.e. `contains(col, 'needle')`. Unlike the tokenized
|
||||
`FTS` index, it matches arbitrary substrings of the raw bytes.
|
||||
|
||||
For example, it works with `url`, `path`, `content`, etc.
|
||||
"""
|
||||
|
||||
pass
|
||||
|
||||
|
||||
@dataclass
|
||||
class FTS:
|
||||
"""Describe a FTS index configuration.
|
||||
@@ -828,4 +842,5 @@ __all__ = [
|
||||
"FTS",
|
||||
"Bitmap",
|
||||
"LabelList",
|
||||
"Fm",
|
||||
]
|
||||
|
||||
177
python/python/lancedb/lineage.py
Normal file
177
python/python/lancedb/lineage.py
Normal file
@@ -0,0 +1,177 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
"""Client-side model of derived-compute lineage.
|
||||
|
||||
`Connection.lineage()` / `Table.lineage()` / `MaterializedView.lineage()` return
|
||||
a `Lineage`: the graph of what a column or materialized view derives from
|
||||
(upstream), what derives from it (downstream), and -- for each derived column --
|
||||
the function that produced it, the version it was produced with, and whether
|
||||
that is stale relative to the function the registry now holds.
|
||||
|
||||
The server returns this as JSON (the wire contract); these classes deserialize
|
||||
it. Nothing here talks to the server.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Optional, Union
|
||||
|
||||
|
||||
@dataclass
|
||||
class FunctionRef:
|
||||
"""The function that produced a derived column, with version + location."""
|
||||
|
||||
name: str
|
||||
#: Version that produced the data (stamped at compute time), if known.
|
||||
as_computed_version: Optional[str] = None
|
||||
#: Version the registry currently holds for this function name.
|
||||
current_version: Optional[str] = None
|
||||
#: True when the column was produced by an older function than the registry
|
||||
#: now holds -- i.e. silently stale; re-refresh to catch up.
|
||||
stale_vs_current: bool = False
|
||||
language: Optional[str] = None
|
||||
docker_image: Optional[str] = None
|
||||
env_digest: Optional[str] = None
|
||||
code_uri: Optional[str] = None
|
||||
|
||||
@classmethod
|
||||
def _from(cls, d: dict) -> "FunctionRef":
|
||||
return cls(
|
||||
name=d["name"],
|
||||
as_computed_version=d.get("as_computed_version"),
|
||||
current_version=d.get("current_version"),
|
||||
stale_vs_current=d.get("stale_vs_current", False),
|
||||
language=d.get("language"),
|
||||
docker_image=d.get("docker_image"),
|
||||
env_digest=d.get("env_digest"),
|
||||
code_uri=d.get("code_uri"),
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Node:
|
||||
"""A lineage node: a table, view, column, or function."""
|
||||
|
||||
kind: str # "table" | "view" | "column" | "function"
|
||||
id: str # "table", "table.column", or "fn:name@version"
|
||||
table: Optional[str] = None
|
||||
function: Optional[FunctionRef] = None
|
||||
|
||||
@classmethod
|
||||
def _from(cls, d: dict) -> "Node":
|
||||
fn = d.get("function")
|
||||
return cls(
|
||||
kind=d["kind"],
|
||||
id=d["id"],
|
||||
table=d.get("table"),
|
||||
function=FunctionRef._from(fn) if fn else None,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Edge:
|
||||
"""`downstream` depends on `upstream`, produced by `via` (a function name,
|
||||
or None for a passthrough)."""
|
||||
|
||||
downstream: str
|
||||
upstream: str
|
||||
via: Optional[str] = None
|
||||
|
||||
@classmethod
|
||||
def _from(cls, d: dict) -> "Edge":
|
||||
return cls(downstream=d["downstream"], upstream=d["upstream"], via=d.get("via"))
|
||||
|
||||
|
||||
@dataclass
|
||||
class Lineage:
|
||||
"""A derived-compute lineage graph (nodes + labeled edges)."""
|
||||
|
||||
target: str
|
||||
nodes: List[Node] = field(default_factory=list)
|
||||
edges: List[Edge] = field(default_factory=list)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, raw: Union[str, bytes, dict]) -> "Lineage":
|
||||
d = json.loads(raw) if isinstance(raw, (str, bytes)) else raw
|
||||
return cls(
|
||||
target=d.get("target", ""),
|
||||
nodes=[Node._from(n) for n in d.get("nodes", [])],
|
||||
edges=[Edge._from(e) for e in d.get("edges", [])],
|
||||
)
|
||||
|
||||
def functions(self) -> List[FunctionRef]:
|
||||
"""The function nodes in the graph."""
|
||||
return [n.function for n in self.nodes if n.function is not None]
|
||||
|
||||
def stale(self) -> List[FunctionRef]:
|
||||
"""Functions whose as-computed version is behind the current registry
|
||||
version -- the columns they produced are silently out of date."""
|
||||
return [f for f in self.functions() if f.stale_vs_current]
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
def prune(d: dict) -> dict:
|
||||
return {k: v for k, v in d.items() if v is not None}
|
||||
|
||||
return {
|
||||
"target": self.target,
|
||||
"nodes": [
|
||||
prune(
|
||||
{
|
||||
"kind": n.kind,
|
||||
"id": n.id,
|
||||
"table": n.table,
|
||||
"function": prune(vars(n.function)) if n.function else None,
|
||||
}
|
||||
)
|
||||
for n in self.nodes
|
||||
],
|
||||
"edges": [prune(vars(e)) for e in self.edges],
|
||||
}
|
||||
|
||||
def to_graphviz(self) -> str:
|
||||
"""Graphviz DOT for the lineage DAG: columns/tables as nodes, function
|
||||
names on edges, drift edges dashed + red."""
|
||||
stale_names = {f.name for f in self.stale()}
|
||||
out = [
|
||||
"digraph lineage {",
|
||||
" rankdir=LR;",
|
||||
' node [fontname="monospace"];',
|
||||
]
|
||||
for n in self.nodes:
|
||||
if n.kind == "function":
|
||||
continue
|
||||
shape = "ellipse" if n.kind in ("table", "view") else "box"
|
||||
out.append(f' "{n.id}" [shape={shape}];')
|
||||
for e in self.edges:
|
||||
attrs = ""
|
||||
if e.via:
|
||||
if e.via in stale_names:
|
||||
attrs = f' [label="{e.via}" color=red style=dashed]'
|
||||
else:
|
||||
attrs = f' [label="{e.via}"]'
|
||||
out.append(f' "{e.upstream}" -> "{e.downstream}"{attrs};')
|
||||
out.append("}")
|
||||
return "\n".join(out)
|
||||
|
||||
def _repr_html_(self) -> str:
|
||||
warn = ""
|
||||
drift = self.stale()
|
||||
if drift:
|
||||
names = ", ".join(sorted({f.name for f in drift}))
|
||||
warn = (
|
||||
f'<p style="color:#b00000"><b>stale vs current:</b> {names} '
|
||||
"(re-refresh to catch up)</p>"
|
||||
)
|
||||
rows = "".join(
|
||||
f"<tr><td><code>{e.downstream}</code></td>"
|
||||
f"<td>← {e.via or ''}</td>"
|
||||
f"<td><code>{e.upstream}</code></td></tr>"
|
||||
for e in self.edges
|
||||
)
|
||||
return (
|
||||
f"<b>lineage: <code>{self.target}</code></b>{warn}"
|
||||
"<table><tr><th>derived</th><th>via</th><th>from</th></tr>"
|
||||
f"{rows}</table>"
|
||||
)
|
||||
@@ -5,7 +5,9 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import timedelta
|
||||
from typing import TYPE_CHECKING, List, Optional
|
||||
from typing import TYPE_CHECKING, List, Optional, Union
|
||||
|
||||
from .expr import Expr
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .common import DATA
|
||||
@@ -32,6 +34,7 @@ class LanceMergeInsertBuilder(object):
|
||||
self._when_not_matched_insert_all = False
|
||||
self._when_not_matched_by_source_delete = False
|
||||
self._when_not_matched_by_source_condition = None
|
||||
self._when_not_matched_by_source_condition_expr = None
|
||||
self._timeout = None
|
||||
self._use_index = True
|
||||
self._use_lsm_write = None
|
||||
@@ -62,7 +65,7 @@ class LanceMergeInsertBuilder(object):
|
||||
return self
|
||||
|
||||
def when_not_matched_by_source_delete(
|
||||
self, condition: Optional[str] = None
|
||||
self, condition: Union[str, Expr, None] = None
|
||||
) -> LanceMergeInsertBuilder:
|
||||
"""
|
||||
Rows that exist only in the target table (old data) will be
|
||||
@@ -71,13 +74,16 @@ class LanceMergeInsertBuilder(object):
|
||||
|
||||
Parameters
|
||||
----------
|
||||
condition: Optional[str], default None
|
||||
condition: str or :class:`~lancedb.expr.Expr` or None, default None
|
||||
If None then all such rows will be deleted. Otherwise the
|
||||
condition will be used as an SQL filter to limit what rows
|
||||
are deleted.
|
||||
condition will be used as a filter to limit what rows are deleted.
|
||||
Can be a SQL string or a type-safe :class:`~lancedb.expr.Expr`
|
||||
built with :func:`~lancedb.expr.col` and :func:`~lancedb.expr.lit`.
|
||||
"""
|
||||
self._when_not_matched_by_source_delete = True
|
||||
if condition is not None:
|
||||
if isinstance(condition, Expr):
|
||||
self._when_not_matched_by_source_condition_expr = condition._inner
|
||||
elif condition is not None:
|
||||
self._when_not_matched_by_source_condition = condition
|
||||
return self
|
||||
|
||||
|
||||
@@ -58,6 +58,7 @@ from lance_namespace import (
|
||||
ListTablesRequest,
|
||||
DescribeNamespaceRequest,
|
||||
DropTableRequest,
|
||||
RenameTableRequest,
|
||||
ListNamespacesRequest,
|
||||
CreateNamespaceRequest,
|
||||
DropNamespaceRequest,
|
||||
@@ -70,6 +71,9 @@ from lancedb.embeddings import EmbeddingFunctionConfig
|
||||
from ._lancedb import Session
|
||||
|
||||
|
||||
_MAX_QUERY_K = 2**31 - 1
|
||||
|
||||
|
||||
def _query_to_namespace_request(
|
||||
table_id: List[str],
|
||||
query: "Query",
|
||||
@@ -144,7 +148,13 @@ def _query_to_namespace_request(
|
||||
if query.postfilter is not None:
|
||||
prefilter = not query.postfilter
|
||||
|
||||
k = query.limit if query.limit is not None else 10
|
||||
if query.limit is not None:
|
||||
k = query.limit
|
||||
elif query.vector is None and query.full_text_query is None:
|
||||
# limit k to max i32 value to avoid client overflows
|
||||
k = _MAX_QUERY_K
|
||||
else:
|
||||
k = 10
|
||||
|
||||
# Build request kwargs, only including non-None values for optional fields
|
||||
# that Pydantic doesn't accept as None
|
||||
@@ -363,6 +373,19 @@ def _convert_pyarrow_schema_to_json(schema: pa.Schema) -> JsonArrowSchema:
|
||||
return JsonArrowSchema(fields=fields, metadata=meta)
|
||||
|
||||
|
||||
def _builds_namespace_natively(
|
||||
namespace_client_impl: Optional[str],
|
||||
namespace_client_properties: Optional[Dict[str, str]],
|
||||
) -> bool:
|
||||
"""Whether ``connect_namespace_client`` builds the namespace client natively
|
||||
in Rust (installing the read-freshness context provider) rather than wrapping
|
||||
the pre-built Python client.
|
||||
|
||||
Must mirror Rust ``build_namespace_natively`` in ``python/src/connection.rs``.
|
||||
"""
|
||||
return namespace_client_impl == "rest" and bool(namespace_client_properties)
|
||||
|
||||
|
||||
class LanceNamespaceDBConnection(DBConnection):
|
||||
"""
|
||||
A LanceDB connection that uses a namespace for table management.
|
||||
@@ -422,6 +445,13 @@ class LanceNamespaceDBConnection(DBConnection):
|
||||
)
|
||||
self._namespace_client_impl = namespace_client_impl
|
||||
self._namespace_client_properties = namespace_client_properties
|
||||
# When the namespace client is built natively (see Rust
|
||||
# ``build_namespace_natively``), the underlying Rust table performs
|
||||
# QueryTable pushdown through the read-freshness context provider, which
|
||||
# the pure-Python ``query_table`` path bypasses.
|
||||
self._route_pushdown_to_rust = _builds_namespace_natively(
|
||||
namespace_client_impl, namespace_client_properties
|
||||
)
|
||||
self._inner = AsyncConnection(
|
||||
_connect_namespace_client(
|
||||
namespace_client,
|
||||
@@ -533,6 +563,7 @@ class LanceNamespaceDBConnection(DBConnection):
|
||||
namespace_path=namespace_path,
|
||||
namespace_client=self._namespace_client,
|
||||
pushdown_operations=self._namespace_client_pushdown_operations,
|
||||
route_pushdown_to_rust=self._route_pushdown_to_rust,
|
||||
_async=async_table,
|
||||
)
|
||||
|
||||
@@ -544,6 +575,8 @@ class LanceNamespaceDBConnection(DBConnection):
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
index_cache_size: Optional[int] = None,
|
||||
branch: Optional[str] = None,
|
||||
version: Optional[int] = None,
|
||||
) -> Table:
|
||||
if namespace_path is None:
|
||||
namespace_path = []
|
||||
@@ -562,14 +595,20 @@ class LanceNamespaceDBConnection(DBConnection):
|
||||
raise TableNotFoundError(f"Table not found: {'$'.join(table_id)}")
|
||||
raise
|
||||
|
||||
return LanceTable(
|
||||
tbl = LanceTable(
|
||||
self,
|
||||
name,
|
||||
namespace_path=namespace_path,
|
||||
namespace_client=self._namespace_client,
|
||||
pushdown_operations=self._namespace_client_pushdown_operations,
|
||||
route_pushdown_to_rust=self._route_pushdown_to_rust,
|
||||
_async=async_table,
|
||||
)
|
||||
if branch is not None:
|
||||
tbl = tbl.branches.checkout(branch, version)
|
||||
elif version is not None:
|
||||
tbl.checkout(version)
|
||||
return tbl
|
||||
|
||||
@override
|
||||
def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):
|
||||
@@ -592,9 +631,14 @@ class LanceNamespaceDBConnection(DBConnection):
|
||||
cur_namespace_path = []
|
||||
if new_namespace_path is None:
|
||||
new_namespace_path = []
|
||||
raise NotImplementedError(
|
||||
"rename_table is not supported for namespace connections"
|
||||
cur_table_id = cur_namespace_path + [cur_name]
|
||||
new_namespace_id = new_namespace_path if new_namespace_path else None
|
||||
request = RenameTableRequest(
|
||||
id=cur_table_id,
|
||||
new_table_name=new_name,
|
||||
new_namespace_id=new_namespace_id,
|
||||
)
|
||||
self._namespace_client.rename_table(request)
|
||||
|
||||
@override
|
||||
def drop_database(self):
|
||||
@@ -853,6 +897,8 @@ class AsyncLanceNamespaceDBConnection:
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
session: Optional[Session] = None,
|
||||
namespace_client_pushdown_operations: Optional[List[str]] = None,
|
||||
namespace_client_impl: Optional[str] = None,
|
||||
namespace_client_properties: Optional[Dict[str, str]] = None,
|
||||
):
|
||||
"""
|
||||
Initialize an async namespace-based LanceDB connection.
|
||||
@@ -878,6 +924,12 @@ class AsyncLanceNamespaceDBConnection:
|
||||
namespace.create_table() instead of using declare_table + local write.
|
||||
|
||||
Default is None (no pushdown, all operations run locally).
|
||||
namespace_client_impl : Optional[str]
|
||||
The namespace implementation name used to create this connection.
|
||||
Required (with ``namespace_client_properties``) for the Rust client to
|
||||
be built natively and install the read-freshness provider.
|
||||
namespace_client_properties : Optional[Dict[str, str]]
|
||||
The namespace properties used to create this connection.
|
||||
"""
|
||||
self._namespace_client = namespace_client
|
||||
self.read_consistency_interval = read_consistency_interval
|
||||
@@ -886,6 +938,14 @@ class AsyncLanceNamespaceDBConnection:
|
||||
self._namespace_client_pushdown_operations = set(
|
||||
namespace_client_pushdown_operations or []
|
||||
)
|
||||
self._namespace_client_impl = namespace_client_impl
|
||||
self._namespace_client_properties = namespace_client_properties
|
||||
# See LanceNamespaceDBConnection: when built natively the Rust table runs
|
||||
# QueryTable pushdown through the read-freshness provider, so defer to it
|
||||
# rather than the urllib3 client (which omits x-lancedb-min-timestamp).
|
||||
self._route_pushdown_to_rust = _builds_namespace_natively(
|
||||
namespace_client_impl, namespace_client_properties
|
||||
)
|
||||
self._inner = AsyncConnection(
|
||||
_connect_namespace_client(
|
||||
namespace_client,
|
||||
@@ -899,8 +959,8 @@ class AsyncLanceNamespaceDBConnection:
|
||||
namespace_client_pushdown_operations=(
|
||||
list(self._namespace_client_pushdown_operations)
|
||||
),
|
||||
namespace_client_impl=None,
|
||||
namespace_client_properties=None,
|
||||
namespace_client_impl=namespace_client_impl,
|
||||
namespace_client_properties=namespace_client_properties,
|
||||
)
|
||||
)
|
||||
|
||||
@@ -954,7 +1014,7 @@ class AsyncLanceNamespaceDBConnection:
|
||||
if mode.lower() not in ["create", "overwrite"]:
|
||||
raise ValueError("mode must be either 'create' or 'overwrite'")
|
||||
validate_table_name(name)
|
||||
return await self._inner.create_table(
|
||||
table = await self._inner.create_table(
|
||||
name,
|
||||
data,
|
||||
schema=schema,
|
||||
@@ -966,6 +1026,12 @@ class AsyncLanceNamespaceDBConnection:
|
||||
embedding_functions=embedding_functions,
|
||||
storage_options=storage_options,
|
||||
)
|
||||
return table._set_namespace_context(
|
||||
namespace_path=namespace_path,
|
||||
namespace_client=self._namespace_client,
|
||||
pushdown_operations=self._namespace_client_pushdown_operations,
|
||||
route_pushdown_to_rust=self._route_pushdown_to_rust,
|
||||
)
|
||||
|
||||
async def open_table(
|
||||
self,
|
||||
@@ -974,12 +1040,14 @@ class AsyncLanceNamespaceDBConnection:
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
index_cache_size: Optional[int] = None,
|
||||
branch: Optional[str] = None,
|
||||
version: Optional[int] = None,
|
||||
) -> AsyncTable:
|
||||
"""Open an existing table from the namespace."""
|
||||
if namespace_path is None:
|
||||
namespace_path = []
|
||||
try:
|
||||
return await self._inner.open_table(
|
||||
table = await self._inner.open_table(
|
||||
name,
|
||||
namespace_path=namespace_path,
|
||||
storage_options=storage_options,
|
||||
@@ -990,6 +1058,18 @@ class AsyncLanceNamespaceDBConnection:
|
||||
table_id = namespace_path + [name]
|
||||
raise TableNotFoundError(f"Table not found: {'$'.join(table_id)}")
|
||||
raise
|
||||
# "main" is the default branch, so treat it as no branch (mirrors the
|
||||
# sync remote path); the version still applies.
|
||||
if branch is not None and branch != "main":
|
||||
table = await table.branches.checkout(branch, version)
|
||||
elif version is not None:
|
||||
await table.checkout(version)
|
||||
return table._set_namespace_context(
|
||||
namespace_path=namespace_path,
|
||||
namespace_client=self._namespace_client,
|
||||
pushdown_operations=self._namespace_client_pushdown_operations,
|
||||
route_pushdown_to_rust=self._route_pushdown_to_rust,
|
||||
)
|
||||
|
||||
async def drop_table(self, name: str, namespace_path: Optional[List[str]] = None):
|
||||
"""Drop a table from the namespace."""
|
||||
@@ -1006,14 +1086,19 @@ class AsyncLanceNamespaceDBConnection:
|
||||
cur_namespace_path: Optional[List[str]] = None,
|
||||
new_namespace_path: Optional[List[str]] = None,
|
||||
):
|
||||
"""Rename is not supported for namespace connections."""
|
||||
"""Rename a table in the namespace."""
|
||||
if cur_namespace_path is None:
|
||||
cur_namespace_path = []
|
||||
if new_namespace_path is None:
|
||||
new_namespace_path = []
|
||||
raise NotImplementedError(
|
||||
"rename_table is not supported for namespace connections"
|
||||
cur_table_id = cur_namespace_path + [cur_name]
|
||||
new_namespace_id = new_namespace_path if new_namespace_path else None
|
||||
request = RenameTableRequest(
|
||||
id=cur_table_id,
|
||||
new_table_name=new_name,
|
||||
new_namespace_id=new_namespace_id,
|
||||
)
|
||||
self._namespace_client.rename_table(request)
|
||||
|
||||
async def drop_database(self):
|
||||
"""Deprecated method."""
|
||||
@@ -1342,4 +1427,6 @@ def connect_namespace_async(
|
||||
storage_options=storage_options,
|
||||
session=session,
|
||||
namespace_client_pushdown_operations=namespace_client_pushdown_operations,
|
||||
namespace_client_impl=namespace_client_impl,
|
||||
namespace_client_properties=namespace_client_properties,
|
||||
)
|
||||
|
||||
@@ -48,6 +48,14 @@ class PermutationBuilder:
|
||||
By default, the permutation builder will create a single split that contains all
|
||||
rows in the same order as the base table.
|
||||
"""
|
||||
if not hasattr(table, "_inner"):
|
||||
raise TypeError(
|
||||
f"PermutationBuilder requires a local LanceTable, "
|
||||
f"got {type(table).__name__}. "
|
||||
"The permutation API is not supported on remote tables. "
|
||||
"Remote tables connect to LanceDB Cloud or Enterprise and do not have "
|
||||
"direct access to the underlying Lance dataset needed for permutations."
|
||||
)
|
||||
self._async = async_permutation_builder(table)
|
||||
|
||||
def split_random(
|
||||
|
||||
@@ -275,7 +275,18 @@ def _py_type_to_arrow_type(py_type: Type[Any], field: FieldInfo) -> pa.DataType:
|
||||
tz = get_extras(field, "tz")
|
||||
return pa.timestamp("us", tz=tz)
|
||||
elif getattr(py_type, "__origin__", None) in (list, tuple):
|
||||
child = py_type.__args__[0]
|
||||
# A bare, unparameterised ``typing.List`` / ``typing.Tuple`` matches this
|
||||
# branch (its ``__origin__`` is ``list`` / ``tuple``) but has no
|
||||
# ``__args__``, so we cannot infer the element type. Raise a clear
|
||||
# ``TypeError`` instead of crashing with an opaque ``AttributeError``.
|
||||
args = getattr(py_type, "__args__", None)
|
||||
if not args:
|
||||
raise TypeError(
|
||||
"Converting Pydantic type to Arrow Type: unsupported type "
|
||||
f"{py_type}. Specify the element type, e.g. List[int] instead "
|
||||
"of a bare List."
|
||||
)
|
||||
child = args[0]
|
||||
return _pydantic_list_child_to_arrow(child, field)
|
||||
raise TypeError(
|
||||
f"Converting Pydantic type to Arrow Type: unsupported type {py_type}."
|
||||
|
||||
@@ -91,14 +91,14 @@ def _schema_has_blob_field(schema: pa.Schema) -> bool:
|
||||
|
||||
|
||||
def _blob_mode_requires_native_pandas(blob_mode: BlobMode, schema: pa.Schema) -> bool:
|
||||
return blob_mode in ("lazy", "bytes") and _schema_has_blob_field(schema)
|
||||
return blob_mode in _BLOB_MODE_TO_HANDLING and _schema_has_blob_field(schema)
|
||||
|
||||
|
||||
def _unsupported_blob_pandas_error(reason: str) -> RuntimeError:
|
||||
return RuntimeError(
|
||||
"blob_mode='lazy' and blob_mode='bytes' require Lance native pandas "
|
||||
f"conversion for queries that return blob columns, but {reason}. "
|
||||
"Use blob_mode='descriptions' or remove blob columns from the projection."
|
||||
"blob columns require Lance native scanner conversion for query "
|
||||
f"to_pandas(), but {reason}. Use a plain scan query or remove blob "
|
||||
"columns from the projection."
|
||||
)
|
||||
|
||||
|
||||
@@ -149,19 +149,48 @@ def _projection_to_scanner_kwargs(
|
||||
return {"columns": projection}
|
||||
|
||||
|
||||
def _scanner_kwargs_for_query(query: Query, blob_mode: BlobMode) -> Dict[str, Any]:
|
||||
def _scanner_kwargs_for_query(
|
||||
query: Query, blob_mode: BlobMode, dataset: Optional[Any] = None
|
||||
) -> Dict[str, Any]:
|
||||
fragments = _scanner_fragments_for_query(query, dataset)
|
||||
kwargs = {
|
||||
**_projection_to_scanner_kwargs(query.columns),
|
||||
"filter": _filter_to_sql(query.filter),
|
||||
"limit": query.limit,
|
||||
"offset": query.offset,
|
||||
"with_row_id": query.with_row_id,
|
||||
"with_row_address": query.with_row_address,
|
||||
"fast_search": query.fast_search,
|
||||
"blob_handling": _BLOB_MODE_TO_HANDLING[blob_mode],
|
||||
"fragments": fragments,
|
||||
}
|
||||
return {key: value for key, value in kwargs.items() if value is not None}
|
||||
|
||||
|
||||
def _scanner_fragments_for_query(query: Query, dataset: Optional[Any]) -> Optional[Any]:
|
||||
if query.fragments is not None and query.fragment_ids is not None:
|
||||
raise ValueError("fragments and fragment_ids cannot both be set")
|
||||
if query.fragments is not None:
|
||||
return query.fragments
|
||||
if query.fragment_ids is None:
|
||||
return None
|
||||
if dataset is None:
|
||||
raise ValueError("fragment_ids require a Lance dataset")
|
||||
|
||||
requested = set(query.fragment_ids)
|
||||
fragments = [
|
||||
fragment
|
||||
for fragment in dataset.get_fragments()
|
||||
if fragment.fragment_id in requested
|
||||
]
|
||||
found = {fragment.fragment_id for fragment in fragments}
|
||||
missing = requested - found
|
||||
if missing:
|
||||
missing_ids = ", ".join(str(fragment_id) for fragment_id in sorted(missing))
|
||||
raise ValueError(f"fragment_ids not found in dataset: {missing_ids}")
|
||||
return fragments
|
||||
|
||||
|
||||
def _ensure_lazy_blob_frame(
|
||||
df: "pd.DataFrame", schema: pa.Schema, blob_mode: BlobMode
|
||||
) -> "pd.DataFrame":
|
||||
@@ -179,6 +208,16 @@ def _ensure_lazy_blob_frame(
|
||||
return df
|
||||
|
||||
|
||||
def _scanner_to_table(scanner: Any) -> pa.Table:
|
||||
if hasattr(scanner, "to_pyarrow"):
|
||||
reader = scanner.to_pyarrow()
|
||||
return reader.read_all()
|
||||
if hasattr(scanner, "to_table"):
|
||||
return scanner.to_table()
|
||||
reader = scanner.to_reader()
|
||||
return reader.read_all()
|
||||
|
||||
|
||||
def _scanner_to_pandas(scanner: Any, blob_mode: BlobMode, **kwargs) -> "pd.DataFrame":
|
||||
schema = getattr(scanner, "projected_schema", None)
|
||||
if schema is None:
|
||||
@@ -199,14 +238,7 @@ def _scanner_to_pandas(scanner: Any, blob_mode: BlobMode, **kwargs) -> "pd.DataF
|
||||
return _ensure_lazy_blob_frame(df, schema, blob_mode)
|
||||
return df
|
||||
|
||||
if hasattr(scanner, "to_pyarrow"):
|
||||
reader = scanner.to_pyarrow()
|
||||
tbl = reader.read_all()
|
||||
elif hasattr(scanner, "to_table"):
|
||||
tbl = scanner.to_table()
|
||||
else:
|
||||
reader = scanner.to_reader()
|
||||
tbl = reader.read_all()
|
||||
tbl = _scanner_to_table(scanner)
|
||||
if blob_mode == "lazy" and _schema_has_blob_field(tbl.schema):
|
||||
raise _unsupported_blob_pandas_error(
|
||||
"the Lance scanner does not expose to_pandas"
|
||||
@@ -648,6 +680,13 @@ class Query(pydantic.BaseModel):
|
||||
# if true, include the row id in the results
|
||||
with_row_id: Optional[bool] = None
|
||||
|
||||
# if true, include the row address in the results
|
||||
with_row_address: Optional[bool] = None
|
||||
|
||||
# Lance fragments or fragment ids to scan on scanner-backed plain queries
|
||||
fragments: Optional[Any] = None
|
||||
fragment_ids: Optional[List[int]] = None
|
||||
|
||||
# offset to start fetching results from
|
||||
offset: Optional[int] = None
|
||||
|
||||
@@ -840,6 +879,9 @@ class LanceQueryBuilder(ABC):
|
||||
self._where = None
|
||||
self._postfilter = None
|
||||
self._with_row_id = None
|
||||
self._with_row_address = None
|
||||
self._fragments = None
|
||||
self._fragment_ids = None
|
||||
self._vector = None
|
||||
self._text = None
|
||||
self._ef = None
|
||||
@@ -901,9 +943,11 @@ class LanceQueryBuilder(ABC):
|
||||
schema = output_schema()
|
||||
if _blob_mode_requires_native_pandas(blob_mode, schema):
|
||||
native_error = None
|
||||
if flatten is None and timeout is None:
|
||||
if (flatten is None or blob_mode == "descriptions") and timeout is None:
|
||||
try:
|
||||
df = self._plain_scan_to_pandas(blob_mode, **kwargs)
|
||||
df = self._plain_scan_to_pandas(
|
||||
blob_mode, flatten=flatten, **kwargs
|
||||
)
|
||||
if df is not None:
|
||||
return df
|
||||
except Exception as err:
|
||||
@@ -1125,6 +1169,32 @@ class LanceQueryBuilder(ABC):
|
||||
self._with_row_id = with_row_id
|
||||
return self
|
||||
|
||||
def with_row_address(self, with_row_address: bool = True) -> Self:
|
||||
"""Set whether to return row addresses.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
with_row_address: bool, default True
|
||||
If True, return the _rowaddr column in the results.
|
||||
|
||||
Returns
|
||||
-------
|
||||
LanceQueryBuilder
|
||||
The LanceQueryBuilder object.
|
||||
"""
|
||||
self._with_row_address = with_row_address
|
||||
return self
|
||||
|
||||
def with_fragments(self, fragments: Any) -> Self:
|
||||
"""Set the Lance fragments to scan for plain scanner-backed queries."""
|
||||
self._fragments = fragments
|
||||
return self
|
||||
|
||||
def fragment_ids(self, fragment_ids: List[int]) -> Self:
|
||||
"""Set the Lance fragment ids to scan for plain scanner-backed queries."""
|
||||
self._fragment_ids = fragment_ids
|
||||
return self
|
||||
|
||||
def explain_plan(self, verbose: Optional[bool] = False) -> str:
|
||||
"""Return the execution plan for this query.
|
||||
|
||||
@@ -1267,6 +1337,7 @@ class LanceQueryBuilder(ABC):
|
||||
def _plain_scan_to_pandas(
|
||||
self,
|
||||
blob_mode: BlobMode,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
**kwargs,
|
||||
) -> Optional["pd.DataFrame"]:
|
||||
query = self.to_query_object()
|
||||
@@ -1274,7 +1345,12 @@ class LanceQueryBuilder(ABC):
|
||||
return None
|
||||
|
||||
dataset = self._table.to_lance()
|
||||
scanner = dataset.scanner(**_scanner_kwargs_for_query(query, blob_mode))
|
||||
scanner = dataset.scanner(
|
||||
**_scanner_kwargs_for_query(query, blob_mode, dataset)
|
||||
)
|
||||
if flatten is not None:
|
||||
tbl = flatten_columns(_scanner_to_table(scanner), flatten)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
return _scanner_to_pandas(scanner, blob_mode, **kwargs)
|
||||
|
||||
@abstractmethod
|
||||
@@ -1548,6 +1624,9 @@ class LanceVectorQueryBuilder(LanceQueryBuilder):
|
||||
refine_factor=self._refine_factor,
|
||||
vector_column=self._vector_column,
|
||||
with_row_id=self._with_row_id,
|
||||
with_row_address=self._with_row_address,
|
||||
fragments=self._fragments,
|
||||
fragment_ids=self._fragment_ids,
|
||||
offset=self._offset,
|
||||
fast_search=self._fast_search,
|
||||
ef=self._ef,
|
||||
@@ -1750,6 +1829,9 @@ class LanceFtsQueryBuilder(LanceQueryBuilder):
|
||||
limit=self._limit,
|
||||
postfilter=self._postfilter,
|
||||
with_row_id=self._with_row_id,
|
||||
with_row_address=self._with_row_address,
|
||||
fragments=self._fragments,
|
||||
fragment_ids=self._fragment_ids,
|
||||
full_text_query=FullTextSearchQuery(
|
||||
query=self._query, columns=self._fts_columns
|
||||
),
|
||||
@@ -1820,6 +1902,9 @@ class LanceEmptyQueryBuilder(LanceQueryBuilder):
|
||||
filter=self._where,
|
||||
limit=self._limit,
|
||||
with_row_id=self._with_row_id,
|
||||
with_row_address=self._with_row_address,
|
||||
fragments=self._fragments,
|
||||
fragment_ids=self._fragment_ids,
|
||||
offset=self._offset,
|
||||
order_by=self._order_by,
|
||||
)
|
||||
@@ -2411,6 +2496,9 @@ class AsyncQueryBase(object):
|
||||
"""
|
||||
self._inner = inner
|
||||
self._table = table
|
||||
self._with_row_address = None
|
||||
self._fragments = None
|
||||
self._fragment_ids = None
|
||||
|
||||
def to_query_object(self) -> Query:
|
||||
"""
|
||||
@@ -2419,7 +2507,11 @@ class AsyncQueryBase(object):
|
||||
This is currently experimental but can be useful as the query object is pure
|
||||
python and more easily serializable.
|
||||
"""
|
||||
return Query.from_inner(self._inner.to_query_request())
|
||||
query = Query.from_inner(self._inner.to_query_request())
|
||||
query.with_row_address = self._with_row_address
|
||||
query.fragments = self._fragments
|
||||
query.fragment_ids = self._fragment_ids
|
||||
return query
|
||||
|
||||
def select(self, columns: Union[List[str], dict[str, str]]) -> Self:
|
||||
"""
|
||||
@@ -2476,6 +2568,27 @@ class AsyncQueryBase(object):
|
||||
self._inner.with_row_id()
|
||||
return self
|
||||
|
||||
def with_row_address(self, with_row_address: bool = True) -> Self:
|
||||
"""
|
||||
Include the _rowaddr column in scanner-backed plain query results.
|
||||
"""
|
||||
self._with_row_address = with_row_address
|
||||
return self
|
||||
|
||||
def with_fragments(self, fragments: Any) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragments.
|
||||
"""
|
||||
self._fragments = fragments
|
||||
return self
|
||||
|
||||
def fragment_ids(self, fragment_ids: List[int]) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragment ids.
|
||||
"""
|
||||
self._fragment_ids = fragment_ids
|
||||
return self
|
||||
|
||||
async def to_batches(
|
||||
self,
|
||||
*,
|
||||
@@ -2601,9 +2714,11 @@ class AsyncQueryBase(object):
|
||||
schema = await self.output_schema()
|
||||
if _blob_mode_requires_native_pandas(blob_mode, schema):
|
||||
native_error = None
|
||||
if flatten is None and timeout is None:
|
||||
if (flatten is None or blob_mode == "descriptions") and timeout is None:
|
||||
try:
|
||||
df = await self._plain_scan_to_pandas(blob_mode, **kwargs)
|
||||
df = await self._plain_scan_to_pandas(
|
||||
blob_mode, flatten=flatten, **kwargs
|
||||
)
|
||||
if df is not None:
|
||||
return df
|
||||
except Exception as err:
|
||||
@@ -2625,6 +2740,7 @@ class AsyncQueryBase(object):
|
||||
async def _plain_scan_to_pandas(
|
||||
self,
|
||||
blob_mode: BlobMode,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
**kwargs,
|
||||
) -> Optional["pd.DataFrame"]:
|
||||
if self._table is None:
|
||||
@@ -2635,7 +2751,12 @@ class AsyncQueryBase(object):
|
||||
return None
|
||||
|
||||
dataset = await self._table._to_lance()
|
||||
scanner = dataset.scanner(**_scanner_kwargs_for_query(query, blob_mode))
|
||||
scanner = dataset.scanner(
|
||||
**_scanner_kwargs_for_query(query, blob_mode, dataset)
|
||||
)
|
||||
if flatten is not None:
|
||||
tbl = flatten_columns(_scanner_to_table(scanner), flatten)
|
||||
return tbl.to_pandas(**kwargs)
|
||||
return _scanner_to_pandas(scanner, blob_mode, **kwargs)
|
||||
|
||||
async def to_polars(
|
||||
@@ -3522,6 +3643,7 @@ class AsyncTakeQuery(AsyncQueryBase):
|
||||
async def _plain_scan_to_pandas(
|
||||
self,
|
||||
blob_mode: BlobMode,
|
||||
flatten: Optional[Union[int, bool]] = None,
|
||||
**kwargs,
|
||||
) -> Optional["pd.DataFrame"]:
|
||||
return None
|
||||
@@ -3576,6 +3698,27 @@ class BaseQueryBuilder(object):
|
||||
self._inner.with_row_id()
|
||||
return self
|
||||
|
||||
def with_row_address(self, with_row_address: bool = True) -> Self:
|
||||
"""
|
||||
Include the _rowaddr column in scanner-backed plain query results.
|
||||
"""
|
||||
self._inner.with_row_address(with_row_address)
|
||||
return self
|
||||
|
||||
def with_fragments(self, fragments: Any) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragments.
|
||||
"""
|
||||
self._inner.with_fragments(fragments)
|
||||
return self
|
||||
|
||||
def fragment_ids(self, fragment_ids: List[int]) -> Self:
|
||||
"""
|
||||
Restrict scanner-backed plain query results to the given Lance fragment ids.
|
||||
"""
|
||||
self._inner.fragment_ids(fragment_ids)
|
||||
return self
|
||||
|
||||
def output_schema(self) -> pa.Schema:
|
||||
"""
|
||||
Return the output schema for the query
|
||||
|
||||
@@ -9,6 +9,7 @@ from typing import List, Optional
|
||||
from lancedb import __version__
|
||||
|
||||
from .header import HeaderProvider
|
||||
from .oauth import OAuthConfig, OAuthFlowType
|
||||
|
||||
__all__ = [
|
||||
"TimeoutConfig",
|
||||
@@ -16,6 +17,8 @@ __all__ = [
|
||||
"TlsConfig",
|
||||
"ClientConfig",
|
||||
"HeaderProvider",
|
||||
"OAuthConfig",
|
||||
"OAuthFlowType",
|
||||
]
|
||||
|
||||
|
||||
|
||||
@@ -124,6 +124,7 @@ class RemoteDBConnection(DBConnection):
|
||||
"request_thread_pool is no longer used and will be removed in "
|
||||
"a future release.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
if connection_timeout is not None:
|
||||
@@ -132,6 +133,7 @@ class RemoteDBConnection(DBConnection):
|
||||
"release. Please use client_config.timeout_config.connect_timeout "
|
||||
"instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
client_config.timeout_config.connect_timeout = timedelta(
|
||||
seconds=connection_timeout
|
||||
@@ -142,6 +144,7 @@ class RemoteDBConnection(DBConnection):
|
||||
"read_timeout is deprecated and will be removed in a future release. "
|
||||
"Please use client_config.timeout_config.read_timeout instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
client_config.timeout_config.read_timeout = timedelta(seconds=read_timeout)
|
||||
|
||||
@@ -383,6 +386,8 @@ class RemoteDBConnection(DBConnection):
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
storage_options: Optional[Dict[str, str]] = None,
|
||||
index_cache_size: Optional[int] = None,
|
||||
branch: Optional[str] = None,
|
||||
version: Optional[int] = None,
|
||||
) -> Table:
|
||||
"""Open a Lance Table in the database.
|
||||
|
||||
@@ -393,6 +398,14 @@ class RemoteDBConnection(DBConnection):
|
||||
namespace_path: List[str], optional
|
||||
The namespace to open the table from.
|
||||
None or empty list represents root namespace.
|
||||
branch: str, optional
|
||||
If provided, open a handle scoped to this branch instead of the
|
||||
default branch. Reads and writes operate in the branch's context.
|
||||
version: int, optional
|
||||
If provided, open the table pinned to this version, producing a
|
||||
read-only handle. Composes with ``branch``: when both are given,
|
||||
opens that branch at the version; otherwise opens ``main`` at the
|
||||
version. Call ``checkout_latest`` to return to a writable state.
|
||||
|
||||
Returns
|
||||
-------
|
||||
@@ -409,12 +422,17 @@ class RemoteDBConnection(DBConnection):
|
||||
)
|
||||
|
||||
table = LOOP.run(self._conn.open_table(name, namespace_path=namespace_path))
|
||||
return RemoteTable(
|
||||
tbl = RemoteTable(
|
||||
table,
|
||||
self.db_name,
|
||||
connection_state=self.serialize,
|
||||
namespace_path=namespace_path,
|
||||
)
|
||||
if branch is not None:
|
||||
tbl = tbl.branches.checkout(branch, version)
|
||||
elif version is not None:
|
||||
tbl.checkout(version)
|
||||
return tbl
|
||||
|
||||
def clone_table(
|
||||
self,
|
||||
|
||||
@@ -27,6 +27,9 @@ class LanceDBClientError(RuntimeError):
|
||||
self.request_id = request_id
|
||||
self.status_code = status_code
|
||||
|
||||
def __reduce__(self) -> tuple[type, tuple]:
|
||||
return (self.__class__, (str(self), self.request_id, self.status_code))
|
||||
|
||||
|
||||
class HttpError(LanceDBClientError):
|
||||
"""An error that occurred during an HTTP request.
|
||||
@@ -101,3 +104,19 @@ class RetryError(LanceDBClientError):
|
||||
self.max_request_failures = max_request_failures
|
||||
self.max_connect_failures = max_connect_failures
|
||||
self.max_read_failures = max_read_failures
|
||||
|
||||
def __reduce__(self) -> tuple[type, tuple]:
|
||||
return (
|
||||
self.__class__,
|
||||
(
|
||||
str(self),
|
||||
self.request_id,
|
||||
self.request_failures,
|
||||
self.connect_failures,
|
||||
self.read_failures,
|
||||
self.max_request_failures,
|
||||
self.max_connect_failures,
|
||||
self.max_read_failures,
|
||||
self.status_code,
|
||||
),
|
||||
)
|
||||
|
||||
75
python/python/lancedb/remote/oauth.py
Normal file
75
python/python/lancedb/remote/oauth.py
Normal file
@@ -0,0 +1,75 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import List, Optional
|
||||
|
||||
|
||||
class OAuthFlowType(str, Enum):
|
||||
"""OAuth authentication flow types."""
|
||||
|
||||
CLIENT_CREDENTIALS = "client_credentials"
|
||||
"""Client Credentials grant (service-to-service / M2M)."""
|
||||
|
||||
AZURE_MANAGED_IDENTITY = "azure_managed_identity"
|
||||
"""Azure Managed Identity via IMDS."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class OAuthConfig:
|
||||
"""OAuth configuration for LanceDB authentication.
|
||||
|
||||
All token acquisition and refresh is handled in the Rust layer.
|
||||
This config is passed through to Rust via PyO3.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
issuer_url : str
|
||||
OIDC issuer URL or OAuth authority URL.
|
||||
For Azure: ``https://login.microsoftonline.com/{tenant_id}/v2.0``
|
||||
client_id : str
|
||||
Application / Client ID.
|
||||
scopes : List[str]
|
||||
OAuth scopes to request.
|
||||
For Azure managed identity, exactly one scope or resource is required.
|
||||
For example: ``["api://{app_id}/.default"]``
|
||||
flow : OAuthFlowType
|
||||
Authentication flow to use. Default: CLIENT_CREDENTIALS.
|
||||
client_secret : Optional[str]
|
||||
Client secret (required for CLIENT_CREDENTIALS).
|
||||
managed_identity_client_id : Optional[str]
|
||||
Client ID for user-assigned managed identity (AZURE_MANAGED_IDENTITY).
|
||||
refresh_buffer_secs : Optional[int]
|
||||
Seconds before expiry to trigger proactive refresh (default: 300).
|
||||
Keep this well below the token TTL; if it is greater than or equal to
|
||||
the TTL, each request refreshes the token.
|
||||
|
||||
Examples
|
||||
--------
|
||||
Client Credentials (service-to-service):
|
||||
|
||||
>>> config = OAuthConfig(
|
||||
... issuer_url="https://login.microsoftonline.com/{tenant}/v2.0",
|
||||
... client_id="app-id",
|
||||
... client_secret="secret",
|
||||
... scopes=["api://lancedb-api/.default"],
|
||||
... )
|
||||
|
||||
Azure Managed Identity:
|
||||
|
||||
>>> config = OAuthConfig(
|
||||
... issuer_url="https://login.microsoftonline.com/{tenant}/v2.0",
|
||||
... client_id="app-id",
|
||||
... scopes=["api://lancedb-api/.default"],
|
||||
... flow=OAuthFlowType.AZURE_MANAGED_IDENTITY,
|
||||
... )
|
||||
"""
|
||||
|
||||
issuer_url: str
|
||||
client_id: str
|
||||
scopes: List[str]
|
||||
flow: OAuthFlowType = OAuthFlowType.CLIENT_CREDENTIALS
|
||||
client_secret: Optional[str] = field(default=None, repr=False)
|
||||
managed_identity_client_id: Optional[str] = None
|
||||
refresh_buffer_secs: Optional[int] = None
|
||||
@@ -13,10 +13,14 @@ from typing import (
|
||||
Iterable,
|
||||
List,
|
||||
Optional,
|
||||
TYPE_CHECKING,
|
||||
Union,
|
||||
Literal,
|
||||
overload,
|
||||
)
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from ..udf import JobHandle
|
||||
import warnings
|
||||
|
||||
from lancedb import __version__
|
||||
@@ -56,7 +60,7 @@ from lancedb.embeddings import EmbeddingFunctionRegistry
|
||||
from lancedb.table import _normalize_progress
|
||||
|
||||
from ..query import LanceVectorQueryBuilder, LanceQueryBuilder, LanceTakeQueryBuilder
|
||||
from ..table import AsyncTable, BlobMode, IndexStatistics, Query, Table, Tags
|
||||
from ..table import AsyncTable, BlobMode, Branches, IndexStatistics, Query, Table, Tags
|
||||
from ..types import BaseTokenizerType
|
||||
|
||||
|
||||
@@ -75,6 +79,9 @@ class RemoteTable(Table):
|
||||
self._connection_state = connection_state
|
||||
self._namespace_path = list(namespace_path or [])
|
||||
self._checkout_version: Optional[int] = None
|
||||
# The branch this handle is scoped to (None == main). Persisted so a
|
||||
# fork/pickle reopen restores the branch instead of reverting to main.
|
||||
self._branch: Optional[str] = None
|
||||
self._pid = os.getpid()
|
||||
|
||||
def _serialized_connection_state(self) -> str:
|
||||
@@ -109,9 +116,14 @@ class RemoteTable(Table):
|
||||
from lancedb import deserialize_conn
|
||||
|
||||
db = deserialize_conn(self._serialized_connection_state(), for_worker=True)
|
||||
table = db.open_table(self._name, namespace_path=self._namespace_path)
|
||||
if self._checkout_version is not None:
|
||||
table.checkout(self._checkout_version)
|
||||
# Reopen on the same branch and pinned version (branch=None / version=None
|
||||
# reproduce the plain main-latest open).
|
||||
table = db.open_table(
|
||||
self._name,
|
||||
namespace_path=self._namespace_path,
|
||||
branch=self._branch,
|
||||
version=self._checkout_version,
|
||||
)
|
||||
|
||||
self._table_handle = table._table
|
||||
self.db_name = table.db_name
|
||||
@@ -124,6 +136,7 @@ class RemoteTable(Table):
|
||||
"name": self.name,
|
||||
"namespace_path": self._namespace_path,
|
||||
"checkout_version": self._checkout_version,
|
||||
"branch": self._branch,
|
||||
}
|
||||
|
||||
def __setstate__(self, state: dict) -> None:
|
||||
@@ -133,6 +146,7 @@ class RemoteTable(Table):
|
||||
self._connection_state = state["connection_state"]
|
||||
self._namespace_path = state["namespace_path"]
|
||||
self._checkout_version = state["checkout_version"]
|
||||
self._branch = state.get("branch")
|
||||
self._pid = None
|
||||
|
||||
@property
|
||||
@@ -160,6 +174,34 @@ class RemoteTable(Table):
|
||||
def tags(self) -> Tags:
|
||||
return Tags(self._table)
|
||||
|
||||
@property
|
||||
def branches(self) -> Branches:
|
||||
"""Branch management for the table.
|
||||
|
||||
``create``/``checkout`` return a new table handle scoped to the branch;
|
||||
writes on it do not affect ``main``.
|
||||
"""
|
||||
return Branches(self)
|
||||
|
||||
def current_branch(self) -> Optional[str]:
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
return self._table.current_branch()
|
||||
|
||||
def _wrap_branch_handle(
|
||||
self, async_table: AsyncTable, version: Optional[int] = None
|
||||
) -> "RemoteTable":
|
||||
# A branch handle stays a RemoteTable with the same connection context.
|
||||
# Record the branch and version pin so a fork/pickle reopen restores both.
|
||||
handle = RemoteTable(
|
||||
async_table,
|
||||
self.db_name,
|
||||
connection_state=self._connection_state,
|
||||
namespace_path=self._namespace_path,
|
||||
)
|
||||
handle._branch = async_table.current_branch()
|
||||
handle._checkout_version = version
|
||||
return handle
|
||||
|
||||
@cached_property
|
||||
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
|
||||
"""
|
||||
@@ -807,7 +849,8 @@ class RemoteTable(Table):
|
||||
"""
|
||||
warnings.warn(
|
||||
"cleanup_old_versions() is a no-op on LanceDB Cloud. "
|
||||
"Tables are automatically cleaned up and optimized."
|
||||
"Tables are automatically cleaned up and optimized.",
|
||||
stacklevel=2,
|
||||
)
|
||||
pass
|
||||
|
||||
@@ -819,7 +862,8 @@ class RemoteTable(Table):
|
||||
"""
|
||||
warnings.warn(
|
||||
"compact_files() is a no-op on LanceDB Cloud. "
|
||||
"Tables are automatically compacted and optimized."
|
||||
"Tables are automatically compacted and optimized.",
|
||||
stacklevel=2,
|
||||
)
|
||||
pass
|
||||
|
||||
@@ -836,15 +880,150 @@ class RemoteTable(Table):
|
||||
"""
|
||||
warnings.warn(
|
||||
"optimize() is a no-op on LanceDB Cloud. "
|
||||
"Indices are optimized automatically."
|
||||
"Indices are optimized automatically.",
|
||||
stacklevel=2,
|
||||
)
|
||||
pass
|
||||
|
||||
def count_rows(self, filter: Optional[str] = None) -> int:
|
||||
return LOOP.run(self._table.count_rows(filter))
|
||||
|
||||
def add_columns(self, transforms: Dict[str, str]) -> AddColumnsResult:
|
||||
return LOOP.run(self._table.add_columns(transforms))
|
||||
def add_columns(
|
||||
self,
|
||||
transforms: Optional[Dict[str, str]] = None,
|
||||
*,
|
||||
computed: Optional[Dict[str, tuple]] = None,
|
||||
) -> Optional[AddColumnsResult]:
|
||||
result = None
|
||||
if transforms is not None:
|
||||
result = LOOP.run(self._table.add_columns(transforms))
|
||||
if computed:
|
||||
LOOP.run(self._table.add_columns(computed=computed))
|
||||
return result
|
||||
|
||||
def refresh_column(
|
||||
self,
|
||||
columns,
|
||||
*,
|
||||
where: Optional[str] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> "JobHandle":
|
||||
"""Trigger recompute of computed columns (REFRESH COLUMN).
|
||||
|
||||
The expression is resolved server-side from each column's stored
|
||||
binding; columns bound to the same struct-returning function
|
||||
refresh together. Returns a `JobHandle` to wait on, poll, or cancel
|
||||
(``tbl.refresh_column("c").wait()``). Server-backed feature
|
||||
(LanceDB Enterprise / Cloud).
|
||||
|
||||
num_workers / max_workers / batch_size / priority are per-refresh
|
||||
scheduling knobs (how to run THIS refresh) and override any default
|
||||
the function carries. `priority` is a Kueue tier
|
||||
(training | interactive | backfill).
|
||||
"""
|
||||
from ..udf import JobHandle
|
||||
|
||||
if isinstance(columns, str):
|
||||
columns = [columns]
|
||||
job_id = LOOP.run(
|
||||
self._table.refresh_column(
|
||||
list(columns),
|
||||
where=where,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
priority=priority,
|
||||
)
|
||||
)
|
||||
return JobHandle(self._job_conn(), job_id)
|
||||
|
||||
def lineage(self, column=None, *, direction=None, depth=None):
|
||||
"""Derived-compute lineage of this table, or one of its columns:
|
||||
upstream sources, downstream dependents, and the function version +
|
||||
location that produced each derived column (with a drift flag). Returns
|
||||
a `Lineage`. See `Connection.lineage`."""
|
||||
return self._job_conn().lineage(
|
||||
self._name, column, direction=direction, depth=depth
|
||||
)
|
||||
|
||||
def _job_conn(self):
|
||||
"""A client connection for polling jobs this table spawns. Built lazily
|
||||
from the table's serialized connection state and cached (not pickled --
|
||||
a forked/unpickled table rebuilds it on next use)."""
|
||||
from lancedb import deserialize_conn
|
||||
|
||||
conn = getattr(self, "_job_conn_cache", None)
|
||||
if conn is None:
|
||||
conn = deserialize_conn(self._serialized_connection_state())
|
||||
self._job_conn_cache = conn
|
||||
return conn
|
||||
|
||||
def load_columns(
|
||||
self,
|
||||
source: Union[str, Iterable[str]],
|
||||
pk: str,
|
||||
columns: Union[Iterable[str], Dict[str, str]],
|
||||
*,
|
||||
source_format: str = "parquet",
|
||||
source_pk: Optional[str] = None,
|
||||
on_missing: str = "carry",
|
||||
source_storage_options: Optional[Dict[str, str]] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
commit_granularity: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Fill existing columns from an external source by primary-key join.
|
||||
|
||||
The distributed-job equivalent of Geneva's ``Table.load_columns()``:
|
||||
imports precomputed values (e.g. embeddings) from Parquet/Lance/IPC into
|
||||
this table, matching on a primary key. Returns the load job id.
|
||||
Server-backed feature (LanceDB Enterprise / Cloud).
|
||||
|
||||
Parameters
|
||||
----------
|
||||
source: str | list[str]
|
||||
One source URI or a list of URIs.
|
||||
pk: str
|
||||
Destination primary-key column. Also the source key unless
|
||||
``source_pk`` is given.
|
||||
columns: list[str] | dict[str, str]
|
||||
Value columns to load. A list loads same-named columns; a dict maps
|
||||
``{target: source}``.
|
||||
source_format: str
|
||||
``"parquet"`` (default), ``"lance"``, or ``"ipc"``.
|
||||
source_pk: str, optional
|
||||
Source primary-key column when it differs from ``pk``.
|
||||
on_missing: str
|
||||
Behavior for destination rows with no source match:
|
||||
``"carry"`` (default, keep existing), ``"null"``, or ``"error"``.
|
||||
"""
|
||||
if isinstance(source, str):
|
||||
source = [source]
|
||||
if isinstance(columns, dict):
|
||||
mappings = [(target, src) for target, src in columns.items()]
|
||||
else:
|
||||
mappings = [(c, None) for c in columns]
|
||||
return LOOP.run(
|
||||
self._table.load_columns(
|
||||
list(source),
|
||||
source_format,
|
||||
pk,
|
||||
mappings,
|
||||
source_key=source_pk,
|
||||
source_storage_options=source_storage_options,
|
||||
on_missing=on_missing,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
commit_granularity=commit_granularity,
|
||||
priority=priority,
|
||||
)
|
||||
)
|
||||
|
||||
def alter_columns(
|
||||
self, *alterations: Iterable[Dict[str, str]]
|
||||
|
||||
@@ -125,6 +125,9 @@ class MRRReranker(Reranker):
|
||||
This cannot reuse rerank_hybrid because MRR semantics require treating
|
||||
each vector result as a separate ranking system.
|
||||
"""
|
||||
if not vector_results:
|
||||
raise ValueError("vector_results must not be empty")
|
||||
|
||||
if not all(isinstance(v, type(vector_results[0])) for v in vector_results):
|
||||
raise ValueError(
|
||||
"All elements in vector_results should be of the same type"
|
||||
|
||||
@@ -82,6 +82,9 @@ class RRFReranker(Reranker):
|
||||
results from multiple vector searches as it doesn't support reranking
|
||||
vector results individually.
|
||||
"""
|
||||
if not vector_results:
|
||||
raise ValueError("vector_results must not be empty")
|
||||
|
||||
# Make sure all elements are of the same type
|
||||
if not all(isinstance(v, type(vector_results[0])) for v in vector_results):
|
||||
raise ValueError(
|
||||
|
||||
@@ -86,7 +86,10 @@ def _from_list(data: list) -> Scannable:
|
||||
|
||||
@to_scannable.register(dict)
|
||||
def _from_dict(data: dict) -> Scannable:
|
||||
raise ValueError("Cannot add a single dictionary to a table. Use a list.")
|
||||
raise ValueError(
|
||||
"Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
|
||||
|
||||
@to_scannable.register(LanceModel)
|
||||
|
||||
@@ -30,7 +30,7 @@ from lancedb.scannable import _register_optional_converters, to_scannable
|
||||
|
||||
from . import __version__
|
||||
from lancedb.arrow import peek_reader
|
||||
from lancedb.background_loop import LOOP
|
||||
from lancedb.background_loop import LOOP, embedding_executor
|
||||
from .dependencies import (
|
||||
_check_for_hugging_face,
|
||||
_check_for_lance,
|
||||
@@ -55,11 +55,13 @@ from .index import (
|
||||
Bitmap,
|
||||
IvfRq,
|
||||
LabelList,
|
||||
Fm,
|
||||
HnswPq,
|
||||
HnswSq,
|
||||
HnswFlat,
|
||||
FTS,
|
||||
)
|
||||
from .expr import Expr
|
||||
from .merge import LanceMergeInsertBuilder
|
||||
from .pydantic import LanceModel, model_to_dict
|
||||
from .query import (
|
||||
@@ -92,6 +94,12 @@ BlobMode = Literal["lazy", "bytes", "descriptions"]
|
||||
_VALID_BLOB_MODES = ("lazy", "bytes", "descriptions")
|
||||
|
||||
|
||||
def _should_push_down_query_table(
|
||||
namespace_client: Optional[Any], pushdown_operations: set
|
||||
) -> bool:
|
||||
return namespace_client is not None and "QueryTable" in pushdown_operations
|
||||
|
||||
|
||||
def _validate_blob_mode(blob_mode: BlobMode) -> None:
|
||||
if blob_mode not in _VALID_BLOB_MODES:
|
||||
modes = ", ".join(repr(mode) for mode in _VALID_BLOB_MODES)
|
||||
@@ -207,6 +215,7 @@ IndexConfigType = Union[
|
||||
BTree,
|
||||
Bitmap,
|
||||
LabelList,
|
||||
Fm,
|
||||
FTS,
|
||||
]
|
||||
|
||||
@@ -234,7 +243,10 @@ def _into_pyarrow_reader(
|
||||
raise ValueError("Cannot add a single LanceModel to a table. Use a list.")
|
||||
|
||||
if isinstance(data, dict):
|
||||
raise ValueError("Cannot add a single dictionary to a table. Use a list.")
|
||||
raise ValueError(
|
||||
"Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
|
||||
if isinstance(data, list):
|
||||
# Handle empty list case
|
||||
@@ -690,6 +702,24 @@ def _normalize_progress(progress):
|
||||
return progress, False
|
||||
|
||||
|
||||
def _computed_groups(computed):
|
||||
"""Group computed columns by expression, preserving declaration order
|
||||
(struct-returning functions need their columns adjacent so schema order
|
||||
matches field order). Accepts the ergonomic forms -- `fn("col")` values
|
||||
and tuple keys for struct fan-out -- via `_normalize_computed`."""
|
||||
from .udf import _normalize_computed
|
||||
|
||||
groups = []
|
||||
for name, (sql_type, expression) in _normalize_computed(computed).items():
|
||||
for expr, cols in groups:
|
||||
if expr == expression:
|
||||
cols.append((name, sql_type))
|
||||
break
|
||||
else:
|
||||
groups.append((expression, [(name, sql_type)]))
|
||||
return groups
|
||||
|
||||
|
||||
class Table(ABC):
|
||||
"""
|
||||
A Table is a collection of Records in a LanceDB Database.
|
||||
@@ -778,10 +808,76 @@ class Table(ABC):
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
@property
|
||||
def branches(self) -> "Branches":
|
||||
"""Branch management for the table.
|
||||
|
||||
Branches are isolated, writable lines of history forked from another
|
||||
branch (or version). Writes on a branch do not affect ``main``.
|
||||
"""
|
||||
raise NotImplementedError
|
||||
|
||||
def current_branch(self) -> Optional[str]:
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
raise NotImplementedError
|
||||
|
||||
def __len__(self) -> int:
|
||||
"""The number of rows in this Table"""
|
||||
return self.count_rows(None)
|
||||
|
||||
def add_computed_column(
|
||||
self,
|
||||
columns,
|
||||
fn,
|
||||
args: Optional[List[str]] = None,
|
||||
types=None,
|
||||
) -> None:
|
||||
"""Declare computed column(s) bound to a UDF -- no compute happens
|
||||
here (the agent fills them lazily, or refresh_column() triggers a run).
|
||||
|
||||
.. deprecated::
|
||||
A computed column is an expression over a registered function, so
|
||||
bind it as one: ``add_columns(computed={"vec": embed("data")})``.
|
||||
``embed("data")`` applies the function to the `data` column and
|
||||
infers the type from the function's return signature -- the
|
||||
function never couples to a particular column. Prefer that form.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
warnings.warn(
|
||||
"add_computed_column is deprecated; use add_columns(computed="
|
||||
'{"vec": embed("data")}).',
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
from .udf import Udf, struct_field_types
|
||||
|
||||
multi = isinstance(columns, (tuple, list))
|
||||
if isinstance(fn, Udf):
|
||||
expr = fn.expression(*(args or []))
|
||||
if types is None:
|
||||
if multi:
|
||||
if not fn.returns.upper().startswith("STRUCT"):
|
||||
raise ValueError(
|
||||
"several columns need a STRUCT-returning function"
|
||||
)
|
||||
types = struct_field_types(fn.returns)
|
||||
else:
|
||||
types = fn.returns
|
||||
else:
|
||||
if types is None:
|
||||
raise ValueError("pass types= when fn is a name string")
|
||||
expr = f"{fn}({', '.join(args or [])})"
|
||||
if multi:
|
||||
if len(types) != len(columns):
|
||||
raise ValueError(
|
||||
f"{len(columns)} columns but {len(types)} output types"
|
||||
)
|
||||
computed = {c: (t, expr) for c, t in zip(columns, types)}
|
||||
else:
|
||||
computed = {columns: (types, expr)}
|
||||
self.add_columns(computed=computed)
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
|
||||
@@ -923,7 +1019,7 @@ class Table(ABC):
|
||||
config : IndexConfigType, optional
|
||||
The index configuration object. If provided, uses the new unified API.
|
||||
Can be one of: IvfFlat, IvfPq, IvfSq, IvfRq, HnswPq, HnswSq,
|
||||
BTree, Bitmap, LabelList, FTS.
|
||||
BTree, Bitmap, LabelList, Fm, FTS.
|
||||
replace : bool, default True
|
||||
Whether to replace an existing index on this column.
|
||||
wait_timeout : timedelta, optional
|
||||
@@ -1516,7 +1612,7 @@ class Table(ABC):
|
||||
) -> MergeResult: ...
|
||||
|
||||
@abstractmethod
|
||||
def delete(self, where: str) -> DeleteResult:
|
||||
def delete(self, where: Union[str, Expr]) -> DeleteResult:
|
||||
"""Delete rows from the table.
|
||||
|
||||
This can be used to delete a single row, many rows, all rows, or
|
||||
@@ -1524,10 +1620,10 @@ class Table(ABC):
|
||||
|
||||
Parameters
|
||||
----------
|
||||
where: str
|
||||
The SQL where clause to use when deleting rows.
|
||||
|
||||
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
|
||||
where: str or :class:`~lancedb.expr.Expr`
|
||||
The filter condition. Can be a SQL string or a type-safe
|
||||
:class:`~lancedb.expr.Expr` built with :func:`~lancedb.expr.col`
|
||||
and :func:`~lancedb.expr.lit`.
|
||||
|
||||
The filter must not be empty, or it will error.
|
||||
|
||||
@@ -1997,6 +2093,7 @@ class LanceTable(Table):
|
||||
namespace_client: Optional[Any] = None,
|
||||
managed_versioning: Optional[bool] = None,
|
||||
pushdown_operations: Optional[set] = None,
|
||||
route_pushdown_to_rust: bool = False,
|
||||
_async: AsyncTable = None,
|
||||
):
|
||||
if namespace_path is None:
|
||||
@@ -2006,6 +2103,14 @@ class LanceTable(Table):
|
||||
self._location = location # Store location for use in _dataset_path
|
||||
self._namespace_client = namespace_client
|
||||
self._pushdown_operations = pushdown_operations or set()
|
||||
# When the connection built the namespace client natively (e.g. an
|
||||
# enterprise "rest" connection), the underlying Rust table already
|
||||
# executes QueryTable pushdown itself -- and, unlike this Python urllib3
|
||||
# path, it routes through the read-freshness context provider that emits
|
||||
# the ``x-lancedb-min-timestamp`` header. So we must defer pushdown to
|
||||
# Rust instead of calling the Python ``namespace_client.query_table``
|
||||
# directly, or reads silently bypass read-freshness (stale results).
|
||||
self._route_pushdown_to_rust = route_pushdown_to_rust
|
||||
if _async is not None:
|
||||
self._table = _async
|
||||
else:
|
||||
@@ -2106,22 +2211,27 @@ class LanceTable(Table):
|
||||
"Please install with `pip install pylance`."
|
||||
)
|
||||
|
||||
branch = self.current_branch()
|
||||
version = None if branch is not None else self.version
|
||||
if self._namespace_client is not None:
|
||||
table_id = self._namespace_path + [self.name]
|
||||
return lance.dataset(
|
||||
version=self.version,
|
||||
ds = lance.dataset(
|
||||
version=version,
|
||||
storage_options=self._conn.storage_options,
|
||||
namespace_client=self._namespace_client,
|
||||
table_id=table_id,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return lance.dataset(
|
||||
self._dataset_path,
|
||||
version=self.version,
|
||||
storage_options=self._conn.storage_options,
|
||||
**kwargs,
|
||||
)
|
||||
else:
|
||||
ds = lance.dataset(
|
||||
self._dataset_path,
|
||||
version=version,
|
||||
storage_options=self._conn.storage_options,
|
||||
**kwargs,
|
||||
)
|
||||
if branch is not None:
|
||||
ds = ds.checkout_version((branch, self.version))
|
||||
return ds
|
||||
|
||||
@property
|
||||
def schema(self) -> pa.Schema:
|
||||
@@ -2187,6 +2297,35 @@ class LanceTable(Table):
|
||||
"""
|
||||
return Tags(self._table)
|
||||
|
||||
@property
|
||||
def branches(self) -> "Branches":
|
||||
"""Branch management for the table.
|
||||
|
||||
``create``/``checkout`` return a new table handle scoped to the branch;
|
||||
writes on it do not affect ``main``.
|
||||
"""
|
||||
return Branches(self)
|
||||
|
||||
def current_branch(self) -> Optional[str]:
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
return self._table.current_branch()
|
||||
|
||||
def _wrap_branch_handle(
|
||||
self, async_table: "AsyncTable", version: Optional[int] = None
|
||||
) -> "LanceTable":
|
||||
# version is unused locally: the pin already lives on async_table and a
|
||||
# local handle is not reopened via a serialized connection.
|
||||
return LanceTable(
|
||||
self._conn,
|
||||
async_table.name,
|
||||
namespace_path=self._namespace_path,
|
||||
namespace_client=self._namespace_client,
|
||||
pushdown_operations=self._pushdown_operations,
|
||||
route_pushdown_to_rust=self._route_pushdown_to_rust,
|
||||
location=self._location,
|
||||
_async=async_table,
|
||||
)
|
||||
|
||||
def checkout(self, version: Union[int, str]):
|
||||
"""Checkout a version of the table. This is an in-place operation.
|
||||
|
||||
@@ -2333,6 +2472,14 @@ class LanceTable(Table):
|
||||
Returns
|
||||
-------
|
||||
pa.Table"""
|
||||
if (
|
||||
_should_push_down_query_table(
|
||||
self._namespace_client, self._pushdown_operations
|
||||
)
|
||||
and not self._route_pushdown_to_rust
|
||||
):
|
||||
return self._execute_query(Query()).read_all()
|
||||
|
||||
return LOOP.run(self._table.to_arrow())
|
||||
|
||||
def to_polars(self, batch_size=None) -> "pl.LazyFrame":
|
||||
@@ -2449,7 +2596,7 @@ class LanceTable(Table):
|
||||
config : IndexConfigType, optional
|
||||
The index configuration object. If provided, uses the new unified API.
|
||||
Can be one of: IvfFlat, IvfPq, IvfSq, IvfRq, HnswPq, HnswSq,
|
||||
BTree, Bitmap, LabelList, FTS.
|
||||
BTree, Bitmap, LabelList, Fm, FTS.
|
||||
replace : bool, default True
|
||||
Whether to replace an existing index on this column.
|
||||
wait_timeout : timedelta, optional
|
||||
@@ -3281,6 +3428,7 @@ class LanceTable(Table):
|
||||
location: Optional[str] = None,
|
||||
namespace_client: Optional[Any] = None,
|
||||
pushdown_operations: Optional[set] = None,
|
||||
route_pushdown_to_rust: bool = False,
|
||||
):
|
||||
"""
|
||||
Create a new table.
|
||||
@@ -3343,21 +3491,24 @@ class LanceTable(Table):
|
||||
self._location = location
|
||||
self._namespace_client = namespace_client
|
||||
self._pushdown_operations = pushdown_operations or set()
|
||||
self._route_pushdown_to_rust = route_pushdown_to_rust
|
||||
|
||||
if data_storage_version is not None:
|
||||
warnings.warn(
|
||||
"setting data_storage_version directly on create_table is deprecated. ",
|
||||
"setting data_storage_version directly on create_table is deprecated. "
|
||||
"Use database_options instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
if storage_options is None:
|
||||
storage_options = {}
|
||||
storage_options["new_table_data_storage_version"] = data_storage_version
|
||||
if enable_v2_manifest_paths is not None:
|
||||
warnings.warn(
|
||||
"setting enable_v2_manifest_paths directly on create_table is ",
|
||||
"setting enable_v2_manifest_paths directly on create_table is "
|
||||
"deprecated. Use database_options instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
if storage_options is None:
|
||||
storage_options = {}
|
||||
@@ -3383,8 +3534,9 @@ class LanceTable(Table):
|
||||
)
|
||||
return self
|
||||
|
||||
def delete(self, where: str) -> DeleteResult:
|
||||
return LOOP.run(self._table.delete(where))
|
||||
def delete(self, where: Union[str, Expr]) -> DeleteResult:
|
||||
predicate = where._inner if isinstance(where, Expr) else where
|
||||
return LOOP.run(self._table.delete(predicate))
|
||||
|
||||
def update(
|
||||
self,
|
||||
@@ -3446,9 +3598,15 @@ class LanceTable(Table):
|
||||
batch_size: Optional[int] = None,
|
||||
timeout: Optional[timedelta] = None,
|
||||
) -> pa.RecordBatchReader:
|
||||
# Branch queries run locally: the server-side query protocol can't
|
||||
# carry a branch yet.
|
||||
# TODO: push down server-side once it can (with remote table support).
|
||||
if (
|
||||
"QueryTable" in self._pushdown_operations
|
||||
and self._namespace_client is not None
|
||||
_should_push_down_query_table(
|
||||
self._namespace_client, self._pushdown_operations
|
||||
)
|
||||
and not self._route_pushdown_to_rust
|
||||
and self.current_branch() is None
|
||||
):
|
||||
from lancedb.namespace import _execute_server_side_query
|
||||
|
||||
@@ -3623,9 +3781,68 @@ class LanceTable(Table):
|
||||
return LOOP.run(self._table.index_stats(index_name))
|
||||
|
||||
def add_columns(
|
||||
self, transforms: Dict[str, str] | pa.field | List[pa.field] | pa.Schema
|
||||
) -> AddColumnsResult:
|
||||
return LOOP.run(self._table.add_columns(transforms))
|
||||
self,
|
||||
transforms: Dict[str, str]
|
||||
| pa.field
|
||||
| List[pa.field]
|
||||
| pa.Schema
|
||||
| None = None,
|
||||
*,
|
||||
computed: Optional[Dict] = None,
|
||||
) -> Optional[AddColumnsResult]:
|
||||
result = None
|
||||
if transforms is not None:
|
||||
result = LOOP.run(self._table.add_columns(transforms))
|
||||
if computed:
|
||||
# computed binds an expression over a registered function to a
|
||||
# column: {col: fn("input_col")} -- fn("input_col") yields the
|
||||
# expression and carries the inferred type; a tuple key fans a
|
||||
# STRUCT return out to several columns. Declares the binding only;
|
||||
# the server fills the values (server-backed). The legacy
|
||||
# {col: (sql_type, expression)} tuple form is still accepted.
|
||||
result_unused = LOOP.run(self._table.add_columns(computed=computed))
|
||||
del result_unused
|
||||
return result
|
||||
|
||||
def refresh_column(
|
||||
self,
|
||||
columns,
|
||||
*,
|
||||
where: Optional[str] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> "JobHandle":
|
||||
"""Trigger recompute of computed columns (REFRESH COLUMN).
|
||||
|
||||
The expression is resolved server-side from each column's stored
|
||||
binding; columns bound to the same struct-returning function
|
||||
refresh together. Returns a `JobHandle` to wait on, poll, or cancel
|
||||
(``tbl.refresh_column("col").wait()``) -- mirrors
|
||||
`MaterializedView.refresh()`. Server-backed feature (LanceDB
|
||||
Enterprise / Cloud).
|
||||
|
||||
num_workers / max_workers / batch_size / priority are per-refresh
|
||||
scheduling knobs (how to run THIS refresh) and override any default
|
||||
the function carries. `priority` is a Kueue tier
|
||||
(training | interactive | backfill).
|
||||
"""
|
||||
from .udf import JobHandle
|
||||
|
||||
if isinstance(columns, str):
|
||||
columns = [columns]
|
||||
job_id = LOOP.run(
|
||||
self._table.refresh_column(
|
||||
list(columns),
|
||||
where=where,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
priority=priority,
|
||||
)
|
||||
)
|
||||
return JobHandle(self._conn, job_id, table=self.name)
|
||||
|
||||
def alter_columns(
|
||||
self, *alterations: Iterable[Dict[str, str]]
|
||||
@@ -4182,7 +4399,15 @@ class AsyncTable:
|
||||
[AsyncTable.create_index][lancedb.table.AsyncTable.create_index].
|
||||
"""
|
||||
|
||||
def __init__(self, table: LanceDBTable):
|
||||
def __init__(
|
||||
self,
|
||||
table: LanceDBTable,
|
||||
*,
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
namespace_client: Optional[Any] = None,
|
||||
pushdown_operations: Optional[set] = None,
|
||||
route_pushdown_to_rust: bool = False,
|
||||
):
|
||||
"""Create a new AsyncTable object.
|
||||
|
||||
You should not create AsyncTable objects directly.
|
||||
@@ -4191,6 +4416,26 @@ class AsyncTable:
|
||||
[AsyncConnection.open_table][lancedb.AsyncConnection.open_table] to obtain
|
||||
Table objects."""
|
||||
self._inner = table
|
||||
self._namespace_path = namespace_path or []
|
||||
self._namespace_client = namespace_client
|
||||
self._pushdown_operations = pushdown_operations or set()
|
||||
# See LanceTable.__init__: defer QueryTable pushdown to Rust (which emits
|
||||
# the read-freshness header) for natively-built namespace clients.
|
||||
self._route_pushdown_to_rust = route_pushdown_to_rust
|
||||
|
||||
def _set_namespace_context(
|
||||
self,
|
||||
*,
|
||||
namespace_path: Optional[List[str]] = None,
|
||||
namespace_client: Optional[Any] = None,
|
||||
pushdown_operations: Optional[set] = None,
|
||||
route_pushdown_to_rust: bool = False,
|
||||
) -> "AsyncTable":
|
||||
self._namespace_path = namespace_path or []
|
||||
self._namespace_client = namespace_client
|
||||
self._pushdown_operations = pushdown_operations or set()
|
||||
self._route_pushdown_to_rust = route_pushdown_to_rust
|
||||
return self
|
||||
|
||||
def __repr__(self):
|
||||
return self._inner.__repr__()
|
||||
@@ -4353,12 +4598,20 @@ class AsyncTable:
|
||||
"Please install with `pip install pylance`."
|
||||
)
|
||||
|
||||
return lance.dataset(
|
||||
# lance.dataset() can't open a branch directly, so open the base table
|
||||
# and check out the branch ref (a None branch resolves to main).
|
||||
branch = self.current_branch()
|
||||
table_version = await self.version()
|
||||
version = None if branch is not None else table_version
|
||||
ds = lance.dataset(
|
||||
await self.uri(),
|
||||
version=await self.version(),
|
||||
version=version,
|
||||
storage_options=await self.latest_storage_options(),
|
||||
**kwargs,
|
||||
)
|
||||
if branch is not None:
|
||||
ds = ds.checkout_version((branch, table_version))
|
||||
return ds
|
||||
|
||||
async def to_pandas(self, blob_mode: BlobMode = "lazy", **kwargs) -> "pd.DataFrame":
|
||||
"""Return the table as a pandas DataFrame.
|
||||
@@ -4391,6 +4644,14 @@ class AsyncTable:
|
||||
-------
|
||||
pa.Table
|
||||
"""
|
||||
if (
|
||||
_should_push_down_query_table(
|
||||
self._namespace_client, self._pushdown_operations
|
||||
)
|
||||
and not self._route_pushdown_to_rust
|
||||
):
|
||||
return (await self._execute_query(Query())).read_all()
|
||||
|
||||
return await self.query().to_arrow()
|
||||
|
||||
async def create_index(
|
||||
@@ -4409,6 +4670,7 @@ class AsyncTable:
|
||||
BTree,
|
||||
Bitmap,
|
||||
LabelList,
|
||||
Fm,
|
||||
FTS,
|
||||
]
|
||||
] = None,
|
||||
@@ -4461,12 +4723,14 @@ class AsyncTable:
|
||||
BTree,
|
||||
Bitmap,
|
||||
LabelList,
|
||||
Fm,
|
||||
FTS,
|
||||
),
|
||||
):
|
||||
raise TypeError(
|
||||
"config must be an instance of IvfSq, IvfPq, IvfRq, HnswPq, HnswSq,"
|
||||
" BTree, Bitmap, LabelList, or FTS, but got " + str(type(config))
|
||||
" BTree, Bitmap, LabelList, Fm, or FTS, but got "
|
||||
+ str(type(config))
|
||||
)
|
||||
try:
|
||||
await self._inner.create_index(
|
||||
@@ -4908,10 +5172,13 @@ class AsyncTable:
|
||||
if embedding is not None:
|
||||
loop = asyncio.get_running_loop()
|
||||
# This function is likely to block, since it either calls an expensive
|
||||
# function or makes an HTTP request to an embeddings REST API.
|
||||
# function or makes an HTTP request to an embeddings REST API. Run it
|
||||
# on a dedicated executor so it can't starve the default executor that
|
||||
# other blocking I/O shares. See
|
||||
# https://github.com/lancedb/lancedb/issues/3310.
|
||||
return (
|
||||
await loop.run_in_executor(
|
||||
None,
|
||||
embedding_executor(),
|
||||
embedding.function.compute_query_embeddings_with_retry,
|
||||
query,
|
||||
)
|
||||
@@ -5065,6 +5332,17 @@ class AsyncTable:
|
||||
batch_size: Optional[int] = None,
|
||||
timeout: Optional[timedelta] = None,
|
||||
) -> pa.RecordBatchReader:
|
||||
if (
|
||||
_should_push_down_query_table(
|
||||
self._namespace_client, self._pushdown_operations
|
||||
)
|
||||
and not self._route_pushdown_to_rust
|
||||
):
|
||||
from lancedb.namespace import _execute_server_side_query
|
||||
|
||||
table_id = self._namespace_path + [self.name]
|
||||
return _execute_server_side_query(self._namespace_client, table_id, query)
|
||||
|
||||
# The sync table calls into this method, so we need to map the
|
||||
# query to the async version of the query and run that here. This is only
|
||||
# used for that code path right now.
|
||||
@@ -5120,6 +5398,7 @@ class AsyncTable:
|
||||
when_not_matched_insert_all=merge._when_not_matched_insert_all,
|
||||
when_not_matched_by_source_delete=merge._when_not_matched_by_source_delete,
|
||||
when_not_matched_by_source_condition=merge._when_not_matched_by_source_condition,
|
||||
when_not_matched_by_source_condition_expr=merge._when_not_matched_by_source_condition_expr,
|
||||
timeout=merge._timeout,
|
||||
use_index=merge._use_index,
|
||||
use_lsm_write=merge._use_lsm_write,
|
||||
@@ -5127,7 +5406,7 @@ class AsyncTable:
|
||||
),
|
||||
)
|
||||
|
||||
async def delete(self, where: str) -> DeleteResult:
|
||||
async def delete(self, where: Union[str, Expr]) -> DeleteResult:
|
||||
"""Delete rows from the table.
|
||||
|
||||
This can be used to delete a single row, many rows, all rows, or
|
||||
@@ -5135,10 +5414,10 @@ class AsyncTable:
|
||||
|
||||
Parameters
|
||||
----------
|
||||
where: str
|
||||
The SQL where clause to use when deleting rows.
|
||||
|
||||
- For example, 'x = 2' or 'x IN (1, 2, 3)'.
|
||||
where: str or :class:`~lancedb.expr.Expr`
|
||||
The filter condition. Can be a SQL string or a type-safe
|
||||
:class:`~lancedb.expr.Expr` built with :func:`~lancedb.expr.col`
|
||||
and :func:`~lancedb.expr.lit`.
|
||||
|
||||
The filter must not be empty, or it will error.
|
||||
|
||||
@@ -5177,7 +5456,8 @@ class AsyncTable:
|
||||
x vector
|
||||
0 3 [5.0, 6.0]
|
||||
"""
|
||||
return await self._inner.delete(where)
|
||||
predicate = where._inner if isinstance(where, Expr) else where
|
||||
return await self._inner.delete(predicate)
|
||||
|
||||
async def update(
|
||||
self,
|
||||
@@ -5240,9 +5520,44 @@ class AsyncTable:
|
||||
|
||||
return await self._inner.update(updates_sql, where)
|
||||
|
||||
async def refresh_column(
|
||||
self,
|
||||
columns,
|
||||
*,
|
||||
where: Optional[str] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Trigger recompute of computed columns (REFRESH COLUMN).
|
||||
Returns the refresh job id. Server-backed feature.
|
||||
|
||||
num_workers / max_workers / batch_size / priority are per-refresh
|
||||
scheduling knobs (how to run THIS refresh); they override any default
|
||||
the function carries. `priority` is a Kueue tier
|
||||
(training | interactive | backfill)."""
|
||||
if isinstance(columns, str):
|
||||
columns = [columns]
|
||||
return await self._inner.refresh_column(
|
||||
list(columns),
|
||||
where_clause=where,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
priority=priority,
|
||||
)
|
||||
|
||||
async def add_columns(
|
||||
self, transforms: dict[str, str] | pa.field | List[pa.field] | pa.Schema
|
||||
) -> AddColumnsResult:
|
||||
self,
|
||||
transforms: dict[str, str]
|
||||
| pa.field
|
||||
| List[pa.field]
|
||||
| pa.Schema
|
||||
| None = None,
|
||||
*,
|
||||
computed: Optional[Dict] = None,
|
||||
) -> Optional[AddColumnsResult]:
|
||||
"""
|
||||
Add new columns with defined values.
|
||||
|
||||
@@ -5261,6 +5576,7 @@ class AsyncTable:
|
||||
version: the new version number of the table after adding columns.
|
||||
|
||||
"""
|
||||
result = None
|
||||
if isinstance(transforms, pa.Field):
|
||||
transforms = [transforms]
|
||||
if isinstance(transforms, list) and all(
|
||||
@@ -5268,9 +5584,69 @@ class AsyncTable:
|
||||
):
|
||||
transforms = pa.schema(transforms)
|
||||
if isinstance(transforms, pa.Schema):
|
||||
return await self._inner.add_columns_with_schema(transforms)
|
||||
result = await self._inner.add_columns_with_schema(transforms)
|
||||
elif transforms is not None:
|
||||
result = await self._inner.add_columns(list(transforms.items()))
|
||||
if computed:
|
||||
# computed binds an expression over a registered function to a
|
||||
# column: {col: fn("input_col")} -- fn("input_col") yields the
|
||||
# expression and carries the inferred type; a tuple key fans a
|
||||
# STRUCT return out to several columns. Declares the binding only;
|
||||
# the server fills the values (server-backed). The legacy
|
||||
# {col: (sql_type, expression)} tuple form is still accepted.
|
||||
for expression, cols in _computed_groups(computed):
|
||||
await self._inner.add_computed_columns(cols, expression)
|
||||
return result
|
||||
|
||||
async def add_computed_column(
|
||||
self,
|
||||
columns,
|
||||
fn,
|
||||
args: Optional[List[str]] = None,
|
||||
types=None,
|
||||
) -> None:
|
||||
"""Declare computed column(s) bound to a UDF (async).
|
||||
|
||||
.. deprecated::
|
||||
Use ``add_columns(computed={"col": fn("input_col")})`` -- a computed
|
||||
column is an expression over a registered function, so bind it that
|
||||
way instead of coupling the UDF to the column here.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
warnings.warn(
|
||||
"add_computed_column is deprecated; use add_columns(computed="
|
||||
'{"col": fn("input_col")}).',
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
from .udf import Udf, struct_field_types
|
||||
|
||||
multi = isinstance(columns, (tuple, list))
|
||||
if isinstance(fn, Udf):
|
||||
expr = fn.expression(*(args or []))
|
||||
if types is None:
|
||||
if multi:
|
||||
if not fn.returns.upper().startswith("STRUCT"):
|
||||
raise ValueError(
|
||||
"several columns need a STRUCT-returning function"
|
||||
)
|
||||
types = struct_field_types(fn.returns)
|
||||
else:
|
||||
types = fn.returns
|
||||
else:
|
||||
return await self._inner.add_columns(list(transforms.items()))
|
||||
if types is None:
|
||||
raise ValueError("pass types= when fn is a name string")
|
||||
expr = f"{fn}({', '.join(args or [])})"
|
||||
if multi:
|
||||
if len(types) != len(columns):
|
||||
raise ValueError(
|
||||
f"{len(columns)} columns but {len(types)} output types"
|
||||
)
|
||||
computed = {c: (t, expr) for c, t in zip(columns, types)}
|
||||
else:
|
||||
computed = {columns: (types, expr)}
|
||||
await self.add_columns(computed=computed)
|
||||
|
||||
async def alter_columns(
|
||||
self, *alterations: Iterable[dict[str, Any]]
|
||||
@@ -5473,6 +5849,19 @@ class AsyncTable:
|
||||
"""
|
||||
return AsyncTags(self._inner)
|
||||
|
||||
@property
|
||||
def branches(self) -> AsyncBranches:
|
||||
"""Branch management for the table.
|
||||
|
||||
Branches are isolated, writable lines of history forked from another
|
||||
branch (or version). Writes on a branch do not affect ``main``.
|
||||
"""
|
||||
return AsyncBranches(self._inner)
|
||||
|
||||
def current_branch(self) -> Optional[str]:
|
||||
"""The branch this table handle is scoped to, or ``None`` for ``main``."""
|
||||
return self._inner.current_branch()
|
||||
|
||||
async def optimize(
|
||||
self,
|
||||
*,
|
||||
@@ -5529,6 +5918,7 @@ class AsyncTable:
|
||||
"The 'retrain' parameter is deprecated and will be removed in a "
|
||||
"future version.",
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
|
||||
return await self._inner.optimize(
|
||||
@@ -5634,8 +6024,6 @@ class IndexStatistics:
|
||||
The distance type used by the index.
|
||||
num_indices: Optional[int]
|
||||
The number of parts the index is split into.
|
||||
loss: Optional[float]
|
||||
The KMeans loss for the index, for only vector indices.
|
||||
"""
|
||||
|
||||
num_indexed_rows: int
|
||||
@@ -5655,7 +6043,6 @@ class IndexStatistics:
|
||||
]
|
||||
distance_type: Optional[Literal["l2", "cosine", "dot"]] = None
|
||||
num_indices: Optional[int] = None
|
||||
loss: Optional[float] = None
|
||||
|
||||
# This exists for backwards compatibility with an older API, which returned
|
||||
# a dictionary instead of a class.
|
||||
@@ -5808,6 +6195,69 @@ class Tags:
|
||||
LOOP.run(self._table.tags.update(tag, version))
|
||||
|
||||
|
||||
class Branches:
|
||||
"""
|
||||
Table branch manager.
|
||||
"""
|
||||
|
||||
def __init__(self, parent: "LanceTable"):
|
||||
self._parent = parent
|
||||
self._table = parent._table
|
||||
|
||||
def list(self) -> Dict[str, Any]:
|
||||
"""List all branches, mapping name to branch metadata."""
|
||||
return LOOP.run(self._table.branches.list())
|
||||
|
||||
def create(
|
||||
self,
|
||||
name: str,
|
||||
from_ref: Optional[str] = None,
|
||||
from_version: Optional[int] = None,
|
||||
) -> "Table":
|
||||
"""Create a branch and return a handle scoped to it.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
name: str
|
||||
Name of the new branch.
|
||||
from_ref: str, optional
|
||||
Source branch to fork from. Defaults to ``main``.
|
||||
from_version: int, optional
|
||||
A specific version on ``from_ref`` to fork from. Defaults to latest.
|
||||
"""
|
||||
async_table = LOOP.run(
|
||||
self._table.branches.create(name, from_ref, from_version)
|
||||
)
|
||||
return self._wrap(async_table)
|
||||
|
||||
def checkout(self, name: str, version: Optional[int] = None) -> "Table":
|
||||
"""Check out an existing branch and return a handle scoped to it.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
name: str
|
||||
Name of the branch to check out.
|
||||
version: int, optional
|
||||
A specific version on the branch to pin. When set, the returned
|
||||
handle is a read-only view of that version; when omitted it tracks
|
||||
the branch's latest and stays writable.
|
||||
"""
|
||||
async_table = LOOP.run(self._table.branches.checkout(name, version))
|
||||
return self._wrap(async_table, version)
|
||||
|
||||
def delete(self, name: str) -> None:
|
||||
"""Delete a branch."""
|
||||
LOOP.run(self._table.branches.delete(name))
|
||||
|
||||
def _wrap(
|
||||
self, async_table: "AsyncTable", version: Optional[int] = None
|
||||
) -> "Table":
|
||||
# Delegate to the parent so the branch handle keeps its concrete type
|
||||
# (LanceTable / RemoteTable) and connection context; `version` is the
|
||||
# explicit pin so a remote handle can restore branch+version on reopen.
|
||||
return self._parent._wrap_branch_handle(async_table, version)
|
||||
|
||||
|
||||
class AsyncTags:
|
||||
"""
|
||||
Async table tag manager.
|
||||
@@ -5875,3 +6325,56 @@ class AsyncTags:
|
||||
The new table version to tag.
|
||||
"""
|
||||
await self._table.tags.update(tag, version)
|
||||
|
||||
|
||||
class AsyncBranches:
|
||||
"""Async table branch manager."""
|
||||
|
||||
def __init__(self, table):
|
||||
self._table = table
|
||||
|
||||
async def list(self) -> Dict[str, Any]:
|
||||
"""List all branches, mapping name to branch metadata."""
|
||||
return await self._table.branches.list()
|
||||
|
||||
async def create(
|
||||
self,
|
||||
name: str,
|
||||
from_ref: Optional[str] = None,
|
||||
from_version: Optional[int] = None,
|
||||
) -> "AsyncTable":
|
||||
"""Create a branch and return a handle scoped to it.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
name: str
|
||||
Name of the new branch.
|
||||
from_ref: str, optional
|
||||
Source branch to fork from. Defaults to ``main``.
|
||||
from_version: int, optional
|
||||
A specific version on ``from_ref`` to fork from. Defaults to latest.
|
||||
"""
|
||||
# "main" and None are two spellings of the root branch in lance; normalize
|
||||
# so from_ref="main" behaves identically to the default.
|
||||
if from_ref == "main":
|
||||
from_ref = None
|
||||
inner = await self._table.branches.create(name, from_ref, from_version)
|
||||
return AsyncTable(inner)
|
||||
|
||||
async def checkout(self, name: str, version: Optional[int] = None) -> "AsyncTable":
|
||||
"""Check out an existing branch and return a handle scoped to it.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
name: str
|
||||
Name of the branch to check out.
|
||||
version: int, optional
|
||||
A specific version on the branch to pin. When set, the returned
|
||||
handle is a read-only view of that version; when omitted it tracks
|
||||
the branch's latest and stays writable.
|
||||
"""
|
||||
return AsyncTable(await self._table.branches.checkout(name, version))
|
||||
|
||||
async def delete(self, name: str) -> None:
|
||||
"""Delete a branch."""
|
||||
await self._table.branches.delete(name)
|
||||
|
||||
753
python/python/lancedb/udf.py
Normal file
753
python/python/lancedb/udf.py
Normal file
@@ -0,0 +1,753 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
"""UDF authoring for LanceDB derived compute (server-backed).
|
||||
|
||||
`@udf` / `@table_udf` turn a plain Python function into a registrable
|
||||
server-side UDF: a cloudpickled (or source) body, a SQL signature inferred
|
||||
from type hints, and the runtime options (pip deps, GPUs, batching, ...).
|
||||
Register and use them through the existing connection/table API:
|
||||
|
||||
import lancedb
|
||||
from lancedb import udf, table_udf
|
||||
|
||||
db = lancedb.connect("db://my_db", api_key="...", host_override="...")
|
||||
|
||||
@udf(pip=["torch>=2.0"], num_gpus=1)
|
||||
def embed(text: str) -> list[float]:
|
||||
return model.encode(text).tolist()
|
||||
|
||||
db.create_function(embed) # CREATE FUNCTION (once)
|
||||
tbl = db.open_table("docs")
|
||||
tbl.add_columns(computed={"vec": embed("text")}) # bind embed(text) -> vec
|
||||
tbl.refresh_column("vec").wait() # materialize (returns a JobHandle)
|
||||
view = db.create_materialized_view("chunks", tbl, ["id", chunk_fn])
|
||||
|
||||
`embed("text")` applies the registered function to the `text` column and yields
|
||||
the expression `embed(text)`; the function itself stays decoupled from any
|
||||
column, so the same `embed` works on any column or table.
|
||||
|
||||
These operations are server-backed (LanceDB Enterprise / Cloud); the
|
||||
decorator itself works locally (define + call), only registration needs a
|
||||
remote connection.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import base64
|
||||
import dataclasses
|
||||
import functools
|
||||
import inspect
|
||||
import re
|
||||
import sys
|
||||
import textwrap
|
||||
import time
|
||||
import typing
|
||||
|
||||
# -- type hints -> SQL type strings -------------------------------------
|
||||
|
||||
_SCALARS = {
|
||||
int: "BIGINT",
|
||||
# Pragmatic default for ML workloads: python float maps to FLOAT
|
||||
# (Float32). Use an explicit `returns=` for DOUBLE.
|
||||
float: "FLOAT",
|
||||
str: "VARCHAR",
|
||||
bool: "BOOLEAN",
|
||||
bytes: "BLOB",
|
||||
}
|
||||
|
||||
|
||||
class TypeInferenceError(TypeError):
|
||||
pass
|
||||
|
||||
|
||||
def sql_type(hint) -> str:
|
||||
"""SQL type string for a python type hint."""
|
||||
if hint in _SCALARS:
|
||||
return _SCALARS[hint]
|
||||
origin = typing.get_origin(hint)
|
||||
if origin in (list, typing.List):
|
||||
(item,) = typing.get_args(hint) or (None,)
|
||||
if item in _SCALARS:
|
||||
return f"{_SCALARS[item]}[]"
|
||||
raise TypeInferenceError(
|
||||
f"unsupported list item type {item!r}; use an explicit returns="
|
||||
)
|
||||
fields = _struct_fields(hint)
|
||||
if fields is not None:
|
||||
inner = ", ".join(f"{name} {sql_type(h)}" for name, h in fields)
|
||||
return f"STRUCT({inner})"
|
||||
raise TypeInferenceError(
|
||||
f"cannot infer a SQL type for {hint!r}; pass an explicit type string"
|
||||
)
|
||||
|
||||
|
||||
def _struct_fields(hint):
|
||||
"""(name, hint) pairs for a TypedDict or dataclass, else None."""
|
||||
if dataclasses.is_dataclass(hint):
|
||||
return [(f.name, f.type) for f in dataclasses.fields(hint)]
|
||||
# TypedDict detection: a dict subclass with __annotations__.
|
||||
if (
|
||||
isinstance(hint, type)
|
||||
and issubclass(hint, dict)
|
||||
and typing.get_type_hints(hint)
|
||||
):
|
||||
return list(typing.get_type_hints(hint).items())
|
||||
return None
|
||||
|
||||
|
||||
def return_type(fn, override: "str | None", table: bool) -> str:
|
||||
"""SQL return type for a function: explicit override wins, else the
|
||||
return annotation. Table functions render as TABLE(...) and accept
|
||||
struct-shaped hints (TypedDict/dataclass, optionally list-wrapped)."""
|
||||
if override is not None:
|
||||
s = override.strip()
|
||||
if table and not s.upper().startswith("TABLE"):
|
||||
if s.upper().startswith("STRUCT"):
|
||||
return "TABLE" + s[len("STRUCT") :]
|
||||
raise TypeInferenceError(
|
||||
"a table function's returns= must be TABLE(...) or STRUCT(...)"
|
||||
)
|
||||
return s
|
||||
|
||||
hints = typing.get_type_hints(fn)
|
||||
ret = hints.get("return")
|
||||
if ret is None:
|
||||
raise TypeInferenceError(
|
||||
f"function {fn.__name__!r} needs a return annotation or returns="
|
||||
)
|
||||
if table:
|
||||
# Accept list[Row] / Row where Row is a TypedDict or dataclass.
|
||||
if typing.get_origin(ret) in (list, typing.List):
|
||||
(ret,) = typing.get_args(ret)
|
||||
fields = _struct_fields(ret)
|
||||
if fields is None:
|
||||
raise TypeInferenceError(
|
||||
"a table function must return rows shaped as a TypedDict or "
|
||||
"dataclass (optionally list-wrapped); or pass returns=..."
|
||||
)
|
||||
inner = ", ".join(f"{name} {sql_type(h)}" for name, h in fields)
|
||||
return f"TABLE({inner})"
|
||||
return sql_type(ret)
|
||||
|
||||
|
||||
def param_types(fn) -> "list[tuple[str, str]]":
|
||||
"""(name, sql type) per parameter, from annotations. Each UDF
|
||||
parameter binds to a source column of the same name by default."""
|
||||
hints = typing.get_type_hints(fn)
|
||||
out = []
|
||||
for name, p in inspect.signature(fn).parameters.items():
|
||||
if p.kind in (p.VAR_POSITIONAL, p.VAR_KEYWORD):
|
||||
raise TypeInferenceError("*args/**kwargs are not supported in UDFs")
|
||||
hint = hints.get(name)
|
||||
if hint is None:
|
||||
raise TypeInferenceError(
|
||||
f"parameter {name!r} of {fn.__name__!r} needs a type annotation"
|
||||
)
|
||||
out.append((name, sql_type(hint)))
|
||||
return out
|
||||
|
||||
|
||||
# -- column expressions -------------------------------------------------
|
||||
|
||||
|
||||
class ColumnExpr(str):
|
||||
"""A computed-column expression produced by applying a registered
|
||||
function to column names, e.g. ``embed("data") -> "embed(data)"``.
|
||||
|
||||
It IS the expression string everywhere a string is expected (views, SQL,
|
||||
logging), and additionally carries the function's declared return type so
|
||||
``add_columns(computed=...)`` can declare the column without a hand-written
|
||||
type. ``field_types`` holds the per-field SQL types of a STRUCT return, for
|
||||
fanning one expression out to several columns.
|
||||
"""
|
||||
|
||||
data_type: "str | None"
|
||||
field_types: "list[str] | None"
|
||||
|
||||
def __new__(cls, expr: str, data_type=None, field_types=None):
|
||||
obj = super().__new__(cls, expr)
|
||||
obj.data_type = data_type
|
||||
obj.field_types = field_types
|
||||
return obj
|
||||
|
||||
|
||||
def _normalize_computed(computed: dict) -> dict:
|
||||
"""Normalize the user-facing ``computed=`` mapping to the canonical
|
||||
``{name: (sql_type, expression)}`` form.
|
||||
|
||||
Accepts, per entry:
|
||||
- value is a `ColumnExpr` (from ``fn("col")``): the column's SQL type
|
||||
comes from the function's return type -- no hand-written type needed. A
|
||||
tuple key (``("chunk", "idx")``) fans a STRUCT return out to one
|
||||
(type, expression) entry per field, in declared order.
|
||||
- value is a legacy ``(sql_type, expression)`` tuple: passed through (the
|
||||
escape hatch, e.g. bare-name function strings).
|
||||
"""
|
||||
out: dict = {}
|
||||
for key, val in computed.items():
|
||||
if isinstance(val, ColumnExpr):
|
||||
expr = str(val)
|
||||
if isinstance(key, (tuple, list)):
|
||||
if not val.field_types:
|
||||
raise ValueError(
|
||||
f"columns {tuple(key)} need a STRUCT-returning function; "
|
||||
f"{expr} returns a single value"
|
||||
)
|
||||
if len(val.field_types) != len(key):
|
||||
raise ValueError(
|
||||
f"{len(key)} columns but {len(val.field_types)} struct fields "
|
||||
f"in {expr}"
|
||||
)
|
||||
for name, t in zip(key, val.field_types):
|
||||
out[name] = (t, expr)
|
||||
else:
|
||||
if val.data_type is None:
|
||||
raise ValueError(f"cannot infer a type for {expr}; pass types=")
|
||||
out[key] = (val.data_type, expr)
|
||||
else:
|
||||
out[key] = val
|
||||
return out
|
||||
|
||||
|
||||
# -- the @udf / @table_udf decorators -----------------------------------
|
||||
|
||||
|
||||
class Udf:
|
||||
def __init__(
|
||||
self,
|
||||
fn,
|
||||
*,
|
||||
returns: "str | None" = None,
|
||||
table: bool = False,
|
||||
name: "str | None" = None,
|
||||
pip: "list[str] | None" = None,
|
||||
pip_index_url: "str | None" = None,
|
||||
pip_extra_index_urls: "list[str] | None" = None,
|
||||
find_links: "list[str] | None" = None,
|
||||
requirements: "str | list[str] | None" = None,
|
||||
conda: "list[str] | None" = None,
|
||||
conda_channels: "list[str] | None" = None,
|
||||
env: "dict[str, str] | list[str] | None" = None,
|
||||
num_cpus: "int | None" = None,
|
||||
num_gpus: "int | None" = None,
|
||||
batch_size: "int | None" = None,
|
||||
timeout: "float | None" = None,
|
||||
error_policy: "str | None" = None,
|
||||
max_skip_ratio: "float | None" = None,
|
||||
retries: "int | None" = None,
|
||||
docker_image: "str | None" = None,
|
||||
description: "str | None" = None,
|
||||
prefer_source: bool = False,
|
||||
):
|
||||
functools.update_wrapper(self, fn)
|
||||
self.fn = fn
|
||||
self.name = name or fn.__name__
|
||||
self.table = table
|
||||
self.params = param_types(fn)
|
||||
self.returns = return_type(fn, returns, table)
|
||||
self.prefer_source = prefer_source
|
||||
self.options: "dict[str, str]" = {}
|
||||
if conda and (pip or requirements):
|
||||
raise ValueError("pass conda or pip/requirements, not both")
|
||||
if conda_channels and not conda:
|
||||
raise ValueError("conda_channels requires conda")
|
||||
if pip:
|
||||
self.options["pip"] = ",".join(pip)
|
||||
if pip_extra_index_urls:
|
||||
self.options["pip_extra_index_urls"] = ",".join(pip_extra_index_urls)
|
||||
if find_links:
|
||||
self.options["find_links"] = ",".join(find_links)
|
||||
if requirements:
|
||||
self.options["requirements"] = _format_requirements(requirements)
|
||||
if conda:
|
||||
self.options["conda"] = ",".join(conda)
|
||||
if conda_channels:
|
||||
self.options["conda_channels"] = ",".join(conda_channels)
|
||||
if env:
|
||||
self.options["env"] = _format_env(env)
|
||||
for key, val in [
|
||||
("pip_index_url", pip_index_url),
|
||||
("num_cpus", num_cpus),
|
||||
("num_gpus", num_gpus),
|
||||
("batch_size", batch_size),
|
||||
("timeout", timeout),
|
||||
("error_policy", error_policy),
|
||||
("max_skip_ratio", max_skip_ratio),
|
||||
("retries", retries),
|
||||
("docker_image", docker_image),
|
||||
]:
|
||||
if val is not None:
|
||||
self.options[key] = str(val)
|
||||
# Keep the source in the description (when available) so the
|
||||
# catalog stays inspectable even for pickled bodies.
|
||||
if description is not None:
|
||||
self.options["description"] = description
|
||||
else:
|
||||
try:
|
||||
self.options["description"] = textwrap.dedent(inspect.getsource(fn))
|
||||
except (OSError, TypeError):
|
||||
pass
|
||||
|
||||
def __call__(self, *args, **kwargs):
|
||||
"""Call with real values to run locally; call with column-name
|
||||
strings to build an expression for backfills and views, e.g.
|
||||
``embed("data")`` -> the expression ``embed(data)`` (a `ColumnExpr`
|
||||
carrying the function's return type for `add_columns(computed=...)`)."""
|
||||
if args and all(isinstance(a, str) for a in args) and not kwargs:
|
||||
return self.expression(*args)
|
||||
return self.fn(*args, **kwargs)
|
||||
|
||||
def expression(self, *columns: str) -> ColumnExpr:
|
||||
"""The expression applying this function to `columns` (default: the
|
||||
function's own parameter names). Returns a `ColumnExpr` -- a string
|
||||
that also carries the declared return type (and struct field types)."""
|
||||
cols = columns or [p for p, _ in self.params]
|
||||
expr = f"{self.name}({', '.join(cols)})"
|
||||
field_types = None
|
||||
if self.returns.upper().startswith("STRUCT"):
|
||||
field_types = struct_field_types(self.returns)
|
||||
return ColumnExpr(expr, data_type=self.returns, field_types=field_types)
|
||||
|
||||
def _body(self) -> "tuple[str, str]":
|
||||
"""(body literal, body_format). Source when requested and
|
||||
retrievable; cloudpickle otherwise (handles closures)."""
|
||||
if self.prefer_source:
|
||||
try:
|
||||
src = textwrap.dedent(inspect.getsource(self.fn))
|
||||
# Strip the decorator line(s) so the stored body is a
|
||||
# plain function definition.
|
||||
lines = src.splitlines(keepends=True)
|
||||
while lines and lines[0].lstrip().startswith("@"):
|
||||
lines.pop(0)
|
||||
return "".join(lines), "source"
|
||||
except (OSError, TypeError):
|
||||
pass
|
||||
import cloudpickle
|
||||
|
||||
raw = cloudpickle.dumps(self.fn)
|
||||
return base64.b64encode(raw).decode("ascii"), "cloudpickle"
|
||||
|
||||
def _body_and_options(self) -> "tuple[str, dict[str, str]]":
|
||||
"""The body literal plus the finalized options (body_format /
|
||||
python_version / cloudpickle-pip bookkeeping for a non-source
|
||||
body)."""
|
||||
body, body_format = self._body()
|
||||
options = dict(self.options)
|
||||
if body_format != "source":
|
||||
options["body_format"] = body_format
|
||||
# Pickled code objects only load under the same interpreter
|
||||
# minor version; record ours so the worker can fail with a
|
||||
# clear message instead of a bytecode error.
|
||||
options["python_version"] = self.pickle_environment()
|
||||
# The worker deserializes the body with cloudpickle; make sure
|
||||
# the job's pip environment provides it. Conda bakes inject
|
||||
# cloudpickle server-side, so do not create an invalid pip+conda
|
||||
# declaration here.
|
||||
if "conda" not in options:
|
||||
pip = [d for d in options.get("pip", "").split(",") if d]
|
||||
if not any(d.startswith("cloudpickle") for d in pip):
|
||||
pip.append("cloudpickle")
|
||||
options["pip"] = ",".join(pip)
|
||||
return body, options
|
||||
|
||||
def create_request(self) -> dict:
|
||||
"""Keyword arguments for `connection.create_function`."""
|
||||
body, options = self._body_and_options()
|
||||
return {
|
||||
"name": self.name,
|
||||
"language": "python",
|
||||
"return_type": self.returns,
|
||||
"body": body,
|
||||
"options": options,
|
||||
}
|
||||
|
||||
def create_statement(self) -> str:
|
||||
"""The equivalent `CREATE FUNCTION` SQL (for SQL-surface callers)."""
|
||||
params = ", ".join(f"{n} {t}" for n, t in self.params)
|
||||
body, options = self._body_and_options()
|
||||
with_clause = ""
|
||||
if options:
|
||||
rendered = ", ".join(
|
||||
f"{k} = '{_escape(v)}'" for k, v in sorted(options.items())
|
||||
)
|
||||
with_clause = f" WITH ({rendered})"
|
||||
return (
|
||||
f"CREATE FUNCTION {self.name}({params}) RETURNS {self.returns} "
|
||||
f"LANGUAGE python AS '{_escape_body(body)}'{with_clause}"
|
||||
)
|
||||
|
||||
def pickle_environment(self) -> str:
|
||||
"""Python version the body pickles under -- workers should match
|
||||
the minor version for cloudpickle compatibility."""
|
||||
return f"{sys.version_info.major}.{sys.version_info.minor}"
|
||||
|
||||
|
||||
def _escape(s: str) -> str:
|
||||
return str(s).replace("'", "''")
|
||||
|
||||
|
||||
def _format_requirements(requirements: "str | list[str]") -> str:
|
||||
if isinstance(requirements, str):
|
||||
return requirements
|
||||
return "\n".join(str(req) for req in requirements)
|
||||
|
||||
|
||||
def _format_env(env: "dict[str, str] | list[str]") -> str:
|
||||
if isinstance(env, dict):
|
||||
return "; ".join(f"{key}={value}" for key, value in env.items())
|
||||
return "; ".join(str(entry) for entry in env)
|
||||
|
||||
|
||||
def _escape_body(body: str) -> str:
|
||||
# The server unescapes \n / \t in single-quoted bodies; encode real
|
||||
# newlines accordingly and escape quotes.
|
||||
return (
|
||||
body.replace("\\", "\\\\")
|
||||
.replace("'", "''")
|
||||
.replace("\n", "\\n")
|
||||
.replace("\t", "\\t")
|
||||
)
|
||||
|
||||
|
||||
def udf(fn=None, **kwargs):
|
||||
"""Decorate a function as a scalar (or struct-returning) UDF.
|
||||
|
||||
@udf
|
||||
def doubled(val: int) -> float: ...
|
||||
|
||||
@udf(pip=["torch>=2"], num_gpus=1)
|
||||
def embed(body: str) -> list[float]: ...
|
||||
"""
|
||||
if fn is not None:
|
||||
return Udf(fn, **kwargs)
|
||||
return lambda f: Udf(f, **kwargs)
|
||||
|
||||
|
||||
def table_udf(fn=None, **kwargs):
|
||||
"""Decorate a table function (UDTF): each input row may emit zero or
|
||||
more output rows. Only usable in materialized views.
|
||||
|
||||
class Chunk(TypedDict):
|
||||
chunk: str
|
||||
chunk_idx: int
|
||||
|
||||
@table_udf
|
||||
def chunker(body: str) -> list[Chunk]: ...
|
||||
"""
|
||||
kwargs["table"] = True
|
||||
if fn is not None:
|
||||
return Udf(fn, **kwargs)
|
||||
return lambda f: Udf(f, **kwargs)
|
||||
|
||||
|
||||
# -- view / job handles (thin references over a connection) -------------
|
||||
|
||||
|
||||
def struct_field_types(returns: str) -> "list[str]":
|
||||
"""Field type strings of a STRUCT(...) SQL type, in declared order."""
|
||||
inner = returns.strip()[len("STRUCT(") : -1]
|
||||
fields, depth, start = [], 0, 0
|
||||
for i, c in enumerate(inner):
|
||||
if c in "([":
|
||||
depth += 1
|
||||
elif c in ")]":
|
||||
depth -= 1
|
||||
elif c == "," and depth == 0:
|
||||
fields.append(inner[start:i].strip())
|
||||
start = i + 1
|
||||
fields.append(inner[start:].strip())
|
||||
# Each field is "name TYPE"; drop the name.
|
||||
return [f.split(None, 1)[1] for f in fields]
|
||||
|
||||
|
||||
def build_view_query(source, select) -> str:
|
||||
"""Assemble a view SELECT from a source (name or table) and select
|
||||
items: a column name, an expression string, a (alias, expression)
|
||||
tuple, or a @udf/@table_udf object."""
|
||||
src = source.name if hasattr(source, "name") else source
|
||||
items = []
|
||||
for item in select:
|
||||
if isinstance(item, Udf):
|
||||
items.append(item.expression())
|
||||
elif isinstance(item, tuple):
|
||||
alias, expr = item
|
||||
expr = expr.expression() if isinstance(expr, Udf) else expr
|
||||
items.append(f"{expr} AS {alias}")
|
||||
else:
|
||||
items.append(item)
|
||||
return f"SELECT {', '.join(items)} FROM {src}"
|
||||
|
||||
|
||||
def _job_id_matches(handle_id: str, listed_id: str) -> bool:
|
||||
# The refresh/backfill endpoints return the submission id (a uuid), but
|
||||
# the agent names the manifest job "<table>-<type>-<first 8 of the
|
||||
# submission id>" -- which is what list_jobs and cancel report. Match the
|
||||
# canonical id directly, or by that submission prefix.
|
||||
if listed_id == handle_id:
|
||||
return True
|
||||
prefix = handle_id[:8]
|
||||
return len(prefix) >= 4 and prefix in listed_id
|
||||
|
||||
|
||||
class MaterializedView:
|
||||
"""A reference to a materialized view (name + connection). Operations are
|
||||
server-backed connection calls bound to the name.
|
||||
|
||||
``create_materialized_view`` returns one of these; ``job_id`` is the
|
||||
initial-population job (None when the view was created with no data), so
|
||||
``db.create_materialized_view(...).wait()`` blocks until it is populated.
|
||||
"""
|
||||
|
||||
def __init__(self, conn, name: str, job_id: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.name = name
|
||||
#: initial-population job id from create, or None (with_no_data).
|
||||
self.job_id = job_id
|
||||
|
||||
def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
"""Block until the initial-population job (from create) finishes.
|
||||
A no-op when the view was created with no data."""
|
||||
if self.job_id is None:
|
||||
return "finished"
|
||||
return JobHandle(self.conn, self.job_id, table=self.name).wait(
|
||||
timeout=timeout, poll=poll
|
||||
)
|
||||
|
||||
def refresh(self, full: bool = False) -> "JobHandle":
|
||||
"""Refresh the materialized view; returns a `JobHandle` to wait on,
|
||||
poll, or cancel (``view.refresh().wait()``).
|
||||
|
||||
``full=True`` forces a full rebuild (recompute and replace every row)
|
||||
instead of the default incremental refresh. A full rebuild preserves
|
||||
the view's indexes -- they are reindexed by the distributed indexer.
|
||||
"""
|
||||
job_id = self.conn._refresh_materialized_view(self.name, full=full)
|
||||
return JobHandle(self.conn, job_id, table=self.name)
|
||||
|
||||
def explain_refresh(self, full: bool = False):
|
||||
"""Plan a refresh without running it (EXPLAIN REFRESH)."""
|
||||
return self.conn.explain_refresh_materialized_view(self.name, full=full)
|
||||
|
||||
def alter(self, auto_refresh: bool) -> None:
|
||||
self.conn.alter_materialized_view(self.name, auto_refresh=auto_refresh)
|
||||
|
||||
def drop(self) -> None:
|
||||
self.conn.drop_materialized_view(self.name)
|
||||
|
||||
# A materialized view is a first-class table: it can be indexed and
|
||||
# searched like any other. These open the materialized dataset by name and
|
||||
# delegate. Indexes declared this way are recorded against the view, so the
|
||||
# engine re-applies them after a full refresh rebuilds the dataset (a full
|
||||
# refresh overwrites the dataset, which would otherwise drop its indices).
|
||||
def _table(self):
|
||||
return self.conn.open_table(self.name)
|
||||
|
||||
def create_index(self, *args, **kwargs):
|
||||
"""Build an index on the materialized view (see Table.create_index)."""
|
||||
return self._table().create_index(*args, **kwargs)
|
||||
|
||||
def create_scalar_index(self, *args, **kwargs):
|
||||
"""Build a scalar index on the materialized view."""
|
||||
return self._table().create_scalar_index(*args, **kwargs)
|
||||
|
||||
def create_fts_index(self, *args, **kwargs):
|
||||
"""Build a full-text-search index on the materialized view."""
|
||||
return self._table().create_fts_index(*args, **kwargs)
|
||||
|
||||
def search(self, *args, **kwargs):
|
||||
"""Search the materialized view (vector / FTS / hybrid)."""
|
||||
return self._table().search(*args, **kwargs)
|
||||
|
||||
def lineage(self, column=None, *, direction=None, depth=None):
|
||||
"""Lineage of the materialized view (or one of its columns). Delegates
|
||||
to the backing table; the server already includes the view's sources
|
||||
and downstream dependents. Returns a `Lineage`."""
|
||||
return self._table().lineage(column, direction=direction, depth=depth)
|
||||
|
||||
|
||||
_PROGRESS = re.compile(r"(\d+)/(\d+)")
|
||||
|
||||
|
||||
class JobFailedError(RuntimeError):
|
||||
"""Raised by ``JobHandle.wait()`` when the server reports the job ``failed``.
|
||||
|
||||
Carries the server-side error so a doomed backfill (e.g. a multi-column
|
||||
``REFRESH COLUMN`` of a scalar UDF) surfaces its real cause promptly,
|
||||
instead of the caller blocking until ``wait()``'s timeout.
|
||||
"""
|
||||
|
||||
def __init__(self, job_id: str, error: "str | None"):
|
||||
self.job_id = job_id
|
||||
self.error = error
|
||||
super().__init__(f"job {job_id} failed: {error or 'unknown error'}")
|
||||
|
||||
|
||||
class JobHandle:
|
||||
"""A reference to an inflight server-side job, with polling helpers."""
|
||||
|
||||
#: How long an unseen job is treated as still materializing (submission
|
||||
#: -> agent cycle -> manifest write is async).
|
||||
GRACE_SECONDS = 20.0
|
||||
|
||||
def __init__(self, conn, job_id: str, table: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.id = job_id
|
||||
#: The job's table, when known (refresh_column / MV refresh). Lets the
|
||||
#: server resolve this job with an O(1) single-node read; without it the
|
||||
#: lookup scans the database's active jobs (still correct).
|
||||
self.table = table
|
||||
self._created = time.monotonic()
|
||||
self._seen = False
|
||||
|
||||
def _job(self):
|
||||
# Poll by id (one job), not list_jobs (every active job): the server
|
||||
# matches the submission/manifest id and reads just this table's node.
|
||||
return self.conn.get_job(self.id, self.table)
|
||||
|
||||
def status(self) -> str:
|
||||
"""pending / running / cancelling / stale, or 'finished' once the
|
||||
job has left the inflight listing."""
|
||||
job = self._job()
|
||||
if job is not None:
|
||||
self._seen = True
|
||||
return job.state
|
||||
if not self._seen and time.monotonic() - self._created < self.GRACE_SECONDS:
|
||||
return "pending"
|
||||
return "finished"
|
||||
|
||||
def progress(self) -> "tuple[int, int] | None":
|
||||
"""(units_done, units_total) while running, else None."""
|
||||
job = self._job()
|
||||
if job is not None and job.units_total is not None:
|
||||
return job.units_done or 0, job.units_total
|
||||
return None
|
||||
|
||||
def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
deadline = time.monotonic() + timeout
|
||||
while time.monotonic() < deadline:
|
||||
state = self.status()
|
||||
if state in ("finished", "stale"):
|
||||
return state
|
||||
if state == "failed":
|
||||
# Terminal failure -- surface the server error now, don't block
|
||||
# until `timeout`. `finalize` wrote it to the job's status node.
|
||||
job = self._job()
|
||||
raise JobFailedError(self.id, job.error if job is not None else None)
|
||||
if state == "pending":
|
||||
time.sleep(min(poll, 0.5))
|
||||
continue
|
||||
job = self._job()
|
||||
if job is not None and job.committed:
|
||||
return "finished"
|
||||
time.sleep(poll)
|
||||
raise TimeoutError(f"job {self.id} still {self.status()} after {timeout}s")
|
||||
|
||||
def cancel(self) -> None:
|
||||
# Cancel by the canonical manifest id (what cancel matches), found
|
||||
# via the submission prefix; fall back to the raw id.
|
||||
job = self._job()
|
||||
self.conn.cancel_job(job.job_id if job is not None else self.id)
|
||||
|
||||
|
||||
class AsyncMaterializedView:
|
||||
"""Async reference to a materialized view (name + async connection)."""
|
||||
|
||||
def __init__(self, conn, name: str, job_id: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.name = name
|
||||
#: initial-population job id from create, or None (with_no_data).
|
||||
self.job_id = job_id
|
||||
|
||||
async def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
"""Block until the initial-population job (from create) finishes.
|
||||
A no-op when the view was created with no data."""
|
||||
if self.job_id is None:
|
||||
return "finished"
|
||||
return await AsyncJobHandle(self.conn, self.job_id, table=self.name).wait(
|
||||
timeout=timeout, poll=poll
|
||||
)
|
||||
|
||||
async def refresh(self, full: bool = False) -> "AsyncJobHandle":
|
||||
"""Refresh the materialized view; returns an `AsyncJobHandle` to wait
|
||||
on, poll, or cancel.
|
||||
|
||||
``full=True`` forces a full rebuild instead of an incremental refresh
|
||||
(indexes are preserved and reindexed by the distributed indexer).
|
||||
"""
|
||||
job_id = await self.conn._refresh_materialized_view(self.name, full=full)
|
||||
return AsyncJobHandle(self.conn, job_id, table=self.name)
|
||||
|
||||
async def explain_refresh(self, full: bool = False):
|
||||
return await self.conn.explain_refresh_materialized_view(self.name, full=full)
|
||||
|
||||
async def alter(self, auto_refresh: bool) -> None:
|
||||
await self.conn.alter_materialized_view(self.name, auto_refresh=auto_refresh)
|
||||
|
||||
async def drop(self) -> None:
|
||||
await self.conn.drop_materialized_view(self.name)
|
||||
|
||||
async def lineage(self, column=None, *, direction=None, depth=None):
|
||||
"""Lineage of the materialized view (or column). Returns a `Lineage`."""
|
||||
return await self.conn.lineage(
|
||||
self.name, column, direction=direction, depth=depth
|
||||
)
|
||||
|
||||
|
||||
class AsyncJobHandle:
|
||||
"""Async reference to an inflight server-side job, with polling helpers."""
|
||||
|
||||
GRACE_SECONDS = 20.0
|
||||
|
||||
def __init__(self, conn, job_id: str, table: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.id = job_id
|
||||
#: See JobHandle.table -- enables an O(1) by-id lookup when known.
|
||||
self.table = table
|
||||
self._created = time.monotonic()
|
||||
self._seen = False
|
||||
|
||||
async def _job(self):
|
||||
# Poll by id, not list_jobs (see JobHandle._job).
|
||||
return await self.conn.get_job(self.id, self.table)
|
||||
|
||||
async def status(self) -> str:
|
||||
job = await self._job()
|
||||
if job is not None:
|
||||
self._seen = True
|
||||
return job.state
|
||||
if not self._seen and time.monotonic() - self._created < self.GRACE_SECONDS:
|
||||
return "pending"
|
||||
return "finished"
|
||||
|
||||
async def progress(self) -> "tuple[int, int] | None":
|
||||
job = await self._job()
|
||||
if job is not None and job.units_total is not None:
|
||||
return job.units_done or 0, job.units_total
|
||||
return None
|
||||
|
||||
async def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
deadline = time.monotonic() + timeout
|
||||
while time.monotonic() < deadline:
|
||||
state = await self.status()
|
||||
if state in ("finished", "stale"):
|
||||
return state
|
||||
if state == "failed":
|
||||
# Terminal failure -- surface the server error now, don't block
|
||||
# until `timeout`. `finalize` wrote it to the job's status node.
|
||||
job = await self._job()
|
||||
raise JobFailedError(self.id, job.error if job is not None else None)
|
||||
if state == "pending":
|
||||
await asyncio.sleep(min(poll, 0.5))
|
||||
continue
|
||||
job = await self._job()
|
||||
if job is not None and job.committed:
|
||||
return "finished"
|
||||
await asyncio.sleep(poll)
|
||||
raise TimeoutError(
|
||||
f"job {self.id} still {await self.status()} after {timeout}s"
|
||||
)
|
||||
|
||||
async def cancel(self) -> None:
|
||||
job = await self._job()
|
||||
await self.conn.cancel_job(job.job_id if job is not None else self.id)
|
||||
@@ -373,9 +373,15 @@ def _(value: list):
|
||||
@value_to_sql.register(dict)
|
||||
def _(value: dict):
|
||||
# https://datafusion.apache.org/user-guide/sql/scalar_functions.html#named-struct
|
||||
# Render the field name through value_to_sql(str(...)) as well so that keys
|
||||
# containing characters meaningful in SQL (e.g. a single quote) are escaped
|
||||
# the same way string values are. A bare f"'{k}'" would emit invalid SQL for
|
||||
# a key like "it's".
|
||||
return (
|
||||
"named_struct("
|
||||
+ ", ".join(f"'{k}', {value_to_sql(v)}" for k, v in value.items())
|
||||
+ ", ".join(
|
||||
f"{value_to_sql(str(k))}, {value_to_sql(v)}" for k, v in value.items()
|
||||
)
|
||||
+ ")"
|
||||
)
|
||||
|
||||
@@ -385,6 +391,21 @@ def _(value: np.ndarray):
|
||||
return value_to_sql(value.tolist())
|
||||
|
||||
|
||||
@value_to_sql.register(np.bool_)
|
||||
def _(value: np.bool_):
|
||||
return value_to_sql(bool(value))
|
||||
|
||||
|
||||
@value_to_sql.register(np.integer)
|
||||
def _(value: np.integer):
|
||||
return value_to_sql(int(value))
|
||||
|
||||
|
||||
@value_to_sql.register(np.floating)
|
||||
def _(value: np.floating):
|
||||
return value_to_sql(float(value))
|
||||
|
||||
|
||||
def deprecated(func):
|
||||
"""This is a decorator which can be used to mark functions
|
||||
as deprecated. It will result in a warning being emitted
|
||||
|
||||
56
python/python/tests/test_errors.py
Normal file
56
python/python/tests/test_errors.py
Normal file
@@ -0,0 +1,56 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import pickle
|
||||
|
||||
from lancedb.remote.errors import HttpError, LanceDBClientError, RetryError
|
||||
|
||||
|
||||
def test_pickle_lancedb_client_error():
|
||||
err = LanceDBClientError("something went wrong", "req-123", 400)
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert str(restored) == "something went wrong"
|
||||
assert restored.request_id == "req-123"
|
||||
assert restored.status_code == 400
|
||||
|
||||
|
||||
def test_pickle_lancedb_client_error_no_status_code():
|
||||
err = LanceDBClientError("fail", "req-456")
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert str(restored) == "fail"
|
||||
assert restored.request_id == "req-456"
|
||||
assert restored.status_code is None
|
||||
|
||||
|
||||
def test_pickle_http_error():
|
||||
err = HttpError("not found", "req-789", 404)
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert isinstance(restored, HttpError)
|
||||
assert str(restored) == "not found"
|
||||
assert restored.request_id == "req-789"
|
||||
assert restored.status_code == 404
|
||||
|
||||
|
||||
def test_pickle_retry_error():
|
||||
err = RetryError(
|
||||
"max retries exceeded",
|
||||
"req-abc",
|
||||
request_failures=3,
|
||||
connect_failures=1,
|
||||
read_failures=2,
|
||||
max_request_failures=5,
|
||||
max_connect_failures=3,
|
||||
max_read_failures=3,
|
||||
status_code=503,
|
||||
)
|
||||
restored = pickle.loads(pickle.dumps(err))
|
||||
assert isinstance(restored, RetryError)
|
||||
assert str(restored) == "max retries exceeded"
|
||||
assert restored.request_id == "req-abc"
|
||||
assert restored.request_failures == 3
|
||||
assert restored.connect_failures == 1
|
||||
assert restored.read_failures == 2
|
||||
assert restored.max_request_failures == 5
|
||||
assert restored.max_connect_failures == 3
|
||||
assert restored.max_read_failures == 3
|
||||
assert restored.status_code == 503
|
||||
@@ -20,6 +20,7 @@ from lancedb.index import (
|
||||
IvfRq,
|
||||
Bitmap,
|
||||
LabelList,
|
||||
Fm,
|
||||
HnswPq,
|
||||
HnswSq,
|
||||
HnswFlat,
|
||||
@@ -90,7 +91,9 @@ async def test_create_scalar_index(some_table: AsyncTable):
|
||||
# Can recreate if replace=True
|
||||
await some_table.create_index("id", replace=True)
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices) == '[Index(BTree, columns=["id"], name="id_idx")]'
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="id_idx", index_type="BTree", columns=["id"]'
|
||||
)
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "BTree"
|
||||
assert indices[0].columns == ["id"]
|
||||
@@ -105,6 +108,27 @@ async def test_create_scalar_index(some_table: AsyncTable):
|
||||
assert len(indices) == 0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_index_config_repr(db_async):
|
||||
# Use >= 1000 rows so the thousands separator in the repr is exercised.
|
||||
nrows = 1500
|
||||
table = await db_async.create_table(
|
||||
"repr_table", pa.Table.from_pydict({"id": list(range(nrows))})
|
||||
)
|
||||
await table.create_index("id", config=BTree())
|
||||
indices = await table.list_indices()
|
||||
assert len(indices) == 1
|
||||
|
||||
r = repr(indices[0])
|
||||
assert r.startswith('IndexConfig(name="id_idx", index_type="BTree", columns=["id"]')
|
||||
# Integer counts use `_` thousands separators (valid Python int syntax).
|
||||
assert "num_indexed_rows=1_500" in r
|
||||
assert "num_unindexed_rows=0" in r
|
||||
# created_at renders as a datetime so the value round-trips.
|
||||
assert "created_at=datetime.datetime(" in r
|
||||
assert r.endswith(")")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
|
||||
metadata_type = pa.struct(
|
||||
@@ -113,8 +137,14 @@ async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
|
||||
pa.field("user.id", pa.int32()),
|
||||
]
|
||||
)
|
||||
mixed_case_metadata_type = pa.struct([pa.field("userId", pa.int32())])
|
||||
escaped_metadata_type = pa.struct([pa.field("user-id", pa.int32())])
|
||||
literal_type = pa.struct([pa.field("a.b", pa.int32())])
|
||||
data = pa.Table.from_arrays(
|
||||
[
|
||||
pa.array([1, 2, 3], type=pa.int32()),
|
||||
pa.array([1, 2, 3], type=pa.int32()),
|
||||
pa.array([1, 2, 3], type=pa.int32()),
|
||||
pa.array([1, 2, 3], type=pa.int32()),
|
||||
pa.array(
|
||||
[
|
||||
@@ -124,37 +154,91 @@ async def test_create_nested_scalar_index_lists_canonical_paths(db_async):
|
||||
],
|
||||
type=metadata_type,
|
||||
),
|
||||
pa.array(
|
||||
[{"userId": 10}, {"userId": 20}, {"userId": 30}],
|
||||
type=mixed_case_metadata_type,
|
||||
),
|
||||
pa.array(
|
||||
[{"user-id": 10}, {"user-id": 20}, {"user-id": 30}],
|
||||
type=escaped_metadata_type,
|
||||
),
|
||||
pa.array(
|
||||
[{"a.b": 10}, {"a.b": 20}, {"a.b": 30}],
|
||||
type=literal_type,
|
||||
),
|
||||
],
|
||||
names=[
|
||||
"rowId",
|
||||
"row-id",
|
||||
"userId",
|
||||
"user_id",
|
||||
"metadata",
|
||||
"MetaData",
|
||||
"meta-data",
|
||||
"literal",
|
||||
],
|
||||
names=["user_id", "metadata"],
|
||||
)
|
||||
table = await db_async.create_table("nested_scalar_index", data)
|
||||
|
||||
await table.create_index("user_id", config=BTree(), name="top_user_id_idx")
|
||||
await table.create_index("rowId", config=BTree(), name="row_id_idx")
|
||||
await table.create_index("`row-id`", config=BTree(), name="row_dash_id_idx")
|
||||
await table.create_index("userId", config=BTree(), name="top_user_id_idx")
|
||||
await table.create_index("user_id", config=BTree(), name="top_snake_user_id_idx")
|
||||
await table.create_index(
|
||||
"metadata.user_id", config=BTree(), name="nested_user_id_idx"
|
||||
)
|
||||
await table.create_index(
|
||||
"metadata.`user.id`", config=BTree(), name="escaped_user_id_idx"
|
||||
)
|
||||
await table.create_index(
|
||||
"MetaData.userId", config=BTree(), name="mixed_case_metadata_user_id_idx"
|
||||
)
|
||||
await table.create_index(
|
||||
"`meta-data`.`user-id`", config=BTree(), name="escaped_names_idx"
|
||||
)
|
||||
await table.create_index("literal.`a.b`", config=BTree(), name="literal_dot_idx")
|
||||
|
||||
columns_by_name = {
|
||||
index.name: index.columns for index in await table.list_indices()
|
||||
}
|
||||
assert columns_by_name["top_user_id_idx"] == ["user_id"]
|
||||
assert columns_by_name["row_id_idx"] == ["rowId"]
|
||||
assert columns_by_name["row_dash_id_idx"] == ["`row-id`"]
|
||||
assert columns_by_name["top_user_id_idx"] == ["userId"]
|
||||
assert columns_by_name["top_snake_user_id_idx"] == ["user_id"]
|
||||
assert columns_by_name["nested_user_id_idx"] == ["metadata.user_id"]
|
||||
assert columns_by_name["escaped_user_id_idx"] == ["metadata.`user.id`"]
|
||||
assert columns_by_name["mixed_case_metadata_user_id_idx"] == ["MetaData.userId"]
|
||||
assert columns_by_name["escaped_names_idx"] == ["`meta-data`.`user-id`"]
|
||||
assert columns_by_name["literal_dot_idx"] == ["literal.`a.b`"]
|
||||
|
||||
for index_name in columns_by_name:
|
||||
stats = await table.index_stats(index_name)
|
||||
assert stats is not None
|
||||
assert stats.num_indexed_rows == 3
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_fixed_size_binary_index(some_table: AsyncTable):
|
||||
await some_table.create_index("fsb", config=BTree())
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices) == '[Index(BTree, columns=["fsb"], name="fsb_idx")]'
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="fsb_idx", index_type="BTree", columns=["fsb"]'
|
||||
)
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "BTree"
|
||||
assert indices[0].columns == ["fsb"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_fm_index(some_table: AsyncTable):
|
||||
# FM-Index accelerates substring search on string/binary columns.
|
||||
await some_table.create_index("data", config=Fm())
|
||||
indices = await some_table.list_indices()
|
||||
assert len(indices) == 1
|
||||
assert indices[0].index_type == "Fm"
|
||||
assert indices[0].columns == ["data"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_bitmap_index(some_table: AsyncTable):
|
||||
await some_table.create_index("id", config=Bitmap())
|
||||
@@ -188,14 +272,65 @@ async def test_create_bitmap_index(some_table: AsyncTable):
|
||||
async def test_create_label_list_index(some_table: AsyncTable):
|
||||
await some_table.create_index("tags", config=LabelList())
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices) == '[Index(LabelList, columns=["tags"], name="tags_idx")]'
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="tags_idx", index_type="LabelList", columns=["tags"]'
|
||||
)
|
||||
plan = await some_table.query().where("array_has(tags, 'tag0')").explain_plan()
|
||||
assert "ScalarIndexQuery" in plan
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_large_list_label_list_index(db_async):
|
||||
data = pa.Table.from_pydict(
|
||||
{"tags": [[f"tag{i % 2}", "shared"] for i in range(16)]},
|
||||
schema=pa.schema([pa.field("tags", pa.large_list(pa.string()))]),
|
||||
)
|
||||
table = await db_async.create_table("large_list_label_list_index", data)
|
||||
|
||||
await table.create_index("tags", config=LabelList())
|
||||
indices = await table.list_indices()
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="tags_idx", index_type="LabelList", columns=["tags"]'
|
||||
)
|
||||
plan = await table.query().where("array_has(tags, 'shared')").explain_plan()
|
||||
assert "ScalarIndexQuery" in plan
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_create_label_list_index_rejects_list_struct(db_async):
|
||||
item_type = pa.struct(
|
||||
[
|
||||
pa.field("tag", pa.string()),
|
||||
pa.field(
|
||||
"metadata",
|
||||
pa.struct([pa.field("userId", pa.string())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
data = pa.Table.from_pylist(
|
||||
[
|
||||
{
|
||||
"items": [
|
||||
{"tag": "tag0", "metadata": {"userId": "user0"}},
|
||||
{"tag": "shared", "metadata": {"userId": "user1"}},
|
||||
]
|
||||
}
|
||||
],
|
||||
schema=pa.schema([pa.field("items", pa.list_(item_type))]),
|
||||
)
|
||||
table = await db_async.create_table("list_struct_label_list_index", data)
|
||||
|
||||
with pytest.raises(Exception, match="LabelList index cannot be created"):
|
||||
await table.create_index("items", config=LabelList())
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_full_text_search_index(some_table: AsyncTable):
|
||||
await some_table.create_index("tags", config=FTS(with_position=False))
|
||||
indices = await some_table.list_indices()
|
||||
assert str(indices) == '[Index(FTS, columns=["tags"], name="tags_idx")]'
|
||||
assert str(indices).startswith(
|
||||
'[IndexConfig(name="tags_idx", index_type="FTS", columns=["tags"]'
|
||||
)
|
||||
|
||||
await some_table.prewarm_index("tags_idx")
|
||||
|
||||
@@ -226,7 +361,6 @@ async def test_create_vector_index(some_table: AsyncTable):
|
||||
assert stats.num_indexed_rows == await some_table.count_rows()
|
||||
assert stats.num_unindexed_rows == 0
|
||||
assert stats.num_indices == 1
|
||||
assert stats.loss >= 0.0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@@ -250,7 +384,6 @@ async def test_create_4bit_ivfpq_index(some_table: AsyncTable):
|
||||
assert stats.num_indexed_rows == await some_table.count_rows()
|
||||
assert stats.num_unindexed_rows == 0
|
||||
assert stats.num_indices == 1
|
||||
assert stats.loss >= 0.0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
|
||||
92
python/python/tests/test_job_handle.py
Normal file
92
python/python/tests/test_job_handle.py
Normal file
@@ -0,0 +1,92 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
"""JobHandle.wait() terminal-state handling.
|
||||
|
||||
Regression coverage for the cluster backfill-failure hang: the server reports a
|
||||
doomed job as ``state="failed"`` within seconds, but ``wait()`` used to ignore
|
||||
``failed`` and block until its (default 3600s) timeout. These tests pin that a
|
||||
``failed`` job raises ``JobFailedError`` promptly, carrying the server error.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
|
||||
import pytest
|
||||
|
||||
from lancedb.udf import JobHandle, AsyncJobHandle, JobFailedError
|
||||
|
||||
|
||||
class FakeJobInfo:
|
||||
"""Mirror of the pyo3 builtins.JobInfo fields wait()/status() read."""
|
||||
|
||||
def __init__(self, state, error=None, committed=False, units_total=None):
|
||||
self.state = state
|
||||
self.error = error
|
||||
self.committed = committed
|
||||
self.units_total = units_total
|
||||
self.units_done = None
|
||||
self.job_id = "job-1"
|
||||
|
||||
|
||||
class FakeConn:
|
||||
"""get_job() walks a scripted list of JobInfo (or None) snapshots, holding
|
||||
the last one once exhausted, so wait() polls a deterministic timeline."""
|
||||
|
||||
def __init__(self, snapshots):
|
||||
self._snaps = list(snapshots)
|
||||
self.calls = 0
|
||||
|
||||
def get_job(self, job_id, table=None):
|
||||
snap = self._snaps[min(self.calls, len(self._snaps) - 1)]
|
||||
self.calls += 1
|
||||
return snap
|
||||
|
||||
|
||||
class AsyncFakeConn(FakeConn):
|
||||
async def get_job(self, job_id, table=None):
|
||||
return FakeConn.get_job(self, job_id, table)
|
||||
|
||||
|
||||
def test_wait_raises_on_failed_promptly():
|
||||
# pending -> failed: wait() must raise the server error, not TimeoutError.
|
||||
conn = FakeConn(
|
||||
[None, FakeJobInfo("failed", error="multi-column backfill needs a STRUCT")]
|
||||
)
|
||||
jh = JobHandle(conn, "job-1", table="t")
|
||||
t0 = time.monotonic()
|
||||
with pytest.raises(JobFailedError) as exc:
|
||||
jh.wait(timeout=30, poll=0.01)
|
||||
assert time.monotonic() - t0 < 5 # prompt, nowhere near the 30s timeout
|
||||
assert "STRUCT" in str(exc.value)
|
||||
assert exc.value.error == "multi-column backfill needs a STRUCT"
|
||||
assert exc.value.job_id == "job-1"
|
||||
|
||||
|
||||
def test_wait_returns_finished_on_success():
|
||||
# running -> finished (job left the inflight listing) returns normally.
|
||||
conn = FakeConn([FakeJobInfo("running", units_total=2), None])
|
||||
jh = JobHandle(conn, "job-1", table="t")
|
||||
jh._seen = True # already observed, so a None now means "finished" not grace
|
||||
assert jh.wait(timeout=30, poll=0.01) == "finished"
|
||||
|
||||
|
||||
def test_wait_returns_finished_on_committed():
|
||||
# A committed job that is still listed resolves to finished.
|
||||
conn = FakeConn([FakeJobInfo("running", committed=True, units_total=2)])
|
||||
jh = JobHandle(conn, "job-1", table="t")
|
||||
jh._seen = True
|
||||
assert jh.wait(timeout=30, poll=0.01) == "finished"
|
||||
|
||||
|
||||
def test_async_wait_raises_on_failed_promptly():
|
||||
conn = AsyncFakeConn([None, FakeJobInfo("failed", error="boom")])
|
||||
jh = AsyncJobHandle(conn, "job-1", table="t")
|
||||
|
||||
async def run():
|
||||
t0 = time.monotonic()
|
||||
with pytest.raises(JobFailedError) as exc:
|
||||
await jh.wait(timeout=30, poll=0.01)
|
||||
assert time.monotonic() - t0 < 5
|
||||
assert exc.value.error == "boom"
|
||||
|
||||
asyncio.run(run())
|
||||
@@ -9,6 +9,66 @@ import pytest
|
||||
import pyarrow as pa
|
||||
import lancedb
|
||||
from lance_namespace.errors import NamespaceNotEmptyError, TableNotFoundError
|
||||
from lancedb.namespace import _MAX_QUERY_K
|
||||
from lancedb.table import AsyncTable, LanceTable
|
||||
|
||||
|
||||
PUSHDOWN_DATA = pa.table(
|
||||
{"id": list(range(12)), "text": [f"row-{idx}" for idx in range(12)]}
|
||||
)
|
||||
|
||||
|
||||
def _ipc_file(table: pa.Table = PUSHDOWN_DATA) -> bytes:
|
||||
sink = pa.BufferOutputStream()
|
||||
with pa.ipc.new_file(sink, table.schema) as writer:
|
||||
writer.write_table(table)
|
||||
return sink.getvalue().to_pybytes()
|
||||
|
||||
|
||||
class _FailingSyncInner:
|
||||
name = "hist"
|
||||
|
||||
def current_branch(self):
|
||||
# The pushdown gate only routes server-side when on the default branch.
|
||||
return None
|
||||
|
||||
async def schema(self):
|
||||
return PUSHDOWN_DATA.schema
|
||||
|
||||
async def to_arrow(self):
|
||||
raise RuntimeError("direct table to_arrow should not be used")
|
||||
|
||||
|
||||
class _FailingAsyncInner:
|
||||
def name(self):
|
||||
return "hist"
|
||||
|
||||
async def schema(self):
|
||||
return PUSHDOWN_DATA.schema
|
||||
|
||||
def query(self):
|
||||
raise AssertionError("direct async query should not be used")
|
||||
|
||||
|
||||
class _NamespaceClient:
|
||||
def __init__(self):
|
||||
self.requests = []
|
||||
|
||||
def query_table(self, request):
|
||||
self.requests.append(request)
|
||||
return _ipc_file()
|
||||
|
||||
|
||||
def _namespace_lance_table(namespace_client: _NamespaceClient) -> LanceTable:
|
||||
table = LanceTable.__new__(LanceTable)
|
||||
table._table = _FailingSyncInner()
|
||||
table._namespace_path = ["geneva"]
|
||||
table._namespace_client = namespace_client
|
||||
table._pushdown_operations = {"QueryTable"}
|
||||
# This test exercises the Python-side pushdown path (non-native client), so
|
||||
# pushdown is not routed to Rust.
|
||||
table._route_pushdown_to_rust = False
|
||||
return table
|
||||
|
||||
|
||||
class TestNamespaceConnection:
|
||||
@@ -200,8 +260,15 @@ class TestNamespaceConnection:
|
||||
assert table_schema.field("id").type == pa.int64()
|
||||
assert table_schema.field("text").type == pa.string()
|
||||
|
||||
def test_rename_table_not_supported(self):
|
||||
"""Test that rename_table raises NotImplementedError."""
|
||||
def test_rename_table(self):
|
||||
"""Test that rename_table renames a table in the namespace.
|
||||
|
||||
The `dir` namespace implementation in lance-namespace-impls does not
|
||||
implement `rename_table` yet (only the `rest` backend does), so it
|
||||
currently falls back to the default trait method which raises
|
||||
NotSupported. This is expected to start passing once the `dir`
|
||||
backend gains rename_table support upstream.
|
||||
"""
|
||||
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
|
||||
|
||||
# Create a child namespace first
|
||||
@@ -216,9 +283,14 @@ class TestNamespaceConnection:
|
||||
)
|
||||
db.create_table("old_name", schema=schema, namespace_path=["test_ns"])
|
||||
|
||||
# Rename should raise NotImplementedError
|
||||
with pytest.raises(NotImplementedError, match="rename_table is not supported"):
|
||||
db.rename_table("old_name", "new_name")
|
||||
# Rename the table within the same namespace
|
||||
with pytest.raises(NotImplementedError, match="rename_table not implemented"):
|
||||
db.rename_table(
|
||||
"old_name",
|
||||
"new_name",
|
||||
cur_namespace_path=["test_ns"],
|
||||
new_namespace_path=["test_ns"],
|
||||
)
|
||||
|
||||
def test_drop_all_tables(self):
|
||||
"""Test dropping all tables through namespace."""
|
||||
@@ -736,6 +808,56 @@ class TestPushdownOperations:
|
||||
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
|
||||
assert len(db._namespace_client_pushdown_operations) == 0
|
||||
|
||||
def test_route_pushdown_to_rust_for_native_rest(self):
|
||||
"""A natively-built rest connection must defer QueryTable pushdown to
|
||||
Rust so reads carry the x-lancedb-min-timestamp read-freshness header."""
|
||||
db = lancedb.connect_namespace(
|
||||
"rest",
|
||||
{"uri": "http://localhost:12345"},
|
||||
namespace_client_pushdown_operations=["QueryTable"],
|
||||
)
|
||||
assert db._route_pushdown_to_rust is True
|
||||
|
||||
def test_route_pushdown_to_rust_false_for_dir(self):
|
||||
"""A non-native (dir) connection keeps the Python pushdown path."""
|
||||
db = lancedb.connect_namespace("dir", {"root": self.temp_dir})
|
||||
assert db._route_pushdown_to_rust is False
|
||||
|
||||
def test_async_route_pushdown_to_rust_for_native_rest(self):
|
||||
"""The async connection must not silently bypass the read-freshness fix:
|
||||
a natively-built rest connection defers pushdown to Rust (regression test
|
||||
for the async path omitting the freshness header)."""
|
||||
db = lancedb.connect_namespace_async(
|
||||
"rest",
|
||||
{"uri": "http://localhost:12345"},
|
||||
namespace_client_pushdown_operations=["QueryTable"],
|
||||
)
|
||||
assert db._route_pushdown_to_rust is True
|
||||
|
||||
def test_async_route_pushdown_to_rust_false_for_dir(self):
|
||||
"""The async non-native (dir) connection keeps the Python pushdown path."""
|
||||
db = lancedb.connect_namespace_async("dir", {"root": self.temp_dir})
|
||||
assert db._route_pushdown_to_rust is False
|
||||
|
||||
def test_lance_table_to_arrow_uses_query_pushdown(self):
|
||||
namespace_client = _NamespaceClient()
|
||||
table = _namespace_lance_table(namespace_client)
|
||||
|
||||
assert table.to_arrow().equals(PUSHDOWN_DATA)
|
||||
assert table.to_pandas()["id"].tolist() == list(range(12))
|
||||
assert len(namespace_client.requests) == 2
|
||||
assert [request.id for request in namespace_client.requests] == [
|
||||
["geneva", "hist"],
|
||||
["geneva", "hist"],
|
||||
]
|
||||
# Unlimited reads cap k at i32::MAX (the namespace query_table `k`
|
||||
# field is i32); sys.maxsize would overflow the Rust binding.
|
||||
assert [request.k for request in namespace_client.requests] == [
|
||||
_MAX_QUERY_K,
|
||||
_MAX_QUERY_K,
|
||||
]
|
||||
assert all(r.k <= 2**31 - 1 for r in namespace_client.requests)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
class TestAsyncPushdownOperations:
|
||||
@@ -771,3 +893,42 @@ class TestAsyncPushdownOperations:
|
||||
"""Test that pushdown operations default to empty on async connection."""
|
||||
db = lancedb.connect_namespace_async("dir", {"root": self.temp_dir})
|
||||
assert len(db._namespace_client_pushdown_operations) == 0
|
||||
|
||||
async def test_async_table_to_arrow_uses_query_pushdown(self):
|
||||
namespace_client = _NamespaceClient()
|
||||
|
||||
table = AsyncTable(
|
||||
_FailingAsyncInner(),
|
||||
namespace_path=["geneva"],
|
||||
namespace_client=namespace_client,
|
||||
pushdown_operations={"QueryTable"},
|
||||
)
|
||||
|
||||
assert (await table.to_arrow()).equals(PUSHDOWN_DATA)
|
||||
assert (await table.to_pandas())["id"].tolist() == list(range(12))
|
||||
assert len(namespace_client.requests) == 2
|
||||
assert [request.id for request in namespace_client.requests] == [
|
||||
["geneva", "hist"],
|
||||
["geneva", "hist"],
|
||||
]
|
||||
# Unlimited reads cap k at i32::MAX (the namespace query_table `k`
|
||||
# field is i32); sys.maxsize would overflow the Rust binding.
|
||||
assert [request.k for request in namespace_client.requests] == [
|
||||
_MAX_QUERY_K,
|
||||
_MAX_QUERY_K,
|
||||
]
|
||||
assert all(r.k <= 2**31 - 1 for r in namespace_client.requests)
|
||||
|
||||
|
||||
def test_local_table_to_arrow_and_to_pandas_are_unchanged(tmp_path):
|
||||
db = lancedb.connect(str(tmp_path / "db"))
|
||||
table = db.create_table(
|
||||
"local",
|
||||
data=[
|
||||
{"id": 1, "vector": [1.0, 2.0]},
|
||||
{"id": 2, "vector": [3.0, 4.0]},
|
||||
],
|
||||
)
|
||||
|
||||
assert table.to_arrow().column("id").to_pylist() == [1, 2]
|
||||
assert table.to_pandas()["id"].tolist() == [1, 2]
|
||||
|
||||
686
python/python/tests/test_nested_fields.py
Normal file
686
python/python/tests/test_nested_fields.py
Normal file
@@ -0,0 +1,686 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
"""Regression matrix for nested field support across LanceDB Python APIs.
|
||||
|
||||
Covers the lifecycle described in lancedb/lancedb#3406:
|
||||
- Nested scalar, vector, and FTS index creation with full dotted paths
|
||||
- list_indices / index_stats return canonical full paths (not leaf names)
|
||||
- search, filter, append, optimize behaviour
|
||||
- Field-name edge cases: mixed case, literal-dot field names, same-name leaves
|
||||
- Both sync and async Python table APIs
|
||||
|
||||
The matrix uses the following field-name variants from the acceptance criteria:
|
||||
- rowId (camelCase top-level)
|
||||
- `row-id` (hyphenated top-level, escaped)
|
||||
- parent.`leaf.name` (struct leaf whose name contains a literal dot)
|
||||
- MetaData.userId (mixed-case nested path)
|
||||
- `meta-data`.`user-id` (hyphenated struct with hyphenated leaf)
|
||||
|
||||
Note: Lance forbids top-level field names that contain a '.', so the literal-dot
|
||||
edge case is exercised via a struct leaf field (parent.`leaf.name`) instead.
|
||||
"""
|
||||
|
||||
from datetime import timedelta
|
||||
|
||||
import pyarrow as pa
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
|
||||
import lancedb
|
||||
from lancedb.db import AsyncConnection, DBConnection
|
||||
from lancedb.index import BTree, FTS, IvfPq
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Constants
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DIM = 8
|
||||
# IvfPq requires at least num_partitions * 256 rows by default; keeping rows
|
||||
# small means we must drop num_sub_vectors and num_partitions very low.
|
||||
NROWS = 256
|
||||
|
||||
|
||||
def _vec(row: int) -> list:
|
||||
return [float((row * DIM + i) % 256) for i in range(DIM)]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sync_db(tmp_path) -> DBConnection:
|
||||
return lancedb.connect(tmp_path)
|
||||
|
||||
|
||||
@pytest_asyncio.fixture
|
||||
async def async_db(tmp_path) -> AsyncConnection:
|
||||
return await lancedb.connect_async(
|
||||
tmp_path, read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Schema / data builders
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _nested_scalar_schema() -> pa.Schema:
|
||||
"""Schema with nested scalar fields covering the acceptance-criteria names.
|
||||
|
||||
Top-level columns:
|
||||
- rowId int32 (camelCase top-level)
|
||||
- row-id int32 (hyphenated top-level name)
|
||||
- MetaData struct{userId int32} (mixed-case nested path)
|
||||
- meta-data struct{user-id int32} (hyphenated struct + hyphenated leaf)
|
||||
|
||||
Lance disallows top-level field names that contain '.' (e.g. a field
|
||||
literally named 'a.b'), so that edge case is tested separately using
|
||||
_literal_dot_schema() below.
|
||||
"""
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("rowId", pa.int32()),
|
||||
pa.field("row-id", pa.int32()),
|
||||
pa.field(
|
||||
"MetaData",
|
||||
pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
pa.field(
|
||||
"meta-data",
|
||||
pa.struct([pa.field("user-id", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _nested_scalar_data(nrows: int = NROWS) -> pa.Table:
|
||||
schema = _nested_scalar_schema()
|
||||
return pa.table(
|
||||
{
|
||||
"rowId": pa.array(list(range(nrows)), pa.int32()),
|
||||
"row-id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"MetaData": pa.array(
|
||||
[{"userId": i} for i in range(nrows)],
|
||||
type=pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
"meta-data": pa.array(
|
||||
[{"user-id": i} for i in range(nrows)],
|
||||
type=pa.struct([pa.field("user-id", pa.int32())]),
|
||||
),
|
||||
},
|
||||
schema=schema,
|
||||
)
|
||||
|
||||
|
||||
def _literal_dot_schema() -> pa.Schema:
|
||||
"""Schema where a struct *leaf* field is named with a literal dot.
|
||||
|
||||
The path used in the index API is ``parent.`leaf.name` ``.
|
||||
"""
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int32()),
|
||||
pa.field(
|
||||
"parent",
|
||||
pa.struct([pa.field("leaf.name", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _literal_dot_data(nrows: int = NROWS) -> pa.Table:
|
||||
parent_type = pa.struct([pa.field("leaf.name", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"parent": pa.array(
|
||||
[{"leaf.name": i} for i in range(nrows)],
|
||||
type=parent_type,
|
||||
),
|
||||
},
|
||||
schema=_literal_dot_schema(),
|
||||
)
|
||||
|
||||
|
||||
def _same_leaf_schema() -> pa.Schema:
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("StructA", pa.struct([pa.field("userId", pa.int32())])),
|
||||
pa.field("StructB", pa.struct([pa.field("userId", pa.int32())])),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _same_leaf_data(nrows: int = NROWS) -> pa.Table:
|
||||
t = pa.struct([pa.field("userId", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"StructA": pa.array([{"userId": i} for i in range(nrows)], type=t),
|
||||
"StructB": pa.array([{"userId": i * 10} for i in range(nrows)], type=t),
|
||||
},
|
||||
schema=_same_leaf_schema(),
|
||||
)
|
||||
|
||||
|
||||
def _nested_vector_schema() -> pa.Schema:
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int32()),
|
||||
pa.field(
|
||||
"image",
|
||||
pa.struct([pa.field("embedding", pa.list_(pa.float32(), DIM))]),
|
||||
),
|
||||
pa.field(
|
||||
"MetaData",
|
||||
pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _nested_vector_data(nrows: int = NROWS) -> pa.Table:
|
||||
embedding_type = pa.list_(pa.float32(), DIM)
|
||||
image_type = pa.struct([pa.field("embedding", embedding_type)])
|
||||
meta_type = pa.struct([pa.field("userId", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"image": pa.array(
|
||||
[{"embedding": _vec(i)} for i in range(nrows)],
|
||||
type=image_type,
|
||||
),
|
||||
"MetaData": pa.array(
|
||||
[{"userId": i} for i in range(nrows)],
|
||||
type=meta_type,
|
||||
),
|
||||
},
|
||||
schema=_nested_vector_schema(),
|
||||
)
|
||||
|
||||
|
||||
def _nested_fts_schema() -> pa.Schema:
|
||||
return pa.schema(
|
||||
[
|
||||
pa.field("id", pa.int32()),
|
||||
pa.field(
|
||||
"payload",
|
||||
pa.struct([pa.field("text", pa.utf8())]),
|
||||
),
|
||||
pa.field(
|
||||
"MetaData",
|
||||
pa.struct([pa.field("userId", pa.int32())]),
|
||||
),
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _nested_fts_data(nrows: int = NROWS) -> pa.Table:
|
||||
words = ["alpha", "bravo", "charlie", "delta", "echo"]
|
||||
payload_type = pa.struct([pa.field("text", pa.utf8())])
|
||||
meta_type = pa.struct([pa.field("userId", pa.int32())])
|
||||
return pa.table(
|
||||
{
|
||||
"id": pa.array(list(range(nrows)), pa.int32()),
|
||||
"payload": pa.array(
|
||||
[{"text": words[i % len(words)]} for i in range(nrows)],
|
||||
type=payload_type,
|
||||
),
|
||||
"MetaData": pa.array(
|
||||
[{"userId": i} for i in range(nrows)],
|
||||
type=meta_type,
|
||||
),
|
||||
},
|
||||
schema=_nested_fts_schema(),
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _columns_by_name_sync(tbl) -> dict:
|
||||
return {idx.name: idx.columns for idx in tbl.list_indices()}
|
||||
|
||||
|
||||
async def _columns_by_name_async(tbl) -> dict:
|
||||
return {idx.name: idx.columns for idx in await tbl.list_indices()}
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# SYNC TESTS
|
||||
# ===========================================================================
|
||||
#
|
||||
# The sync LanceTable API uses:
|
||||
# - create_scalar_index(column, ...) for scalar (BTree/Bitmap/LabelList) indices
|
||||
# - create_fts_index(column, ...) for full-text-search indices
|
||||
# - create_index(...) for vector indices (older positional API)
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestNestedScalarIndexSync:
|
||||
"""Sync regression matrix for nested scalar (BTree) indices."""
|
||||
|
||||
def test_top_level_camelcase_field(self, sync_db):
|
||||
"""list_indices must return the full camelCase field name."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index("rowId", index_type="BTREE", name="rowid_idx")
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["rowid_idx"] == ["rowId"], (
|
||||
"list_indices must return 'rowId', not a truncated leaf name"
|
||||
)
|
||||
|
||||
def test_top_level_hyphenated_field_escaped(self, sync_db):
|
||||
"""Top-level field 'row-id' (hyphenated) accessed via escaped path."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index("`row-id`", index_type="BTREE", name="rowid_hyph_idx")
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["rowid_hyph_idx"] == ["`row-id`"], (
|
||||
"list_indices must return escaped path '`row-id`'"
|
||||
)
|
||||
|
||||
def test_struct_leaf_literal_dot_field_escaped(self, sync_db):
|
||||
"""Struct leaf with a literal-dot name: parent.`leaf.name`.
|
||||
|
||||
The index listing must use the full escaped path, not just the leaf.
|
||||
"""
|
||||
tbl = sync_db.create_table("t", _literal_dot_data())
|
||||
tbl.create_scalar_index(
|
||||
"parent.`leaf.name`", index_type="BTREE", name="leaf_dot_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["leaf_dot_idx"] == ["parent.`leaf.name`"], (
|
||||
"list_indices must return 'parent.`leaf.name`', not just '`leaf.name`'"
|
||||
)
|
||||
|
||||
def test_nested_mixed_case_path(self, sync_db):
|
||||
"""Nested path MetaData.userId (mixed case) must appear as full path."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="metadata_userid_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["metadata_userid_idx"] == ["MetaData.userId"], (
|
||||
"list_indices must return 'MetaData.userId', not leaf 'userId'"
|
||||
)
|
||||
|
||||
def test_nested_hyphenated_path_escaped(self, sync_db):
|
||||
"""`meta-data`.`user-id` path with both parts escaped."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"`meta-data`.`user-id`", index_type="BTREE", name="metauid_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["metauid_idx"] == ["`meta-data`.`user-id`"], (
|
||||
"list_indices must return '`meta-data`.`user-id`', not 'user-id'"
|
||||
)
|
||||
|
||||
def test_filter_on_nested_mixed_case(self, sync_db):
|
||||
"""WHERE filter on a nested dotted path works after index creation."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="metadata_userid_idx"
|
||||
)
|
||||
rows = tbl.search().where("MetaData.userId = 5").to_list()
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["MetaData"]["userId"] == 5
|
||||
|
||||
def test_append_and_list_indices_stable(self, sync_db):
|
||||
"""After appending rows the index listing must remain unchanged."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
|
||||
)
|
||||
tbl.add(_nested_scalar_data(nrows=4))
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
def test_optimize_and_list_indices_stable(self, tmp_path):
|
||||
"""After optimize the index listing must still show full paths."""
|
||||
db = lancedb.connect(tmp_path / "opt_db")
|
||||
tbl = db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
|
||||
)
|
||||
tbl.add(_nested_scalar_data(nrows=4))
|
||||
tbl.optimize()
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
def test_same_name_leaves_are_distinct(self, sync_db):
|
||||
"""Two structs sharing a leaf name must produce distinct index paths."""
|
||||
tbl = sync_db.create_table("same_leaf", _same_leaf_data())
|
||||
tbl.create_scalar_index(
|
||||
"StructA.userId", index_type="BTREE", name="a_userid_idx"
|
||||
)
|
||||
tbl.create_scalar_index(
|
||||
"StructB.userId", index_type="BTREE", name="b_userid_idx"
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["a_userid_idx"] == ["StructA.userId"]
|
||||
assert col_map["b_userid_idx"] == ["StructB.userId"]
|
||||
|
||||
def test_index_stats_canonical_path(self, sync_db):
|
||||
"""index_stats round-trip: create on nested field, verify row count."""
|
||||
tbl = sync_db.create_table("t", _nested_scalar_data())
|
||||
tbl.create_scalar_index(
|
||||
"MetaData.userId", index_type="BTREE", name="meta_uid_idx"
|
||||
)
|
||||
stats = tbl.index_stats("meta_uid_idx")
|
||||
assert stats is not None
|
||||
assert stats.index_type == "BTREE"
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
|
||||
class TestNestedVectorIndexSync:
|
||||
"""Sync regression matrix for nested vector (IvfPq) indices."""
|
||||
|
||||
def test_nested_vector_index_full_path(self, sync_db):
|
||||
"""Listing after vector index creation must use the full dotted path."""
|
||||
tbl = sync_db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"], (
|
||||
"list_indices must return 'image.embedding', not leaf 'embedding'"
|
||||
)
|
||||
|
||||
def test_nested_vector_search(self, sync_db):
|
||||
"""Vector search on nested embedding field must return results."""
|
||||
tbl = sync_db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
results = (
|
||||
tbl.search(_vec(0), vector_column_name="image.embedding").limit(5).to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
|
||||
def test_nested_vector_index_stats(self, sync_db):
|
||||
"""index_stats for a nested vector index must reflect correct row count."""
|
||||
tbl = sync_db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
stats = tbl.index_stats("image_emb_idx")
|
||||
assert stats is not None
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
def test_nested_vector_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the vector index listing must be stable."""
|
||||
db = lancedb.connect(tmp_path / "vec_opt_db")
|
||||
tbl = db.create_table("vt", _nested_vector_data())
|
||||
tbl.create_index(
|
||||
num_partitions=2,
|
||||
num_sub_vectors=2,
|
||||
vector_column_name="image.embedding",
|
||||
name="image_emb_idx",
|
||||
)
|
||||
tbl.add(_nested_vector_data(nrows=4))
|
||||
tbl.optimize()
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"]
|
||||
|
||||
|
||||
class TestNestedFTSIndexSync:
|
||||
"""Sync regression matrix for nested FTS indices."""
|
||||
|
||||
def test_nested_fts_index_full_path(self, sync_db):
|
||||
"""FTS index on payload.text must be listed with the full path."""
|
||||
tbl = sync_db.create_table("ft", _nested_fts_data())
|
||||
tbl.create_fts_index("payload.text", name="payload_text_idx")
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"], (
|
||||
"list_indices must return 'payload.text', not leaf 'text'"
|
||||
)
|
||||
|
||||
def test_nested_fts_search(self, sync_db):
|
||||
"""FTS search on a nested text field must return correct results."""
|
||||
tbl = sync_db.create_table("ft", _nested_fts_data())
|
||||
tbl.create_fts_index("payload.text", name="payload_text_idx")
|
||||
results = (
|
||||
tbl.search("alpha", query_type="fts", fts_columns="payload.text")
|
||||
.limit(10)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all(row["payload"]["text"] == "alpha" for row in results)
|
||||
|
||||
def test_nested_fts_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the FTS index listing must be stable."""
|
||||
db = lancedb.connect(tmp_path / "fts_opt_db")
|
||||
tbl = db.create_table("ft", _nested_fts_data())
|
||||
tbl.create_fts_index("payload.text", name="payload_text_idx")
|
||||
tbl.add(_nested_fts_data(nrows=4))
|
||||
tbl.optimize()
|
||||
col_map = _columns_by_name_sync(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"]
|
||||
|
||||
|
||||
# ===========================================================================
|
||||
# ASYNC TESTS
|
||||
# ===========================================================================
|
||||
#
|
||||
# The async AsyncTable API uses create_index(column, config=...) uniformly
|
||||
# for scalar, vector, and FTS indices.
|
||||
# ===========================================================================
|
||||
|
||||
|
||||
class TestNestedScalarIndexAsync:
|
||||
"""Async regression matrix for nested scalar (BTree) indices."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_top_level_camelcase_field(self, async_db):
|
||||
"""list_indices must return the full camelCase field name."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("rowId", config=BTree(), name="rowid_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["rowid_idx"] == ["rowId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_top_level_hyphenated_field_escaped(self, async_db):
|
||||
"""Hyphenated top-level field accessed via escaped path."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("`row-id`", config=BTree(), name="rowid_hyph_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["rowid_hyph_idx"] == ["`row-id`"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_struct_leaf_literal_dot_field_escaped(self, async_db):
|
||||
"""Struct leaf with a literal-dot name: parent.`leaf.name`."""
|
||||
tbl = await async_db.create_table("t", _literal_dot_data())
|
||||
await tbl.create_index(
|
||||
"parent.`leaf.name`", config=BTree(), name="leaf_dot_idx"
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["leaf_dot_idx"] == ["parent.`leaf.name`"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_mixed_case_path(self, async_db):
|
||||
"""Mixed-case nested path MetaData.userId must appear as full path."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index(
|
||||
"MetaData.userId", config=BTree(), name="metadata_userid_idx"
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["metadata_userid_idx"] == ["MetaData.userId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_hyphenated_path_escaped(self, async_db):
|
||||
"""`meta-data`.`user-id` path with both parts escaped."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index(
|
||||
"`meta-data`.`user-id`", config=BTree(), name="metauid_idx"
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["metauid_idx"] == ["`meta-data`.`user-id`"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_filter_on_nested_mixed_case(self, async_db):
|
||||
"""WHERE filter on a nested dotted path works after index creation."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index(
|
||||
"MetaData.userId", config=BTree(), name="metadata_userid_idx"
|
||||
)
|
||||
rows = await tbl.query().where("MetaData.userId = 5").to_list()
|
||||
assert len(rows) == 1
|
||||
assert rows[0]["MetaData"]["userId"] == 5
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_index_stats_canonical_path(self, async_db):
|
||||
"""index_stats round-trip: create on nested field, verify stats."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
|
||||
stats = await tbl.index_stats("meta_uid_idx")
|
||||
assert stats is not None
|
||||
assert stats.index_type == "BTREE"
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_append_and_list_indices_stable(self, async_db):
|
||||
"""After appending rows the index listing must remain unchanged."""
|
||||
tbl = await async_db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
|
||||
await tbl.add(_nested_scalar_data(nrows=4))
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_optimize_and_list_indices_stable(self, tmp_path):
|
||||
"""After optimize the index listing must still show full paths."""
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path / "opt_db", read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
tbl = await db.create_table("t", _nested_scalar_data())
|
||||
await tbl.create_index("MetaData.userId", config=BTree(), name="meta_uid_idx")
|
||||
await tbl.add(_nested_scalar_data(nrows=4))
|
||||
await tbl.optimize()
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["meta_uid_idx"] == ["MetaData.userId"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_same_name_leaves_are_distinct(self, async_db):
|
||||
"""Two structs sharing a leaf name must produce distinct index paths."""
|
||||
tbl = await async_db.create_table("same_leaf", _same_leaf_data())
|
||||
await tbl.create_index("StructA.userId", config=BTree(), name="a_userid_idx")
|
||||
await tbl.create_index("StructB.userId", config=BTree(), name="b_userid_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["a_userid_idx"] == ["StructA.userId"]
|
||||
assert col_map["b_userid_idx"] == ["StructB.userId"]
|
||||
|
||||
|
||||
class TestNestedVectorIndexAsync:
|
||||
"""Async regression matrix for nested vector (IvfPq) indices."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_index_full_path(self, async_db):
|
||||
"""Listing after vector index creation must use the full dotted path."""
|
||||
tbl = await async_db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_search(self, async_db):
|
||||
"""Vector search on nested embedding field must return results."""
|
||||
tbl = await async_db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
results = (
|
||||
await tbl.query()
|
||||
.nearest_to(_vec(0))
|
||||
.column("image.embedding")
|
||||
.limit(5)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_index_stats(self, async_db):
|
||||
"""index_stats for a nested vector index must reflect correct row count."""
|
||||
tbl = await async_db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
stats = await tbl.index_stats("image_emb_idx")
|
||||
assert stats is not None
|
||||
assert stats.num_indexed_rows == NROWS
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_vector_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the vector index listing must be stable."""
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path / "vec_opt_db", read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
tbl = await db.create_table("vt", _nested_vector_data())
|
||||
await tbl.create_index(
|
||||
"image.embedding",
|
||||
config=IvfPq(num_partitions=2, num_sub_vectors=2),
|
||||
name="image_emb_idx",
|
||||
)
|
||||
await tbl.add(_nested_vector_data(nrows=4))
|
||||
await tbl.optimize()
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["image_emb_idx"] == ["image.embedding"]
|
||||
|
||||
|
||||
class TestNestedFTSIndexAsync:
|
||||
"""Async regression matrix for nested FTS indices."""
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_fts_index_full_path(self, async_db):
|
||||
"""FTS index on payload.text must be listed with the full path."""
|
||||
tbl = await async_db.create_table("ft", _nested_fts_data())
|
||||
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"]
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_fts_search(self, async_db):
|
||||
"""FTS search on a nested text field must return correct results."""
|
||||
tbl = await async_db.create_table("ft", _nested_fts_data())
|
||||
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
|
||||
results = (
|
||||
await tbl.query()
|
||||
.nearest_to_text("alpha", columns="payload.text")
|
||||
.limit(10)
|
||||
.to_list()
|
||||
)
|
||||
assert len(results) > 0
|
||||
assert all(row["payload"]["text"] == "alpha" for row in results)
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_nested_fts_append_optimize(self, tmp_path):
|
||||
"""After append and optimize the FTS index listing must be stable."""
|
||||
db = await lancedb.connect_async(
|
||||
tmp_path / "fts_opt_db", read_consistency_interval=timedelta(seconds=0)
|
||||
)
|
||||
tbl = await db.create_table("ft", _nested_fts_data())
|
||||
await tbl.create_index("payload.text", config=FTS(), name="payload_text_idx")
|
||||
await tbl.add(_nested_fts_data(nrows=4))
|
||||
await tbl.optimize()
|
||||
col_map = await _columns_by_name_async(tbl)
|
||||
assert col_map["payload_text_idx"] == ["payload.text"]
|
||||
@@ -188,6 +188,18 @@ def test_nested_struct_list():
|
||||
assert schema == expect_schema
|
||||
|
||||
|
||||
def test_bare_generic_raises_type_error():
|
||||
# A bare, unparameterised List/Tuple has no element type to map to Arrow.
|
||||
# It should raise a clear TypeError, not crash with AttributeError: __args__.
|
||||
for bare in (List, Tuple):
|
||||
|
||||
class TestModel(pydantic.BaseModel):
|
||||
items: bare
|
||||
|
||||
with pytest.raises(TypeError, match="unsupported type"):
|
||||
pydantic_to_schema(TestModel)
|
||||
|
||||
|
||||
def test_nested_struct_list_optional():
|
||||
class SplitInfo(pydantic.BaseModel):
|
||||
start_frame: int
|
||||
|
||||
@@ -255,8 +255,9 @@ def test_plain_scan_query_to_pandas_blob_projection(tmp_db):
|
||||
assert df["double_id"].tolist() == [6, 8]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("blob_mode", ["bytes", "descriptions"])
|
||||
def test_plain_scan_query_to_pandas_blob_mode_does_not_collect_arrow(
|
||||
tmp_db, monkeypatch
|
||||
tmp_db, monkeypatch, blob_mode
|
||||
):
|
||||
pytest.importorskip("lance")
|
||||
table = tmp_db.create_table(
|
||||
@@ -269,10 +270,69 @@ def test_plain_scan_query_to_pandas_blob_mode_does_not_collect_arrow(
|
||||
|
||||
monkeypatch.setattr(query, "to_arrow", fail_to_arrow)
|
||||
|
||||
df = query.to_pandas(blob_mode="bytes")
|
||||
df = query.to_pandas(blob_mode=blob_mode)
|
||||
|
||||
assert df["id"].tolist() == [1]
|
||||
assert df["blob"].tolist() == [b"one"]
|
||||
if blob_mode == "bytes":
|
||||
assert df["blob"].tolist() == [b"one"]
|
||||
else:
|
||||
first = df["blob"].iloc[0]
|
||||
assert first != b"one"
|
||||
assert not hasattr(first, "readall")
|
||||
|
||||
|
||||
def test_plain_scan_query_to_pandas_blob_descriptions_flatten_uses_scanner(
|
||||
tmp_db, monkeypatch
|
||||
):
|
||||
pytest.importorskip("lance")
|
||||
table = tmp_db.create_table(
|
||||
"test_query_to_pandas_blob_desc_flatten", _blob_query_data()
|
||||
)
|
||||
query = table.search().where("id = 1").select(["id", "blob"])
|
||||
|
||||
def fail_to_arrow(*args, **kwargs):
|
||||
raise AssertionError("to_arrow should not be called before scanner pandas")
|
||||
|
||||
monkeypatch.setattr(query, "to_arrow", fail_to_arrow)
|
||||
|
||||
df = query.to_pandas(blob_mode="descriptions", flatten=True)
|
||||
|
||||
assert df["id"].tolist() == [1]
|
||||
assert any(column == "blob" or column.startswith("blob.") for column in df.columns)
|
||||
|
||||
|
||||
def test_plain_scan_query_to_pandas_scanner_state(tmp_db):
|
||||
pytest.importorskip("lance")
|
||||
data = _blob_query_data()
|
||||
table = tmp_db.create_table("test_query_to_pandas_scanner_state", data.slice(0, 2))
|
||||
table.add(data.slice(2, 2))
|
||||
|
||||
fragments = table.to_lance().get_fragments()
|
||||
assert len(fragments) == 2
|
||||
|
||||
query = (
|
||||
table.search()
|
||||
.select(["id", "blob"])
|
||||
.with_row_address()
|
||||
.fragment_ids([fragments[1].fragment_id])
|
||||
)
|
||||
query_obj = query.to_query_object()
|
||||
assert query_obj.with_row_address is True
|
||||
assert query_obj.fragment_ids == [fragments[1].fragment_id]
|
||||
|
||||
df = query.to_pandas(blob_mode="descriptions")
|
||||
|
||||
assert df["id"].tolist() == [3, 4]
|
||||
assert "_rowaddr" in df.columns
|
||||
assert {rowaddr >> 32 for rowaddr in df["_rowaddr"]} == {fragments[1].fragment_id}
|
||||
|
||||
df_by_fragment = (
|
||||
table.search()
|
||||
.select(["id", "blob"])
|
||||
.with_fragments([fragments[0]])
|
||||
.to_pandas(blob_mode="descriptions")
|
||||
)
|
||||
assert df_by_fragment["id"].tolist() == [1, 2]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@@ -312,8 +372,9 @@ async def test_async_plain_scan_query_to_pandas_blob_projection(tmp_db_async):
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.parametrize("blob_mode", ["bytes", "descriptions"])
|
||||
async def test_async_plain_scan_query_to_pandas_blob_mode_does_not_collect_arrow(
|
||||
tmp_db_async, monkeypatch
|
||||
tmp_db_async, monkeypatch, blob_mode
|
||||
):
|
||||
pytest.importorskip("lance")
|
||||
table = await tmp_db_async.create_table(
|
||||
@@ -326,10 +387,15 @@ async def test_async_plain_scan_query_to_pandas_blob_mode_does_not_collect_arrow
|
||||
|
||||
monkeypatch.setattr(query, "to_arrow", fail_to_arrow)
|
||||
|
||||
df = await query.to_pandas(blob_mode="bytes")
|
||||
df = await query.to_pandas(blob_mode=blob_mode)
|
||||
|
||||
assert df["id"].tolist() == [1]
|
||||
assert df["blob"].tolist() == [b"one"]
|
||||
if blob_mode == "bytes":
|
||||
assert df["blob"].tolist() == [b"one"]
|
||||
else:
|
||||
first = df["blob"].iloc[0]
|
||||
assert first != b"one"
|
||||
assert not hasattr(first, "readall")
|
||||
|
||||
|
||||
def test_vector_query_to_pandas_blob_mode_requires_native_path(tmp_db):
|
||||
@@ -342,6 +408,18 @@ def test_vector_query_to_pandas_blob_mode_requires_native_path(tmp_db):
|
||||
)
|
||||
|
||||
|
||||
def test_vector_query_to_pandas_blob_descriptions_requires_plain_scan(tmp_db):
|
||||
pytest.importorskip("lance")
|
||||
table = tmp_db.create_table(
|
||||
"test_vector_query_blob_descriptions", _blob_query_data()
|
||||
)
|
||||
|
||||
with pytest.raises(RuntimeError, match="plain scan query"):
|
||||
table.search([1.0, 0.0]).select(["blob", "vector"]).limit(1).to_pandas(
|
||||
blob_mode="descriptions"
|
||||
)
|
||||
|
||||
|
||||
def test_order_by_plain_query(mem_db):
|
||||
table = mem_db.create_table(
|
||||
"test_order_by",
|
||||
|
||||
@@ -154,6 +154,118 @@ async def test_async_checkout():
|
||||
assert await table.count_rows() == 300
|
||||
|
||||
|
||||
def _branch_open_handler(request):
|
||||
if "/branches/list" in request.path:
|
||||
body = json.dumps(
|
||||
{
|
||||
"branches": {
|
||||
"exp": {
|
||||
"parentBranch": None,
|
||||
"parentVersion": 1,
|
||||
"createAt": 1,
|
||||
"manifestSize": 1,
|
||||
}
|
||||
}
|
||||
}
|
||||
).encode()
|
||||
else:
|
||||
# describe (table open + version/branch validation)
|
||||
body = json.dumps({"version": 2, "schema": {"fields": []}}).encode()
|
||||
request.send_response(200)
|
||||
request.send_header("Content-Type", "application/json")
|
||||
request.end_headers()
|
||||
request.wfile.write(body)
|
||||
|
||||
|
||||
def test_remote_open_table_branch_and_version():
|
||||
with mock_lancedb_connection(_branch_open_handler) as db:
|
||||
# version-only (and "main" + version) time-travels the main chain
|
||||
assert db.open_table("test", version=2) is not None
|
||||
assert db.open_table("test", branch="main", version=2).current_branch() is None
|
||||
|
||||
# a non-main branch opens a handle scoped to that branch, with or
|
||||
# without a version
|
||||
assert db.open_table("test", branch="exp").current_branch() == "exp"
|
||||
assert db.open_table("test", branch="exp", version=2).current_branch() == "exp"
|
||||
|
||||
|
||||
def test_remote_table_branches_sync():
|
||||
# Branch CRUD + current_branch on the sync RemoteTable. The handle returned
|
||||
# by create/checkout must stay a RemoteTable scoped to the branch.
|
||||
from lancedb.remote.table import RemoteTable
|
||||
|
||||
def handler(request):
|
||||
if "/branches/list" in request.path:
|
||||
body = json.dumps(
|
||||
{
|
||||
"branches": {
|
||||
"exp": {
|
||||
"parentBranch": None,
|
||||
"parentVersion": 1,
|
||||
"createAt": 1,
|
||||
"manifestSize": 1,
|
||||
}
|
||||
}
|
||||
}
|
||||
).encode()
|
||||
elif "/branches/create" in request.path or "/branches/delete" in request.path:
|
||||
body = b"{}"
|
||||
else:
|
||||
# describe (table open + checkout validation)
|
||||
body = json.dumps({"version": 1, "schema": {"fields": []}}).encode()
|
||||
request.send_response(200)
|
||||
request.send_header("Content-Type", "application/json")
|
||||
request.end_headers()
|
||||
request.wfile.write(body)
|
||||
|
||||
with mock_lancedb_connection(handler) as db:
|
||||
table = db.open_table("test")
|
||||
assert isinstance(table, RemoteTable)
|
||||
assert table.current_branch() is None
|
||||
|
||||
branch = table.branches.create("exp")
|
||||
assert isinstance(branch, RemoteTable)
|
||||
assert branch.current_branch() == "exp"
|
||||
|
||||
# list + checkout round trip; checkout also yields a branch-scoped handle
|
||||
assert "exp" in table.branches.list()
|
||||
checked = table.branches.checkout("exp")
|
||||
assert isinstance(checked, RemoteTable)
|
||||
assert checked.current_branch() == "exp"
|
||||
|
||||
table.branches.delete("exp")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_remote_open_table_branch_and_version():
|
||||
async with mock_lancedb_connection_async(_branch_open_handler) as db:
|
||||
# version-only (and "main" + version) time-travels the main chain
|
||||
assert await db.open_table("test", version=2) is not None
|
||||
main_v2 = await db.open_table("test", branch="main", version=2)
|
||||
assert main_v2.current_branch() is None
|
||||
|
||||
# a non-main branch opens a handle scoped to that branch
|
||||
exp = await db.open_table("test", branch="exp")
|
||||
assert exp.current_branch() == "exp"
|
||||
exp_v2 = await db.open_table("test", branch="exp", version=2)
|
||||
assert exp_v2.current_branch() == "exp"
|
||||
|
||||
|
||||
def test_remote_table_branch_survives_pickle():
|
||||
# Regression: a branch-scoped handle must keep its branch across a
|
||||
# pickle/fork round-trip (it used to reopen on main).
|
||||
with mock_lancedb_connection(_branch_open_handler) as db:
|
||||
branch = db.open_table("test", branch="exp")
|
||||
assert branch.current_branch() == "exp"
|
||||
restored = pickle.loads(pickle.dumps(branch))
|
||||
assert restored.current_branch() == "exp"
|
||||
|
||||
# the pinned version is carried through as well
|
||||
branch_v2 = db.open_table("test", branch="exp", version=2)
|
||||
restored_v2 = pickle.loads(pickle.dumps(branch_v2))
|
||||
assert restored_v2.current_branch() == "exp"
|
||||
|
||||
|
||||
def test_table_len_sync():
|
||||
def handler(request):
|
||||
if request.path == "/v1/table/test/create/?mode=create":
|
||||
|
||||
@@ -344,6 +344,12 @@ def test_mrr_reranker(tmp_path):
|
||||
assert len(result_deduped) == len(result)
|
||||
|
||||
|
||||
def test_mrr_reranker_empty_input():
|
||||
reranker = MRRReranker()
|
||||
with pytest.raises(ValueError, match="must not be empty"):
|
||||
reranker.rerank_multivector([])
|
||||
|
||||
|
||||
def test_rrf_reranker_distance():
|
||||
data = pa.table(
|
||||
{
|
||||
|
||||
@@ -4,6 +4,7 @@
|
||||
|
||||
import os
|
||||
import sys
|
||||
import threading
|
||||
import warnings
|
||||
from datetime import date, datetime, timedelta
|
||||
from time import sleep
|
||||
@@ -21,6 +22,7 @@ import pytest
|
||||
from lancedb.conftest import MockTextEmbeddingFunction
|
||||
from lancedb.db import AsyncConnection, DBConnection
|
||||
from lancedb.embeddings import EmbeddingFunctionConfig, EmbeddingFunctionRegistry
|
||||
from lancedb.expr import col, lit
|
||||
from lancedb.pydantic import LanceModel, Vector
|
||||
from lancedb.table import LanceTable
|
||||
from pydantic import BaseModel
|
||||
@@ -299,6 +301,16 @@ def test_create_table(mem_db: DBConnection):
|
||||
assert expected == tbl
|
||||
|
||||
|
||||
def test_create_table_rejects_single_dictionary(mem_db: DBConnection):
|
||||
data = {"vector": [3.1, 4.1], "item": "foo", "price": 10.0}
|
||||
with pytest.raises(ValueError) as excep_info:
|
||||
mem_db.create_table("test", data=data)
|
||||
assert (
|
||||
str(excep_info.value) == "Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
|
||||
|
||||
def test_empty_table(mem_db: DBConnection):
|
||||
schema = pa.schema(
|
||||
[
|
||||
@@ -328,8 +340,8 @@ def test_add_dictionary(mem_db: DBConnection):
|
||||
with pytest.raises(ValueError) as excep_info:
|
||||
tbl.add(data=data)
|
||||
assert (
|
||||
str(excep_info.value)
|
||||
== "Cannot add a single dictionary to a table. Use a list."
|
||||
str(excep_info.value) == "Cannot create or add rows from a single dictionary. "
|
||||
"Use a list of dictionaries instead."
|
||||
)
|
||||
|
||||
|
||||
@@ -927,6 +939,346 @@ async def test_async_tags(mem_db_async: AsyncConnection):
|
||||
)
|
||||
|
||||
|
||||
def test_branches(tmp_path):
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(0))
|
||||
table = db.create_table(
|
||||
"test",
|
||||
data=[
|
||||
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
|
||||
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
|
||||
],
|
||||
)
|
||||
assert table.count_rows() == 2
|
||||
|
||||
# fork an isolated, writable branch from main
|
||||
branch = table.branches.create("exp")
|
||||
assert branch.count_rows() == 2
|
||||
branch.add(data=[{"vector": [10.0, 11.0], "item": "baz", "price": 30.0}])
|
||||
|
||||
# writes on the branch do not touch main
|
||||
assert branch.count_rows() == 3
|
||||
assert table.count_rows() == 2
|
||||
|
||||
# the branch is listed, with main (None) as its parent
|
||||
branches = table.branches.list()
|
||||
assert "exp" in branches
|
||||
assert branches["exp"]["parent_branch"] is None
|
||||
|
||||
# from_ref="main" is equivalent to the default
|
||||
table.branches.create("exp2", from_ref="main")
|
||||
assert table.branches.list()["exp2"]["parent_branch"] is None
|
||||
|
||||
# checkout returns a handle scoped to the branch's latest
|
||||
checked_out = table.branches.checkout("exp")
|
||||
assert checked_out.count_rows() == 3
|
||||
|
||||
# delete removes it
|
||||
table.branches.delete("exp")
|
||||
table.branches.delete("exp2")
|
||||
assert "exp" not in table.branches.list()
|
||||
|
||||
|
||||
def test_branch_handle_tracks_concurrent_writes(tmp_path):
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(0))
|
||||
table = db.create_table("t", [{"id": 1}])
|
||||
|
||||
# two independent handles on the same branch
|
||||
writer = table.branches.create("exp")
|
||||
reader = db.open_table("t", branch="exp")
|
||||
assert reader.count_rows() == 1
|
||||
|
||||
# a concurrent write on the branch is visible to the other handle
|
||||
writer.add([{"id": 2}])
|
||||
assert reader.count_rows() == 2
|
||||
# main is unaffected
|
||||
assert table.count_rows() == 1
|
||||
|
||||
|
||||
def test_branch_name_validation(tmp_path):
|
||||
db = lancedb.connect(tmp_path)
|
||||
table = db.create_table("t", [{"id": 1}])
|
||||
|
||||
with pytest.raises(ValueError, match="non-empty"):
|
||||
table.branches.create("")
|
||||
with pytest.raises(ValueError, match="non-empty"):
|
||||
table.branches.checkout("")
|
||||
with pytest.raises(ValueError, match="non-empty"):
|
||||
table.branches.delete("")
|
||||
|
||||
|
||||
def test_branches_preserve_namespace(tmp_path):
|
||||
pytest.importorskip(
|
||||
"lance"
|
||||
) # namespace_path routes through lance's DirectoryNamespace
|
||||
db = lancedb.connect(tmp_path)
|
||||
table = db.create_table("t", [{"id": 1}], namespace_path=["ns1"])
|
||||
assert table.namespace == ["ns1"]
|
||||
|
||||
branch = table.branches.create("exp")
|
||||
assert branch.namespace == ["ns1"]
|
||||
assert branch.id == table.id
|
||||
|
||||
# opening the branch directly also preserves namespace identity
|
||||
opened = db.open_table("t", namespace_path=["ns1"], branch="exp")
|
||||
assert opened.namespace == ["ns1"]
|
||||
|
||||
|
||||
def test_open_table_with_branch(tmp_path):
|
||||
db = lancedb.connect(tmp_path)
|
||||
table = db.create_table("t", [{"i": 1}])
|
||||
table.branches.create("exp").add([{"i": 2}])
|
||||
|
||||
# open_table(branch=...) returns a handle scoped to the branch
|
||||
assert db.open_table("t", branch="exp").count_rows() == 2
|
||||
# opening without branch still tracks main
|
||||
assert db.open_table("t").count_rows() == 1
|
||||
|
||||
|
||||
def test_open_table_with_branch_version(tmp_path):
|
||||
db = lancedb.connect(tmp_path, read_consistency_interval=timedelta(0))
|
||||
|
||||
# main: a single fork-point row
|
||||
t = db.create_table("t", [{"i": 0}])
|
||||
main_v1 = t.version
|
||||
|
||||
# fork "exp", then advance exp AND main independently past the fork so they
|
||||
# diverge while sharing version numbers
|
||||
exp = t.branches.create("exp")
|
||||
exp.add([{"i": 1}]) # exp: {0, 1}
|
||||
exp_v2 = exp.version
|
||||
exp.add([{"i": 2}]) # exp HEAD: {0, 1, 2}
|
||||
t.add([{"i": 100}, {"i": 101}, {"i": 102}]) # main HEAD: {0, 100, 101, 102}
|
||||
assert exp_v2 == t.version, "branch and main must share the version number"
|
||||
|
||||
# open exp at the shared version: the data must be exp's, not main's. count
|
||||
# alone cannot prove this (main@v2 also exists), so assert provenance by
|
||||
# content.
|
||||
pinned = db.open_table("t", branch="exp", version=exp_v2)
|
||||
assert pinned.current_branch() == "exp"
|
||||
assert pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
|
||||
assert pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
|
||||
assert pinned.count_rows("i = 100") == 0 # main's divergent rows are invisible
|
||||
|
||||
# the same coordinate is reachable directly via branches.checkout(name, version)
|
||||
pinned_direct = t.branches.checkout("exp", exp_v2)
|
||||
assert pinned_direct.current_branch() == "exp"
|
||||
assert pinned_direct.count_rows() == 2
|
||||
|
||||
# the HEADs are unaffected
|
||||
assert db.open_table("t", branch="exp").count_rows() == 3
|
||||
assert db.open_table("t").count_rows() == 4
|
||||
|
||||
# version-only (no branch) time-travels main itself: its fork-point version
|
||||
# holds only main's first row, and the shared version number resolves to
|
||||
# main's data, not the branch's ("opens main at the version")
|
||||
old_main = db.open_table("t", version=main_v1)
|
||||
assert old_main.current_branch() is None
|
||||
assert old_main.count_rows() == 1
|
||||
shared_on_main = db.open_table("t", version=exp_v2)
|
||||
assert shared_on_main.current_branch() is None
|
||||
assert shared_on_main.count_rows() == 4
|
||||
|
||||
# detached head: writing to a pinned version is rejected
|
||||
with pytest.raises((ValueError, RuntimeError), match="cannot be modified"):
|
||||
pinned.add([{"i": 9}])
|
||||
|
||||
# a nonexistent version is rejected -- on main, and on a branch (a distinct
|
||||
# resolution path, on the branch's manifests)
|
||||
with pytest.raises((ValueError, RuntimeError)):
|
||||
db.open_table("t", version=9999)
|
||||
with pytest.raises((ValueError, RuntimeError)):
|
||||
db.open_table("t", branch="exp", version=9999)
|
||||
|
||||
# checkout_latest re-attaches the pinned handle to the BRANCH's HEAD
|
||||
# (writable again), not main's HEAD, and not staying pinned
|
||||
pinned.checkout_latest()
|
||||
assert pinned.current_branch() == "exp"
|
||||
assert pinned.count_rows() == 3 # exp HEAD, not main's 4
|
||||
pinned.add([{"i": 3}])
|
||||
assert pinned.count_rows() == 4 # writable again
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_namespace_open_table_with_branch(tmp_path):
|
||||
pytest.importorskip("lance") # "dir" impl is lance.namespace.DirectoryNamespace
|
||||
db = lancedb.connect_namespace_async("dir", {"root": str(tmp_path)})
|
||||
await db.create_namespace(["ns1"])
|
||||
table = await db.create_table("t", [{"id": 1}], namespace_path=["ns1"])
|
||||
branch = await table.branches.create("exp")
|
||||
await branch.add([{"id": 2}])
|
||||
|
||||
# open_table(branch=...) on the async namespace connection must work
|
||||
opened = await db.open_table("t", namespace_path=["ns1"], branch="exp")
|
||||
assert await opened.count_rows() == 2
|
||||
|
||||
|
||||
def test_namespace_open_table_with_branch_version(tmp_path):
|
||||
pytest.importorskip("lance") # "dir" impl is lance.namespace.DirectoryNamespace
|
||||
db = lancedb.connect_namespace("dir", {"root": str(tmp_path)})
|
||||
db.create_namespace(["ns1"])
|
||||
t = db.create_table("t", [{"i": 0}], namespace_path=["ns1"])
|
||||
|
||||
# fork "exp", then advance exp AND main past the fork so they diverge while
|
||||
# sharing version numbers
|
||||
exp = t.branches.create("exp")
|
||||
exp.add([{"i": 1}])
|
||||
exp_v2 = exp.version
|
||||
exp.add([{"i": 2}])
|
||||
t.add([{"i": 100}, {"i": 101}, {"i": 102}])
|
||||
assert exp_v2 == t.version, "branch and main must share the version number"
|
||||
|
||||
# open_table(branch=, version=) on the namespace connection reads the
|
||||
# branch's data at that version, not main's
|
||||
pinned = db.open_table("t", namespace_path=["ns1"], branch="exp", version=exp_v2)
|
||||
assert pinned.current_branch() == "exp"
|
||||
assert pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
|
||||
assert pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
|
||||
assert pinned.count_rows("i = 100") == 0 # main's divergent rows are invisible
|
||||
assert db.open_table("t", namespace_path=["ns1"], branch="exp").count_rows() == 3
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_namespace_open_table_with_branch_version(tmp_path):
|
||||
pytest.importorskip("lance") # "dir" impl is lance.namespace.DirectoryNamespace
|
||||
db = lancedb.connect_namespace_async("dir", {"root": str(tmp_path)})
|
||||
await db.create_namespace(["ns1"])
|
||||
t = await db.create_table("t", [{"i": 0}], namespace_path=["ns1"])
|
||||
|
||||
# fork "exp", then advance exp AND main past the fork so they diverge while
|
||||
# sharing version numbers
|
||||
exp = await t.branches.create("exp")
|
||||
await exp.add([{"i": 1}])
|
||||
exp_v2 = await exp.version()
|
||||
await exp.add([{"i": 2}])
|
||||
await t.add([{"i": 100}, {"i": 101}, {"i": 102}])
|
||||
assert exp_v2 == await t.version(), "branch and main must share the version number"
|
||||
|
||||
# open_table(branch=, version=) on the async namespace connection reads the
|
||||
# branch's data at that version, not main's
|
||||
pinned = await db.open_table(
|
||||
"t", namespace_path=["ns1"], branch="exp", version=exp_v2
|
||||
)
|
||||
assert pinned.current_branch() == "exp"
|
||||
assert await pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
|
||||
assert await pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
|
||||
assert await pinned.count_rows("i = 100") == 0 # main's rows are invisible
|
||||
assert (
|
||||
await (
|
||||
await db.open_table("t", namespace_path=["ns1"], branch="exp")
|
||||
).count_rows()
|
||||
== 3
|
||||
)
|
||||
|
||||
|
||||
def test_branch_to_lance_targets_branch(tmp_path):
|
||||
pytest.importorskip("lance")
|
||||
db = lancedb.connect(tmp_path)
|
||||
table = db.create_table("t", [{"i": 1}])
|
||||
branch = table.branches.create("exp")
|
||||
branch.add([{"i": 2}]) # branch: 2 rows, main: 1 row
|
||||
|
||||
assert branch.to_lance().count_rows() == 2
|
||||
assert table.to_lance().count_rows() == 1
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_branches(tmp_path):
|
||||
db = await lancedb.connect_async(tmp_path)
|
||||
table = await db.create_table(
|
||||
"test",
|
||||
data=[
|
||||
{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
|
||||
{"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
|
||||
],
|
||||
)
|
||||
assert await table.count_rows() == 2
|
||||
|
||||
branch = await table.branches.create("exp")
|
||||
assert await branch.count_rows() == 2
|
||||
await branch.add(data=[{"vector": [10.0, 11.0], "item": "baz", "price": 30.0}])
|
||||
|
||||
assert await branch.count_rows() == 3
|
||||
assert await table.count_rows() == 2
|
||||
|
||||
branches = await table.branches.list()
|
||||
assert "exp" in branches
|
||||
assert branches["exp"]["parent_branch"] is None
|
||||
|
||||
await table.branches.create("exp2", from_ref="main")
|
||||
assert (await table.branches.list())["exp2"]["parent_branch"] is None
|
||||
|
||||
checked_out = await table.branches.checkout("exp")
|
||||
assert await checked_out.count_rows() == 3
|
||||
|
||||
await table.branches.delete("exp")
|
||||
await table.branches.delete("exp2")
|
||||
assert "exp" not in await table.branches.list()
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_open_table_with_branch_version(tmp_path):
|
||||
db = await lancedb.connect_async(tmp_path, read_consistency_interval=timedelta(0))
|
||||
|
||||
# main: a single fork-point row
|
||||
t = await db.create_table("t", [{"i": 0}])
|
||||
main_v1 = await t.version()
|
||||
|
||||
# fork "exp", then advance exp AND main independently past the fork so they
|
||||
# diverge while sharing version numbers
|
||||
exp = await t.branches.create("exp")
|
||||
await exp.add([{"i": 1}]) # exp: {0, 1}
|
||||
exp_v2 = await exp.version()
|
||||
await exp.add([{"i": 2}]) # exp HEAD: {0, 1, 2}
|
||||
await t.add([{"i": 100}, {"i": 101}, {"i": 102}]) # main HEAD: {0, 100, 101, 102}
|
||||
assert exp_v2 == await t.version(), "branch and main must share the version number"
|
||||
|
||||
# open exp at the shared version: the data must be exp's, not main's. count
|
||||
# alone cannot prove this (main@v2 also exists), so assert provenance by
|
||||
# content.
|
||||
pinned = await db.open_table("t", branch="exp", version=exp_v2)
|
||||
assert pinned.current_branch() == "exp"
|
||||
assert await pinned.count_rows() == 2 # not exp HEAD (3), not main@v2 (4)
|
||||
assert await pinned.count_rows("i = 1") == 1 # exp's post-fork row is visible
|
||||
assert await pinned.count_rows("i = 100") == 0 # main's rows are invisible
|
||||
|
||||
# the same coordinate is reachable directly via branches.checkout(name, version)
|
||||
pinned_direct = await t.branches.checkout("exp", exp_v2)
|
||||
assert pinned_direct.current_branch() == "exp"
|
||||
assert await pinned_direct.count_rows() == 2
|
||||
|
||||
# the HEADs are unaffected
|
||||
assert await (await db.open_table("t", branch="exp")).count_rows() == 3
|
||||
assert await (await db.open_table("t")).count_rows() == 4
|
||||
|
||||
# version-only (no branch) time-travels main itself: its fork-point version
|
||||
# holds only main's first row, and the shared version number resolves to
|
||||
# main's data, not the branch's ("opens main at the version")
|
||||
old_main = await db.open_table("t", version=main_v1)
|
||||
assert old_main.current_branch() is None
|
||||
assert await old_main.count_rows() == 1
|
||||
shared_on_main = await db.open_table("t", version=exp_v2)
|
||||
assert shared_on_main.current_branch() is None
|
||||
assert await shared_on_main.count_rows() == 4
|
||||
|
||||
# detached head: writing to a pinned version is rejected
|
||||
with pytest.raises((ValueError, RuntimeError), match="cannot be modified"):
|
||||
await pinned.add([{"i": 9}])
|
||||
|
||||
# a nonexistent version is rejected -- on main, and on a branch
|
||||
with pytest.raises((ValueError, RuntimeError)):
|
||||
await db.open_table("t", version=9999)
|
||||
with pytest.raises((ValueError, RuntimeError)):
|
||||
await db.open_table("t", branch="exp", version=9999)
|
||||
|
||||
# checkout_latest re-attaches the pinned handle to the BRANCH's HEAD
|
||||
# (writable again), not main's HEAD, and not staying pinned
|
||||
await pinned.checkout_latest()
|
||||
assert pinned.current_branch() == "exp"
|
||||
assert await pinned.count_rows() == 3 # exp HEAD, not main's 4
|
||||
await pinned.add([{"i": 3}])
|
||||
assert await pinned.count_rows() == 4 # writable again
|
||||
|
||||
|
||||
@patch("lancedb.table.AsyncTable.create_index")
|
||||
def test_create_index_method(mock_create_index, mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
@@ -1288,6 +1640,45 @@ def test_add_with_empty_fixed_size_list_drops_bad_rows(mem_db: DBConnection):
|
||||
assert np.allclose(data["embedding"].to_pylist()[0], np.array([0.1] * 16))
|
||||
|
||||
|
||||
def test_add_nullable_struct_with_none(mem_db: DBConnection):
|
||||
"""Regression test for issue #2654: a nullable struct column whose
|
||||
first batch contains only None values must not crash in
|
||||
_align_field_types with AttributeError: 'pyarrow.lib.DataType'
|
||||
object has no attribute 'fields'.
|
||||
|
||||
PyArrow infers an all-None struct column as `null` (not `struct`),
|
||||
so the type-alignment path needs to handle the case where the
|
||||
source field type is null and use the target type directly.
|
||||
"""
|
||||
# Use the v2.1 file format so that nullable structs are supported.
|
||||
table = mem_db.create_table(
|
||||
"test_nullable_struct",
|
||||
schema=pa.schema(
|
||||
[
|
||||
pa.field("id", pa.string()),
|
||||
pa.field(
|
||||
"data",
|
||||
pa.struct([pa.field("x", pa.float32())]),
|
||||
nullable=True,
|
||||
),
|
||||
]
|
||||
),
|
||||
storage_options=dict(new_table_data_storage_version="2.1"),
|
||||
)
|
||||
|
||||
# Adding a row with a non-null struct should work.
|
||||
table.add([{"id": "1", "data": {"x": 1.0}}])
|
||||
|
||||
# Adding a row with None for the nullable struct field should also
|
||||
# work — this is what used to crash.
|
||||
table.add([{"id": "2", "data": None}])
|
||||
|
||||
result = table.to_arrow()
|
||||
assert result.num_rows == 2
|
||||
assert result.column("id").to_pylist() == ["1", "2"]
|
||||
assert result.column("data").to_pylist() == [{"x": 1.0}, None]
|
||||
|
||||
|
||||
def test_add_with_integer_embeddings_preserves_casting(mem_db: DBConnection):
|
||||
class Schema(LanceModel):
|
||||
text: str
|
||||
@@ -1586,6 +1977,38 @@ def test_delete(mem_db: DBConnection):
|
||||
assert table.to_arrow()["id"].to_pylist() == [1]
|
||||
|
||||
|
||||
def test_delete_expr(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
data=[
|
||||
{"vector": [1.1, 0.9], "id": 0},
|
||||
{"vector": [1.2, 1.9], "id": 1},
|
||||
{"vector": [1.3, 2.9], "id": 2},
|
||||
],
|
||||
)
|
||||
assert len(table) == 3
|
||||
delete_res = table.delete(col("id") == lit(0))
|
||||
assert delete_res.version == 2
|
||||
assert len(table) == 2
|
||||
assert sorted(table.to_arrow()["id"].to_pylist()) == [1, 2]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_delete_expr_async(mem_db_async: AsyncConnection):
|
||||
table = await mem_db_async.create_table(
|
||||
"my_table",
|
||||
data=[
|
||||
{"vector": [1.1, 0.9], "id": 0},
|
||||
{"vector": [1.2, 1.9], "id": 1},
|
||||
{"vector": [1.3, 2.9], "id": 2},
|
||||
],
|
||||
)
|
||||
assert await table.count_rows() == 3
|
||||
await table.delete(col("id") == lit(0))
|
||||
assert await table.count_rows() == 2
|
||||
assert sorted((await table.to_arrow())["id"].to_pylist()) == [1, 2]
|
||||
|
||||
|
||||
def test_update(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
@@ -1771,6 +2194,50 @@ def test_merge_insert(mem_db: DBConnection):
|
||||
)
|
||||
|
||||
|
||||
def test_merge_insert_by_source_delete_expr(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
"my_table",
|
||||
data=pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]}),
|
||||
)
|
||||
new_data = pa.table({"a": [2, 4], "b": ["x", "z"]})
|
||||
|
||||
# replace-range, limiting the source-absent delete with an Expr condition
|
||||
merge_insert_res = (
|
||||
table.merge_insert("a")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.when_not_matched_by_source_delete(col("a") > lit(2))
|
||||
.execute(new_data)
|
||||
)
|
||||
assert merge_insert_res.num_inserted_rows == 1
|
||||
assert merge_insert_res.num_updated_rows == 1
|
||||
assert merge_insert_res.num_deleted_rows == 1
|
||||
|
||||
expected = pa.table({"a": [1, 2, 4], "b": ["a", "x", "z"]})
|
||||
assert table.to_arrow().sort_by("a") == expected
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_insert_by_source_delete_expr_async(
|
||||
mem_db_async: AsyncConnection,
|
||||
):
|
||||
data = pa.table({"a": [1, 2, 3], "b": ["a", "b", "c"]})
|
||||
table = await mem_db_async.create_table("some_table", data=data)
|
||||
new_data = pa.table({"a": [2, 4], "b": ["x", "z"]})
|
||||
|
||||
# replace-range, limiting the source-absent delete with an Expr condition
|
||||
await (
|
||||
table.merge_insert("a")
|
||||
.when_matched_update_all()
|
||||
.when_not_matched_insert_all()
|
||||
.when_not_matched_by_source_delete(col("a") > lit(2))
|
||||
.execute(new_data)
|
||||
)
|
||||
|
||||
expected = pa.table({"a": [1, 2, 4], "b": ["a", "x", "z"]})
|
||||
assert (await table.to_arrow()).sort_by("a") == expected
|
||||
|
||||
|
||||
# We vary the data format because there are slight differences in how
|
||||
# subschemas are handled in different formats
|
||||
@pytest.mark.parametrize(
|
||||
@@ -2019,18 +2486,32 @@ def test_create_scalar_index(mem_db: DBConnection):
|
||||
def test_create_index_nested_field_paths(mem_db: DBConnection):
|
||||
schema = pa.schema(
|
||||
[
|
||||
pa.field("rowId", pa.int32()),
|
||||
pa.field("row-id", pa.int32()),
|
||||
pa.field("userId", pa.int32()),
|
||||
pa.field("metadata", pa.struct([pa.field("user_id", pa.int32())])),
|
||||
pa.field("MetaData", pa.struct([pa.field("userId", pa.int32())])),
|
||||
pa.field(
|
||||
"image",
|
||||
pa.struct([pa.field("embedding", pa.list_(pa.float32(), 2))]),
|
||||
),
|
||||
pa.field("payload", pa.struct([pa.field("text", pa.string())])),
|
||||
pa.field("meta-data", pa.struct([pa.field("user-id", pa.int32())])),
|
||||
pa.field("literal", pa.struct([pa.field("a.b", pa.int32())])),
|
||||
]
|
||||
)
|
||||
data = pa.Table.from_pylist(
|
||||
[
|
||||
{
|
||||
"rowId": i,
|
||||
"row-id": i,
|
||||
"userId": i,
|
||||
"metadata": {"user_id": i},
|
||||
"MetaData": {"userId": i},
|
||||
"image": {"embedding": [float(i), float(i + 1)]},
|
||||
"payload": {"text": f"document {i}"},
|
||||
"meta-data": {"user-id": i},
|
||||
"literal": {"a.b": i},
|
||||
}
|
||||
for i in range(256)
|
||||
],
|
||||
@@ -2038,19 +2519,37 @@ def test_create_index_nested_field_paths(mem_db: DBConnection):
|
||||
)
|
||||
table = mem_db.create_table("nested_index_paths", data=data)
|
||||
|
||||
table.create_scalar_index("rowId", name="row_id_idx")
|
||||
table.create_scalar_index("`row-id`", name="row_dash_id_idx")
|
||||
table.create_scalar_index("userId", name="top_user_id_idx")
|
||||
table.create_scalar_index("metadata.user_id", name="metadata_user_id_idx")
|
||||
table.create_scalar_index("MetaData.userId", name="mixed_case_metadata_user_id_idx")
|
||||
table.create_scalar_index("`meta-data`.`user-id`", name="escaped_names_idx")
|
||||
table.create_scalar_index("literal.`a.b`", name="literal_dot_idx")
|
||||
table.create_index(
|
||||
vector_column_name="image.embedding",
|
||||
num_partitions=1,
|
||||
num_sub_vectors=1,
|
||||
name="image_embedding_idx",
|
||||
)
|
||||
table.create_fts_index("payload.text", with_position=False, name="payload_text_idx")
|
||||
|
||||
indices = sorted(table.list_indices(), key=lambda idx: idx.name)
|
||||
assert [(idx.name, idx.index_type, idx.columns) for idx in indices] == [
|
||||
("escaped_names_idx", "BTree", ["`meta-data`.`user-id`"]),
|
||||
("image_embedding_idx", "IvfPq", ["image.embedding"]),
|
||||
("literal_dot_idx", "BTree", ["literal.`a.b`"]),
|
||||
("metadata_user_id_idx", "BTree", ["metadata.user_id"]),
|
||||
("mixed_case_metadata_user_id_idx", "BTree", ["MetaData.userId"]),
|
||||
("payload_text_idx", "FTS", ["payload.text"]),
|
||||
("row_dash_id_idx", "BTree", ["`row-id`"]),
|
||||
("row_id_idx", "BTree", ["rowId"]),
|
||||
("top_user_id_idx", "BTree", ["userId"]),
|
||||
]
|
||||
for index in indices:
|
||||
stats = table.index_stats(index.name)
|
||||
assert stats is not None
|
||||
assert stats.num_indexed_rows == 256
|
||||
|
||||
vector_results = (
|
||||
table.search([0.0, 1.0], vector_column_name="image.embedding")
|
||||
@@ -2068,6 +2567,63 @@ def test_create_index_nested_field_paths(mem_db: DBConnection):
|
||||
assert len(filtered_results) == 1
|
||||
assert filtered_results[0]["metadata"]["user_id"] == 42
|
||||
|
||||
escaped_results = table.search().where("`row-id` = 43").limit(1).to_list()
|
||||
assert len(escaped_results) == 1
|
||||
assert escaped_results[0]["row-id"] == 43
|
||||
|
||||
fts_results = table.search("document 44", query_type="fts").limit(1).to_list()
|
||||
assert len(fts_results) == 1
|
||||
assert fts_results[0]["payload"]["text"] == "document 44"
|
||||
|
||||
|
||||
def test_index_config_fields(mem_db: DBConnection):
|
||||
"""Test that IndexConfig exposes the new rich metadata fields."""
|
||||
vec_array = pa.array(
|
||||
[[float(i), float(i + 1)] for i in range(300)], pa.list_(pa.float32(), 2)
|
||||
)
|
||||
data = pa.Table.from_pydict({"x": list(range(300)), "vector": vec_array})
|
||||
table = mem_db.create_table("index_config_fields", data=data)
|
||||
table.create_scalar_index("x", index_type="BTREE")
|
||||
table.create_index(
|
||||
vector_column_name="vector",
|
||||
num_partitions=1,
|
||||
num_sub_vectors=1,
|
||||
)
|
||||
|
||||
indices = {idx.name: idx for idx in table.list_indices()}
|
||||
|
||||
scalar_idx = indices["x_idx"]
|
||||
assert scalar_idx.index_uuid is not None
|
||||
assert isinstance(scalar_idx.index_uuid, str)
|
||||
assert scalar_idx.num_indexed_rows is not None
|
||||
assert scalar_idx.num_indexed_rows == 300
|
||||
assert scalar_idx.num_unindexed_rows is not None
|
||||
assert scalar_idx.num_unindexed_rows == 0
|
||||
assert scalar_idx.num_segments is not None
|
||||
assert scalar_idx.num_segments >= 1
|
||||
assert scalar_idx.size_bytes is not None
|
||||
assert scalar_idx.size_bytes > 0
|
||||
assert scalar_idx.created_at is not None
|
||||
from datetime import datetime, timezone
|
||||
|
||||
assert isinstance(scalar_idx.created_at, datetime)
|
||||
assert scalar_idx.created_at.tzinfo == timezone.utc
|
||||
|
||||
# __getitem__ compatibility
|
||||
assert scalar_idx["index_uuid"] == scalar_idx.index_uuid
|
||||
assert scalar_idx["num_indexed_rows"] == scalar_idx.num_indexed_rows
|
||||
assert scalar_idx["created_at"] == scalar_idx.created_at
|
||||
|
||||
# index_details is parsed from JSON into a Python object
|
||||
assert scalar_idx.index_details is not None
|
||||
assert isinstance(scalar_idx.index_details, dict)
|
||||
assert scalar_idx["index_details"] == scalar_idx.index_details
|
||||
|
||||
vector_idx = indices["vector_idx"]
|
||||
assert vector_idx.index_uuid is not None
|
||||
assert vector_idx.num_indexed_rows == 300
|
||||
assert isinstance(vector_idx.index_details, dict)
|
||||
|
||||
|
||||
def test_empty_query(mem_db: DBConnection):
|
||||
table = mem_db.create_table(
|
||||
@@ -2798,3 +3354,38 @@ def test_sanitize_data_metadata_not_stripped():
|
||||
assert result_schema.metadata is not None
|
||||
assert result_schema.metadata[b"existing_key"] == b"existing_value"
|
||||
assert result_schema.metadata[b"new_key"] == b"new_value"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_async_search_runs_embedding_on_dedicated_executor(
|
||||
mem_db_async: AsyncConnection,
|
||||
):
|
||||
# Regression test for #3310: AsyncTable.search() must run the (potentially
|
||||
# blocking) query-embedding call on the dedicated embedding executor, not
|
||||
# asyncio's default executor -- which is shared with other blocking I/O and
|
||||
# can be starved by a slow embedding call under concurrent load.
|
||||
func = MockTextEmbeddingFunction.create()
|
||||
|
||||
class Schema(LanceModel):
|
||||
text: str = func.SourceField()
|
||||
vector: Vector(func.ndims()) = func.VectorField()
|
||||
|
||||
table = await mem_db_async.create_table("embed_executor", schema=Schema)
|
||||
await table.add([{"text": "hello world"}])
|
||||
|
||||
captured_threads: List[str] = []
|
||||
original = MockTextEmbeddingFunction.generate_embeddings
|
||||
|
||||
def record_thread(self, texts):
|
||||
captured_threads.append(threading.current_thread().name)
|
||||
return original(self, texts)
|
||||
|
||||
# Patch only around the search so we capture the query-embedding call, not
|
||||
# the add-time source-embedding call.
|
||||
with patch.object(MockTextEmbeddingFunction, "generate_embeddings", record_thread):
|
||||
await (await table.search("a query string")).limit(1).to_list()
|
||||
|
||||
assert captured_threads, "search did not invoke the embedding function"
|
||||
assert all(name.startswith("lancedb-embedding") for name in captured_threads), (
|
||||
f"embedding ran off the dedicated executor: {captured_threads}"
|
||||
)
|
||||
|
||||
@@ -149,6 +149,36 @@ def test_value_to_sql_dict():
|
||||
assert value_to_sql({}) == "named_struct()"
|
||||
|
||||
|
||||
def test_value_to_sql_dict_key_escaping():
|
||||
# Struct field names that contain a single quote must be escaped (doubled)
|
||||
# the same way string values are, otherwise value_to_sql emits invalid SQL
|
||||
# such as named_struct('it's', 1).
|
||||
assert value_to_sql({"it's": 1}) == "named_struct('it''s', 1)"
|
||||
assert (
|
||||
value_to_sql({"o'brien": "d'angelo"}) == "named_struct('o''brien', 'd''angelo')"
|
||||
)
|
||||
# Escaping also applies to keys of nested structs.
|
||||
assert (
|
||||
value_to_sql({"outer": {"in'r": 1}})
|
||||
== "named_struct('outer', named_struct('in''r', 1))"
|
||||
)
|
||||
|
||||
|
||||
def test_value_to_sql_numpy_scalars():
|
||||
# numpy scalars (e.g. pulled from an ndarray or a pandas column) must
|
||||
# convert the same way as their native Python counterparts. np.float64
|
||||
# already worked by virtue of subclassing float, but the integer / bool
|
||||
# / float32 scalars previously raised NotImplementedError.
|
||||
import numpy as np
|
||||
|
||||
assert value_to_sql(np.int32(5)) == "5"
|
||||
assert value_to_sql(np.int64(5)) == "5"
|
||||
assert value_to_sql(np.float32(1.5)) == "1.5"
|
||||
assert value_to_sql(np.float64(1.5)) == "1.5"
|
||||
assert value_to_sql(np.bool_(True)) == "TRUE"
|
||||
assert value_to_sql(np.bool_(False)) == "FALSE"
|
||||
|
||||
|
||||
def test_append_vector_columns():
|
||||
registry = EmbeddingFunctionRegistry.get_instance()
|
||||
registry.register("test")(MockTextEmbeddingFunction)
|
||||
|
||||
@@ -18,7 +18,10 @@ use lancedb::{
|
||||
connection::Connection as LanceConnection,
|
||||
connection::NamespaceClientPushdownOperation,
|
||||
database::namespace::LanceNamespaceDatabase,
|
||||
database::{CreateTableMode, Database, ReadConsistency},
|
||||
database::{
|
||||
CreateFunctionRequest, CreateMaterializedViewRequest, CreateTableMode, Database,
|
||||
ReadConsistency, RefreshMaterializedViewRequest, TableLineageRequest,
|
||||
},
|
||||
};
|
||||
use pyo3::{
|
||||
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
|
||||
@@ -27,6 +30,92 @@ use pyo3::{
|
||||
types::{PyDict, PyDictMethods},
|
||||
};
|
||||
|
||||
/// A registered function, as returned by `list_functions`.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct FunctionInfo {
|
||||
pub name: String,
|
||||
pub language: String,
|
||||
pub return_type: String,
|
||||
pub description: String,
|
||||
}
|
||||
|
||||
/// A registered materialized view definition.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct MaterializedViewInfo {
|
||||
pub name: String,
|
||||
pub source_table: String,
|
||||
pub projection: Vec<String>,
|
||||
pub udf_columns: Vec<String>,
|
||||
pub filter: Option<String>,
|
||||
pub auto_refresh: bool,
|
||||
}
|
||||
|
||||
/// One inflight server-side job.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct JobInfo {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub age_seconds: Option<i64>,
|
||||
pub command: Option<String>,
|
||||
pub units_done: Option<i64>,
|
||||
pub units_total: Option<i64>,
|
||||
pub committed: bool,
|
||||
pub rows_skipped: u64,
|
||||
pub error: Option<String>,
|
||||
}
|
||||
|
||||
/// One durable, completed/terminal server-side job record (SHOW JOB HISTORY).
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct JobHistoryEntry {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub created_ms: i64,
|
||||
pub updated_ms: i64,
|
||||
pub completed_ms: Option<i64>,
|
||||
pub rows_processed: Option<i64>,
|
||||
pub rows_skipped: Option<i64>,
|
||||
pub error: Option<String>,
|
||||
pub events: Option<String>,
|
||||
}
|
||||
|
||||
/// One per-row UDF error recorded by `error_policy=skip` (SHOW ERRORS).
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct JobErrorEntry {
|
||||
pub job_id: String,
|
||||
pub table: String,
|
||||
pub column: String,
|
||||
pub error_type: String,
|
||||
pub error_message: String,
|
||||
pub fragment_id: Option<i64>,
|
||||
pub source_row_id: Option<i64>,
|
||||
pub table_version: Option<i64>,
|
||||
pub age_seconds: Option<i64>,
|
||||
}
|
||||
|
||||
/// The plan a REFRESH MATERIALIZED VIEW would execute (EXPLAIN REFRESH).
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct MvRefreshPlan {
|
||||
pub table_name: String,
|
||||
pub has_work: bool,
|
||||
pub source_version: u64,
|
||||
pub last_refreshed_version: Option<u64>,
|
||||
pub full_refresh: bool,
|
||||
pub rebuild: bool,
|
||||
pub units_total: u64,
|
||||
}
|
||||
|
||||
#[pyclass]
|
||||
pub struct Connection {
|
||||
inner: Option<LanceConnection>,
|
||||
@@ -310,6 +399,308 @@ impl Connection {
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, language, return_type, body, options=None))]
|
||||
pub fn create_function(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
language: String,
|
||||
return_type: String,
|
||||
body: String,
|
||||
options: Option<HashMap<String, String>>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.create_function(CreateFunctionRequest {
|
||||
name,
|
||||
language,
|
||||
return_type,
|
||||
body,
|
||||
options: options.unwrap_or_default(),
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn list_functions(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let functions = inner.list_functions().await.infer_error()?;
|
||||
Ok(functions
|
||||
.into_iter()
|
||||
.map(|f| FunctionInfo {
|
||||
name: f.name,
|
||||
language: f.language,
|
||||
return_type: f.return_type,
|
||||
description: f.description,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn drop_function(self_: PyRef<'_, Self>, name: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.drop_function(&name).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, query, auto_refresh=false, with_no_data=false, partition_by=None))]
|
||||
pub fn create_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
query: String,
|
||||
auto_refresh: bool,
|
||||
with_no_data: bool,
|
||||
partition_by: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.create_materialized_view(CreateMaterializedViewRequest {
|
||||
name,
|
||||
query,
|
||||
auto_refresh,
|
||||
with_no_data,
|
||||
partition_by,
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, full=false, src_version=None, num_workers=None, max_workers=None))]
|
||||
pub fn refresh_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.refresh_materialized_view(RefreshMaterializedViewRequest {
|
||||
name,
|
||||
full,
|
||||
src_version,
|
||||
num_workers,
|
||||
max_workers,
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
/// Derived-compute lineage of a table/view (or column), returned as the
|
||||
/// server's lineage JSON string (the Python layer parses it).
|
||||
pub fn table_lineage(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
column: Option<String>,
|
||||
direction: Option<String>,
|
||||
depth: Option<u32>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.table_lineage(TableLineageRequest {
|
||||
name,
|
||||
column,
|
||||
direction,
|
||||
depth,
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, full=false, src_version=None))]
|
||||
pub fn explain_refresh_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let p = inner
|
||||
.explain_refresh_materialized_view(&name, full, src_version)
|
||||
.await
|
||||
.infer_error()?;
|
||||
Ok(MvRefreshPlan {
|
||||
table_name: p.table_name,
|
||||
has_work: p.has_work,
|
||||
source_version: p.source_version,
|
||||
last_refreshed_version: p.last_refreshed_version,
|
||||
full_refresh: p.full_refresh,
|
||||
rebuild: p.rebuild,
|
||||
units_total: p.units_total,
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
pub fn alter_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
auto_refresh: bool,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.alter_materialized_view(&name, auto_refresh)
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn drop_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.drop_materialized_view(&name).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn list_materialized_views(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let views = inner.list_materialized_views().await.infer_error()?;
|
||||
Ok(views
|
||||
.into_iter()
|
||||
.map(|v| MaterializedViewInfo {
|
||||
name: v.name,
|
||||
source_table: v.source_table,
|
||||
projection: v.projection,
|
||||
udf_columns: v.udf_columns,
|
||||
filter: v.filter,
|
||||
auto_refresh: v.auto_refresh,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn list_jobs(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let jobs = inner.list_jobs().await.infer_error()?;
|
||||
Ok(jobs
|
||||
.into_iter()
|
||||
.map(|j| JobInfo {
|
||||
table: j.table,
|
||||
job_id: j.job_id,
|
||||
job_type: j.job_type,
|
||||
state: j.state,
|
||||
column: j.column,
|
||||
age_seconds: j.age_seconds,
|
||||
command: j.command,
|
||||
units_done: j.units_done,
|
||||
units_total: j.units_total,
|
||||
committed: j.committed,
|
||||
rows_skipped: j.rows_skipped,
|
||||
error: j.error,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn cancel_job(self_: PyRef<'_, Self>, job_id: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.cancel_job(&job_id).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (job_id, table=None))]
|
||||
pub fn get_job(
|
||||
self_: PyRef<'_, Self>,
|
||||
job_id: String,
|
||||
table: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let job = inner
|
||||
.get_job(&job_id, table.as_deref())
|
||||
.await
|
||||
.infer_error()?;
|
||||
Ok(job.map(|j| JobInfo {
|
||||
table: j.table,
|
||||
job_id: j.job_id,
|
||||
job_type: j.job_type,
|
||||
state: j.state,
|
||||
column: j.column,
|
||||
age_seconds: j.age_seconds,
|
||||
command: j.command,
|
||||
units_done: j.units_done,
|
||||
units_total: j.units_total,
|
||||
committed: j.committed,
|
||||
rows_skipped: j.rows_skipped,
|
||||
error: j.error,
|
||||
}))
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (job_id=None))]
|
||||
pub fn job_history(
|
||||
self_: PyRef<'_, Self>,
|
||||
job_id: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let rows = inner.job_history(job_id.as_deref()).await.infer_error()?;
|
||||
Ok(rows
|
||||
.into_iter()
|
||||
.map(|r| JobHistoryEntry {
|
||||
table: r.table,
|
||||
job_id: r.job_id,
|
||||
job_type: r.job_type,
|
||||
state: r.state,
|
||||
column: r.column,
|
||||
created_ms: r.created_ms,
|
||||
updated_ms: r.updated_ms,
|
||||
completed_ms: r.completed_ms,
|
||||
rows_processed: r.rows_processed,
|
||||
rows_skipped: r.rows_skipped,
|
||||
error: r.error,
|
||||
events: r.events,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (job_id=None, table=None))]
|
||||
pub fn errors(
|
||||
self_: PyRef<'_, Self>,
|
||||
job_id: Option<String>,
|
||||
table: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let rows = inner
|
||||
.errors(job_id.as_deref(), table.as_deref())
|
||||
.await
|
||||
.infer_error()?;
|
||||
Ok(rows
|
||||
.into_iter()
|
||||
.map(|e| JobErrorEntry {
|
||||
job_id: e.job_id,
|
||||
table: e.table,
|
||||
column: e.column,
|
||||
error_type: e.error_type,
|
||||
error_message: e.error_message,
|
||||
fragment_id: e.fragment_id,
|
||||
source_row_id: e.source_row_id,
|
||||
table_version: e.table_version,
|
||||
age_seconds: e.age_seconds,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (cur_name, new_name, cur_namespace_path=None, new_namespace_path=None))]
|
||||
pub fn rename_table(
|
||||
self_: PyRef<'_, Self>,
|
||||
@@ -539,7 +930,7 @@ impl Connection {
|
||||
}
|
||||
|
||||
#[pyfunction]
|
||||
#[pyo3(signature = (uri, api_key=None, region=None, host_override=None, read_consistency_interval=None, client_config=None, storage_options=None, session=None, manifest_enabled=false, namespace_client_properties=None))]
|
||||
#[pyo3(signature = (uri, api_key=None, region=None, host_override=None, read_consistency_interval=None, client_config=None, storage_options=None, session=None, manifest_enabled=false, namespace_client_properties=None, oauth_config=None))]
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
pub fn connect(
|
||||
py: Python<'_>,
|
||||
@@ -553,6 +944,7 @@ pub fn connect(
|
||||
session: Option<crate::session::Session>,
|
||||
manifest_enabled: bool,
|
||||
namespace_client_properties: Option<HashMap<String, String>>,
|
||||
oauth_config: Option<crate::oauth::PyOAuthConfig>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
future_into_py(py, async move {
|
||||
let mut builder = lancedb::connect(&uri);
|
||||
@@ -582,6 +974,11 @@ pub fn connect(
|
||||
if let Some(client_config) = client_config {
|
||||
builder = builder.client_config(client_config.into());
|
||||
}
|
||||
if let Some(oauth_config) = oauth_config {
|
||||
let config: lancedb::remote::oauth::OAuthConfig =
|
||||
oauth_config.try_into().infer_error()?;
|
||||
builder = builder.oauth_config(config);
|
||||
}
|
||||
if let Some(session) = session {
|
||||
builder = builder.session(session.inner.clone());
|
||||
}
|
||||
@@ -610,24 +1007,38 @@ pub fn connect_namespace_client(
|
||||
namespace_client_impl: Option<String>,
|
||||
namespace_client_properties: Option<HashMap<String, String>>,
|
||||
) -> PyResult<Connection> {
|
||||
let namespace_client = extract_namespace_arc(py, namespace_client)?;
|
||||
let read_consistency_interval = read_consistency_interval.map(Duration::from_secs_f64);
|
||||
let namespace_client_pushdown_operations =
|
||||
parse_namespace_client_pushdown_operations(namespace_client_pushdown_operations)?;
|
||||
let ns_impl = namespace_client_impl.unwrap_or_else(|| "python".to_string());
|
||||
let ns_properties = namespace_client_properties.unwrap_or_default();
|
||||
let storage_options = storage_options.unwrap_or_default();
|
||||
let session = session.map(|s| s.inner.clone());
|
||||
|
||||
let database = LanceNamespaceDatabase::from_namespace_client(
|
||||
namespace_client,
|
||||
ns_impl,
|
||||
ns_properties,
|
||||
storage_options,
|
||||
read_consistency_interval,
|
||||
session,
|
||||
namespace_client_pushdown_operations,
|
||||
);
|
||||
// Prefer building the namespace natively from (impl, properties) so the
|
||||
// read-freshness provider installed
|
||||
let database = if build_namespace_natively(namespace_client_impl.as_deref(), &ns_properties) {
|
||||
let ns_impl = namespace_client_impl.expect("impl present per build_namespace_natively");
|
||||
crate::runtime::block_on(LanceNamespaceDatabase::connect(
|
||||
&ns_impl,
|
||||
ns_properties,
|
||||
storage_options,
|
||||
read_consistency_interval,
|
||||
session,
|
||||
namespace_client_pushdown_operations,
|
||||
))
|
||||
.infer_error()?
|
||||
} else {
|
||||
let namespace_client = extract_namespace_arc(py, namespace_client)?;
|
||||
LanceNamespaceDatabase::from_namespace_client(
|
||||
namespace_client,
|
||||
namespace_client_impl.unwrap_or_else(|| "python".to_string()),
|
||||
ns_properties,
|
||||
storage_options,
|
||||
read_consistency_interval,
|
||||
session,
|
||||
namespace_client_pushdown_operations,
|
||||
)
|
||||
};
|
||||
|
||||
Ok(Connection::new(LanceConnection::new(
|
||||
Arc::new(database),
|
||||
@@ -635,6 +1046,16 @@ pub fn connect_namespace_client(
|
||||
)))
|
||||
}
|
||||
|
||||
/// Whether to build the namespace natively (from impl + properties) instead of
|
||||
/// wrapping a pre-built client. Native construction is required for the
|
||||
/// read-freshness provider to be installed
|
||||
fn build_namespace_natively(
|
||||
namespace_client_impl: Option<&str>,
|
||||
namespace_client_properties: &HashMap<String, String>,
|
||||
) -> bool {
|
||||
matches!(namespace_client_impl, Some("rest")) && !namespace_client_properties.is_empty()
|
||||
}
|
||||
|
||||
#[derive(FromPyObject)]
|
||||
pub struct PyClientConfig {
|
||||
user_agent: String,
|
||||
@@ -733,3 +1154,36 @@ impl From<PyClientConfig> for lancedb::remote::ClientConfig {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
fn props(pairs: &[(&str, &str)]) -> HashMap<String, String> {
|
||||
pairs
|
||||
.iter()
|
||||
.map(|(k, v)| (k.to_string(), v.to_string()))
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn native_build_only_for_rest_with_properties() {
|
||||
let rest = props(&[("uri", "http://localhost:10024")]);
|
||||
|
||||
// rest + non-empty properties -> build natively (installs the
|
||||
// read-freshness provider so checkout_latest() busts the server cache).
|
||||
assert!(build_namespace_natively(Some("rest"), &rest));
|
||||
|
||||
// dir is local (no server cache) -> wrap the pre-built client unchanged.
|
||||
assert!(!build_namespace_natively(
|
||||
Some("dir"),
|
||||
&props(&[("root", "/tmp")])
|
||||
));
|
||||
|
||||
// No impl: only a pre-built client was handed in -> wrap it as-is.
|
||||
assert!(!build_namespace_natively(None, &rest));
|
||||
|
||||
// rest but no properties: nothing to build a connection from -> wrap.
|
||||
assert!(!build_namespace_natively(Some("rest"), &HashMap::new()));
|
||||
}
|
||||
}
|
||||
|
||||
@@ -9,7 +9,9 @@
|
||||
|
||||
use arrow::{datatypes::DataType, pyarrow::PyArrowType};
|
||||
use datafusion_common::ScalarValue;
|
||||
use lancedb::expr::{DfExpr, col as ldb_col, contains, expr_cast, lit as df_lit, lower, upper};
|
||||
use lancedb::expr::{
|
||||
DfExpr, col as ldb_col, contains, expr_cast, is_in, lit as df_lit, lower, upper,
|
||||
};
|
||||
use pyo3::types::PyBytes;
|
||||
use pyo3::{Bound, PyAny, PyResult, exceptions::PyValueError, prelude::*, pyfunction};
|
||||
|
||||
@@ -105,6 +107,14 @@ impl PyExpr {
|
||||
Self(contains(self.0.clone(), substr.0.clone()))
|
||||
}
|
||||
|
||||
// ── membership ───────────────────────────────────────────────────────────
|
||||
|
||||
/// Return true where the value is one of the given expressions (SQL ``IN``).
|
||||
fn isin(&self, list: Vec<Self>) -> Self {
|
||||
let items: Vec<DfExpr> = list.into_iter().map(|e| e.0).collect();
|
||||
Self(is_in(self.0.clone(), items))
|
||||
}
|
||||
|
||||
// ── type cast ────────────────────────────────────────────────────────────
|
||||
|
||||
/// Cast the expression to `data_type`.
|
||||
|
||||
@@ -1,18 +1,19 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
use chrono::{DateTime, Utc};
|
||||
use lancedb::index::vector::{
|
||||
IvfFlatIndexBuilder, IvfHnswFlatIndexBuilder, IvfHnswPqIndexBuilder, IvfHnswSqIndexBuilder,
|
||||
IvfPqIndexBuilder, IvfRqIndexBuilder, IvfSqIndexBuilder,
|
||||
};
|
||||
use lancedb::index::{
|
||||
Index as LanceDbIndex,
|
||||
scalar::{BTreeIndexBuilder, FtsIndexBuilder},
|
||||
scalar::{BTreeIndexBuilder, FmIndexBuilder, FtsIndexBuilder},
|
||||
};
|
||||
use pyo3::IntoPyObject;
|
||||
use pyo3::types::PyStringMethods;
|
||||
use pyo3::{
|
||||
Bound, FromPyObject, PyAny, PyResult, Python,
|
||||
Bound, FromPyObject, Py, PyAny, PyResult, Python,
|
||||
exceptions::{PyKeyError, PyValueError},
|
||||
intern, pyclass, pymethods,
|
||||
types::{PyAnyMethods, PyString},
|
||||
@@ -38,6 +39,7 @@ pub fn extract_index_params(source: &Option<Bound<'_, PyAny>>) -> PyResult<Lance
|
||||
"BTree" => Ok(LanceDbIndex::BTree(BTreeIndexBuilder::default())),
|
||||
"Bitmap" => Ok(LanceDbIndex::Bitmap(Default::default())),
|
||||
"LabelList" => Ok(LanceDbIndex::LabelList(Default::default())),
|
||||
"Fm" => Ok(LanceDbIndex::Fm(FmIndexBuilder::default())),
|
||||
"FTS" => {
|
||||
let params = source.extract::<FtsParams>()?;
|
||||
let inner_opts = FtsIndexBuilder::default()
|
||||
@@ -183,7 +185,7 @@ pub fn extract_index_params(source: &Option<Bound<'_, PyAny>>) -> PyResult<Lance
|
||||
Ok(LanceDbIndex::IvfHnswFlat(hnsw_flat_builder))
|
||||
}
|
||||
not_supported => Err(PyValueError::new_err(format!(
|
||||
"Invalid index type '{}'. Must be one of BTree, Bitmap, LabelList, FTS, IvfPq, IvfSq, IvfHnswPq, IvfHnswSq, or IvfHnswFlat",
|
||||
"Invalid index type '{}'. Must be one of BTree, Bitmap, LabelList, Fm, FTS, IvfPq, IvfSq, IvfHnswPq, IvfHnswSq, or IvfHnswFlat",
|
||||
not_supported
|
||||
))),
|
||||
}
|
||||
@@ -293,15 +295,77 @@ pub struct IndexConfig {
|
||||
pub columns: Vec<String>,
|
||||
/// Name of the index.
|
||||
pub name: String,
|
||||
/// The UUID of the first segment of the index.
|
||||
pub index_uuid: Option<String>,
|
||||
/// The protobuf type URL, a precise type identifier for the index.
|
||||
pub type_url: Option<String>,
|
||||
/// When the index was created.
|
||||
pub created_at: Option<DateTime<Utc>>,
|
||||
/// The number of rows indexed, across all segments.
|
||||
pub num_indexed_rows: Option<u64>,
|
||||
/// The number of rows not yet covered by this index.
|
||||
pub num_unindexed_rows: Option<u64>,
|
||||
/// The total size in bytes of all index files across all segments.
|
||||
pub size_bytes: Option<u64>,
|
||||
/// The number of segments that make up the index.
|
||||
pub num_segments: Option<u32>,
|
||||
/// The on-disk index format version.
|
||||
pub index_version: Option<i32>,
|
||||
/// Index-type-specific details parsed as a Python object (dict, list, etc.).
|
||||
///
|
||||
/// Falls back to a raw string if JSON parsing fails. `None` when unavailable.
|
||||
pub index_details: Option<Py<PyAny>>,
|
||||
}
|
||||
|
||||
#[pymethods]
|
||||
impl IndexConfig {
|
||||
pub fn __repr__(&self) -> String {
|
||||
format!(
|
||||
"Index({}, columns={:?}, name=\"{}\")",
|
||||
self.index_type, self.columns, self.name
|
||||
)
|
||||
pub fn __repr__(&self, py: Python<'_>) -> String {
|
||||
let mut fields = vec![
|
||||
format!("name={:?}", self.name),
|
||||
format!("index_type={:?}", self.index_type),
|
||||
format!("columns={:?}", self.columns),
|
||||
];
|
||||
if let Some(v) = &self.index_uuid {
|
||||
fields.push(format!("index_uuid={:?}", v));
|
||||
}
|
||||
if let Some(v) = &self.type_url {
|
||||
fields.push(format!("type_url={:?}", v));
|
||||
}
|
||||
if let Some(v) = self.created_at {
|
||||
// Render the datetime's own Python repr so the value round-trips,
|
||||
// falling back to RFC 3339 if the conversion ever fails.
|
||||
let rendered = v
|
||||
.into_pyobject(py)
|
||||
.ok()
|
||||
.and_then(|obj| obj.into_any().repr().ok())
|
||||
.map(|r| r.to_string())
|
||||
.unwrap_or_else(|| v.to_rfc3339());
|
||||
fields.push(format!("created_at={}", rendered));
|
||||
}
|
||||
if let Some(v) = self.num_indexed_rows {
|
||||
fields.push(format!("num_indexed_rows={}", fmt_thousands(v)));
|
||||
}
|
||||
if let Some(v) = self.num_unindexed_rows {
|
||||
fields.push(format!("num_unindexed_rows={}", fmt_thousands(v)));
|
||||
}
|
||||
if let Some(v) = self.size_bytes {
|
||||
fields.push(format!("size_bytes={}", fmt_thousands(v)));
|
||||
}
|
||||
if let Some(v) = self.num_segments {
|
||||
fields.push(format!("num_segments={}", v));
|
||||
}
|
||||
if let Some(v) = self.index_version {
|
||||
fields.push(format!("index_version={}", v));
|
||||
}
|
||||
if let Some(v) = &self.index_details {
|
||||
let details = v
|
||||
.bind(py)
|
||||
.repr()
|
||||
.map(|r| r.to_string())
|
||||
.unwrap_or_else(|_| "<unavailable>".to_string());
|
||||
fields.push(format!("index_details={}", details));
|
||||
}
|
||||
format!("IndexConfig({})", fields.join(", "))
|
||||
}
|
||||
|
||||
// For backwards-compatibility with the old sync SDK, we also support getting
|
||||
@@ -311,18 +375,66 @@ impl IndexConfig {
|
||||
"index_type" => Ok(self.index_type.clone().into_pyobject(py)?.into_any()),
|
||||
"columns" => Ok(self.columns.clone().into_pyobject(py)?.into_any()),
|
||||
"name" | "index_name" => Ok(self.name.clone().into_pyobject(py)?.into_any()),
|
||||
"index_uuid" => Ok(self.index_uuid.clone().into_pyobject(py)?.into_any()),
|
||||
"type_url" => Ok(self.type_url.clone().into_pyobject(py)?.into_any()),
|
||||
"created_at" => Ok(self.created_at.into_pyobject(py)?.into_any()),
|
||||
"num_indexed_rows" => Ok(self.num_indexed_rows.into_pyobject(py)?.into_any()),
|
||||
"num_unindexed_rows" => Ok(self.num_unindexed_rows.into_pyobject(py)?.into_any()),
|
||||
"size_bytes" => Ok(self.size_bytes.into_pyobject(py)?.into_any()),
|
||||
"num_segments" => Ok(self.num_segments.into_pyobject(py)?.into_any()),
|
||||
"index_version" => Ok(self.index_version.into_pyobject(py)?.into_any()),
|
||||
"index_details" => Ok(self
|
||||
.index_details
|
||||
.as_ref()
|
||||
.map(|obj| obj.clone_ref(py))
|
||||
.into_pyobject(py)?
|
||||
.into_any()),
|
||||
_ => Err(PyKeyError::new_err(format!("Invalid key: {}", key))),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl From<lancedb::index::IndexConfig> for IndexConfig {
|
||||
fn from(value: lancedb::index::IndexConfig) -> Self {
|
||||
/// Format an integer with `_` thousands separators, e.g. `24_500_213`.
|
||||
///
|
||||
/// Underscores are valid Python int-literal syntax, so the repr stays
|
||||
/// copy-pasteable and machine-parseable while remaining readable.
|
||||
fn fmt_thousands(n: u64) -> String {
|
||||
let digits = n.to_string();
|
||||
let bytes = digits.as_bytes();
|
||||
let mut out = String::with_capacity(digits.len() + digits.len() / 3);
|
||||
for (i, b) in bytes.iter().enumerate() {
|
||||
if i > 0 && (bytes.len() - i).is_multiple_of(3) {
|
||||
out.push('_');
|
||||
}
|
||||
out.push(*b as char);
|
||||
}
|
||||
out
|
||||
}
|
||||
|
||||
fn parse_index_details(py: Python<'_>, s: String) -> Py<PyAny> {
|
||||
let json = py.import("json").expect("json module is always available");
|
||||
match json.call_method1("loads", (s.as_str(),)) {
|
||||
Ok(obj) => obj.into_any().unbind(),
|
||||
Err(_) => s.into_pyobject(py).unwrap().into_any().unbind(),
|
||||
}
|
||||
}
|
||||
|
||||
impl IndexConfig {
|
||||
pub fn from_lancedb(py: Python<'_>, value: lancedb::index::IndexConfig) -> Self {
|
||||
let index_type = format!("{:?}", value.index_type);
|
||||
Self {
|
||||
index_type,
|
||||
columns: value.columns,
|
||||
name: value.name,
|
||||
index_uuid: value.index_uuid,
|
||||
type_url: value.type_url,
|
||||
created_at: value.created_at,
|
||||
num_indexed_rows: value.num_indexed_rows,
|
||||
num_unindexed_rows: value.num_unindexed_rows,
|
||||
size_bytes: value.size_bytes,
|
||||
num_segments: value.num_segments,
|
||||
index_version: value.index_version,
|
||||
index_details: value.index_details.map(|s| parse_index_details(py, s)),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -26,6 +26,7 @@ pub mod expr;
|
||||
pub mod header;
|
||||
pub mod index;
|
||||
pub mod namespace;
|
||||
pub mod oauth;
|
||||
pub mod permutation;
|
||||
pub mod query;
|
||||
pub mod runtime;
|
||||
@@ -40,6 +41,11 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
|
||||
.write_style("LANCEDB_LOG_STYLE");
|
||||
env_logger::init_from_env(env);
|
||||
m.add_class::<Connection>()?;
|
||||
m.add_class::<connection::FunctionInfo>()?;
|
||||
m.add_class::<connection::MaterializedViewInfo>()?;
|
||||
m.add_class::<connection::JobInfo>()?;
|
||||
m.add_class::<connection::JobHistoryEntry>()?;
|
||||
m.add_class::<connection::JobErrorEntry>()?;
|
||||
m.add_class::<Session>()?;
|
||||
m.add_class::<Table>()?;
|
||||
m.add_class::<IndexConfig>()?;
|
||||
|
||||
72
python/src/oauth.rs
Normal file
72
python/src/oauth.rs
Normal file
@@ -0,0 +1,72 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
use pyo3::FromPyObject;
|
||||
|
||||
use lancedb::error::Error;
|
||||
use lancedb::remote::oauth::{OAuthConfig, OAuthFlow};
|
||||
|
||||
/// Python-side OAuth configuration, extracted via FromPyObject.
|
||||
/// Maps to `lancedb.remote.oauth.OAuthConfig` Python dataclass.
|
||||
#[derive(FromPyObject)]
|
||||
pub struct PyOAuthConfig {
|
||||
pub issuer_url: String,
|
||||
pub client_id: String,
|
||||
pub scopes: Vec<String>,
|
||||
pub flow: String,
|
||||
pub client_secret: Option<String>,
|
||||
pub managed_identity_client_id: Option<String>,
|
||||
pub refresh_buffer_secs: Option<u64>,
|
||||
}
|
||||
|
||||
impl TryFrom<PyOAuthConfig> for OAuthConfig {
|
||||
type Error = Error;
|
||||
|
||||
fn try_from(py: PyOAuthConfig) -> Result<Self, Self::Error> {
|
||||
let flow = match py.flow.as_str() {
|
||||
"client_credentials" => OAuthFlow::ClientCredentials,
|
||||
"azure_managed_identity" => OAuthFlow::AzureManagedIdentity {
|
||||
client_id: py.managed_identity_client_id,
|
||||
},
|
||||
other => {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!("Unknown OAuth flow type: {other}"),
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
Ok(Self {
|
||||
issuer_url: py.issuer_url,
|
||||
client_id: py.client_id,
|
||||
client_secret: py.client_secret,
|
||||
scopes: py.scopes,
|
||||
flow,
|
||||
refresh_buffer_secs: py.refresh_buffer_secs,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_unknown_oauth_flow_returns_invalid_input() {
|
||||
let config = PyOAuthConfig {
|
||||
issuer_url: "https://issuer.example.com".to_string(),
|
||||
client_id: "client-id".to_string(),
|
||||
scopes: vec!["scope".to_string()],
|
||||
flow: "typo".to_string(),
|
||||
client_secret: None,
|
||||
managed_identity_client_id: None,
|
||||
refresh_buffer_secs: None,
|
||||
};
|
||||
|
||||
let err = OAuthConfig::try_from(config).unwrap_err();
|
||||
assert!(matches!(
|
||||
err,
|
||||
Error::InvalidInput { message }
|
||||
if message == "Unknown OAuth flow type: typo"
|
||||
));
|
||||
}
|
||||
}
|
||||
@@ -56,6 +56,15 @@ fn get_runtime() -> &'static runtime::Runtime {
|
||||
unsafe { &*new_ptr }
|
||||
}
|
||||
|
||||
/// Block the current thread on a future using the shared runtime.
|
||||
///
|
||||
/// For sync `#[pyfunction]`s that need to drive an async operation (e.g.
|
||||
/// building a namespace client). Must not be called from within the runtime's
|
||||
/// own worker threads.
|
||||
pub fn block_on<F: std::future::Future>(fut: F) -> F::Output {
|
||||
get_runtime().block_on(fut)
|
||||
}
|
||||
|
||||
/// Runs in async-signal context after `fork()` in the child. We can only
|
||||
/// touch atomics here; we deliberately leak the previous runtime because
|
||||
/// dropping a tokio `Runtime` would try to join its (now-dead) worker
|
||||
|
||||
@@ -6,6 +6,7 @@ use crate::runtime::future_into_py;
|
||||
use crate::{
|
||||
connection::Connection,
|
||||
error::PythonErrorExt,
|
||||
expr::PyExpr,
|
||||
index::{IndexConfig, extract_index_params},
|
||||
query::{Query, TakeQuery},
|
||||
table::scannable::PyScannable,
|
||||
@@ -16,8 +17,8 @@ use arrow::{
|
||||
pyarrow::{FromPyArrow, PyArrowType, ToPyArrow},
|
||||
};
|
||||
use lancedb::table::{
|
||||
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, NewColumnTransform,
|
||||
OptimizeAction, OptimizeOptions, Table as LanceDbTable,
|
||||
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, LoadColumnsRequest,
|
||||
NewColumnTransform, OptimizeAction, OptimizeOptions, Ref, Table as LanceDbTable,
|
||||
};
|
||||
use pyo3::{
|
||||
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
|
||||
@@ -28,6 +29,12 @@ use pyo3::{
|
||||
|
||||
mod scannable;
|
||||
|
||||
#[derive(FromPyObject)]
|
||||
enum PredicateArg {
|
||||
Expr(PyExpr),
|
||||
Sql(String),
|
||||
}
|
||||
|
||||
/// Statistics about a compaction operation.
|
||||
#[pyclass(get_all, from_py_object)]
|
||||
#[derive(Clone, Debug)]
|
||||
@@ -561,10 +568,15 @@ impl Table {
|
||||
})
|
||||
}
|
||||
|
||||
pub fn delete(self_: PyRef<'_, Self>, condition: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
#[allow(private_interfaces)]
|
||||
pub fn delete(self_: PyRef<'_, Self>, condition: PredicateArg) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let result = inner.delete(&condition).await.infer_error()?;
|
||||
let result = match &condition {
|
||||
PredicateArg::Expr(e) => inner.delete(&e.0).await,
|
||||
PredicateArg::Sql(s) => inner.delete(s.as_str()).await,
|
||||
}
|
||||
.infer_error()?;
|
||||
Ok(DeleteResult::from(result))
|
||||
})
|
||||
}
|
||||
@@ -682,13 +694,13 @@ impl Table {
|
||||
pub fn list_indices(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
Ok(inner
|
||||
.list_indices()
|
||||
.await
|
||||
.infer_error()?
|
||||
.into_iter()
|
||||
.map(IndexConfig::from)
|
||||
.collect::<Vec<_>>())
|
||||
let indices = inner.list_indices().await.infer_error()?;
|
||||
Python::attach(|py| {
|
||||
Ok(indices
|
||||
.into_iter()
|
||||
.map(|idx| IndexConfig::from_lancedb(py, idx))
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
@@ -711,10 +723,6 @@ impl Table {
|
||||
dict.set_item("num_indices", num_indices)?;
|
||||
}
|
||||
|
||||
if let Some(loss) = stats.loss {
|
||||
dict.set_item("loss", loss)?;
|
||||
}
|
||||
|
||||
Ok(Some(dict.unbind()))
|
||||
})
|
||||
} else {
|
||||
@@ -864,6 +872,15 @@ impl Table {
|
||||
Ok(Tags::new(self.inner_ref()?.clone()))
|
||||
}
|
||||
|
||||
pub fn current_branch(&self) -> PyResult<Option<String>> {
|
||||
Ok(self.inner_ref()?.current_branch())
|
||||
}
|
||||
|
||||
#[getter]
|
||||
pub fn branches(&self) -> PyResult<Branches> {
|
||||
Ok(Branches::new(self.inner_ref()?.clone()))
|
||||
}
|
||||
|
||||
#[pyo3(signature = (offsets))]
|
||||
pub fn take_offsets(self_: PyRef<'_, Self>, offsets: Vec<u64>) -> PyResult<TakeQuery> {
|
||||
Ok(TakeQuery::new(
|
||||
@@ -954,8 +971,13 @@ impl Table {
|
||||
builder.when_not_matched_insert_all();
|
||||
}
|
||||
if parameters.when_not_matched_by_source_delete {
|
||||
builder
|
||||
.when_not_matched_by_source_delete(parameters.when_not_matched_by_source_condition);
|
||||
if let Some(e) = parameters.when_not_matched_by_source_condition_expr {
|
||||
builder.when_not_matched_by_source_delete_expr(e.0);
|
||||
} else {
|
||||
builder.when_not_matched_by_source_delete(
|
||||
parameters.when_not_matched_by_source_condition,
|
||||
);
|
||||
}
|
||||
}
|
||||
if let Some(timeout) = parameters.timeout {
|
||||
builder.timeout(timeout);
|
||||
@@ -1038,6 +1060,83 @@ impl Table {
|
||||
})
|
||||
}
|
||||
|
||||
pub fn add_computed_columns(
|
||||
self_: PyRef<'_, Self>,
|
||||
columns: Vec<(String, String)>,
|
||||
expression: String,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.add_computed_columns(&columns, &expression)
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (columns, where_clause=None, num_workers=None, max_workers=None, batch_size=None, priority=None))]
|
||||
pub fn refresh_column(
|
||||
self_: PyRef<'_, Self>,
|
||||
columns: Vec<String>,
|
||||
where_clause: Option<String>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
batch_size: Option<u32>,
|
||||
priority: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.refresh_column(
|
||||
&columns,
|
||||
where_clause,
|
||||
num_workers,
|
||||
max_workers,
|
||||
batch_size,
|
||||
priority,
|
||||
)
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
#[pyo3(signature = (source_uris, source_format, target_key, columns, source_key=None, source_storage_options=None, on_missing=None, num_workers=None, max_workers=None, batch_size=None, commit_granularity=None, priority=None))]
|
||||
pub fn load_columns(
|
||||
self_: PyRef<'_, Self>,
|
||||
source_uris: Vec<String>,
|
||||
source_format: String,
|
||||
target_key: String,
|
||||
columns: Vec<(String, Option<String>)>,
|
||||
source_key: Option<String>,
|
||||
source_storage_options: Option<std::collections::HashMap<String, String>>,
|
||||
on_missing: Option<String>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
batch_size: Option<u32>,
|
||||
commit_granularity: Option<u32>,
|
||||
priority: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
let request = LoadColumnsRequest {
|
||||
source_uris,
|
||||
source_format,
|
||||
source_storage_options,
|
||||
target_key,
|
||||
source_key,
|
||||
columns,
|
||||
on_missing,
|
||||
num_workers,
|
||||
max_workers,
|
||||
batch_size,
|
||||
commit_granularity,
|
||||
priority,
|
||||
};
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.load_columns(request).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn add_columns(
|
||||
self_: PyRef<'_, Self>,
|
||||
definitions: Vec<(String, String)>,
|
||||
@@ -1191,6 +1290,7 @@ pub struct MergeInsertParams {
|
||||
when_not_matched_insert_all: bool,
|
||||
when_not_matched_by_source_delete: bool,
|
||||
when_not_matched_by_source_condition: Option<String>,
|
||||
when_not_matched_by_source_condition_expr: Option<PyExpr>,
|
||||
timeout: Option<std::time::Duration>,
|
||||
use_index: Option<bool>,
|
||||
use_lsm_write: Option<bool>,
|
||||
@@ -1265,3 +1365,71 @@ impl Tags {
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
#[pyclass]
|
||||
pub struct Branches {
|
||||
inner: LanceDbTable,
|
||||
}
|
||||
|
||||
impl Branches {
|
||||
pub fn new(table: LanceDbTable) -> Self {
|
||||
Self { inner: table }
|
||||
}
|
||||
}
|
||||
|
||||
#[pymethods]
|
||||
impl Branches {
|
||||
pub fn list(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let res = inner.list_branches().await.infer_error()?;
|
||||
Python::attach(|py| {
|
||||
let py_dict = PyDict::new(py);
|
||||
for (name, contents) in res {
|
||||
let value = PyDict::new(py);
|
||||
value.set_item("parent_branch", contents.parent_branch)?;
|
||||
value.set_item("parent_version", contents.parent_version)?;
|
||||
value.set_item("manifest_size", contents.manifest_size)?;
|
||||
py_dict.set_item(name, value)?;
|
||||
}
|
||||
Ok(py_dict.unbind())
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, from_ref=None, from_version=None))]
|
||||
pub fn create(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
from_ref: Option<String>,
|
||||
from_version: Option<u64>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let from = Ref::Version(from_ref, from_version);
|
||||
let table = inner.create_branch(&name, from).await.infer_error()?;
|
||||
Ok(Table::new(table))
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, version=None))]
|
||||
pub fn checkout(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
version: Option<u64>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let table = inner.checkout_branch(&name, version).await.infer_error()?;
|
||||
Ok(Table::new(table))
|
||||
})
|
||||
}
|
||||
|
||||
pub fn delete(self_: PyRef<'_, Self>, name: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.delete_branch(&name).await.infer_error()?;
|
||||
Ok(())
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
@@ -450,6 +450,27 @@ def binary_table(tmp_path):
|
||||
return db.create_table("binary_test", data)
|
||||
|
||||
|
||||
class TestExprIsin:
|
||||
def test_isin_ints(self):
|
||||
assert col("id").isin([1, 2, 3]).to_sql() == "id IN (1, 2, 3)"
|
||||
|
||||
def test_isin_strs(self):
|
||||
assert (
|
||||
col("status").isin(["active", "pending"]).to_sql()
|
||||
== "status IN ('active', 'pending')"
|
||||
)
|
||||
|
||||
def test_isin_coerces_and_mixes(self):
|
||||
assert col("id").isin([lit(1), 2]).to_sql() == "id IN (1, 2)"
|
||||
|
||||
def test_isin_empty(self):
|
||||
assert col("id").isin([]).to_sql() == "id IN ()"
|
||||
|
||||
def test_isin_filter(self, simple_table):
|
||||
result = simple_table.search().where(col("id").isin([1, 3, 5])).to_arrow()
|
||||
assert result.num_rows == 3
|
||||
|
||||
|
||||
class TestExprBytesIntegration:
|
||||
def test_binary_equality_filter(self, binary_table):
|
||||
result = (
|
||||
|
||||
33
python/tests/test_oauth.py
Normal file
33
python/tests/test_oauth.py
Normal file
@@ -0,0 +1,33 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
import importlib.util
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def _load_oauth_module():
|
||||
oauth_path = (
|
||||
Path(__file__).parents[1] / "python" / "lancedb" / "remote" / "oauth.py"
|
||||
)
|
||||
spec = importlib.util.spec_from_file_location("lancedb_remote_oauth", oauth_path)
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
assert spec.loader is not None
|
||||
sys.modules[spec.name] = module
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def test_oauth_config_repr_redacts_client_secret():
|
||||
oauth = _load_oauth_module()
|
||||
|
||||
config = oauth.OAuthConfig(
|
||||
issuer_url="https://issuer.example.com",
|
||||
client_id="client-id",
|
||||
scopes=["scope"],
|
||||
client_secret="super-secret",
|
||||
)
|
||||
|
||||
rendered = repr(config)
|
||||
assert "super-secret" not in rendered
|
||||
assert "client_secret" not in rendered
|
||||
4226
python/uv.lock
generated
4226
python/uv.lock
generated
File diff suppressed because it is too large
Load Diff
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "lancedb"
|
||||
version = "0.30.1-beta.0"
|
||||
version = "0.31.0-beta.4"
|
||||
edition.workspace = true
|
||||
description = "LanceDB: A serverless, low-latency vector database for AI applications"
|
||||
license.workspace = true
|
||||
@@ -50,7 +50,7 @@ lance-namespace = { workspace = true }
|
||||
lance-namespace-impls = { workspace = true }
|
||||
moka = { workspace = true }
|
||||
pin-project = { workspace = true }
|
||||
tokio = { version = "1.23", features = ["rt-multi-thread"] }
|
||||
tokio = { version = "1.23", features = ["rt-multi-thread", "sync"] }
|
||||
log.workspace = true
|
||||
async-trait = "0"
|
||||
bytes = "1"
|
||||
@@ -75,6 +75,7 @@ reqwest = { version = "0.12.0", default-features = false, features = [
|
||||
"stream",
|
||||
], optional = true }
|
||||
http = { version = "1", optional = true } # Matching what is in reqwest
|
||||
urlencoding = { version = "2", optional = true }
|
||||
uuid = { version = "1.7.0", features = ["v4", "v5"] }
|
||||
polars-arrow = { version = ">=0.37,<0.40.0", optional = true }
|
||||
polars = { version = ">=0.37,<0.40.0", optional = true }
|
||||
@@ -93,6 +94,7 @@ semver = { workspace = true }
|
||||
anyhow = "1"
|
||||
tempfile = "3.5.0"
|
||||
random_word = { version = "0.4.3", features = ["en"] }
|
||||
tokio = { version = "1.23", features = ["io-util", "macros", "net", "rt-multi-thread", "sync"] }
|
||||
uuid = { version = "1.7.0", features = ["v4"] }
|
||||
walkdir = "2"
|
||||
aws-sdk-dynamodb = { version = "1.55.0" }
|
||||
@@ -129,7 +131,13 @@ huggingface = [
|
||||
"lance-namespace-impls/dir-huggingface",
|
||||
]
|
||||
dynamodb = ["lance/dynamodb", "aws"]
|
||||
remote = ["dep:reqwest", "dep:http", "lance-namespace-impls/rest", "lance-namespace-impls/rest-adapter"]
|
||||
remote = [
|
||||
"dep:reqwest",
|
||||
"dep:http",
|
||||
"dep:urlencoding",
|
||||
"lance-namespace-impls/rest",
|
||||
"lance-namespace-impls/rest-adapter",
|
||||
]
|
||||
fp16kernels = ["lance-linalg/fp16kernels"]
|
||||
s3-test = []
|
||||
bedrock = ["dep:aws-sdk-bedrockruntime"]
|
||||
|
||||
435
rust/lancedb/src/blob.rs
Normal file
435
rust/lancedb/src/blob.rs
Normal file
@@ -0,0 +1,435 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
//! Lance blob v2 columns store large binary payloads out of line.
|
||||
//!
|
||||
//! Declare a column with [`blob`]. On write, [`crate::table::Table::add`] coerces
|
||||
//! raw `Binary` / `LargeBinary` into the blob struct layout. Queries return
|
||||
//! small descriptors, not bytes.
|
||||
//!
|
||||
//! Blob tables require Lance file format >= 2.2 and stable row ids at create.
|
||||
|
||||
use std::sync::Arc;
|
||||
|
||||
use arrow_array::builder::LargeBinaryBuilder;
|
||||
use arrow_array::{Array, LargeBinaryArray, RecordBatch, StructArray, UInt8Array, UInt64Array};
|
||||
use arrow_schema::{DataType, Field, Schema};
|
||||
use lance::dataset::{Dataset, WriteParams};
|
||||
use lance_arrow::FieldExt;
|
||||
use lance_core::datatypes::parse_field_path;
|
||||
use lance_encoding::version::LanceFileVersion;
|
||||
|
||||
use crate::error::{Error, Result};
|
||||
|
||||
pub use lance::dataset::BlobFile;
|
||||
|
||||
/// Creates an Arrow field for a Lance blob v2 column.
|
||||
///
|
||||
/// `Struct<data, uri>` with the `lance.blob.v2` marker. Same layout Lance
|
||||
/// expects on write.
|
||||
///
|
||||
/// A blob column may be top-level or nested inside a struct or list. Nested
|
||||
/// blobs are addressed by a dotted path (e.g. `info.blob`) in the read APIs.
|
||||
///
|
||||
/// ```
|
||||
/// use arrow_schema::{DataType, Field, Schema};
|
||||
///
|
||||
/// let schema = Schema::new(vec![
|
||||
/// Field::new("id", DataType::Int64, false),
|
||||
/// lancedb::blob("image", true),
|
||||
/// ]);
|
||||
/// ```
|
||||
pub fn blob(name: impl AsRef<str>, nullable: bool) -> Field {
|
||||
lance::blob::blob_field(name.as_ref(), nullable)
|
||||
}
|
||||
|
||||
/// Returns true if `field` is a blob v2 column.
|
||||
///
|
||||
/// ```
|
||||
/// let field = lancedb::blob("image", true);
|
||||
/// assert!(lancedb::blob::is_blob(&field));
|
||||
/// ```
|
||||
pub fn is_blob(field: &Field) -> bool {
|
||||
field.is_blob_v2()
|
||||
}
|
||||
|
||||
/// Returns true if `field`, or any field nested under it, is a blob v2 column.
|
||||
fn field_tree_has_blob_v2(field: &Field) -> bool {
|
||||
if field.is_blob_v2() {
|
||||
return true;
|
||||
}
|
||||
match field.data_type() {
|
||||
DataType::Struct(children) => children.iter().any(|c| field_tree_has_blob_v2(c)),
|
||||
DataType::List(child) | DataType::LargeList(child) | DataType::FixedSizeList(child, _) => {
|
||||
field_tree_has_blob_v2(child)
|
||||
}
|
||||
_ => false,
|
||||
}
|
||||
}
|
||||
|
||||
/// Collects the dotted paths of blob v2 columns under `field`, into `paths`.
|
||||
fn collect_blob_paths(field: &Field, prefix: &str, paths: &mut Vec<String>) {
|
||||
let path = if prefix.is_empty() {
|
||||
field.name().clone()
|
||||
} else {
|
||||
format!("{prefix}.{}", field.name())
|
||||
};
|
||||
if field.is_blob_v2() {
|
||||
paths.push(path);
|
||||
return;
|
||||
}
|
||||
match field.data_type() {
|
||||
DataType::Struct(children) => {
|
||||
for child in children {
|
||||
collect_blob_paths(child, &path, paths);
|
||||
}
|
||||
}
|
||||
DataType::List(child) | DataType::LargeList(child) | DataType::FixedSizeList(child, _) => {
|
||||
collect_blob_paths(child, &path, paths)
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns true if `schema` declares any blob v2 column, including nested ones.
|
||||
pub(crate) fn has_blob_columns(schema: &Schema) -> bool {
|
||||
schema.fields().iter().any(|f| field_tree_has_blob_v2(f))
|
||||
}
|
||||
|
||||
/// Blob v2 column paths in `schema`, declaration order preserved. Nested blobs
|
||||
/// are dotted paths (e.g. `info.blob`).
|
||||
pub(crate) fn blob_column_names(schema: &Schema) -> Vec<String> {
|
||||
let mut paths = Vec::new();
|
||||
for field in schema.fields() {
|
||||
collect_blob_paths(field, "", &mut paths);
|
||||
}
|
||||
paths
|
||||
}
|
||||
|
||||
/// Bumps storage format to at least [`LanceFileVersion::V2_2`] for blob schemas.
|
||||
pub(crate) fn ensure_blob_storage_version(schema: &Schema, params: &mut WriteParams) {
|
||||
if !has_blob_columns(schema) {
|
||||
return;
|
||||
}
|
||||
|
||||
let resolved = params
|
||||
.data_storage_version
|
||||
.unwrap_or(LanceFileVersion::Stable)
|
||||
.resolve();
|
||||
if resolved < LanceFileVersion::V2_2 {
|
||||
params.data_storage_version = Some(LanceFileVersion::V2_2);
|
||||
}
|
||||
}
|
||||
|
||||
/// Validate that `column` exists and is a blob v2 column.
|
||||
///
|
||||
/// Legacy v1 columns (`lance-encoding:blob`) error with a migration hint.
|
||||
pub(crate) fn ensure_blob_v2_column(
|
||||
schema: &lance_core::datatypes::Schema,
|
||||
column: &str,
|
||||
) -> Result<()> {
|
||||
match schema.field(column) {
|
||||
Some(field) if field.is_blob_v2() => Ok(()),
|
||||
Some(field) if field.is_blob() => Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"column '{column}' is a legacy blob column; blob APIs require blob v2 columns \
|
||||
(ARROW:extension:name = \"lance.blob.v2\")"
|
||||
),
|
||||
}),
|
||||
Some(_) => Err(Error::InvalidInput {
|
||||
message: format!("column '{column}' is not a blob column"),
|
||||
}),
|
||||
None => Err(Error::InvalidInput {
|
||||
message: format!("no column named '{column}' in this table"),
|
||||
}),
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the leaf descriptor `StructArray` for `column` in a descriptor batch.
|
||||
fn leaf_descriptor_struct<'a>(batch: &'a RecordBatch, column: &str) -> Result<&'a StructArray> {
|
||||
let path = parse_field_path(column).map_err(|e| Error::InvalidInput {
|
||||
message: format!("invalid blob column path '{column}': {e}"),
|
||||
})?;
|
||||
let not_struct = || Error::Runtime {
|
||||
message: format!("blob column '{column}' did not read back as a descriptor struct"),
|
||||
};
|
||||
let mut current = batch
|
||||
.column_by_name(&path[0])
|
||||
.and_then(|c| c.as_any().downcast_ref::<StructArray>())
|
||||
.ok_or_else(not_struct)?;
|
||||
for segment in &path[1..] {
|
||||
current = current
|
||||
.column_by_name(segment)
|
||||
.and_then(|c| c.as_any().downcast_ref::<StructArray>())
|
||||
.ok_or_else(not_struct)?;
|
||||
}
|
||||
Ok(current)
|
||||
}
|
||||
|
||||
/// Null rows in `row_ids`, from a descriptor take.
|
||||
///
|
||||
/// Lance `read_blobs` / `take_blobs` skip null rows (`kind == 0 && position == 0 && size == 0`).
|
||||
/// TODO(lance): aligned read API would drop this pass.
|
||||
async fn blob_null_mask(
|
||||
dataset: &Arc<Dataset>,
|
||||
column: &str,
|
||||
row_ids: &[u64],
|
||||
) -> Result<Vec<bool>> {
|
||||
let projection = dataset.schema().project(&[column])?;
|
||||
let descriptors = dataset.take_builder(row_ids, projection)?.execute().await?;
|
||||
if descriptors.num_rows() != row_ids.len() {
|
||||
return Err(Error::InvalidInput {
|
||||
message: format!(
|
||||
"blob take for column '{column}' requested {} row ids but only {} exist in the \
|
||||
table; pass row ids collected from this table",
|
||||
row_ids.len(),
|
||||
descriptors.num_rows()
|
||||
),
|
||||
});
|
||||
}
|
||||
let descriptor_struct = leaf_descriptor_struct(&descriptors, column)?;
|
||||
let child = |name: &str| {
|
||||
descriptor_struct
|
||||
.column_by_name(name)
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor for '{column}' is missing the '{name}' field"),
|
||||
})
|
||||
};
|
||||
let kinds = child("kind")?
|
||||
.as_any()
|
||||
.downcast_ref::<UInt8Array>()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor 'kind' for '{column}' is not a UInt8 array"),
|
||||
})?;
|
||||
let positions = child("position")?
|
||||
.as_any()
|
||||
.downcast_ref::<UInt64Array>()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor 'position' for '{column}' is not a UInt64 array"),
|
||||
})?;
|
||||
let sizes = child("size")?
|
||||
.as_any()
|
||||
.downcast_ref::<UInt64Array>()
|
||||
.ok_or_else(|| Error::Runtime {
|
||||
message: format!("blob descriptor 'size' for '{column}' is not a UInt64 array"),
|
||||
})?;
|
||||
|
||||
// Match Lance `collect_blob_entries_v2` skip condition (`BlobKind::Inline` == 0).
|
||||
Ok((0..descriptor_struct.len())
|
||||
.map(|i| {
|
||||
descriptor_struct.is_null(i)
|
||||
|| kinds.is_null(i)
|
||||
|| (kinds.value(i) == 0 && positions.value(i) == 0 && sizes.value(i) == 0)
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
fn non_null_row_ids(row_ids: &[u64], null_mask: &[bool]) -> Vec<u64> {
|
||||
row_ids
|
||||
.iter()
|
||||
.zip(null_mask)
|
||||
.filter_map(|(row_id, is_null)| (!is_null).then_some(*row_id))
|
||||
.collect()
|
||||
}
|
||||
|
||||
/// Materialize blob bytes for `row_ids` (same length and order, nulls preserved).
|
||||
pub(crate) async fn take_blobs_aligned(
|
||||
dataset: &Arc<Dataset>,
|
||||
column: &str,
|
||||
row_ids: &[u64],
|
||||
) -> Result<LargeBinaryArray> {
|
||||
ensure_blob_v2_column(dataset.schema(), column)?;
|
||||
if row_ids.is_empty() {
|
||||
return Ok(LargeBinaryBuilder::new().finish());
|
||||
}
|
||||
|
||||
let null_mask = blob_null_mask(dataset, column, row_ids).await?;
|
||||
let non_null_row_ids = non_null_row_ids(row_ids, &null_mask);
|
||||
let non_null_count = non_null_row_ids.len();
|
||||
let payloads = if non_null_count == 0 {
|
||||
Vec::new()
|
||||
} else {
|
||||
dataset
|
||||
.read_blobs(column)?
|
||||
.with_row_ids(non_null_row_ids)
|
||||
.preserve_order(true)
|
||||
.execute()
|
||||
.await?
|
||||
};
|
||||
|
||||
if payloads.len() != non_null_count {
|
||||
return Err(Error::Runtime {
|
||||
message: format!(
|
||||
"blob read for column '{column}' returned {} payloads for {} non-null rows",
|
||||
payloads.len(),
|
||||
non_null_count
|
||||
),
|
||||
});
|
||||
}
|
||||
|
||||
let mut builder = LargeBinaryBuilder::new();
|
||||
let mut payload_idx = 0;
|
||||
for is_null in &null_mask {
|
||||
if *is_null {
|
||||
builder.append_null();
|
||||
} else {
|
||||
builder.append_value(payloads[payload_idx].data.as_ref());
|
||||
payload_idx += 1;
|
||||
}
|
||||
}
|
||||
Ok(builder.finish())
|
||||
}
|
||||
|
||||
/// Open lazy [`BlobFile`] handles for `row_ids` (same length and order, nulls as `None`).
|
||||
pub(crate) async fn take_blob_files_aligned(
|
||||
dataset: &Arc<Dataset>,
|
||||
column: &str,
|
||||
row_ids: &[u64],
|
||||
) -> Result<Vec<Option<BlobFile>>> {
|
||||
ensure_blob_v2_column(dataset.schema(), column)?;
|
||||
if row_ids.is_empty() {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
let null_mask = blob_null_mask(dataset, column, row_ids).await?;
|
||||
let non_null_row_ids = non_null_row_ids(row_ids, &null_mask);
|
||||
let handles = if non_null_row_ids.is_empty() {
|
||||
Vec::new()
|
||||
} else {
|
||||
dataset.take_blobs(&non_null_row_ids, column).await?
|
||||
};
|
||||
if handles.len() != non_null_row_ids.len() {
|
||||
return Err(Error::Runtime {
|
||||
message: format!(
|
||||
"blob take for column '{column}' returned {} handles for {} non-null rows",
|
||||
handles.len(),
|
||||
non_null_row_ids.len()
|
||||
),
|
||||
});
|
||||
}
|
||||
|
||||
let mut handles = handles.into_iter();
|
||||
Ok(null_mask
|
||||
.iter()
|
||||
.map(|is_null| {
|
||||
if *is_null {
|
||||
None
|
||||
} else {
|
||||
Some(handles.next().unwrap())
|
||||
}
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use arrow_schema::DataType;
|
||||
use lance_arrow::ARROW_EXT_NAME_KEY;
|
||||
|
||||
fn blob_schema() -> Schema {
|
||||
Schema::new(vec![
|
||||
Field::new("id", DataType::Int64, false),
|
||||
blob("image", true),
|
||||
])
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn blob_field_carries_v2_extension_marker() {
|
||||
let field = blob("image", true);
|
||||
assert_eq!(
|
||||
field.metadata().get(ARROW_EXT_NAME_KEY).map(String::as_str),
|
||||
Some("lance.blob.v2")
|
||||
);
|
||||
assert!(matches!(field.data_type(), DataType::Struct(_)));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn has_blob_columns_detects_blob_fields() {
|
||||
assert!(has_blob_columns(&blob_schema()));
|
||||
let plain = Schema::new(vec![Field::new("id", DataType::Int64, false)]);
|
||||
assert!(!has_blob_columns(&plain));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_bumps_to_v2_2() {
|
||||
let mut params = WriteParams::default();
|
||||
ensure_blob_storage_version(&blob_schema(), &mut params);
|
||||
assert_eq!(
|
||||
params.data_storage_version.unwrap().resolve(),
|
||||
LanceFileVersion::V2_2
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_overrides_lower_explicit_version() {
|
||||
let mut params = WriteParams {
|
||||
data_storage_version: Some(LanceFileVersion::V2_0),
|
||||
..Default::default()
|
||||
};
|
||||
ensure_blob_storage_version(&blob_schema(), &mut params);
|
||||
assert_eq!(
|
||||
params.data_storage_version.unwrap().resolve(),
|
||||
LanceFileVersion::V2_2
|
||||
);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_keeps_higher_explicit_version() {
|
||||
let mut params = WriteParams {
|
||||
data_storage_version: Some(LanceFileVersion::V2_3),
|
||||
..Default::default()
|
||||
};
|
||||
ensure_blob_storage_version(&blob_schema(), &mut params);
|
||||
assert_eq!(params.data_storage_version.unwrap(), LanceFileVersion::V2_3);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn legacy_v1_blob_column_is_rejected_with_migration_hint() {
|
||||
let legacy = Field::new("image", DataType::LargeBinary, true).with_metadata(
|
||||
std::collections::HashMap::from([(
|
||||
"lance-encoding:blob".to_string(),
|
||||
"true".to_string(),
|
||||
)]),
|
||||
);
|
||||
let arrow_schema = Schema::new(vec![legacy]);
|
||||
let lance_schema = lance_core::datatypes::Schema::try_from(&arrow_schema).unwrap();
|
||||
|
||||
let err = ensure_blob_v2_column(&lance_schema, "image").unwrap_err();
|
||||
assert!(matches!(err, Error::InvalidInput { .. }));
|
||||
assert!(err.to_string().contains("legacy blob column"));
|
||||
assert!(err.to_string().contains("lance.blob.v2"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn non_blob_and_unknown_columns_are_rejected_by_name() {
|
||||
let arrow_schema = Schema::new(vec![Field::new("id", DataType::Int64, false)]);
|
||||
let lance_schema = lance_core::datatypes::Schema::try_from(&arrow_schema).unwrap();
|
||||
|
||||
let err = ensure_blob_v2_column(&lance_schema, "id").unwrap_err();
|
||||
assert!(err.to_string().contains("'id' is not a blob column"));
|
||||
|
||||
let err = ensure_blob_v2_column(&lance_schema, "missing").unwrap_err();
|
||||
assert!(err.to_string().contains("no column named 'missing'"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn blob_column_names_includes_nested_path() {
|
||||
let blob_field = blob("blob", true);
|
||||
let info = Field::new(
|
||||
"info",
|
||||
DataType::Struct(vec![Field::new("name", DataType::Utf8, false), blob_field].into()),
|
||||
true,
|
||||
);
|
||||
let schema = Schema::new(vec![Field::new("id", DataType::Int64, false), info]);
|
||||
assert_eq!(blob_column_names(&schema), vec!["info.blob"]);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn storage_version_noop_without_blob_columns() {
|
||||
let schema = Schema::new(vec![Field::new("id", DataType::Int64, false)]);
|
||||
let mut params = WriteParams::default();
|
||||
ensure_blob_storage_version(&schema, &mut params);
|
||||
assert!(params.data_storage_version.is_none());
|
||||
}
|
||||
}
|
||||
@@ -9,6 +9,7 @@ use std::sync::Arc;
|
||||
use arrow_array::RecordBatch;
|
||||
use arrow_schema::SchemaRef;
|
||||
use lance::dataset::ReadParams;
|
||||
use lance::dataset::refs::MAIN_BRANCH;
|
||||
use lance_namespace::models::{
|
||||
CreateNamespaceRequest, CreateNamespaceResponse, DescribeNamespaceRequest,
|
||||
DescribeNamespaceResponse, DropNamespaceRequest, DropNamespaceResponse, ListNamespacesRequest,
|
||||
@@ -22,8 +23,10 @@ use crate::connection::create_table::CreateTableBuilder;
|
||||
use crate::data::scannable::Scannable;
|
||||
use crate::database::listing::ListingDatabase;
|
||||
use crate::database::{
|
||||
CloneTableRequest, Database, DatabaseOptions, OpenTableRequest, ReadConsistency,
|
||||
TableNamesRequest,
|
||||
CloneTableRequest, CreateFunctionRequest, CreateMaterializedViewRequest, Database,
|
||||
DatabaseOptions, FunctionInfo, JobErrorInfo, JobHistoryInfo, JobInfo, MaterializedViewInfo,
|
||||
MvRefreshPlan, OpenTableRequest, ReadConsistency, RefreshMaterializedViewRequest,
|
||||
TableLineageRequest, TableNamesRequest,
|
||||
};
|
||||
use crate::embeddings::{EmbeddingRegistry, MemoryRegistry};
|
||||
use crate::error::{Error, Result};
|
||||
@@ -119,6 +122,8 @@ pub struct OpenTableBuilder {
|
||||
parent: Arc<dyn Database>,
|
||||
request: OpenTableRequest,
|
||||
embedding_registry: Arc<dyn EmbeddingRegistry>,
|
||||
branch: Option<String>,
|
||||
version: Option<u64>,
|
||||
}
|
||||
|
||||
impl OpenTableBuilder {
|
||||
@@ -139,6 +144,8 @@ impl OpenTableBuilder {
|
||||
managed_versioning: None,
|
||||
},
|
||||
embedding_registry,
|
||||
branch: None,
|
||||
version: None,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -259,14 +266,48 @@ impl OpenTableBuilder {
|
||||
self
|
||||
}
|
||||
|
||||
/// Open the table scoped to the given branch instead of the default branch.
|
||||
///
|
||||
/// Reads and writes on the returned table operate in the branch's context.
|
||||
pub fn branch(mut self, branch: impl Into<String>) -> Self {
|
||||
self.branch = Some(branch.into());
|
||||
self
|
||||
}
|
||||
|
||||
/// Open the table pinned to a specific version, producing a read-only "view".
|
||||
///
|
||||
/// Composes with [`Self::branch`]: when a branch is also set, this opens that
|
||||
/// branch at the given version; otherwise it opens `main` at that version.
|
||||
/// The returned table is a detached head, so operations that modify the table
|
||||
/// will fail until [`Table::checkout_latest`] is called.
|
||||
///
|
||||
/// ```
|
||||
/// # use lancedb::Connection;
|
||||
/// # async fn f(conn: &Connection) -> Result<(), Box<dyn std::error::Error>> {
|
||||
/// let table = conn.open_table("t").branch("exp").version(3).execute().await?;
|
||||
/// # Ok(())
|
||||
/// # }
|
||||
/// ```
|
||||
pub fn version(mut self, version: u64) -> Self {
|
||||
self.version = Some(version);
|
||||
self
|
||||
}
|
||||
|
||||
/// Open the table
|
||||
pub async fn execute(self) -> Result<Table> {
|
||||
let table = self.parent.open_table(self.request).await?;
|
||||
Ok(Table::new_with_embedding_registry(
|
||||
table,
|
||||
self.parent,
|
||||
self.embedding_registry,
|
||||
))
|
||||
let table = Table::new_with_embedding_registry(table, self.parent, self.embedding_registry);
|
||||
// "main" is the default branch, so treat it as no branch.
|
||||
let branch = self.branch.filter(|b| b.as_str() != MAIN_BRANCH);
|
||||
match branch {
|
||||
Some(branch) => table.checkout_branch(&branch, self.version).await,
|
||||
None => {
|
||||
if let Some(version) = self.version {
|
||||
table.checkout(version).await?;
|
||||
}
|
||||
Ok(table)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -449,6 +490,113 @@ impl Connection {
|
||||
)
|
||||
}
|
||||
|
||||
// -- Derived compute: functions, materialized views, jobs -------------
|
||||
// Server-backed features (LanceDB Enterprise / Cloud); local
|
||||
// databases return NotSupported for now.
|
||||
|
||||
/// Register a UDF (CREATE FUNCTION).
|
||||
pub async fn create_function(&self, request: CreateFunctionRequest) -> Result<()> {
|
||||
self.internal.create_function(request).await
|
||||
}
|
||||
|
||||
/// List registered functions (SHOW FUNCTIONS).
|
||||
pub async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
|
||||
self.internal.list_functions().await
|
||||
}
|
||||
|
||||
/// Drop a registered function (DROP FUNCTION).
|
||||
pub async fn drop_function(&self, name: &str) -> Result<()> {
|
||||
self.internal.drop_function(name).await
|
||||
}
|
||||
|
||||
/// Create a materialized view (CREATE MATERIALIZED VIEW). Returns
|
||||
/// the initial-population job id, absent when `with_no_data`.
|
||||
pub async fn create_materialized_view(
|
||||
&self,
|
||||
request: CreateMaterializedViewRequest,
|
||||
) -> Result<Option<String>> {
|
||||
self.internal.create_materialized_view(request).await
|
||||
}
|
||||
|
||||
/// Refresh a materialized view; returns the refresh job id.
|
||||
pub async fn refresh_materialized_view(
|
||||
&self,
|
||||
request: RefreshMaterializedViewRequest,
|
||||
) -> Result<String> {
|
||||
self.internal.refresh_materialized_view(request).await
|
||||
}
|
||||
|
||||
/// Derived-compute lineage of a table/view (or column), as server-defined
|
||||
/// JSON. Read-only.
|
||||
pub async fn table_lineage(&self, request: TableLineageRequest) -> Result<String> {
|
||||
self.internal.table_lineage(request).await
|
||||
}
|
||||
|
||||
/// Plan a materialized-view refresh without submitting work
|
||||
/// (EXPLAIN REFRESH).
|
||||
pub async fn explain_refresh_materialized_view(
|
||||
&self,
|
||||
name: &str,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
) -> Result<MvRefreshPlan> {
|
||||
self.internal
|
||||
.explain_refresh_materialized_view(name, full, src_version)
|
||||
.await
|
||||
}
|
||||
|
||||
/// Update a materialized view's options (ALTER MATERIALIZED VIEW).
|
||||
pub async fn alter_materialized_view(&self, name: &str, auto_refresh: bool) -> Result<()> {
|
||||
self.internal
|
||||
.alter_materialized_view(name, auto_refresh)
|
||||
.await
|
||||
}
|
||||
|
||||
/// Drop a materialized view definition (DROP MATERIALIZED VIEW).
|
||||
pub async fn drop_materialized_view(&self, name: &str) -> Result<()> {
|
||||
self.internal.drop_materialized_view(name).await
|
||||
}
|
||||
|
||||
/// List registered materialized view definitions.
|
||||
pub async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
|
||||
self.internal.list_materialized_views().await
|
||||
}
|
||||
|
||||
/// List inflight server-side jobs across the database's tables.
|
||||
pub async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
|
||||
self.internal.list_jobs().await
|
||||
}
|
||||
|
||||
/// Cancel an inflight server-side job by id. Returns true if a
|
||||
/// matching inflight job was flagged for cancellation.
|
||||
pub async fn cancel_job(&self, job_id: &str) -> Result<bool> {
|
||||
self.internal.cancel_job(job_id).await
|
||||
}
|
||||
|
||||
/// Look up a single server-side job by id -- the `wait()`/status poll path.
|
||||
/// `table_hint` (the job's table) enables an O(1) server-side lookup; `None`
|
||||
/// scans the database's active jobs. A `None` result means unknown / not
|
||||
/// active.
|
||||
pub async fn get_job(&self, job_id: &str, table_hint: Option<&str>) -> Result<Option<JobInfo>> {
|
||||
self.internal.get_job(job_id, table_hint).await
|
||||
}
|
||||
|
||||
/// Durable job history (SHOW JOB HISTORY) across the database's tables.
|
||||
/// Pass `job_id` to narrow to a single job.
|
||||
pub async fn job_history(&self, job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
|
||||
self.internal.job_history(job_id).await
|
||||
}
|
||||
|
||||
/// Per-row UDF errors (SHOW ERRORS) across the database's tables, optionally
|
||||
/// filtered by `job_id` and/or `table`.
|
||||
pub async fn errors(
|
||||
&self,
|
||||
job_id: Option<&str>,
|
||||
table: Option<&str>,
|
||||
) -> Result<Vec<JobErrorInfo>> {
|
||||
self.internal.errors(job_id, table).await
|
||||
}
|
||||
|
||||
/// Rename a table in the database.
|
||||
///
|
||||
/// This is only supported in LanceDB Cloud.
|
||||
@@ -537,6 +685,9 @@ impl Connection {
|
||||
/// For LanceNamespaceDatabase, it is the underlying LanceNamespace.
|
||||
/// For ListingDatabase, it is the equivalent DirectoryNamespace.
|
||||
/// For RemoteDatabase, it is the equivalent RestNamespace.
|
||||
///
|
||||
/// Remote connections using dynamic headers forward them through the
|
||||
/// namespace client's per-request context provider.
|
||||
pub async fn namespace_client(&self) -> Result<Arc<dyn lance_namespace::LanceNamespace>> {
|
||||
self.internal.namespace_client().await
|
||||
}
|
||||
@@ -545,6 +696,9 @@ impl Connection {
|
||||
/// Returns (impl_type, properties) where:
|
||||
/// - impl_type: "dir" for DirectoryNamespace, "rest" for RestNamespace
|
||||
/// - properties: configuration properties for the namespace
|
||||
///
|
||||
/// Remote connections using dynamic headers cannot be exported because the
|
||||
/// namespace client config only carries static headers.
|
||||
pub async fn namespace_client_config(
|
||||
&self,
|
||||
) -> Result<(String, std::collections::HashMap<String, String>)> {
|
||||
@@ -622,6 +776,8 @@ pub struct ConnectRequest {
|
||||
pub struct ConnectBuilder {
|
||||
request: ConnectRequest,
|
||||
embedding_registry: Option<Arc<dyn EmbeddingRegistry>>,
|
||||
#[cfg(feature = "remote")]
|
||||
oauth_config: Option<crate::remote::OAuthConfig>,
|
||||
}
|
||||
|
||||
#[cfg(feature = "remote")]
|
||||
@@ -643,6 +799,8 @@ impl ConnectBuilder {
|
||||
session: None,
|
||||
},
|
||||
embedding_registry: None,
|
||||
#[cfg(feature = "remote")]
|
||||
oauth_config: None,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -731,6 +889,19 @@ impl ConnectBuilder {
|
||||
self
|
||||
}
|
||||
|
||||
/// Configure OAuth authentication for LanceDB Cloud/Enterprise.
|
||||
///
|
||||
/// This creates an [`OAuthHeaderProvider`](crate::remote::OAuthHeaderProvider)
|
||||
/// from the given config and sets it as the header provider. OAuth cannot
|
||||
/// be combined with an API key or another header provider.
|
||||
///
|
||||
/// Token acquisition and refresh are handled in Rust.
|
||||
#[cfg(feature = "remote")]
|
||||
pub fn oauth_config(mut self, config: crate::remote::OAuthConfig) -> Self {
|
||||
self.oauth_config = Some(config);
|
||||
self
|
||||
}
|
||||
|
||||
/// Provide a custom [`EmbeddingRegistry`] to use for this connection.
|
||||
pub fn embedding_registry(mut self, registry: Arc<dyn EmbeddingRegistry>) -> Self {
|
||||
self.embedding_registry = Some(registry);
|
||||
@@ -876,9 +1047,40 @@ impl ConnectBuilder {
|
||||
let region = options.region.ok_or_else(|| Error::InvalidInput {
|
||||
message: "A region is required when connecting to LanceDb Cloud".to_string(),
|
||||
})?;
|
||||
let api_key = options.api_key.ok_or_else(|| Error::InvalidInput {
|
||||
message: "An api_key is required when connecting to LanceDb Cloud".to_string(),
|
||||
})?;
|
||||
let api_key = match (&self.oauth_config, &options.api_key) {
|
||||
(Some(_), None) => String::new(),
|
||||
(Some(_), Some(_)) => {
|
||||
return Err(Error::InvalidInput {
|
||||
message:
|
||||
"api_key and oauth_config cannot both be set when connecting to LanceDb Cloud"
|
||||
.to_string(),
|
||||
});
|
||||
}
|
||||
(None, Some(key)) => key.clone(),
|
||||
(None, None) => {
|
||||
return Err(Error::InvalidInput {
|
||||
message:
|
||||
"An api_key or oauth_config is required when connecting to LanceDb Cloud"
|
||||
.to_string(),
|
||||
});
|
||||
}
|
||||
};
|
||||
|
||||
if self.oauth_config.is_some() && self.request.client_config.header_provider.is_some() {
|
||||
return Err(Error::InvalidInput {
|
||||
message:
|
||||
"oauth_config and client_config.header_provider cannot both be set when connecting to LanceDb Cloud"
|
||||
.to_string(),
|
||||
});
|
||||
}
|
||||
|
||||
let mut client_config = self.request.client_config;
|
||||
|
||||
if let Some(oauth_config) = self.oauth_config {
|
||||
let provider = crate::remote::OAuthHeaderProvider::new(oauth_config)?;
|
||||
client_config.header_provider =
|
||||
Some(Arc::new(provider) as Arc<dyn crate::remote::HeaderProvider>);
|
||||
}
|
||||
|
||||
let storage_options = StorageOptions(options.storage_options.clone());
|
||||
let internal = Arc::new(crate::remote::db::RemoteDatabase::try_new(
|
||||
@@ -886,7 +1088,7 @@ impl ConnectBuilder {
|
||||
&api_key,
|
||||
®ion,
|
||||
options.host_override,
|
||||
self.request.client_config,
|
||||
client_config,
|
||||
storage_options.into(),
|
||||
self.request.read_consistency_interval,
|
||||
)?);
|
||||
@@ -1195,6 +1397,83 @@ mod tests {
|
||||
assert_eq!(Some(&"EXPLICIT-VALUE".to_string()), options.get(opts_key));
|
||||
}
|
||||
|
||||
#[cfg(feature = "remote")]
|
||||
#[tokio::test]
|
||||
async fn test_connect_rejects_api_key_with_oauth_config() {
|
||||
let oauth_config = crate::remote::OAuthConfig {
|
||||
issuer_url: "https://issuer.example.com".to_string(),
|
||||
client_id: "client-id".to_string(),
|
||||
client_secret: Some("secret".to_string()),
|
||||
scopes: vec!["scope".to_string()],
|
||||
flow: crate::remote::OAuthFlow::ClientCredentials,
|
||||
refresh_buffer_secs: None,
|
||||
};
|
||||
|
||||
let result = ConnectBuilder::new("db://my-container/my-prefix")
|
||||
.region("us-east-1")
|
||||
.api_key("my-api-key")
|
||||
.oauth_config(oauth_config)
|
||||
.execute()
|
||||
.await;
|
||||
|
||||
match result {
|
||||
Err(Error::InvalidInput { message })
|
||||
if message
|
||||
== "api_key and oauth_config cannot both be set when connecting to LanceDb Cloud" =>
|
||||
{}
|
||||
Err(err) => panic!("expected InvalidInput, got {err:?}"),
|
||||
Ok(_) => panic!("expected api_key and oauth_config to be rejected"),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "remote")]
|
||||
#[tokio::test]
|
||||
async fn test_connect_rejects_header_provider_with_oauth_config() {
|
||||
#[derive(Debug)]
|
||||
struct TestHeaderProvider;
|
||||
|
||||
#[async_trait::async_trait]
|
||||
impl crate::remote::HeaderProvider for TestHeaderProvider {
|
||||
async fn get_headers(&self) -> Result<HashMap<String, String>> {
|
||||
Ok(HashMap::from([(
|
||||
"authorization".to_string(),
|
||||
"Bearer token".to_string(),
|
||||
)]))
|
||||
}
|
||||
}
|
||||
|
||||
let oauth_config = crate::remote::OAuthConfig {
|
||||
issuer_url: "https://issuer.example.com".to_string(),
|
||||
client_id: "client-id".to_string(),
|
||||
client_secret: Some("secret".to_string()),
|
||||
scopes: vec!["scope".to_string()],
|
||||
flow: crate::remote::OAuthFlow::ClientCredentials,
|
||||
refresh_buffer_secs: None,
|
||||
};
|
||||
let client_config = crate::remote::ClientConfig {
|
||||
header_provider: Some(
|
||||
Arc::new(TestHeaderProvider) as Arc<dyn crate::remote::HeaderProvider>
|
||||
),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = ConnectBuilder::new("db://my-container/my-prefix")
|
||||
.region("us-east-1")
|
||||
.client_config(client_config)
|
||||
.oauth_config(oauth_config)
|
||||
.execute()
|
||||
.await;
|
||||
|
||||
match result {
|
||||
Err(Error::InvalidInput { message })
|
||||
if message
|
||||
== "oauth_config and client_config.header_provider cannot both be set when connecting to LanceDb Cloud" =>
|
||||
{}
|
||||
Err(err) => panic!("expected InvalidInput, got {err:?}"),
|
||||
Ok(_) => panic!("expected header_provider and oauth_config to be rejected"),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(not(windows))]
|
||||
#[tokio::test]
|
||||
async fn test_connect_relative() {
|
||||
|
||||
@@ -27,11 +27,12 @@ use lance_namespace::models::{
|
||||
};
|
||||
|
||||
use crate::data::scannable::Scannable;
|
||||
use crate::error::Result;
|
||||
use crate::error::{Error, Result};
|
||||
use crate::table::{BaseTable, WriteOptions};
|
||||
|
||||
pub mod listing;
|
||||
pub mod namespace;
|
||||
pub(crate) mod read_freshness;
|
||||
|
||||
pub trait DatabaseOptions {
|
||||
fn serialize_into_map(&self, map: &mut HashMap<String, String>);
|
||||
@@ -199,6 +200,205 @@ pub enum ReadConsistency {
|
||||
Strong,
|
||||
}
|
||||
|
||||
/// A request to register a UDF (CREATE FUNCTION).
|
||||
///
|
||||
/// Functions are first-class database objects, decoupled from any
|
||||
/// column; computed columns and materialized views reference them by
|
||||
/// name. Server-backed feature (LanceDB Enterprise / Cloud).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CreateFunctionRequest {
|
||||
/// Function name.
|
||||
pub name: String,
|
||||
/// Implementation language (currently "python").
|
||||
pub language: String,
|
||||
/// SQL return type, e.g. `FLOAT`, `FLOAT[1536]`,
|
||||
/// `STRUCT(a FLOAT, b VARCHAR)`, `TABLE(chunk VARCHAR, idx INT)`.
|
||||
pub return_type: String,
|
||||
/// Function body: source text, or base64 cloudpickle bytes when
|
||||
/// `options["body_format"] = "cloudpickle"`.
|
||||
pub body: String,
|
||||
/// Options: input_columns, pip, num_gpus, batch_size, timeout,
|
||||
/// error_policy, docker_image, body_format, ...
|
||||
pub options: HashMap<String, String>,
|
||||
}
|
||||
|
||||
/// A registered function, as returned by `list_functions`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct FunctionInfo {
|
||||
pub name: String,
|
||||
pub language: String,
|
||||
pub return_type: String,
|
||||
pub description: String,
|
||||
}
|
||||
|
||||
/// A request to create a materialized view (CREATE MATERIALIZED VIEW).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CreateMaterializedViewRequest {
|
||||
/// View name.
|
||||
pub name: String,
|
||||
/// The view's SELECT statement, e.g.
|
||||
/// `SELECT id, embed(body) AS vec FROM articles WHERE id > 1`.
|
||||
/// Bare columns project through; function-call columns compute via
|
||||
/// registered UDFs (a RETURNS TABLE function makes a row-expanding
|
||||
/// chunker view).
|
||||
pub query: String,
|
||||
/// Refresh automatically when the source table changes.
|
||||
pub auto_refresh: bool,
|
||||
/// Register the definition only; skip the initial population.
|
||||
pub with_no_data: bool,
|
||||
/// Optional source column to partition the view's table function on. If the
|
||||
/// column has an IVF vector index the server partitions by its clusters
|
||||
/// (image-dedup style); otherwise it groups by distinct value.
|
||||
pub partition_by: Option<String>,
|
||||
}
|
||||
|
||||
impl CreateMaterializedViewRequest {
|
||||
pub fn new(name: impl Into<String>, query: impl Into<String>) -> Self {
|
||||
Self {
|
||||
name: name.into(),
|
||||
query: query.into(),
|
||||
auto_refresh: false,
|
||||
with_no_data: false,
|
||||
partition_by: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A request to refresh a materialized view.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct RefreshMaterializedViewRequest {
|
||||
/// View name.
|
||||
pub name: String,
|
||||
/// Force a full rebuild (recompute and replace every row) instead of the
|
||||
/// default incremental refresh.
|
||||
pub full: bool,
|
||||
/// Pin the refresh to a source-table version; latest when absent.
|
||||
pub src_version: Option<u64>,
|
||||
/// Initial worker count.
|
||||
pub num_workers: Option<u32>,
|
||||
/// Elastic worker ceiling.
|
||||
pub max_workers: Option<u32>,
|
||||
}
|
||||
|
||||
/// A request for the derived-compute lineage of a table/view (or one of its
|
||||
/// columns). The response is server-defined lineage JSON, returned opaque so
|
||||
/// this client need not model the server's lineage schema.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct TableLineageRequest {
|
||||
/// Table or view name.
|
||||
pub name: String,
|
||||
/// Column for column-level lineage; whole table/view when absent.
|
||||
pub column: Option<String>,
|
||||
/// "upstream" | "downstream" | "both" (server default when absent).
|
||||
pub direction: Option<String>,
|
||||
/// Column-hops to walk; transitive when absent.
|
||||
pub depth: Option<u32>,
|
||||
}
|
||||
|
||||
impl RefreshMaterializedViewRequest {
|
||||
pub fn new(name: impl Into<String>) -> Self {
|
||||
Self {
|
||||
name: name.into(),
|
||||
full: false,
|
||||
src_version: None,
|
||||
num_workers: None,
|
||||
max_workers: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A registered materialized view definition, as returned by
|
||||
/// `list_materialized_views`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MaterializedViewInfo {
|
||||
pub name: String,
|
||||
pub source_table: String,
|
||||
/// Source columns projected through.
|
||||
pub projection: Vec<String>,
|
||||
/// `alias=expression` per UDF-computed column.
|
||||
pub udf_columns: Vec<String>,
|
||||
pub filter: Option<String>,
|
||||
pub auto_refresh: bool,
|
||||
}
|
||||
|
||||
/// A row from `list_jobs`: one inflight server-side job (index build,
|
||||
/// compaction, column refresh, view refresh, ...).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct JobInfo {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
/// Lifecycle state: "running", "cancelling", or "stale".
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub age_seconds: Option<i64>,
|
||||
pub command: Option<String>,
|
||||
pub units_done: Option<i64>,
|
||||
pub units_total: Option<i64>,
|
||||
/// Whether the job's final commit has completed (output visible).
|
||||
pub committed: bool,
|
||||
pub rows_skipped: u64,
|
||||
pub error: Option<String>,
|
||||
}
|
||||
|
||||
/// A row from `job_history`: one durable, completed/terminal server-side job
|
||||
/// record (SHOW JOB HISTORY), read from a table's `_job_history` store. Unlike
|
||||
/// `JobInfo` (live, inflight jobs) this carries created/updated/completed
|
||||
/// timestamps and the lifecycle event log.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct JobHistoryInfo {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub created_ms: i64,
|
||||
pub updated_ms: i64,
|
||||
pub completed_ms: Option<i64>,
|
||||
pub rows_processed: Option<i64>,
|
||||
pub rows_skipped: Option<i64>,
|
||||
pub error: Option<String>,
|
||||
/// Newline-joined lifecycle event log, oldest first.
|
||||
pub events: Option<String>,
|
||||
}
|
||||
|
||||
/// A row from `errors`: one per-row UDF failure recorded by `error_policy=skip`
|
||||
/// (SHOW ERRORS).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct JobErrorInfo {
|
||||
pub job_id: String,
|
||||
pub table: String,
|
||||
pub column: String,
|
||||
pub error_type: String,
|
||||
pub error_message: String,
|
||||
pub fragment_id: Option<i64>,
|
||||
pub source_row_id: Option<i64>,
|
||||
pub table_version: Option<i64>,
|
||||
pub age_seconds: Option<i64>,
|
||||
}
|
||||
|
||||
/// The plan a `REFRESH MATERIALIZED VIEW` would execute, as returned by
|
||||
/// `explain_refresh_materialized_view` (EXPLAIN REFRESH). No work is run.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MvRefreshPlan {
|
||||
pub table_name: String,
|
||||
/// Whether a refresh would do anything (rebuild or non-empty units).
|
||||
pub has_work: bool,
|
||||
pub source_version: u64,
|
||||
pub last_refreshed_version: Option<u64>,
|
||||
pub full_refresh: bool,
|
||||
/// Source changed non-append-only since the last refresh -> rebuild.
|
||||
pub rebuild: bool,
|
||||
/// Number of row-range work units the refresh would process.
|
||||
pub units_total: u64,
|
||||
}
|
||||
|
||||
fn not_supported<T>(what: &str) -> Result<T> {
|
||||
Err(Error::NotSupported {
|
||||
message: format!("{} is not supported by this database", what),
|
||||
})
|
||||
}
|
||||
|
||||
/// The `Database` trait defines the interface for database implementations.
|
||||
///
|
||||
/// A database is responsible for managing tables and their metadata.
|
||||
@@ -244,6 +444,99 @@ pub trait Database:
|
||||
///
|
||||
/// See [`CloneTableRequest`] for detailed documentation and examples.
|
||||
async fn clone_table(&self, request: CloneTableRequest) -> Result<Arc<dyn BaseTable>>;
|
||||
|
||||
// -- Derived compute: functions, materialized views, jobs -------------
|
||||
//
|
||||
// Server-backed features (LanceDB Enterprise / Cloud). The defaults
|
||||
// return NotSupported; the remote database overrides them. Local
|
||||
// single-node implementations are planned.
|
||||
|
||||
/// Register a UDF (CREATE FUNCTION).
|
||||
async fn create_function(&self, _request: CreateFunctionRequest) -> Result<()> {
|
||||
not_supported("create_function")
|
||||
}
|
||||
/// List registered functions (SHOW FUNCTIONS).
|
||||
async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
|
||||
not_supported("list_functions")
|
||||
}
|
||||
/// Drop a registered function (DROP FUNCTION).
|
||||
async fn drop_function(&self, _name: &str) -> Result<()> {
|
||||
not_supported("drop_function")
|
||||
}
|
||||
/// Create a materialized view (CREATE MATERIALIZED VIEW). Returns
|
||||
/// the initial-population job id, absent when `with_no_data`.
|
||||
async fn create_materialized_view(
|
||||
&self,
|
||||
_request: CreateMaterializedViewRequest,
|
||||
) -> Result<Option<String>> {
|
||||
not_supported("create_materialized_view")
|
||||
}
|
||||
/// Refresh a materialized view; returns the refresh job id.
|
||||
async fn refresh_materialized_view(
|
||||
&self,
|
||||
_request: RefreshMaterializedViewRequest,
|
||||
) -> Result<String> {
|
||||
not_supported("refresh_materialized_view")
|
||||
}
|
||||
/// Derived-compute lineage of a table/view (or column), as server-defined
|
||||
/// JSON. Read-only.
|
||||
async fn table_lineage(&self, _request: TableLineageRequest) -> Result<String> {
|
||||
not_supported("table_lineage")
|
||||
}
|
||||
/// Plan a materialized-view refresh without submitting work
|
||||
/// (EXPLAIN REFRESH). `full` plans a full rebuild (incremental
|
||||
/// planning requires stable row IDs on the source).
|
||||
async fn explain_refresh_materialized_view(
|
||||
&self,
|
||||
_name: &str,
|
||||
_full: bool,
|
||||
_src_version: Option<u64>,
|
||||
) -> Result<MvRefreshPlan> {
|
||||
not_supported("explain_refresh_materialized_view")
|
||||
}
|
||||
/// Update a materialized view's options (ALTER MATERIALIZED VIEW).
|
||||
async fn alter_materialized_view(&self, _name: &str, _auto_refresh: bool) -> Result<()> {
|
||||
not_supported("alter_materialized_view")
|
||||
}
|
||||
/// Drop a materialized view definition (DROP MATERIALIZED VIEW).
|
||||
async fn drop_materialized_view(&self, _name: &str) -> Result<()> {
|
||||
not_supported("drop_materialized_view")
|
||||
}
|
||||
/// List registered materialized view definitions.
|
||||
async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
|
||||
not_supported("list_materialized_views")
|
||||
}
|
||||
/// List inflight server-side jobs across the database's tables.
|
||||
async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
|
||||
not_supported("list_jobs")
|
||||
}
|
||||
/// Cancel an inflight server-side job by id. Returns true if a
|
||||
/// matching inflight job was found and flagged for cancellation,
|
||||
/// false if none was inflight (best-effort, like SQL `CANCEL JOB`).
|
||||
async fn cancel_job(&self, _job_id: &str) -> Result<bool> {
|
||||
not_supported("cancel_job")
|
||||
}
|
||||
/// Point-access for a single job by id -- the `wait()`/status poll path.
|
||||
/// `table_hint` (the job's table, which `wait()` callers know) enables an
|
||||
/// O(1) server-side lookup. `None` if the job is unknown or not active.
|
||||
async fn get_job(&self, _job_id: &str, _table_hint: Option<&str>) -> Result<Option<JobInfo>> {
|
||||
not_supported("get_job")
|
||||
}
|
||||
/// Durable job history (SHOW JOB HISTORY) across the database's tables,
|
||||
/// optionally narrowed to a single `job_id`.
|
||||
async fn job_history(&self, _job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
|
||||
not_supported("job_history")
|
||||
}
|
||||
/// Per-row UDF errors (SHOW ERRORS) recorded by `error_policy=skip` across
|
||||
/// the database's tables, optionally filtered by `job_id` and/or `table`.
|
||||
async fn errors(
|
||||
&self,
|
||||
_job_id: Option<&str>,
|
||||
_table: Option<&str>,
|
||||
) -> Result<Vec<JobErrorInfo>> {
|
||||
not_supported("errors")
|
||||
}
|
||||
|
||||
/// Open a table in the database
|
||||
async fn open_table(&self, request: OpenTableRequest) -> Result<Arc<dyn BaseTable>>;
|
||||
/// Rename a table in the database
|
||||
|
||||
@@ -18,6 +18,7 @@ use lance_table::io::commit::commit_handler_from_url;
|
||||
use object_store::local::LocalFileSystem;
|
||||
use snafu::ResultExt;
|
||||
|
||||
use crate::blob::{ensure_blob_storage_version, has_blob_columns};
|
||||
use crate::connection::ConnectRequest;
|
||||
use crate::database::ReadConsistency;
|
||||
use crate::database::namespace::LanceNamespaceDatabase;
|
||||
@@ -838,13 +839,16 @@ impl ListingDatabase {
|
||||
write_params.enable_v2_manifest_paths = enable_v2_manifest_paths;
|
||||
}
|
||||
|
||||
// Apply enable_stable_row_ids: table-level override takes precedence over connection config
|
||||
if let Some(enable_stable_row_ids) =
|
||||
stable_row_ids_override.or(self.new_table_config.enable_stable_row_ids)
|
||||
let data_schema = request.data.arrow_schema();
|
||||
if let Some(enable_stable_row_ids) = stable_row_ids_override
|
||||
.or(self.new_table_config.enable_stable_row_ids)
|
||||
.or(has_blob_columns(&data_schema).then_some(true))
|
||||
{
|
||||
write_params.enable_stable_row_ids = enable_stable_row_ids;
|
||||
}
|
||||
|
||||
ensure_blob_storage_version(&data_schema, &mut write_params);
|
||||
|
||||
if matches!(&request.mode, CreateTableMode::Overwrite) {
|
||||
write_params.mode = WriteMode::Overwrite;
|
||||
}
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
//! Namespace-based database implementation that delegates table management to lance-namespace
|
||||
|
||||
use std::collections::{HashMap, HashSet};
|
||||
use std::sync::Arc;
|
||||
use std::sync::{Arc, Mutex};
|
||||
|
||||
use async_trait::async_trait;
|
||||
use lance::io::commit::namespace_manifest::LanceNamespaceExternalManifestStore;
|
||||
@@ -16,19 +16,23 @@ use lance_namespace::{
|
||||
CreateNamespaceRequest, CreateNamespaceResponse, DeclareTableRequest,
|
||||
DescribeNamespaceRequest, DescribeNamespaceResponse, DescribeTableRequest,
|
||||
DropNamespaceRequest, DropNamespaceResponse, DropTableRequest, ListNamespacesRequest,
|
||||
ListNamespacesResponse, ListTablesRequest, ListTablesResponse,
|
||||
ListNamespacesResponse, ListTablesRequest, ListTablesResponse, RenameTableRequest,
|
||||
},
|
||||
};
|
||||
use lance_namespace_impls::ConnectBuilder;
|
||||
use lance_table::io::commit::CommitHandler;
|
||||
use lance_table::io::commit::external_manifest::ExternalManifestCommitHandler;
|
||||
|
||||
use crate::blob::{ensure_blob_storage_version, has_blob_columns};
|
||||
use crate::connection::NamespaceClientPushdownOperation;
|
||||
use crate::database::ReadConsistency;
|
||||
use crate::database::listing::{
|
||||
NewTableConfig, OPT_NEW_TABLE_ENABLE_STABLE_ROW_IDS, OPT_NEW_TABLE_STORAGE_VERSION,
|
||||
OPT_NEW_TABLE_V2_MANIFEST_PATHS,
|
||||
};
|
||||
use crate::database::read_freshness::{
|
||||
FreshnessBaselines, ReadFreshnessContextProvider, TableFreshness,
|
||||
};
|
||||
use crate::error::{Error, Result};
|
||||
use crate::table::{NativeTable, map_namespace_lance_error};
|
||||
use lance::dataset::WriteMode;
|
||||
@@ -51,6 +55,10 @@ fn is_table_already_exists_namespace_error(err: &lance::Error) -> bool {
|
||||
false
|
||||
}
|
||||
|
||||
/// Object-id delimiter default (matches `RestNamespaceBuilder`'s); overridable
|
||||
/// via the `delimiter` property.
|
||||
const DEFAULT_NAMESPACE_DELIMITER: &str = "$";
|
||||
|
||||
/// A database implementation that uses lance-namespace for table management
|
||||
pub struct LanceNamespaceDatabase {
|
||||
namespace: Arc<dyn LanceNamespace>,
|
||||
@@ -70,6 +78,17 @@ pub struct LanceNamespaceDatabase {
|
||||
ns_properties: HashMap<String, String>,
|
||||
// Options for tables created by this connection
|
||||
new_table_config: NewTableConfig,
|
||||
// Per-table read-freshness baselines, shared with the context provider.
|
||||
freshness_baselines: FreshnessBaselines,
|
||||
// Delimiter for building freshness keys; see `table_freshness`.
|
||||
delimiter: String,
|
||||
}
|
||||
|
||||
fn resolve_delimiter(ns_properties: &HashMap<String, String>) -> String {
|
||||
ns_properties
|
||||
.get("delimiter")
|
||||
.cloned()
|
||||
.unwrap_or_else(|| DEFAULT_NAMESPACE_DELIMITER.to_string())
|
||||
}
|
||||
|
||||
impl LanceNamespaceDatabase {
|
||||
@@ -82,6 +101,9 @@ impl LanceNamespaceDatabase {
|
||||
session: Option<Arc<lance::session::Session>>,
|
||||
namespace_client_pushdown_operations: HashSet<NamespaceClientPushdownOperation>,
|
||||
) -> Self {
|
||||
// Client is pre-built, so we can't install the freshness provider here;
|
||||
// baselines are still tracked for a uniform bump path.
|
||||
let delimiter = resolve_delimiter(&namespace_client_properties);
|
||||
Self {
|
||||
namespace: namespace_client,
|
||||
storage_options,
|
||||
@@ -92,6 +114,8 @@ impl LanceNamespaceDatabase {
|
||||
ns_impl: namespace_client_impl,
|
||||
ns_properties: namespace_client_properties,
|
||||
new_table_config: NewTableConfig::default(),
|
||||
freshness_baselines: Arc::new(Mutex::new(HashMap::new())),
|
||||
delimiter,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -136,10 +160,19 @@ impl LanceNamespaceDatabase {
|
||||
if let Some(ref sess) = session {
|
||||
builder = builder.session(sess.clone());
|
||||
}
|
||||
|
||||
// Install the read-freshness provider before building the client.
|
||||
let freshness_baselines: FreshnessBaselines = Arc::new(Mutex::new(HashMap::new()));
|
||||
builder = builder.context_provider(Arc::new(ReadFreshnessContextProvider::new(
|
||||
freshness_baselines.clone(),
|
||||
read_consistency_interval,
|
||||
)));
|
||||
|
||||
let namespace = builder.connect().await.map_err(|e| Error::InvalidInput {
|
||||
message: format!("Failed to connect to namespace: {:?}", e),
|
||||
})?;
|
||||
|
||||
let delimiter = resolve_delimiter(&ns_properties);
|
||||
Ok(Self {
|
||||
namespace,
|
||||
storage_options,
|
||||
@@ -150,9 +183,20 @@ impl LanceNamespaceDatabase {
|
||||
ns_impl: ns_impl.to_string(),
|
||||
ns_properties,
|
||||
new_table_config,
|
||||
freshness_baselines,
|
||||
delimiter,
|
||||
})
|
||||
}
|
||||
|
||||
/// Build a table's freshness handle, keyed to match the `object_id` the
|
||||
/// namespace client sends on reads (table-id parts joined by the delimiter).
|
||||
fn table_freshness(&self, namespace_path: &[String], name: &str) -> TableFreshness {
|
||||
let mut parts = namespace_path.to_vec();
|
||||
parts.push(name.to_string());
|
||||
let key = parts.join(&self.delimiter);
|
||||
TableFreshness::new(self.freshness_baselines.clone(), key)
|
||||
}
|
||||
|
||||
fn extract_storage_overrides(
|
||||
&self,
|
||||
request: &DbCreateTableRequest,
|
||||
@@ -214,12 +258,16 @@ impl LanceNamespaceDatabase {
|
||||
params.enable_v2_manifest_paths = enable_v2_manifest_paths;
|
||||
}
|
||||
|
||||
if let Some(enable_stable_row_ids) =
|
||||
stable_row_ids_override.or(self.new_table_config.enable_stable_row_ids)
|
||||
let data_schema = request.data.schema();
|
||||
if let Some(enable_stable_row_ids) = stable_row_ids_override
|
||||
.or(self.new_table_config.enable_stable_row_ids)
|
||||
.or(has_blob_columns(data_schema.as_ref()).then_some(true))
|
||||
{
|
||||
params.enable_stable_row_ids = enable_stable_row_ids;
|
||||
}
|
||||
|
||||
ensure_blob_storage_version(data_schema.as_ref(), params);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
@@ -331,7 +379,8 @@ impl Database for LanceNamespaceDatabase {
|
||||
self.pushdown_operations.clone(),
|
||||
self.session.clone(),
|
||||
)
|
||||
.await?;
|
||||
.await?
|
||||
.with_freshness(self.table_freshness(&request.namespace_path, &request.name));
|
||||
|
||||
return Ok(Arc::new(native_table));
|
||||
}
|
||||
@@ -437,8 +486,11 @@ impl Database for LanceNamespaceDatabase {
|
||||
|
||||
// Set up commit handler when managed_versioning is enabled
|
||||
if managed_versioning == Some(true) {
|
||||
let external_store =
|
||||
LanceNamespaceExternalManifestStore::new(self.namespace.clone(), table_id.clone());
|
||||
let external_store = LanceNamespaceExternalManifestStore::for_table_uri(
|
||||
self.namespace.clone(),
|
||||
table_id.clone(),
|
||||
&location,
|
||||
)?;
|
||||
let commit_handler: Arc<dyn CommitHandler> = Arc::new(ExternalManifestCommitHandler {
|
||||
external_manifest_store: Arc::new(external_store),
|
||||
});
|
||||
@@ -459,7 +511,8 @@ impl Database for LanceNamespaceDatabase {
|
||||
self.pushdown_operations.clone(),
|
||||
self.session.clone(),
|
||||
)
|
||||
.await?;
|
||||
.await?
|
||||
.with_freshness(self.table_freshness(&request.namespace_path, &request.name));
|
||||
|
||||
Ok(Arc::new(native_table))
|
||||
}
|
||||
@@ -475,7 +528,8 @@ impl Database for LanceNamespaceDatabase {
|
||||
self.pushdown_operations.clone(),
|
||||
self.session.clone(),
|
||||
)
|
||||
.await?;
|
||||
.await?
|
||||
.with_freshness(self.table_freshness(&request.namespace_path, &request.name));
|
||||
|
||||
Ok(Arc::new(native_table))
|
||||
}
|
||||
@@ -488,14 +542,34 @@ impl Database for LanceNamespaceDatabase {
|
||||
|
||||
async fn rename_table(
|
||||
&self,
|
||||
_cur_name: &str,
|
||||
_new_name: &str,
|
||||
_cur_namespace_path: &[String],
|
||||
_new_namespace_path: &[String],
|
||||
cur_name: &str,
|
||||
new_name: &str,
|
||||
cur_namespace_path: &[String],
|
||||
new_namespace_path: &[String],
|
||||
) -> Result<()> {
|
||||
Err(Error::NotSupported {
|
||||
message: "rename_table is not supported for namespace connections".to_string(),
|
||||
})
|
||||
let mut cur_table_id = cur_namespace_path.to_vec();
|
||||
cur_table_id.push(cur_name.to_string());
|
||||
|
||||
let new_namespace_id = if new_namespace_path.is_empty() {
|
||||
None
|
||||
} else {
|
||||
Some(new_namespace_path.to_vec())
|
||||
};
|
||||
|
||||
let rename_request = RenameTableRequest {
|
||||
id: Some(cur_table_id),
|
||||
new_table_name: new_name.to_string(),
|
||||
new_namespace_id,
|
||||
..Default::default()
|
||||
};
|
||||
self.namespace
|
||||
.rename_table(rename_request)
|
||||
.await
|
||||
.map_err(|e| Error::Runtime {
|
||||
message: format!("Failed to rename table: {}", e),
|
||||
})?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn drop_table(&self, name: &str, namespace_path: &[String]) -> Result<()> {
|
||||
@@ -740,6 +814,64 @@ mod tests {
|
||||
assert!(table_names.contains(&"test_table".to_string()));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_namespace_branch_query_under_pushdown_stays_local() {
|
||||
// With QueryTable pushdown enabled, a query on the main branch routes to
|
||||
// the namespace server, but a branch handle must run locally: the
|
||||
// server-side request carries no branch and would return main's rows.
|
||||
let tmp_dir = tempdir().unwrap();
|
||||
let root_path = tmp_dir.path().to_str().unwrap().to_string();
|
||||
|
||||
let mut properties = HashMap::new();
|
||||
properties.insert("root".to_string(), root_path);
|
||||
|
||||
let conn = connect_namespace("dir", properties)
|
||||
.pushdown_operation(NamespaceClientPushdownOperation::QueryTable)
|
||||
.execute()
|
||||
.await
|
||||
.expect("Failed to connect to namespace");
|
||||
|
||||
conn.create_namespace(CreateNamespaceRequest {
|
||||
id: Some(vec!["test_ns".into()]),
|
||||
..Default::default()
|
||||
})
|
||||
.await
|
||||
.expect("Failed to create namespace");
|
||||
|
||||
// main has 5 rows
|
||||
let table = conn
|
||||
.create_table("ref_test", create_test_data())
|
||||
.namespace(vec!["test_ns".into()])
|
||||
.execute()
|
||||
.await
|
||||
.expect("Failed to create table");
|
||||
let main_version = table.version().await.unwrap();
|
||||
|
||||
// fork a branch off main, then add 5 more rows so it differs from main
|
||||
let branch = table
|
||||
.create_branch("exp", main_version)
|
||||
.await
|
||||
.expect("Failed to create branch");
|
||||
branch
|
||||
.add(create_test_data())
|
||||
.execute()
|
||||
.await
|
||||
.expect("Failed to append to branch");
|
||||
|
||||
// the branch query must run locally and see the branch's 10 rows --
|
||||
// not get routed to the server (which carries no branch) and see main's 5
|
||||
let results = branch
|
||||
.query()
|
||||
.execute()
|
||||
.await
|
||||
.expect("Failed to query branch")
|
||||
.try_collect::<Vec<_>>()
|
||||
.await
|
||||
.expect("Failed to collect results");
|
||||
let count: usize = results.iter().map(|b| b.num_rows()).sum();
|
||||
assert_eq!(count, 10);
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_namespace_describe_table() {
|
||||
// Setup: Create a temporary directory for the namespace
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user