mirror of
https://github.com/lancedb/lancedb.git
synced 2026-06-30 17:40:40 +00:00
Compare commits
27 Commits
main
...
jack/sopho
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
bff911a65d | ||
|
|
3a4cdb7aff | ||
|
|
142ac835d3 | ||
|
|
3f44f93e92 | ||
|
|
9dfa43a9de | ||
|
|
03e895fa5c | ||
|
|
c31e53088e | ||
|
|
434a5be187 | ||
|
|
78aa005093 | ||
|
|
6191542cfe | ||
|
|
6af3088b91 | ||
|
|
e73d4618d8 | ||
|
|
3d92106394 | ||
|
|
5810974b37 | ||
|
|
8b38500b07 | ||
|
|
fd0a3b97d0 | ||
|
|
b9f33ba1c9 | ||
|
|
d4f4fef3ba | ||
|
|
fbe6a5a3fd | ||
|
|
127054069a | ||
|
|
b20931b8f7 | ||
|
|
396d68e490 | ||
|
|
ad37f87387 | ||
|
|
e93476f0e0 | ||
|
|
2b41fce033 | ||
|
|
04948fc4f6 | ||
|
|
ff3c7111b9 |
@@ -1,137 +0,0 @@
|
||||
---
|
||||
name: lancedb-branch-ops
|
||||
description: Branch management for LanceDB tables via the REST API. Use this skill whenever someone wants to create, delete, list, or switch branches on a LanceDB table — or needs to make sure a write (metadata update, index build, etc.) lands on a specific branch instead of main. Invoke it even without the word "branch" if context makes clear they want an experimental copy of a table, want to isolate changes, or want to confirm a mutation didn't touch main. Covers: branches/list, branches/create, branches/delete, and passing "branch" in describe/update_field_metadata/create_index to target a non-main version.
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Manage branches on a LanceDB table: list what exists, create new ones, delete stale ones, and direct read/write operations at a specific branch without touching main.
|
||||
|
||||
## Step 0: Establish the connection
|
||||
|
||||
Use the `lancedb-connect` skill to resolve the base URL and auth headers (`x-api-key`, `x-lancedb-database`). Skip this only if the connection is already known from the current conversation.
|
||||
|
||||
All examples below use `{base_url}` — substitute the resolved endpoint and include the auth headers on every request.
|
||||
|
||||
## The branch model (important)
|
||||
|
||||
LanceDB branches are named snapshots that diverge from the table's current state at creation time. There is **no checkout command** — you never switch the whole table to a branch. Instead, you **pass `"branch": "<name>"` in the request body** of any operation to target that branch. Omitting the key (or sending an empty body) always targets main.
|
||||
|
||||
`branches/list` returns only non-main branches. Main always exists and is not listed.
|
||||
|
||||
## List branches
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/branches/list
|
||||
Content-Type: application/json
|
||||
|
||||
{}
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"branches": {
|
||||
"experiment-reindex": {"parentVersion": 1, "createAt": 1782506085, "manifestSize": 1029}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If `branches` is `{}`, the table has no branches besides main.
|
||||
|
||||
## Create a branch
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/branches/create
|
||||
Content-Type: application/json
|
||||
|
||||
{"name": "experiment-reindex"}
|
||||
```
|
||||
|
||||
HTTP 200 with `{}` body = success. The branch is created off the table's current state on main.
|
||||
|
||||
Verify by calling `branches/list` and confirming the new name appears.
|
||||
|
||||
## Delete a branch
|
||||
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/branches/delete
|
||||
Content-Type: application/json
|
||||
|
||||
{"name": "stale-2024"}
|
||||
```
|
||||
|
||||
HTTP 200 with `{}` body = success. Only the branch pointer is removed — main and all row data remain intact.
|
||||
|
||||
Verify by calling `branches/list` (name gone) and `describe` with no branch param (main still responds).
|
||||
|
||||
## Operate on a specific branch
|
||||
|
||||
Pass `"branch": "<name>"` in the body of any operation to scope it to that branch:
|
||||
|
||||
**Read schema on a branch:**
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/describe
|
||||
Content-Type: application/json
|
||||
|
||||
{"branch": "wip-branch"}
|
||||
```
|
||||
|
||||
**Write metadata to a branch (not main):**
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/update_field_metadata
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"branch": "wip-branch",
|
||||
"updates": [
|
||||
{
|
||||
"path": "category",
|
||||
"metadata": {"lancedb:description": "Product category label."},
|
||||
"replace": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Build an index on a branch:**
|
||||
```http
|
||||
POST {base_url}/v1/table/{table_id}/create_index
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"branch": "wip-branch",
|
||||
"column": "category",
|
||||
"index_type": "BTREE"
|
||||
}
|
||||
```
|
||||
|
||||
## Verifying isolation
|
||||
|
||||
After writing to a branch, always confirm the change did NOT land on main:
|
||||
|
||||
```bash
|
||||
# Should show the new metadata
|
||||
curl -s -X POST {base_url}/v1/table/{table_id}/describe \
|
||||
-H "x-api-key: <key>" -H "x-lancedb-database: <db>" \
|
||||
-H "content-type: application/json" \
|
||||
-d '{"branch": "wip-branch"}'
|
||||
|
||||
# Should NOT show the new metadata
|
||||
curl -s -X POST {base_url}/v1/table/{table_id}/describe \
|
||||
-H "x-api-key: <key>" -H "x-lancedb-database: <db>" \
|
||||
-H "content-type: application/json" \
|
||||
-d '{}'
|
||||
```
|
||||
|
||||
## Quick reference
|
||||
|
||||
| Goal | Endpoint | Body |
|
||||
|------|----------|------|
|
||||
| List all branches | `branches/list` | `{}` |
|
||||
| Create a branch | `branches/create` | `{"name": "..."}` |
|
||||
| Delete a branch | `branches/delete` | `{"name": "..."}` |
|
||||
| Read schema on branch | `describe` | `{"branch": "..."}` |
|
||||
| Write metadata on branch | `update_field_metadata` | `{"branch": "...", "updates": [...]}` |
|
||||
| Build index on branch | `create_index` | `{"branch": "...", "column": ..., "index_type": ...}` |
|
||||
| Target main (default) | any endpoint | omit `"branch"` key |
|
||||
74
Cargo.lock
generated
74
Cargo.lock
generated
@@ -1750,7 +1750,7 @@ version = "3.1.1"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "faf9468729b8cbcea668e36183cb69d317348c2e08e994829fb56ebfdfbaac34"
|
||||
dependencies = [
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -3014,7 +3014,7 @@ dependencies = [
|
||||
"libc",
|
||||
"option-ext",
|
||||
"redox_users",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -3231,7 +3231,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb"
|
||||
dependencies = [
|
||||
"libc",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -3424,7 +3424,7 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
|
||||
[[package]]
|
||||
name = "fsst"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"rand 0.9.4",
|
||||
@@ -4466,7 +4466,7 @@ checksum = "3640c1c38b8e4e43584d8df18be5fc6b0aa314ce6ebf51b53313d4306cca8e46"
|
||||
dependencies = [
|
||||
"hermit-abi",
|
||||
"libc",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -4560,7 +4560,7 @@ dependencies = [
|
||||
"portable-atomic-util",
|
||||
"serde_core",
|
||||
"wasm-bindgen",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -4727,7 +4727,7 @@ checksum = "e037a2e1d8d5fdbd49b16a4ea09d5d6401c1f29eca5ff29d03d3824dba16256a"
|
||||
[[package]]
|
||||
name = "lance"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arc-swap",
|
||||
"arrow",
|
||||
@@ -4802,7 +4802,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-arrow"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -4823,7 +4823,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-arrow-scalar"
|
||||
version = "58.0.0"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -4837,7 +4837,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-arrow-stats"
|
||||
version = "58.0.0"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-schema",
|
||||
@@ -4847,7 +4847,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-bitpacking"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrayref",
|
||||
"crunchy",
|
||||
@@ -4858,7 +4858,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-core"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -4897,7 +4897,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-datafusion"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-array",
|
||||
@@ -4928,7 +4928,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-datagen"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-array",
|
||||
@@ -4946,7 +4946,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-derive"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
@@ -4956,7 +4956,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-encoding"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-arith",
|
||||
"arrow-array",
|
||||
@@ -4992,7 +4992,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-file"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-arith",
|
||||
"arrow-array",
|
||||
@@ -5023,7 +5023,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-index"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arc-swap",
|
||||
"arrow",
|
||||
@@ -5089,7 +5089,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-io"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-arith",
|
||||
@@ -5131,7 +5131,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-linalg"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -5148,7 +5148,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-namespace"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"async-trait",
|
||||
@@ -5161,7 +5161,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-namespace-impls"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-ipc",
|
||||
@@ -5216,7 +5216,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-select"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-buffer",
|
||||
@@ -5232,7 +5232,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-table"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow",
|
||||
"arrow-array",
|
||||
@@ -5272,7 +5272,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-testing"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"arrow-array",
|
||||
"arrow-schema",
|
||||
@@ -5286,7 +5286,7 @@ dependencies = [
|
||||
[[package]]
|
||||
name = "lance-tokenizer"
|
||||
version = "9.0.0-beta.10"
|
||||
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
|
||||
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
|
||||
dependencies = [
|
||||
"icu_segmenter",
|
||||
"jieba-rs",
|
||||
@@ -6085,7 +6085,7 @@ version = "0.50.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "7957b9740744892f114936ab4a57b3f487491bbeafaf8083688b16841a4240e5"
|
||||
dependencies = [
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -7400,8 +7400,8 @@ version = "0.14.3"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "343d3bd7056eda839b03204e68deff7d1b13aba7af2b2fd16890697274262ee7"
|
||||
dependencies = [
|
||||
"heck 0.5.0",
|
||||
"itertools 0.14.0",
|
||||
"heck 0.4.1",
|
||||
"itertools 0.11.0",
|
||||
"log",
|
||||
"multimap",
|
||||
"petgraph",
|
||||
@@ -7420,7 +7420,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "27c6023962132f4b30eb4c172c91ce92d933da334c59c23cddee82358ddafb0b"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"itertools 0.14.0",
|
||||
"itertools 0.11.0",
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn 2.0.117",
|
||||
@@ -7654,7 +7654,7 @@ dependencies = [
|
||||
"once_cell",
|
||||
"socket2 0.6.3",
|
||||
"tracing",
|
||||
"windows-sys 0.60.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -8394,7 +8394,7 @@ dependencies = [
|
||||
"errno",
|
||||
"libc",
|
||||
"linux-raw-sys",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -8465,7 +8465,7 @@ dependencies = [
|
||||
"security-framework",
|
||||
"security-framework-sys",
|
||||
"webpki-root-certs",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -9027,7 +9027,7 @@ version = "0.8.9"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c1c97747dbf44bb1ca44a561ece23508e99cb592e862f22222dcf42f51d1e451"
|
||||
dependencies = [
|
||||
"heck 0.5.0",
|
||||
"heck 0.4.1",
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn 2.0.117",
|
||||
@@ -9039,7 +9039,7 @@ version = "0.9.0"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "54254b8531cafa275c5e096f62d48c81435d1015405a91198ddb11e967301d40"
|
||||
dependencies = [
|
||||
"heck 0.5.0",
|
||||
"heck 0.4.1",
|
||||
"proc-macro2",
|
||||
"quote",
|
||||
"syn 2.0.117",
|
||||
@@ -9472,7 +9472,7 @@ dependencies = [
|
||||
"getrandom 0.4.2",
|
||||
"once_cell",
|
||||
"rustix",
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -10407,7 +10407,7 @@ version = "0.1.11"
|
||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||
checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22"
|
||||
dependencies = [
|
||||
"windows-sys 0.61.2",
|
||||
"windows-sys 0.59.0",
|
||||
]
|
||||
|
||||
[[package]]
|
||||
|
||||
28
Cargo.toml
28
Cargo.toml
@@ -13,20 +13,20 @@ categories = ["database-implementations"]
|
||||
rust-version = "1.91.0"
|
||||
|
||||
[workspace.dependencies]
|
||||
lance = { "version" = "=9.0.0-beta.10", default-features = false, "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-core = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datagen = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-file = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-io = { "version" = "=9.0.0-beta.10", default-features = false, "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-index = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-linalg = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=9.0.0-beta.10", default-features = false, "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-table = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-testing = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-datafusion = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-encoding = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance-arrow = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
|
||||
lance = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-core = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-datagen = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-file = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-io = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-index = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-linalg = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-namespace = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-namespace-impls = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-table = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-testing = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-datafusion = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-encoding = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
lance-arrow = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
|
||||
ahash = "0.8"
|
||||
# Note that this one does not include pyarrow
|
||||
arrow = { version = "58.0.0", optional = false }
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
|
||||
use std::time::Duration;
|
||||
|
||||
use lancedb::{ipc::ipc_file_to_batches, table::merge::MergeInsertBuilder};
|
||||
use lancedb::{arrow::IntoArrow, ipc::ipc_file_to_batches, table::merge::MergeInsertBuilder};
|
||||
use napi::bindgen_prelude::*;
|
||||
use napi_derive::napi;
|
||||
|
||||
@@ -66,9 +66,11 @@ impl NativeMergeInsertBuilder {
|
||||
|
||||
#[napi(catch_unwind)]
|
||||
pub async fn execute(&self, buf: Buffer) -> napi::Result<MergeResult> {
|
||||
let data = ipc_file_to_batches(buf.to_vec()).map_err(|e| {
|
||||
napi::Error::from_reason(format!("Failed to read IPC file: {}", convert_error(&e)))
|
||||
})?;
|
||||
let data = ipc_file_to_batches(buf.to_vec())
|
||||
.and_then(IntoArrow::into_arrow)
|
||||
.map_err(|e| {
|
||||
napi::Error::from_reason(format!("Failed to read IPC file: {}", convert_error(&e)))
|
||||
})?;
|
||||
|
||||
let this = self.clone();
|
||||
|
||||
|
||||
@@ -17,6 +17,17 @@ from .db import AsyncConnection, DBConnection, LanceDBConnection
|
||||
from .remote import ClientConfig
|
||||
from .remote.db import RemoteDBConnection
|
||||
from .expr import Expr, col, lit, func
|
||||
from .udf import (
|
||||
udf,
|
||||
table_udf,
|
||||
Udf,
|
||||
JobHandle,
|
||||
JobFailedError,
|
||||
MaterializedView,
|
||||
AsyncJobHandle,
|
||||
AsyncMaterializedView,
|
||||
)
|
||||
from .lineage import Lineage, Node, Edge, FunctionRef
|
||||
from .schema import vector
|
||||
from .table import AsyncTable, Table
|
||||
from ._lancedb import Session
|
||||
@@ -448,6 +459,18 @@ async def connect_async(
|
||||
|
||||
|
||||
__all__ = [
|
||||
"udf",
|
||||
"table_udf",
|
||||
"Udf",
|
||||
"JobHandle",
|
||||
"JobFailedError",
|
||||
"MaterializedView",
|
||||
"AsyncJobHandle",
|
||||
"AsyncMaterializedView",
|
||||
"Lineage",
|
||||
"Node",
|
||||
"Edge",
|
||||
"FunctionRef",
|
||||
"connect",
|
||||
"connect_async",
|
||||
"connect_namespace",
|
||||
|
||||
@@ -65,6 +65,7 @@ if TYPE_CHECKING:
|
||||
from .common import DATA, URI
|
||||
from .embeddings import EmbeddingFunctionConfig
|
||||
from ._lancedb import Session
|
||||
from .udf import MaterializedView, AsyncMaterializedView
|
||||
|
||||
from .namespace_utils import (
|
||||
_normalize_create_namespace_mode,
|
||||
@@ -562,6 +563,259 @@ class DBConnection(EnforceOverrides):
|
||||
"""
|
||||
raise NotImplementedError("serialize is not supported for this connection type")
|
||||
|
||||
# -- Derived compute: functions, materialized views, jobs -------------
|
||||
# Server-backed features (LanceDB Enterprise / Cloud); local
|
||||
# connections raise NotImplementedError for now.
|
||||
|
||||
def create_function(
|
||||
self,
|
||||
name,
|
||||
language: str = "python",
|
||||
return_type: Optional[str] = None,
|
||||
body: Optional[str] = None,
|
||||
options: Optional[Dict[str, str]] = None,
|
||||
*,
|
||||
replace: bool = False,
|
||||
):
|
||||
"""Register a UDF (CREATE FUNCTION).
|
||||
|
||||
Pass a ``@udf`` / ``@table_udf``-decorated function (preferred):
|
||||
|
||||
db.create_function(embed)
|
||||
|
||||
or the explicit fields:
|
||||
|
||||
Parameters
|
||||
----------
|
||||
name: str or Udf
|
||||
A decorated UDF object, or the function name.
|
||||
language: str
|
||||
Implementation language (currently "python").
|
||||
return_type: str
|
||||
SQL return type, e.g. "FLOAT", "FLOAT[1536]",
|
||||
"STRUCT(a FLOAT, b VARCHAR)", "TABLE(chunk VARCHAR, idx INT)".
|
||||
body: str
|
||||
Function body: source text, or base64 cloudpickle bytes when
|
||||
options["body_format"] == "cloudpickle".
|
||||
options: dict, optional
|
||||
input_columns, pip, num_gpus, batch_size, timeout,
|
||||
error_policy, docker_image, body_format, ...
|
||||
replace: bool
|
||||
Drop an existing function of the same name first.
|
||||
"""
|
||||
from .udf import Udf
|
||||
|
||||
if isinstance(name, Udf):
|
||||
req = name.create_request()
|
||||
name, language, return_type, body, options = (
|
||||
req["name"],
|
||||
req["language"],
|
||||
req["return_type"],
|
||||
req["body"],
|
||||
req["options"],
|
||||
)
|
||||
if replace:
|
||||
try:
|
||||
self.drop_function(name)
|
||||
except Exception:
|
||||
pass
|
||||
LOOP.run(self._conn.create_function(name, language, return_type, body, options))
|
||||
|
||||
def list_functions(self):
|
||||
"""List registered functions (SHOW FUNCTIONS)."""
|
||||
return LOOP.run(self._conn.list_functions())
|
||||
|
||||
def drop_function(self, name: str):
|
||||
"""Drop a registered function (DROP FUNCTION)."""
|
||||
LOOP.run(self._conn.drop_function(name))
|
||||
|
||||
def create_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
source=None,
|
||||
select=None,
|
||||
*,
|
||||
query: Optional[str] = None,
|
||||
where: Optional[str] = None,
|
||||
auto_refresh: bool = False,
|
||||
with_no_data: bool = False,
|
||||
replace: bool = False,
|
||||
partition_by: Optional[str] = None,
|
||||
) -> "MaterializedView":
|
||||
"""Create a materialized view (CREATE MATERIALIZED VIEW); returns a
|
||||
`MaterializedView` handle (``.wait()`` blocks until it is populated).
|
||||
|
||||
Two ways to specify the view body:
|
||||
|
||||
- ergonomic: pass ``source`` (a table name or table) and ``select``
|
||||
items -- column names, expression strings ("embed(body)"),
|
||||
(alias, expression) tuples, or ``@udf`` / ``@table_udf`` objects.
|
||||
The SELECT is assembled and parsed server-side (one parser, shared
|
||||
with SQL).
|
||||
- raw: pass ``query=`` with a full SELECT, e.g.
|
||||
"SELECT id, embed(body) AS vec FROM articles WHERE id > 1".
|
||||
|
||||
`partition_by` partitions the view's (single) table function on a source
|
||||
column. If that column has an IVF vector index the server partitions by
|
||||
its index clusters (image-dedup style); otherwise it groups by distinct
|
||||
value. (Geneva's `partition_by` and `partition_by_indexed_column` unify
|
||||
here -- the engine picks the strategy from the column.)
|
||||
"""
|
||||
from .udf import build_view_query, MaterializedView
|
||||
|
||||
if query is None:
|
||||
if source is None or select is None:
|
||||
raise ValueError(
|
||||
"create_materialized_view needs either query= or both "
|
||||
"source and select"
|
||||
)
|
||||
query = build_view_query(source, select)
|
||||
if where:
|
||||
query += f" WHERE {where}"
|
||||
if replace:
|
||||
self._drop_view_if_exists(name)
|
||||
job_id = LOOP.run(
|
||||
self._conn.create_materialized_view(
|
||||
name,
|
||||
query=query,
|
||||
auto_refresh=auto_refresh,
|
||||
with_no_data=with_no_data,
|
||||
partition_by=partition_by,
|
||||
)
|
||||
)
|
||||
return MaterializedView(self, name, job_id=job_id)
|
||||
|
||||
def _drop_view_if_exists(self, name: str) -> None:
|
||||
# `replace=True` is "drop if present"; only a not-found error is
|
||||
# benign here. Anything else (perms, server fault) must surface rather
|
||||
# than be masked by a later create failure.
|
||||
try:
|
||||
self.drop_materialized_view(name)
|
||||
except Exception as e:
|
||||
msg = str(e).lower()
|
||||
if "not found" not in msg and "does not exist" not in msg:
|
||||
raise
|
||||
|
||||
def job(self, job_id: str):
|
||||
"""A `JobHandle` for reconnecting to an inflight job by id -- e.g. an
|
||||
id you stored, or one returned from the SQL / REST surface. Submit
|
||||
methods (`refresh_column`, `MaterializedView.refresh`) already return a
|
||||
handle directly, so you do not need this to wait on a fresh submission."""
|
||||
from .udf import JobHandle
|
||||
|
||||
return JobHandle(self, job_id)
|
||||
|
||||
def lineage(
|
||||
self,
|
||||
table: str,
|
||||
column: Optional[str] = None,
|
||||
*,
|
||||
direction: Optional[str] = None,
|
||||
depth: Optional[int] = None,
|
||||
):
|
||||
"""Derived-compute lineage of a table/view, or one of its columns:
|
||||
upstream sources, downstream dependents, and the function version +
|
||||
location that produced each derived column (with a drift flag). Returns
|
||||
a `Lineage`. `direction` is "upstream" | "downstream" | "both" (server
|
||||
default both); `depth` limits column-hops (transitive when omitted)."""
|
||||
# `self._conn` is the AsyncConnection; drive its async `lineage`
|
||||
# (which parses the JSON) on the loop, mirroring create_materialized_view.
|
||||
return LOOP.run(
|
||||
self._conn.lineage(table, column, direction=direction, depth=depth)
|
||||
)
|
||||
|
||||
def _refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Internal: submit a materialized-view refresh, return the job id.
|
||||
The public surface is ``MaterializedView.refresh()`` (which returns a
|
||||
`JobHandle`); this stays private so refresh is only reached through the
|
||||
handle.
|
||||
|
||||
``full=True`` forces a full rebuild (recompute and replace every row)
|
||||
instead of the default incremental refresh.
|
||||
"""
|
||||
return LOOP.run(
|
||||
self._conn._refresh_materialized_view(
|
||||
name,
|
||||
full=full,
|
||||
src_version=src_version,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
)
|
||||
)
|
||||
|
||||
def explain_refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
):
|
||||
"""Plan a refresh without running it (EXPLAIN REFRESH). Returns a
|
||||
plan with .has_work / .source_version / .last_refreshed_version /
|
||||
.full_refresh / .rebuild / .units_total. `full=True` plans a full
|
||||
rebuild (incremental planning needs stable row IDs on the source)."""
|
||||
return LOOP.run(
|
||||
self._conn.explain_refresh_materialized_view(
|
||||
name, full=full, src_version=src_version
|
||||
)
|
||||
)
|
||||
|
||||
def alter_materialized_view(self, name: str, *, auto_refresh: bool):
|
||||
"""Update a materialized view's options (ALTER MATERIALIZED VIEW)."""
|
||||
LOOP.run(self._conn.alter_materialized_view(name, auto_refresh=auto_refresh))
|
||||
|
||||
def drop_materialized_view(self, name: str):
|
||||
"""Drop a materialized view definition (DROP MATERIALIZED VIEW)."""
|
||||
LOOP.run(self._conn.drop_materialized_view(name))
|
||||
|
||||
def list_materialized_views(self):
|
||||
"""List registered materialized view definitions."""
|
||||
return LOOP.run(self._conn.list_materialized_views())
|
||||
|
||||
def list_jobs(self):
|
||||
"""List inflight server-side jobs across the database's tables."""
|
||||
return LOOP.run(self._conn.list_jobs())
|
||||
|
||||
def get_job(self, job_id: str, table: "str | None" = None):
|
||||
"""Look up one server-side job by id (the wait()/status poll path).
|
||||
|
||||
Passing ``table`` (the job's table) lets the server answer with an O(1)
|
||||
single-node read instead of scanning the database's active jobs.
|
||||
Returns the job's status, or None if it's unknown or no longer active.
|
||||
"""
|
||||
return LOOP.run(self._conn.get_job(job_id, table))
|
||||
|
||||
def cancel_job(self, job_id: str) -> bool:
|
||||
"""Cancel an inflight server-side job by id (CANCEL JOB).
|
||||
|
||||
Returns True if a matching inflight job was found and flagged for
|
||||
cancellation, False if none was inflight (already finished or
|
||||
unknown id) -- cancellation is best-effort.
|
||||
"""
|
||||
return LOOP.run(self._conn.cancel_job(job_id))
|
||||
|
||||
def job_history(self, job_id: "str | None" = None):
|
||||
"""Durable history of completed server-side jobs (SHOW JOB HISTORY).
|
||||
|
||||
Pass ``job_id`` to narrow to a single job. Unlike :meth:`list_jobs`
|
||||
(live, inflight) these are the terminal records.
|
||||
"""
|
||||
return LOOP.run(self._conn.job_history(job_id))
|
||||
|
||||
def errors(self, job_id: "str | None" = None, table: "str | None" = None):
|
||||
"""Per-row UDF errors recorded by ``error_policy=skip`` (SHOW ERRORS),
|
||||
optionally filtered by ``job_id`` and/or ``table``.
|
||||
"""
|
||||
return LOOP.run(self._conn.errors(job_id, table))
|
||||
|
||||
|
||||
class LanceDBConnection(DBConnection):
|
||||
"""
|
||||
@@ -1787,6 +2041,200 @@ class AsyncConnection(object):
|
||||
)
|
||||
return AsyncTable(table)
|
||||
|
||||
# -- Derived compute: functions, materialized views, jobs -------------
|
||||
# Server-backed features (LanceDB Enterprise / Cloud); local
|
||||
# connections raise NotImplementedError for now.
|
||||
|
||||
async def create_function(
|
||||
self,
|
||||
name,
|
||||
language: str = "python",
|
||||
return_type: Optional[str] = None,
|
||||
body: Optional[str] = None,
|
||||
options: Optional[Dict[str, str]] = None,
|
||||
*,
|
||||
replace: bool = False,
|
||||
):
|
||||
"""Register a UDF (CREATE FUNCTION). Accepts a ``@udf``/``@table_udf``
|
||||
object (preferred) or the explicit (name, language, return_type, body,
|
||||
options)."""
|
||||
from .udf import Udf
|
||||
|
||||
if isinstance(name, Udf):
|
||||
req = name.create_request()
|
||||
name, language, return_type, body, options = (
|
||||
req["name"],
|
||||
req["language"],
|
||||
req["return_type"],
|
||||
req["body"],
|
||||
req["options"],
|
||||
)
|
||||
if replace:
|
||||
try:
|
||||
await self.drop_function(name)
|
||||
except Exception:
|
||||
pass
|
||||
await self._inner.create_function(name, language, return_type, body, options)
|
||||
|
||||
async def list_functions(self):
|
||||
"""List registered functions (SHOW FUNCTIONS)."""
|
||||
return await self._inner.list_functions()
|
||||
|
||||
async def drop_function(self, name: str):
|
||||
"""Drop a registered function (DROP FUNCTION)."""
|
||||
await self._inner.drop_function(name)
|
||||
|
||||
async def create_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
source=None,
|
||||
select=None,
|
||||
*,
|
||||
query: Optional[str] = None,
|
||||
where: Optional[str] = None,
|
||||
auto_refresh: bool = False,
|
||||
with_no_data: bool = False,
|
||||
replace: bool = False,
|
||||
partition_by: Optional[str] = None,
|
||||
) -> "AsyncMaterializedView":
|
||||
"""Create a materialized view; returns an `AsyncMaterializedView`
|
||||
handle (``.wait()`` blocks until populated). Pass either ``query=`` (a
|
||||
full SELECT) or ``source`` + ``select`` items; `partition_by`
|
||||
partitions the view's table function on a source column (index-cluster
|
||||
if the column is IVF-indexed, else distinct-value). See the sync
|
||||
method for the select grammar."""
|
||||
from .udf import build_view_query, AsyncMaterializedView
|
||||
|
||||
if query is None:
|
||||
if source is None or select is None:
|
||||
raise ValueError(
|
||||
"create_materialized_view needs either query= or both "
|
||||
"source and select"
|
||||
)
|
||||
query = build_view_query(source, select)
|
||||
if where:
|
||||
query += f" WHERE {where}"
|
||||
if replace:
|
||||
try:
|
||||
await self.drop_materialized_view(name)
|
||||
except Exception as e:
|
||||
msg = str(e).lower()
|
||||
if "not found" not in msg and "does not exist" not in msg:
|
||||
raise
|
||||
job_id = await self._inner.create_materialized_view(
|
||||
name,
|
||||
query,
|
||||
auto_refresh=auto_refresh,
|
||||
with_no_data=with_no_data,
|
||||
partition_by=partition_by,
|
||||
)
|
||||
return AsyncMaterializedView(self, name, job_id=job_id)
|
||||
|
||||
def job(self, job_id: str):
|
||||
"""An `AsyncJobHandle` for reconnecting to an inflight job by id (a
|
||||
stored id, or one from the SQL / REST surface). Submit methods already
|
||||
return a handle, so this is only needed to re-attach to an existing
|
||||
job."""
|
||||
from .udf import AsyncJobHandle
|
||||
|
||||
return AsyncJobHandle(self, job_id)
|
||||
|
||||
async def lineage(
|
||||
self,
|
||||
table: str,
|
||||
column: Optional[str] = None,
|
||||
*,
|
||||
direction: Optional[str] = None,
|
||||
depth: Optional[int] = None,
|
||||
):
|
||||
"""Derived-compute lineage of a table/view (or column). See the sync
|
||||
`Connection.lineage`. Returns a `Lineage`."""
|
||||
from .lineage import Lineage
|
||||
|
||||
raw = await self._inner.table_lineage(table, column, direction, depth)
|
||||
return Lineage.from_json(raw)
|
||||
|
||||
async def _refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
) -> str:
|
||||
"""Internal: submit a refresh, return the job id. The public surface is
|
||||
``AsyncMaterializedView.refresh()`` (returns an `AsyncJobHandle`).
|
||||
|
||||
``full=True`` forces a full rebuild (recompute and replace every row)
|
||||
instead of the default incremental refresh.
|
||||
"""
|
||||
return await self._inner.refresh_materialized_view(
|
||||
name,
|
||||
full=full,
|
||||
src_version=src_version,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
)
|
||||
|
||||
async def explain_refresh_materialized_view(
|
||||
self,
|
||||
name: str,
|
||||
*,
|
||||
full: bool = False,
|
||||
src_version: Optional[int] = None,
|
||||
):
|
||||
"""Plan a refresh without running it (EXPLAIN REFRESH)."""
|
||||
return await self._inner.explain_refresh_materialized_view(
|
||||
name, full=full, src_version=src_version
|
||||
)
|
||||
|
||||
async def alter_materialized_view(self, name: str, *, auto_refresh: bool):
|
||||
"""Update a materialized view's options."""
|
||||
await self._inner.alter_materialized_view(name, auto_refresh)
|
||||
|
||||
async def drop_materialized_view(self, name: str):
|
||||
"""Drop a materialized view definition."""
|
||||
await self._inner.drop_materialized_view(name)
|
||||
|
||||
async def list_materialized_views(self):
|
||||
"""List registered materialized view definitions."""
|
||||
return await self._inner.list_materialized_views()
|
||||
|
||||
async def list_jobs(self):
|
||||
"""List inflight server-side jobs across the database's tables."""
|
||||
return await self._inner.list_jobs()
|
||||
|
||||
async def get_job(self, job_id: str, table: "str | None" = None):
|
||||
"""Look up one server-side job by id (the wait()/status poll path).
|
||||
``table`` (the job's table) enables an O(1) server-side lookup.
|
||||
Returns the job's status, or None if unknown / no longer active."""
|
||||
return await self._inner.get_job(job_id, table)
|
||||
|
||||
async def cancel_job(self, job_id: str) -> bool:
|
||||
"""Cancel an inflight server-side job by id (CANCEL JOB).
|
||||
|
||||
Returns True if a matching inflight job was found and flagged for
|
||||
cancellation, False otherwise (best-effort).
|
||||
"""
|
||||
return await self._inner.cancel_job(job_id)
|
||||
|
||||
async def job_history(self, job_id: "str | None" = None):
|
||||
"""Durable history of completed server-side jobs (SHOW JOB HISTORY).
|
||||
|
||||
Reads each table's durable job-history store. Pass ``job_id`` to narrow
|
||||
to a single job. Unlike :meth:`list_jobs` (live, inflight) these are the
|
||||
terminal records, with created/updated/completed timestamps.
|
||||
"""
|
||||
return await self._inner.job_history(job_id)
|
||||
|
||||
async def errors(self, job_id: "str | None" = None, table: "str | None" = None):
|
||||
"""Per-row UDF errors recorded by ``error_policy=skip`` (SHOW ERRORS).
|
||||
|
||||
Optionally filtered by ``job_id`` and/or ``table``.
|
||||
"""
|
||||
return await self._inner.errors(job_id, table)
|
||||
|
||||
async def rename_table(
|
||||
self,
|
||||
cur_name: str,
|
||||
|
||||
177
python/python/lancedb/lineage.py
Normal file
177
python/python/lancedb/lineage.py
Normal file
@@ -0,0 +1,177 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
"""Client-side model of derived-compute lineage.
|
||||
|
||||
`Connection.lineage()` / `Table.lineage()` / `MaterializedView.lineage()` return
|
||||
a `Lineage`: the graph of what a column or materialized view derives from
|
||||
(upstream), what derives from it (downstream), and -- for each derived column --
|
||||
the function that produced it, the version it was produced with, and whether
|
||||
that is stale relative to the function the registry now holds.
|
||||
|
||||
The server returns this as JSON (the wire contract); these classes deserialize
|
||||
it. Nothing here talks to the server.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from dataclasses import dataclass, field
|
||||
from typing import List, Optional, Union
|
||||
|
||||
|
||||
@dataclass
|
||||
class FunctionRef:
|
||||
"""The function that produced a derived column, with version + location."""
|
||||
|
||||
name: str
|
||||
#: Version that produced the data (stamped at compute time), if known.
|
||||
as_computed_version: Optional[str] = None
|
||||
#: Version the registry currently holds for this function name.
|
||||
current_version: Optional[str] = None
|
||||
#: True when the column was produced by an older function than the registry
|
||||
#: now holds -- i.e. silently stale; re-refresh to catch up.
|
||||
stale_vs_current: bool = False
|
||||
language: Optional[str] = None
|
||||
docker_image: Optional[str] = None
|
||||
env_digest: Optional[str] = None
|
||||
code_uri: Optional[str] = None
|
||||
|
||||
@classmethod
|
||||
def _from(cls, d: dict) -> "FunctionRef":
|
||||
return cls(
|
||||
name=d["name"],
|
||||
as_computed_version=d.get("as_computed_version"),
|
||||
current_version=d.get("current_version"),
|
||||
stale_vs_current=d.get("stale_vs_current", False),
|
||||
language=d.get("language"),
|
||||
docker_image=d.get("docker_image"),
|
||||
env_digest=d.get("env_digest"),
|
||||
code_uri=d.get("code_uri"),
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Node:
|
||||
"""A lineage node: a table, view, column, or function."""
|
||||
|
||||
kind: str # "table" | "view" | "column" | "function"
|
||||
id: str # "table", "table.column", or "fn:name@version"
|
||||
table: Optional[str] = None
|
||||
function: Optional[FunctionRef] = None
|
||||
|
||||
@classmethod
|
||||
def _from(cls, d: dict) -> "Node":
|
||||
fn = d.get("function")
|
||||
return cls(
|
||||
kind=d["kind"],
|
||||
id=d["id"],
|
||||
table=d.get("table"),
|
||||
function=FunctionRef._from(fn) if fn else None,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class Edge:
|
||||
"""`downstream` depends on `upstream`, produced by `via` (a function name,
|
||||
or None for a passthrough)."""
|
||||
|
||||
downstream: str
|
||||
upstream: str
|
||||
via: Optional[str] = None
|
||||
|
||||
@classmethod
|
||||
def _from(cls, d: dict) -> "Edge":
|
||||
return cls(downstream=d["downstream"], upstream=d["upstream"], via=d.get("via"))
|
||||
|
||||
|
||||
@dataclass
|
||||
class Lineage:
|
||||
"""A derived-compute lineage graph (nodes + labeled edges)."""
|
||||
|
||||
target: str
|
||||
nodes: List[Node] = field(default_factory=list)
|
||||
edges: List[Edge] = field(default_factory=list)
|
||||
|
||||
@classmethod
|
||||
def from_json(cls, raw: Union[str, bytes, dict]) -> "Lineage":
|
||||
d = json.loads(raw) if isinstance(raw, (str, bytes)) else raw
|
||||
return cls(
|
||||
target=d.get("target", ""),
|
||||
nodes=[Node._from(n) for n in d.get("nodes", [])],
|
||||
edges=[Edge._from(e) for e in d.get("edges", [])],
|
||||
)
|
||||
|
||||
def functions(self) -> List[FunctionRef]:
|
||||
"""The function nodes in the graph."""
|
||||
return [n.function for n in self.nodes if n.function is not None]
|
||||
|
||||
def stale(self) -> List[FunctionRef]:
|
||||
"""Functions whose as-computed version is behind the current registry
|
||||
version -- the columns they produced are silently out of date."""
|
||||
return [f for f in self.functions() if f.stale_vs_current]
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
def prune(d: dict) -> dict:
|
||||
return {k: v for k, v in d.items() if v is not None}
|
||||
|
||||
return {
|
||||
"target": self.target,
|
||||
"nodes": [
|
||||
prune(
|
||||
{
|
||||
"kind": n.kind,
|
||||
"id": n.id,
|
||||
"table": n.table,
|
||||
"function": prune(vars(n.function)) if n.function else None,
|
||||
}
|
||||
)
|
||||
for n in self.nodes
|
||||
],
|
||||
"edges": [prune(vars(e)) for e in self.edges],
|
||||
}
|
||||
|
||||
def to_graphviz(self) -> str:
|
||||
"""Graphviz DOT for the lineage DAG: columns/tables as nodes, function
|
||||
names on edges, drift edges dashed + red."""
|
||||
stale_names = {f.name for f in self.stale()}
|
||||
out = [
|
||||
"digraph lineage {",
|
||||
" rankdir=LR;",
|
||||
' node [fontname="monospace"];',
|
||||
]
|
||||
for n in self.nodes:
|
||||
if n.kind == "function":
|
||||
continue
|
||||
shape = "ellipse" if n.kind in ("table", "view") else "box"
|
||||
out.append(f' "{n.id}" [shape={shape}];')
|
||||
for e in self.edges:
|
||||
attrs = ""
|
||||
if e.via:
|
||||
if e.via in stale_names:
|
||||
attrs = f' [label="{e.via}" color=red style=dashed]'
|
||||
else:
|
||||
attrs = f' [label="{e.via}"]'
|
||||
out.append(f' "{e.upstream}" -> "{e.downstream}"{attrs};')
|
||||
out.append("}")
|
||||
return "\n".join(out)
|
||||
|
||||
def _repr_html_(self) -> str:
|
||||
warn = ""
|
||||
drift = self.stale()
|
||||
if drift:
|
||||
names = ", ".join(sorted({f.name for f in drift}))
|
||||
warn = (
|
||||
f'<p style="color:#b00000"><b>stale vs current:</b> {names} '
|
||||
"(re-refresh to catch up)</p>"
|
||||
)
|
||||
rows = "".join(
|
||||
f"<tr><td><code>{e.downstream}</code></td>"
|
||||
f"<td>← {e.via or ''}</td>"
|
||||
f"<td><code>{e.upstream}</code></td></tr>"
|
||||
for e in self.edges
|
||||
)
|
||||
return (
|
||||
f"<b>lineage: <code>{self.target}</code></b>{warn}"
|
||||
"<table><tr><th>derived</th><th>via</th><th>from</th></tr>"
|
||||
f"{rows}</table>"
|
||||
)
|
||||
@@ -13,10 +13,14 @@ from typing import (
|
||||
Iterable,
|
||||
List,
|
||||
Optional,
|
||||
TYPE_CHECKING,
|
||||
Union,
|
||||
Literal,
|
||||
overload,
|
||||
)
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from ..udf import JobHandle
|
||||
import warnings
|
||||
|
||||
from lancedb import __version__
|
||||
@@ -884,8 +888,142 @@ class RemoteTable(Table):
|
||||
def count_rows(self, filter: Optional[str] = None) -> int:
|
||||
return LOOP.run(self._table.count_rows(filter))
|
||||
|
||||
def add_columns(self, transforms: Dict[str, str]) -> AddColumnsResult:
|
||||
return LOOP.run(self._table.add_columns(transforms))
|
||||
def add_columns(
|
||||
self,
|
||||
transforms: Optional[Dict[str, str]] = None,
|
||||
*,
|
||||
computed: Optional[Dict[str, tuple]] = None,
|
||||
) -> Optional[AddColumnsResult]:
|
||||
result = None
|
||||
if transforms is not None:
|
||||
result = LOOP.run(self._table.add_columns(transforms))
|
||||
if computed:
|
||||
LOOP.run(self._table.add_columns(computed=computed))
|
||||
return result
|
||||
|
||||
def refresh_column(
|
||||
self,
|
||||
columns,
|
||||
*,
|
||||
where: Optional[str] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> "JobHandle":
|
||||
"""Trigger recompute of computed columns (REFRESH COLUMN).
|
||||
|
||||
The expression is resolved server-side from each column's stored
|
||||
binding; columns bound to the same struct-returning function
|
||||
refresh together. Returns a `JobHandle` to wait on, poll, or cancel
|
||||
(``tbl.refresh_column("c").wait()``). Server-backed feature
|
||||
(LanceDB Enterprise / Cloud).
|
||||
|
||||
num_workers / max_workers / batch_size / priority are per-refresh
|
||||
scheduling knobs (how to run THIS refresh) and override any default
|
||||
the function carries. `priority` is a Kueue tier
|
||||
(training | interactive | backfill).
|
||||
"""
|
||||
from ..udf import JobHandle
|
||||
|
||||
if isinstance(columns, str):
|
||||
columns = [columns]
|
||||
job_id = LOOP.run(
|
||||
self._table.refresh_column(
|
||||
list(columns),
|
||||
where=where,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
priority=priority,
|
||||
)
|
||||
)
|
||||
return JobHandle(self._job_conn(), job_id)
|
||||
|
||||
def lineage(self, column=None, *, direction=None, depth=None):
|
||||
"""Derived-compute lineage of this table, or one of its columns:
|
||||
upstream sources, downstream dependents, and the function version +
|
||||
location that produced each derived column (with a drift flag). Returns
|
||||
a `Lineage`. See `Connection.lineage`."""
|
||||
return self._job_conn().lineage(
|
||||
self._name, column, direction=direction, depth=depth
|
||||
)
|
||||
|
||||
def _job_conn(self):
|
||||
"""A client connection for polling jobs this table spawns. Built lazily
|
||||
from the table's serialized connection state and cached (not pickled --
|
||||
a forked/unpickled table rebuilds it on next use)."""
|
||||
from lancedb import deserialize_conn
|
||||
|
||||
conn = getattr(self, "_job_conn_cache", None)
|
||||
if conn is None:
|
||||
conn = deserialize_conn(self._serialized_connection_state())
|
||||
self._job_conn_cache = conn
|
||||
return conn
|
||||
|
||||
def load_columns(
|
||||
self,
|
||||
source: Union[str, Iterable[str]],
|
||||
pk: str,
|
||||
columns: Union[Iterable[str], Dict[str, str]],
|
||||
*,
|
||||
source_format: str = "parquet",
|
||||
source_pk: Optional[str] = None,
|
||||
on_missing: str = "carry",
|
||||
source_storage_options: Optional[Dict[str, str]] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
commit_granularity: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Fill existing columns from an external source by primary-key join.
|
||||
|
||||
The distributed-job equivalent of Geneva's ``Table.load_columns()``:
|
||||
imports precomputed values (e.g. embeddings) from Parquet/Lance/IPC into
|
||||
this table, matching on a primary key. Returns the load job id.
|
||||
Server-backed feature (LanceDB Enterprise / Cloud).
|
||||
|
||||
Parameters
|
||||
----------
|
||||
source: str | list[str]
|
||||
One source URI or a list of URIs.
|
||||
pk: str
|
||||
Destination primary-key column. Also the source key unless
|
||||
``source_pk`` is given.
|
||||
columns: list[str] | dict[str, str]
|
||||
Value columns to load. A list loads same-named columns; a dict maps
|
||||
``{target: source}``.
|
||||
source_format: str
|
||||
``"parquet"`` (default), ``"lance"``, or ``"ipc"``.
|
||||
source_pk: str, optional
|
||||
Source primary-key column when it differs from ``pk``.
|
||||
on_missing: str
|
||||
Behavior for destination rows with no source match:
|
||||
``"carry"`` (default, keep existing), ``"null"``, or ``"error"``.
|
||||
"""
|
||||
if isinstance(source, str):
|
||||
source = [source]
|
||||
if isinstance(columns, dict):
|
||||
mappings = [(target, src) for target, src in columns.items()]
|
||||
else:
|
||||
mappings = [(c, None) for c in columns]
|
||||
return LOOP.run(
|
||||
self._table.load_columns(
|
||||
list(source),
|
||||
source_format,
|
||||
pk,
|
||||
mappings,
|
||||
source_key=source_pk,
|
||||
source_storage_options=source_storage_options,
|
||||
on_missing=on_missing,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
commit_granularity=commit_granularity,
|
||||
priority=priority,
|
||||
)
|
||||
)
|
||||
|
||||
def alter_columns(
|
||||
self, *alterations: Iterable[Dict[str, str]]
|
||||
|
||||
@@ -702,6 +702,24 @@ def _normalize_progress(progress):
|
||||
return progress, False
|
||||
|
||||
|
||||
def _computed_groups(computed):
|
||||
"""Group computed columns by expression, preserving declaration order
|
||||
(struct-returning functions need their columns adjacent so schema order
|
||||
matches field order). Accepts the ergonomic forms -- `fn("col")` values
|
||||
and tuple keys for struct fan-out -- via `_normalize_computed`."""
|
||||
from .udf import _normalize_computed
|
||||
|
||||
groups = []
|
||||
for name, (sql_type, expression) in _normalize_computed(computed).items():
|
||||
for expr, cols in groups:
|
||||
if expr == expression:
|
||||
cols.append((name, sql_type))
|
||||
break
|
||||
else:
|
||||
groups.append((expression, [(name, sql_type)]))
|
||||
return groups
|
||||
|
||||
|
||||
class Table(ABC):
|
||||
"""
|
||||
A Table is a collection of Records in a LanceDB Database.
|
||||
@@ -807,6 +825,59 @@ class Table(ABC):
|
||||
"""The number of rows in this Table"""
|
||||
return self.count_rows(None)
|
||||
|
||||
def add_computed_column(
|
||||
self,
|
||||
columns,
|
||||
fn,
|
||||
args: Optional[List[str]] = None,
|
||||
types=None,
|
||||
) -> None:
|
||||
"""Declare computed column(s) bound to a UDF -- no compute happens
|
||||
here (the agent fills them lazily, or refresh_column() triggers a run).
|
||||
|
||||
.. deprecated::
|
||||
A computed column is an expression over a registered function, so
|
||||
bind it as one: ``add_columns(computed={"vec": embed("data")})``.
|
||||
``embed("data")`` applies the function to the `data` column and
|
||||
infers the type from the function's return signature -- the
|
||||
function never couples to a particular column. Prefer that form.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
warnings.warn(
|
||||
"add_computed_column is deprecated; use add_columns(computed="
|
||||
'{"vec": embed("data")}).',
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
from .udf import Udf, struct_field_types
|
||||
|
||||
multi = isinstance(columns, (tuple, list))
|
||||
if isinstance(fn, Udf):
|
||||
expr = fn.expression(*(args or []))
|
||||
if types is None:
|
||||
if multi:
|
||||
if not fn.returns.upper().startswith("STRUCT"):
|
||||
raise ValueError(
|
||||
"several columns need a STRUCT-returning function"
|
||||
)
|
||||
types = struct_field_types(fn.returns)
|
||||
else:
|
||||
types = fn.returns
|
||||
else:
|
||||
if types is None:
|
||||
raise ValueError("pass types= when fn is a name string")
|
||||
expr = f"{fn}({', '.join(args or [])})"
|
||||
if multi:
|
||||
if len(types) != len(columns):
|
||||
raise ValueError(
|
||||
f"{len(columns)} columns but {len(types)} output types"
|
||||
)
|
||||
computed = {c: (t, expr) for c, t in zip(columns, types)}
|
||||
else:
|
||||
computed = {columns: (types, expr)}
|
||||
self.add_columns(computed=computed)
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
|
||||
@@ -3710,9 +3781,68 @@ class LanceTable(Table):
|
||||
return LOOP.run(self._table.index_stats(index_name))
|
||||
|
||||
def add_columns(
|
||||
self, transforms: Dict[str, str] | pa.field | List[pa.field] | pa.Schema
|
||||
) -> AddColumnsResult:
|
||||
return LOOP.run(self._table.add_columns(transforms))
|
||||
self,
|
||||
transforms: Dict[str, str]
|
||||
| pa.field
|
||||
| List[pa.field]
|
||||
| pa.Schema
|
||||
| None = None,
|
||||
*,
|
||||
computed: Optional[Dict] = None,
|
||||
) -> Optional[AddColumnsResult]:
|
||||
result = None
|
||||
if transforms is not None:
|
||||
result = LOOP.run(self._table.add_columns(transforms))
|
||||
if computed:
|
||||
# computed binds an expression over a registered function to a
|
||||
# column: {col: fn("input_col")} -- fn("input_col") yields the
|
||||
# expression and carries the inferred type; a tuple key fans a
|
||||
# STRUCT return out to several columns. Declares the binding only;
|
||||
# the server fills the values (server-backed). The legacy
|
||||
# {col: (sql_type, expression)} tuple form is still accepted.
|
||||
result_unused = LOOP.run(self._table.add_columns(computed=computed))
|
||||
del result_unused
|
||||
return result
|
||||
|
||||
def refresh_column(
|
||||
self,
|
||||
columns,
|
||||
*,
|
||||
where: Optional[str] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> "JobHandle":
|
||||
"""Trigger recompute of computed columns (REFRESH COLUMN).
|
||||
|
||||
The expression is resolved server-side from each column's stored
|
||||
binding; columns bound to the same struct-returning function
|
||||
refresh together. Returns a `JobHandle` to wait on, poll, or cancel
|
||||
(``tbl.refresh_column("col").wait()``) -- mirrors
|
||||
`MaterializedView.refresh()`. Server-backed feature (LanceDB
|
||||
Enterprise / Cloud).
|
||||
|
||||
num_workers / max_workers / batch_size / priority are per-refresh
|
||||
scheduling knobs (how to run THIS refresh) and override any default
|
||||
the function carries. `priority` is a Kueue tier
|
||||
(training | interactive | backfill).
|
||||
"""
|
||||
from .udf import JobHandle
|
||||
|
||||
if isinstance(columns, str):
|
||||
columns = [columns]
|
||||
job_id = LOOP.run(
|
||||
self._table.refresh_column(
|
||||
list(columns),
|
||||
where=where,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
priority=priority,
|
||||
)
|
||||
)
|
||||
return JobHandle(self._conn, job_id, table=self.name)
|
||||
|
||||
def alter_columns(
|
||||
self, *alterations: Iterable[Dict[str, str]]
|
||||
@@ -5390,9 +5520,44 @@ class AsyncTable:
|
||||
|
||||
return await self._inner.update(updates_sql, where)
|
||||
|
||||
async def refresh_column(
|
||||
self,
|
||||
columns,
|
||||
*,
|
||||
where: Optional[str] = None,
|
||||
num_workers: Optional[int] = None,
|
||||
max_workers: Optional[int] = None,
|
||||
batch_size: Optional[int] = None,
|
||||
priority: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Trigger recompute of computed columns (REFRESH COLUMN).
|
||||
Returns the refresh job id. Server-backed feature.
|
||||
|
||||
num_workers / max_workers / batch_size / priority are per-refresh
|
||||
scheduling knobs (how to run THIS refresh); they override any default
|
||||
the function carries. `priority` is a Kueue tier
|
||||
(training | interactive | backfill)."""
|
||||
if isinstance(columns, str):
|
||||
columns = [columns]
|
||||
return await self._inner.refresh_column(
|
||||
list(columns),
|
||||
where_clause=where,
|
||||
num_workers=num_workers,
|
||||
max_workers=max_workers,
|
||||
batch_size=batch_size,
|
||||
priority=priority,
|
||||
)
|
||||
|
||||
async def add_columns(
|
||||
self, transforms: dict[str, str] | pa.field | List[pa.field] | pa.Schema
|
||||
) -> AddColumnsResult:
|
||||
self,
|
||||
transforms: dict[str, str]
|
||||
| pa.field
|
||||
| List[pa.field]
|
||||
| pa.Schema
|
||||
| None = None,
|
||||
*,
|
||||
computed: Optional[Dict] = None,
|
||||
) -> Optional[AddColumnsResult]:
|
||||
"""
|
||||
Add new columns with defined values.
|
||||
|
||||
@@ -5411,6 +5576,7 @@ class AsyncTable:
|
||||
version: the new version number of the table after adding columns.
|
||||
|
||||
"""
|
||||
result = None
|
||||
if isinstance(transforms, pa.Field):
|
||||
transforms = [transforms]
|
||||
if isinstance(transforms, list) and all(
|
||||
@@ -5418,9 +5584,69 @@ class AsyncTable:
|
||||
):
|
||||
transforms = pa.schema(transforms)
|
||||
if isinstance(transforms, pa.Schema):
|
||||
return await self._inner.add_columns_with_schema(transforms)
|
||||
result = await self._inner.add_columns_with_schema(transforms)
|
||||
elif transforms is not None:
|
||||
result = await self._inner.add_columns(list(transforms.items()))
|
||||
if computed:
|
||||
# computed binds an expression over a registered function to a
|
||||
# column: {col: fn("input_col")} -- fn("input_col") yields the
|
||||
# expression and carries the inferred type; a tuple key fans a
|
||||
# STRUCT return out to several columns. Declares the binding only;
|
||||
# the server fills the values (server-backed). The legacy
|
||||
# {col: (sql_type, expression)} tuple form is still accepted.
|
||||
for expression, cols in _computed_groups(computed):
|
||||
await self._inner.add_computed_columns(cols, expression)
|
||||
return result
|
||||
|
||||
async def add_computed_column(
|
||||
self,
|
||||
columns,
|
||||
fn,
|
||||
args: Optional[List[str]] = None,
|
||||
types=None,
|
||||
) -> None:
|
||||
"""Declare computed column(s) bound to a UDF (async).
|
||||
|
||||
.. deprecated::
|
||||
Use ``add_columns(computed={"col": fn("input_col")})`` -- a computed
|
||||
column is an expression over a registered function, so bind it that
|
||||
way instead of coupling the UDF to the column here.
|
||||
"""
|
||||
import warnings
|
||||
|
||||
warnings.warn(
|
||||
"add_computed_column is deprecated; use add_columns(computed="
|
||||
'{"col": fn("input_col")}).',
|
||||
DeprecationWarning,
|
||||
stacklevel=2,
|
||||
)
|
||||
from .udf import Udf, struct_field_types
|
||||
|
||||
multi = isinstance(columns, (tuple, list))
|
||||
if isinstance(fn, Udf):
|
||||
expr = fn.expression(*(args or []))
|
||||
if types is None:
|
||||
if multi:
|
||||
if not fn.returns.upper().startswith("STRUCT"):
|
||||
raise ValueError(
|
||||
"several columns need a STRUCT-returning function"
|
||||
)
|
||||
types = struct_field_types(fn.returns)
|
||||
else:
|
||||
types = fn.returns
|
||||
else:
|
||||
return await self._inner.add_columns(list(transforms.items()))
|
||||
if types is None:
|
||||
raise ValueError("pass types= when fn is a name string")
|
||||
expr = f"{fn}({', '.join(args or [])})"
|
||||
if multi:
|
||||
if len(types) != len(columns):
|
||||
raise ValueError(
|
||||
f"{len(columns)} columns but {len(types)} output types"
|
||||
)
|
||||
computed = {c: (t, expr) for c, t in zip(columns, types)}
|
||||
else:
|
||||
computed = {columns: (types, expr)}
|
||||
await self.add_columns(computed=computed)
|
||||
|
||||
async def alter_columns(
|
||||
self, *alterations: Iterable[dict[str, Any]]
|
||||
|
||||
753
python/python/lancedb/udf.py
Normal file
753
python/python/lancedb/udf.py
Normal file
@@ -0,0 +1,753 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
"""UDF authoring for LanceDB derived compute (server-backed).
|
||||
|
||||
`@udf` / `@table_udf` turn a plain Python function into a registrable
|
||||
server-side UDF: a cloudpickled (or source) body, a SQL signature inferred
|
||||
from type hints, and the runtime options (pip deps, GPUs, batching, ...).
|
||||
Register and use them through the existing connection/table API:
|
||||
|
||||
import lancedb
|
||||
from lancedb import udf, table_udf
|
||||
|
||||
db = lancedb.connect("db://my_db", api_key="...", host_override="...")
|
||||
|
||||
@udf(pip=["torch>=2.0"], num_gpus=1)
|
||||
def embed(text: str) -> list[float]:
|
||||
return model.encode(text).tolist()
|
||||
|
||||
db.create_function(embed) # CREATE FUNCTION (once)
|
||||
tbl = db.open_table("docs")
|
||||
tbl.add_columns(computed={"vec": embed("text")}) # bind embed(text) -> vec
|
||||
tbl.refresh_column("vec").wait() # materialize (returns a JobHandle)
|
||||
view = db.create_materialized_view("chunks", tbl, ["id", chunk_fn])
|
||||
|
||||
`embed("text")` applies the registered function to the `text` column and yields
|
||||
the expression `embed(text)`; the function itself stays decoupled from any
|
||||
column, so the same `embed` works on any column or table.
|
||||
|
||||
These operations are server-backed (LanceDB Enterprise / Cloud); the
|
||||
decorator itself works locally (define + call), only registration needs a
|
||||
remote connection.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import base64
|
||||
import dataclasses
|
||||
import functools
|
||||
import inspect
|
||||
import re
|
||||
import sys
|
||||
import textwrap
|
||||
import time
|
||||
import typing
|
||||
|
||||
# -- type hints -> SQL type strings -------------------------------------
|
||||
|
||||
_SCALARS = {
|
||||
int: "BIGINT",
|
||||
# Pragmatic default for ML workloads: python float maps to FLOAT
|
||||
# (Float32). Use an explicit `returns=` for DOUBLE.
|
||||
float: "FLOAT",
|
||||
str: "VARCHAR",
|
||||
bool: "BOOLEAN",
|
||||
bytes: "BLOB",
|
||||
}
|
||||
|
||||
|
||||
class TypeInferenceError(TypeError):
|
||||
pass
|
||||
|
||||
|
||||
def sql_type(hint) -> str:
|
||||
"""SQL type string for a python type hint."""
|
||||
if hint in _SCALARS:
|
||||
return _SCALARS[hint]
|
||||
origin = typing.get_origin(hint)
|
||||
if origin in (list, typing.List):
|
||||
(item,) = typing.get_args(hint) or (None,)
|
||||
if item in _SCALARS:
|
||||
return f"{_SCALARS[item]}[]"
|
||||
raise TypeInferenceError(
|
||||
f"unsupported list item type {item!r}; use an explicit returns="
|
||||
)
|
||||
fields = _struct_fields(hint)
|
||||
if fields is not None:
|
||||
inner = ", ".join(f"{name} {sql_type(h)}" for name, h in fields)
|
||||
return f"STRUCT({inner})"
|
||||
raise TypeInferenceError(
|
||||
f"cannot infer a SQL type for {hint!r}; pass an explicit type string"
|
||||
)
|
||||
|
||||
|
||||
def _struct_fields(hint):
|
||||
"""(name, hint) pairs for a TypedDict or dataclass, else None."""
|
||||
if dataclasses.is_dataclass(hint):
|
||||
return [(f.name, f.type) for f in dataclasses.fields(hint)]
|
||||
# TypedDict detection: a dict subclass with __annotations__.
|
||||
if (
|
||||
isinstance(hint, type)
|
||||
and issubclass(hint, dict)
|
||||
and typing.get_type_hints(hint)
|
||||
):
|
||||
return list(typing.get_type_hints(hint).items())
|
||||
return None
|
||||
|
||||
|
||||
def return_type(fn, override: "str | None", table: bool) -> str:
|
||||
"""SQL return type for a function: explicit override wins, else the
|
||||
return annotation. Table functions render as TABLE(...) and accept
|
||||
struct-shaped hints (TypedDict/dataclass, optionally list-wrapped)."""
|
||||
if override is not None:
|
||||
s = override.strip()
|
||||
if table and not s.upper().startswith("TABLE"):
|
||||
if s.upper().startswith("STRUCT"):
|
||||
return "TABLE" + s[len("STRUCT") :]
|
||||
raise TypeInferenceError(
|
||||
"a table function's returns= must be TABLE(...) or STRUCT(...)"
|
||||
)
|
||||
return s
|
||||
|
||||
hints = typing.get_type_hints(fn)
|
||||
ret = hints.get("return")
|
||||
if ret is None:
|
||||
raise TypeInferenceError(
|
||||
f"function {fn.__name__!r} needs a return annotation or returns="
|
||||
)
|
||||
if table:
|
||||
# Accept list[Row] / Row where Row is a TypedDict or dataclass.
|
||||
if typing.get_origin(ret) in (list, typing.List):
|
||||
(ret,) = typing.get_args(ret)
|
||||
fields = _struct_fields(ret)
|
||||
if fields is None:
|
||||
raise TypeInferenceError(
|
||||
"a table function must return rows shaped as a TypedDict or "
|
||||
"dataclass (optionally list-wrapped); or pass returns=..."
|
||||
)
|
||||
inner = ", ".join(f"{name} {sql_type(h)}" for name, h in fields)
|
||||
return f"TABLE({inner})"
|
||||
return sql_type(ret)
|
||||
|
||||
|
||||
def param_types(fn) -> "list[tuple[str, str]]":
|
||||
"""(name, sql type) per parameter, from annotations. Each UDF
|
||||
parameter binds to a source column of the same name by default."""
|
||||
hints = typing.get_type_hints(fn)
|
||||
out = []
|
||||
for name, p in inspect.signature(fn).parameters.items():
|
||||
if p.kind in (p.VAR_POSITIONAL, p.VAR_KEYWORD):
|
||||
raise TypeInferenceError("*args/**kwargs are not supported in UDFs")
|
||||
hint = hints.get(name)
|
||||
if hint is None:
|
||||
raise TypeInferenceError(
|
||||
f"parameter {name!r} of {fn.__name__!r} needs a type annotation"
|
||||
)
|
||||
out.append((name, sql_type(hint)))
|
||||
return out
|
||||
|
||||
|
||||
# -- column expressions -------------------------------------------------
|
||||
|
||||
|
||||
class ColumnExpr(str):
|
||||
"""A computed-column expression produced by applying a registered
|
||||
function to column names, e.g. ``embed("data") -> "embed(data)"``.
|
||||
|
||||
It IS the expression string everywhere a string is expected (views, SQL,
|
||||
logging), and additionally carries the function's declared return type so
|
||||
``add_columns(computed=...)`` can declare the column without a hand-written
|
||||
type. ``field_types`` holds the per-field SQL types of a STRUCT return, for
|
||||
fanning one expression out to several columns.
|
||||
"""
|
||||
|
||||
data_type: "str | None"
|
||||
field_types: "list[str] | None"
|
||||
|
||||
def __new__(cls, expr: str, data_type=None, field_types=None):
|
||||
obj = super().__new__(cls, expr)
|
||||
obj.data_type = data_type
|
||||
obj.field_types = field_types
|
||||
return obj
|
||||
|
||||
|
||||
def _normalize_computed(computed: dict) -> dict:
|
||||
"""Normalize the user-facing ``computed=`` mapping to the canonical
|
||||
``{name: (sql_type, expression)}`` form.
|
||||
|
||||
Accepts, per entry:
|
||||
- value is a `ColumnExpr` (from ``fn("col")``): the column's SQL type
|
||||
comes from the function's return type -- no hand-written type needed. A
|
||||
tuple key (``("chunk", "idx")``) fans a STRUCT return out to one
|
||||
(type, expression) entry per field, in declared order.
|
||||
- value is a legacy ``(sql_type, expression)`` tuple: passed through (the
|
||||
escape hatch, e.g. bare-name function strings).
|
||||
"""
|
||||
out: dict = {}
|
||||
for key, val in computed.items():
|
||||
if isinstance(val, ColumnExpr):
|
||||
expr = str(val)
|
||||
if isinstance(key, (tuple, list)):
|
||||
if not val.field_types:
|
||||
raise ValueError(
|
||||
f"columns {tuple(key)} need a STRUCT-returning function; "
|
||||
f"{expr} returns a single value"
|
||||
)
|
||||
if len(val.field_types) != len(key):
|
||||
raise ValueError(
|
||||
f"{len(key)} columns but {len(val.field_types)} struct fields "
|
||||
f"in {expr}"
|
||||
)
|
||||
for name, t in zip(key, val.field_types):
|
||||
out[name] = (t, expr)
|
||||
else:
|
||||
if val.data_type is None:
|
||||
raise ValueError(f"cannot infer a type for {expr}; pass types=")
|
||||
out[key] = (val.data_type, expr)
|
||||
else:
|
||||
out[key] = val
|
||||
return out
|
||||
|
||||
|
||||
# -- the @udf / @table_udf decorators -----------------------------------
|
||||
|
||||
|
||||
class Udf:
|
||||
def __init__(
|
||||
self,
|
||||
fn,
|
||||
*,
|
||||
returns: "str | None" = None,
|
||||
table: bool = False,
|
||||
name: "str | None" = None,
|
||||
pip: "list[str] | None" = None,
|
||||
pip_index_url: "str | None" = None,
|
||||
pip_extra_index_urls: "list[str] | None" = None,
|
||||
find_links: "list[str] | None" = None,
|
||||
requirements: "str | list[str] | None" = None,
|
||||
conda: "list[str] | None" = None,
|
||||
conda_channels: "list[str] | None" = None,
|
||||
env: "dict[str, str] | list[str] | None" = None,
|
||||
num_cpus: "int | None" = None,
|
||||
num_gpus: "int | None" = None,
|
||||
batch_size: "int | None" = None,
|
||||
timeout: "float | None" = None,
|
||||
error_policy: "str | None" = None,
|
||||
max_skip_ratio: "float | None" = None,
|
||||
retries: "int | None" = None,
|
||||
docker_image: "str | None" = None,
|
||||
description: "str | None" = None,
|
||||
prefer_source: bool = False,
|
||||
):
|
||||
functools.update_wrapper(self, fn)
|
||||
self.fn = fn
|
||||
self.name = name or fn.__name__
|
||||
self.table = table
|
||||
self.params = param_types(fn)
|
||||
self.returns = return_type(fn, returns, table)
|
||||
self.prefer_source = prefer_source
|
||||
self.options: "dict[str, str]" = {}
|
||||
if conda and (pip or requirements):
|
||||
raise ValueError("pass conda or pip/requirements, not both")
|
||||
if conda_channels and not conda:
|
||||
raise ValueError("conda_channels requires conda")
|
||||
if pip:
|
||||
self.options["pip"] = ",".join(pip)
|
||||
if pip_extra_index_urls:
|
||||
self.options["pip_extra_index_urls"] = ",".join(pip_extra_index_urls)
|
||||
if find_links:
|
||||
self.options["find_links"] = ",".join(find_links)
|
||||
if requirements:
|
||||
self.options["requirements"] = _format_requirements(requirements)
|
||||
if conda:
|
||||
self.options["conda"] = ",".join(conda)
|
||||
if conda_channels:
|
||||
self.options["conda_channels"] = ",".join(conda_channels)
|
||||
if env:
|
||||
self.options["env"] = _format_env(env)
|
||||
for key, val in [
|
||||
("pip_index_url", pip_index_url),
|
||||
("num_cpus", num_cpus),
|
||||
("num_gpus", num_gpus),
|
||||
("batch_size", batch_size),
|
||||
("timeout", timeout),
|
||||
("error_policy", error_policy),
|
||||
("max_skip_ratio", max_skip_ratio),
|
||||
("retries", retries),
|
||||
("docker_image", docker_image),
|
||||
]:
|
||||
if val is not None:
|
||||
self.options[key] = str(val)
|
||||
# Keep the source in the description (when available) so the
|
||||
# catalog stays inspectable even for pickled bodies.
|
||||
if description is not None:
|
||||
self.options["description"] = description
|
||||
else:
|
||||
try:
|
||||
self.options["description"] = textwrap.dedent(inspect.getsource(fn))
|
||||
except (OSError, TypeError):
|
||||
pass
|
||||
|
||||
def __call__(self, *args, **kwargs):
|
||||
"""Call with real values to run locally; call with column-name
|
||||
strings to build an expression for backfills and views, e.g.
|
||||
``embed("data")`` -> the expression ``embed(data)`` (a `ColumnExpr`
|
||||
carrying the function's return type for `add_columns(computed=...)`)."""
|
||||
if args and all(isinstance(a, str) for a in args) and not kwargs:
|
||||
return self.expression(*args)
|
||||
return self.fn(*args, **kwargs)
|
||||
|
||||
def expression(self, *columns: str) -> ColumnExpr:
|
||||
"""The expression applying this function to `columns` (default: the
|
||||
function's own parameter names). Returns a `ColumnExpr` -- a string
|
||||
that also carries the declared return type (and struct field types)."""
|
||||
cols = columns or [p for p, _ in self.params]
|
||||
expr = f"{self.name}({', '.join(cols)})"
|
||||
field_types = None
|
||||
if self.returns.upper().startswith("STRUCT"):
|
||||
field_types = struct_field_types(self.returns)
|
||||
return ColumnExpr(expr, data_type=self.returns, field_types=field_types)
|
||||
|
||||
def _body(self) -> "tuple[str, str]":
|
||||
"""(body literal, body_format). Source when requested and
|
||||
retrievable; cloudpickle otherwise (handles closures)."""
|
||||
if self.prefer_source:
|
||||
try:
|
||||
src = textwrap.dedent(inspect.getsource(self.fn))
|
||||
# Strip the decorator line(s) so the stored body is a
|
||||
# plain function definition.
|
||||
lines = src.splitlines(keepends=True)
|
||||
while lines and lines[0].lstrip().startswith("@"):
|
||||
lines.pop(0)
|
||||
return "".join(lines), "source"
|
||||
except (OSError, TypeError):
|
||||
pass
|
||||
import cloudpickle
|
||||
|
||||
raw = cloudpickle.dumps(self.fn)
|
||||
return base64.b64encode(raw).decode("ascii"), "cloudpickle"
|
||||
|
||||
def _body_and_options(self) -> "tuple[str, dict[str, str]]":
|
||||
"""The body literal plus the finalized options (body_format /
|
||||
python_version / cloudpickle-pip bookkeeping for a non-source
|
||||
body)."""
|
||||
body, body_format = self._body()
|
||||
options = dict(self.options)
|
||||
if body_format != "source":
|
||||
options["body_format"] = body_format
|
||||
# Pickled code objects only load under the same interpreter
|
||||
# minor version; record ours so the worker can fail with a
|
||||
# clear message instead of a bytecode error.
|
||||
options["python_version"] = self.pickle_environment()
|
||||
# The worker deserializes the body with cloudpickle; make sure
|
||||
# the job's pip environment provides it. Conda bakes inject
|
||||
# cloudpickle server-side, so do not create an invalid pip+conda
|
||||
# declaration here.
|
||||
if "conda" not in options:
|
||||
pip = [d for d in options.get("pip", "").split(",") if d]
|
||||
if not any(d.startswith("cloudpickle") for d in pip):
|
||||
pip.append("cloudpickle")
|
||||
options["pip"] = ",".join(pip)
|
||||
return body, options
|
||||
|
||||
def create_request(self) -> dict:
|
||||
"""Keyword arguments for `connection.create_function`."""
|
||||
body, options = self._body_and_options()
|
||||
return {
|
||||
"name": self.name,
|
||||
"language": "python",
|
||||
"return_type": self.returns,
|
||||
"body": body,
|
||||
"options": options,
|
||||
}
|
||||
|
||||
def create_statement(self) -> str:
|
||||
"""The equivalent `CREATE FUNCTION` SQL (for SQL-surface callers)."""
|
||||
params = ", ".join(f"{n} {t}" for n, t in self.params)
|
||||
body, options = self._body_and_options()
|
||||
with_clause = ""
|
||||
if options:
|
||||
rendered = ", ".join(
|
||||
f"{k} = '{_escape(v)}'" for k, v in sorted(options.items())
|
||||
)
|
||||
with_clause = f" WITH ({rendered})"
|
||||
return (
|
||||
f"CREATE FUNCTION {self.name}({params}) RETURNS {self.returns} "
|
||||
f"LANGUAGE python AS '{_escape_body(body)}'{with_clause}"
|
||||
)
|
||||
|
||||
def pickle_environment(self) -> str:
|
||||
"""Python version the body pickles under -- workers should match
|
||||
the minor version for cloudpickle compatibility."""
|
||||
return f"{sys.version_info.major}.{sys.version_info.minor}"
|
||||
|
||||
|
||||
def _escape(s: str) -> str:
|
||||
return str(s).replace("'", "''")
|
||||
|
||||
|
||||
def _format_requirements(requirements: "str | list[str]") -> str:
|
||||
if isinstance(requirements, str):
|
||||
return requirements
|
||||
return "\n".join(str(req) for req in requirements)
|
||||
|
||||
|
||||
def _format_env(env: "dict[str, str] | list[str]") -> str:
|
||||
if isinstance(env, dict):
|
||||
return "; ".join(f"{key}={value}" for key, value in env.items())
|
||||
return "; ".join(str(entry) for entry in env)
|
||||
|
||||
|
||||
def _escape_body(body: str) -> str:
|
||||
# The server unescapes \n / \t in single-quoted bodies; encode real
|
||||
# newlines accordingly and escape quotes.
|
||||
return (
|
||||
body.replace("\\", "\\\\")
|
||||
.replace("'", "''")
|
||||
.replace("\n", "\\n")
|
||||
.replace("\t", "\\t")
|
||||
)
|
||||
|
||||
|
||||
def udf(fn=None, **kwargs):
|
||||
"""Decorate a function as a scalar (or struct-returning) UDF.
|
||||
|
||||
@udf
|
||||
def doubled(val: int) -> float: ...
|
||||
|
||||
@udf(pip=["torch>=2"], num_gpus=1)
|
||||
def embed(body: str) -> list[float]: ...
|
||||
"""
|
||||
if fn is not None:
|
||||
return Udf(fn, **kwargs)
|
||||
return lambda f: Udf(f, **kwargs)
|
||||
|
||||
|
||||
def table_udf(fn=None, **kwargs):
|
||||
"""Decorate a table function (UDTF): each input row may emit zero or
|
||||
more output rows. Only usable in materialized views.
|
||||
|
||||
class Chunk(TypedDict):
|
||||
chunk: str
|
||||
chunk_idx: int
|
||||
|
||||
@table_udf
|
||||
def chunker(body: str) -> list[Chunk]: ...
|
||||
"""
|
||||
kwargs["table"] = True
|
||||
if fn is not None:
|
||||
return Udf(fn, **kwargs)
|
||||
return lambda f: Udf(f, **kwargs)
|
||||
|
||||
|
||||
# -- view / job handles (thin references over a connection) -------------
|
||||
|
||||
|
||||
def struct_field_types(returns: str) -> "list[str]":
|
||||
"""Field type strings of a STRUCT(...) SQL type, in declared order."""
|
||||
inner = returns.strip()[len("STRUCT(") : -1]
|
||||
fields, depth, start = [], 0, 0
|
||||
for i, c in enumerate(inner):
|
||||
if c in "([":
|
||||
depth += 1
|
||||
elif c in ")]":
|
||||
depth -= 1
|
||||
elif c == "," and depth == 0:
|
||||
fields.append(inner[start:i].strip())
|
||||
start = i + 1
|
||||
fields.append(inner[start:].strip())
|
||||
# Each field is "name TYPE"; drop the name.
|
||||
return [f.split(None, 1)[1] for f in fields]
|
||||
|
||||
|
||||
def build_view_query(source, select) -> str:
|
||||
"""Assemble a view SELECT from a source (name or table) and select
|
||||
items: a column name, an expression string, a (alias, expression)
|
||||
tuple, or a @udf/@table_udf object."""
|
||||
src = source.name if hasattr(source, "name") else source
|
||||
items = []
|
||||
for item in select:
|
||||
if isinstance(item, Udf):
|
||||
items.append(item.expression())
|
||||
elif isinstance(item, tuple):
|
||||
alias, expr = item
|
||||
expr = expr.expression() if isinstance(expr, Udf) else expr
|
||||
items.append(f"{expr} AS {alias}")
|
||||
else:
|
||||
items.append(item)
|
||||
return f"SELECT {', '.join(items)} FROM {src}"
|
||||
|
||||
|
||||
def _job_id_matches(handle_id: str, listed_id: str) -> bool:
|
||||
# The refresh/backfill endpoints return the submission id (a uuid), but
|
||||
# the agent names the manifest job "<table>-<type>-<first 8 of the
|
||||
# submission id>" -- which is what list_jobs and cancel report. Match the
|
||||
# canonical id directly, or by that submission prefix.
|
||||
if listed_id == handle_id:
|
||||
return True
|
||||
prefix = handle_id[:8]
|
||||
return len(prefix) >= 4 and prefix in listed_id
|
||||
|
||||
|
||||
class MaterializedView:
|
||||
"""A reference to a materialized view (name + connection). Operations are
|
||||
server-backed connection calls bound to the name.
|
||||
|
||||
``create_materialized_view`` returns one of these; ``job_id`` is the
|
||||
initial-population job (None when the view was created with no data), so
|
||||
``db.create_materialized_view(...).wait()`` blocks until it is populated.
|
||||
"""
|
||||
|
||||
def __init__(self, conn, name: str, job_id: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.name = name
|
||||
#: initial-population job id from create, or None (with_no_data).
|
||||
self.job_id = job_id
|
||||
|
||||
def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
"""Block until the initial-population job (from create) finishes.
|
||||
A no-op when the view was created with no data."""
|
||||
if self.job_id is None:
|
||||
return "finished"
|
||||
return JobHandle(self.conn, self.job_id, table=self.name).wait(
|
||||
timeout=timeout, poll=poll
|
||||
)
|
||||
|
||||
def refresh(self, full: bool = False) -> "JobHandle":
|
||||
"""Refresh the materialized view; returns a `JobHandle` to wait on,
|
||||
poll, or cancel (``view.refresh().wait()``).
|
||||
|
||||
``full=True`` forces a full rebuild (recompute and replace every row)
|
||||
instead of the default incremental refresh. A full rebuild preserves
|
||||
the view's indexes -- they are reindexed by the distributed indexer.
|
||||
"""
|
||||
job_id = self.conn._refresh_materialized_view(self.name, full=full)
|
||||
return JobHandle(self.conn, job_id, table=self.name)
|
||||
|
||||
def explain_refresh(self, full: bool = False):
|
||||
"""Plan a refresh without running it (EXPLAIN REFRESH)."""
|
||||
return self.conn.explain_refresh_materialized_view(self.name, full=full)
|
||||
|
||||
def alter(self, auto_refresh: bool) -> None:
|
||||
self.conn.alter_materialized_view(self.name, auto_refresh=auto_refresh)
|
||||
|
||||
def drop(self) -> None:
|
||||
self.conn.drop_materialized_view(self.name)
|
||||
|
||||
# A materialized view is a first-class table: it can be indexed and
|
||||
# searched like any other. These open the materialized dataset by name and
|
||||
# delegate. Indexes declared this way are recorded against the view, so the
|
||||
# engine re-applies them after a full refresh rebuilds the dataset (a full
|
||||
# refresh overwrites the dataset, which would otherwise drop its indices).
|
||||
def _table(self):
|
||||
return self.conn.open_table(self.name)
|
||||
|
||||
def create_index(self, *args, **kwargs):
|
||||
"""Build an index on the materialized view (see Table.create_index)."""
|
||||
return self._table().create_index(*args, **kwargs)
|
||||
|
||||
def create_scalar_index(self, *args, **kwargs):
|
||||
"""Build a scalar index on the materialized view."""
|
||||
return self._table().create_scalar_index(*args, **kwargs)
|
||||
|
||||
def create_fts_index(self, *args, **kwargs):
|
||||
"""Build a full-text-search index on the materialized view."""
|
||||
return self._table().create_fts_index(*args, **kwargs)
|
||||
|
||||
def search(self, *args, **kwargs):
|
||||
"""Search the materialized view (vector / FTS / hybrid)."""
|
||||
return self._table().search(*args, **kwargs)
|
||||
|
||||
def lineage(self, column=None, *, direction=None, depth=None):
|
||||
"""Lineage of the materialized view (or one of its columns). Delegates
|
||||
to the backing table; the server already includes the view's sources
|
||||
and downstream dependents. Returns a `Lineage`."""
|
||||
return self._table().lineage(column, direction=direction, depth=depth)
|
||||
|
||||
|
||||
_PROGRESS = re.compile(r"(\d+)/(\d+)")
|
||||
|
||||
|
||||
class JobFailedError(RuntimeError):
|
||||
"""Raised by ``JobHandle.wait()`` when the server reports the job ``failed``.
|
||||
|
||||
Carries the server-side error so a doomed backfill (e.g. a multi-column
|
||||
``REFRESH COLUMN`` of a scalar UDF) surfaces its real cause promptly,
|
||||
instead of the caller blocking until ``wait()``'s timeout.
|
||||
"""
|
||||
|
||||
def __init__(self, job_id: str, error: "str | None"):
|
||||
self.job_id = job_id
|
||||
self.error = error
|
||||
super().__init__(f"job {job_id} failed: {error or 'unknown error'}")
|
||||
|
||||
|
||||
class JobHandle:
|
||||
"""A reference to an inflight server-side job, with polling helpers."""
|
||||
|
||||
#: How long an unseen job is treated as still materializing (submission
|
||||
#: -> agent cycle -> manifest write is async).
|
||||
GRACE_SECONDS = 20.0
|
||||
|
||||
def __init__(self, conn, job_id: str, table: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.id = job_id
|
||||
#: The job's table, when known (refresh_column / MV refresh). Lets the
|
||||
#: server resolve this job with an O(1) single-node read; without it the
|
||||
#: lookup scans the database's active jobs (still correct).
|
||||
self.table = table
|
||||
self._created = time.monotonic()
|
||||
self._seen = False
|
||||
|
||||
def _job(self):
|
||||
# Poll by id (one job), not list_jobs (every active job): the server
|
||||
# matches the submission/manifest id and reads just this table's node.
|
||||
return self.conn.get_job(self.id, self.table)
|
||||
|
||||
def status(self) -> str:
|
||||
"""pending / running / cancelling / stale, or 'finished' once the
|
||||
job has left the inflight listing."""
|
||||
job = self._job()
|
||||
if job is not None:
|
||||
self._seen = True
|
||||
return job.state
|
||||
if not self._seen and time.monotonic() - self._created < self.GRACE_SECONDS:
|
||||
return "pending"
|
||||
return "finished"
|
||||
|
||||
def progress(self) -> "tuple[int, int] | None":
|
||||
"""(units_done, units_total) while running, else None."""
|
||||
job = self._job()
|
||||
if job is not None and job.units_total is not None:
|
||||
return job.units_done or 0, job.units_total
|
||||
return None
|
||||
|
||||
def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
deadline = time.monotonic() + timeout
|
||||
while time.monotonic() < deadline:
|
||||
state = self.status()
|
||||
if state in ("finished", "stale"):
|
||||
return state
|
||||
if state == "failed":
|
||||
# Terminal failure -- surface the server error now, don't block
|
||||
# until `timeout`. `finalize` wrote it to the job's status node.
|
||||
job = self._job()
|
||||
raise JobFailedError(self.id, job.error if job is not None else None)
|
||||
if state == "pending":
|
||||
time.sleep(min(poll, 0.5))
|
||||
continue
|
||||
job = self._job()
|
||||
if job is not None and job.committed:
|
||||
return "finished"
|
||||
time.sleep(poll)
|
||||
raise TimeoutError(f"job {self.id} still {self.status()} after {timeout}s")
|
||||
|
||||
def cancel(self) -> None:
|
||||
# Cancel by the canonical manifest id (what cancel matches), found
|
||||
# via the submission prefix; fall back to the raw id.
|
||||
job = self._job()
|
||||
self.conn.cancel_job(job.job_id if job is not None else self.id)
|
||||
|
||||
|
||||
class AsyncMaterializedView:
|
||||
"""Async reference to a materialized view (name + async connection)."""
|
||||
|
||||
def __init__(self, conn, name: str, job_id: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.name = name
|
||||
#: initial-population job id from create, or None (with_no_data).
|
||||
self.job_id = job_id
|
||||
|
||||
async def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
"""Block until the initial-population job (from create) finishes.
|
||||
A no-op when the view was created with no data."""
|
||||
if self.job_id is None:
|
||||
return "finished"
|
||||
return await AsyncJobHandle(self.conn, self.job_id, table=self.name).wait(
|
||||
timeout=timeout, poll=poll
|
||||
)
|
||||
|
||||
async def refresh(self, full: bool = False) -> "AsyncJobHandle":
|
||||
"""Refresh the materialized view; returns an `AsyncJobHandle` to wait
|
||||
on, poll, or cancel.
|
||||
|
||||
``full=True`` forces a full rebuild instead of an incremental refresh
|
||||
(indexes are preserved and reindexed by the distributed indexer).
|
||||
"""
|
||||
job_id = await self.conn._refresh_materialized_view(self.name, full=full)
|
||||
return AsyncJobHandle(self.conn, job_id, table=self.name)
|
||||
|
||||
async def explain_refresh(self, full: bool = False):
|
||||
return await self.conn.explain_refresh_materialized_view(self.name, full=full)
|
||||
|
||||
async def alter(self, auto_refresh: bool) -> None:
|
||||
await self.conn.alter_materialized_view(self.name, auto_refresh=auto_refresh)
|
||||
|
||||
async def drop(self) -> None:
|
||||
await self.conn.drop_materialized_view(self.name)
|
||||
|
||||
async def lineage(self, column=None, *, direction=None, depth=None):
|
||||
"""Lineage of the materialized view (or column). Returns a `Lineage`."""
|
||||
return await self.conn.lineage(
|
||||
self.name, column, direction=direction, depth=depth
|
||||
)
|
||||
|
||||
|
||||
class AsyncJobHandle:
|
||||
"""Async reference to an inflight server-side job, with polling helpers."""
|
||||
|
||||
GRACE_SECONDS = 20.0
|
||||
|
||||
def __init__(self, conn, job_id: str, table: "str | None" = None):
|
||||
self.conn = conn
|
||||
self.id = job_id
|
||||
#: See JobHandle.table -- enables an O(1) by-id lookup when known.
|
||||
self.table = table
|
||||
self._created = time.monotonic()
|
||||
self._seen = False
|
||||
|
||||
async def _job(self):
|
||||
# Poll by id, not list_jobs (see JobHandle._job).
|
||||
return await self.conn.get_job(self.id, self.table)
|
||||
|
||||
async def status(self) -> str:
|
||||
job = await self._job()
|
||||
if job is not None:
|
||||
self._seen = True
|
||||
return job.state
|
||||
if not self._seen and time.monotonic() - self._created < self.GRACE_SECONDS:
|
||||
return "pending"
|
||||
return "finished"
|
||||
|
||||
async def progress(self) -> "tuple[int, int] | None":
|
||||
job = await self._job()
|
||||
if job is not None and job.units_total is not None:
|
||||
return job.units_done or 0, job.units_total
|
||||
return None
|
||||
|
||||
async def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
|
||||
deadline = time.monotonic() + timeout
|
||||
while time.monotonic() < deadline:
|
||||
state = await self.status()
|
||||
if state in ("finished", "stale"):
|
||||
return state
|
||||
if state == "failed":
|
||||
# Terminal failure -- surface the server error now, don't block
|
||||
# until `timeout`. `finalize` wrote it to the job's status node.
|
||||
job = await self._job()
|
||||
raise JobFailedError(self.id, job.error if job is not None else None)
|
||||
if state == "pending":
|
||||
await asyncio.sleep(min(poll, 0.5))
|
||||
continue
|
||||
job = await self._job()
|
||||
if job is not None and job.committed:
|
||||
return "finished"
|
||||
await asyncio.sleep(poll)
|
||||
raise TimeoutError(
|
||||
f"job {self.id} still {await self.status()} after {timeout}s"
|
||||
)
|
||||
|
||||
async def cancel(self) -> None:
|
||||
job = await self._job()
|
||||
await self.conn.cancel_job(job.job_id if job is not None else self.id)
|
||||
92
python/python/tests/test_job_handle.py
Normal file
92
python/python/tests/test_job_handle.py
Normal file
@@ -0,0 +1,92 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
"""JobHandle.wait() terminal-state handling.
|
||||
|
||||
Regression coverage for the cluster backfill-failure hang: the server reports a
|
||||
doomed job as ``state="failed"`` within seconds, but ``wait()`` used to ignore
|
||||
``failed`` and block until its (default 3600s) timeout. These tests pin that a
|
||||
``failed`` job raises ``JobFailedError`` promptly, carrying the server error.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
|
||||
import pytest
|
||||
|
||||
from lancedb.udf import JobHandle, AsyncJobHandle, JobFailedError
|
||||
|
||||
|
||||
class FakeJobInfo:
|
||||
"""Mirror of the pyo3 builtins.JobInfo fields wait()/status() read."""
|
||||
|
||||
def __init__(self, state, error=None, committed=False, units_total=None):
|
||||
self.state = state
|
||||
self.error = error
|
||||
self.committed = committed
|
||||
self.units_total = units_total
|
||||
self.units_done = None
|
||||
self.job_id = "job-1"
|
||||
|
||||
|
||||
class FakeConn:
|
||||
"""get_job() walks a scripted list of JobInfo (or None) snapshots, holding
|
||||
the last one once exhausted, so wait() polls a deterministic timeline."""
|
||||
|
||||
def __init__(self, snapshots):
|
||||
self._snaps = list(snapshots)
|
||||
self.calls = 0
|
||||
|
||||
def get_job(self, job_id, table=None):
|
||||
snap = self._snaps[min(self.calls, len(self._snaps) - 1)]
|
||||
self.calls += 1
|
||||
return snap
|
||||
|
||||
|
||||
class AsyncFakeConn(FakeConn):
|
||||
async def get_job(self, job_id, table=None):
|
||||
return FakeConn.get_job(self, job_id, table)
|
||||
|
||||
|
||||
def test_wait_raises_on_failed_promptly():
|
||||
# pending -> failed: wait() must raise the server error, not TimeoutError.
|
||||
conn = FakeConn(
|
||||
[None, FakeJobInfo("failed", error="multi-column backfill needs a STRUCT")]
|
||||
)
|
||||
jh = JobHandle(conn, "job-1", table="t")
|
||||
t0 = time.monotonic()
|
||||
with pytest.raises(JobFailedError) as exc:
|
||||
jh.wait(timeout=30, poll=0.01)
|
||||
assert time.monotonic() - t0 < 5 # prompt, nowhere near the 30s timeout
|
||||
assert "STRUCT" in str(exc.value)
|
||||
assert exc.value.error == "multi-column backfill needs a STRUCT"
|
||||
assert exc.value.job_id == "job-1"
|
||||
|
||||
|
||||
def test_wait_returns_finished_on_success():
|
||||
# running -> finished (job left the inflight listing) returns normally.
|
||||
conn = FakeConn([FakeJobInfo("running", units_total=2), None])
|
||||
jh = JobHandle(conn, "job-1", table="t")
|
||||
jh._seen = True # already observed, so a None now means "finished" not grace
|
||||
assert jh.wait(timeout=30, poll=0.01) == "finished"
|
||||
|
||||
|
||||
def test_wait_returns_finished_on_committed():
|
||||
# A committed job that is still listed resolves to finished.
|
||||
conn = FakeConn([FakeJobInfo("running", committed=True, units_total=2)])
|
||||
jh = JobHandle(conn, "job-1", table="t")
|
||||
jh._seen = True
|
||||
assert jh.wait(timeout=30, poll=0.01) == "finished"
|
||||
|
||||
|
||||
def test_async_wait_raises_on_failed_promptly():
|
||||
conn = AsyncFakeConn([None, FakeJobInfo("failed", error="boom")])
|
||||
jh = AsyncJobHandle(conn, "job-1", table="t")
|
||||
|
||||
async def run():
|
||||
t0 = time.monotonic()
|
||||
with pytest.raises(JobFailedError) as exc:
|
||||
await jh.wait(timeout=30, poll=0.01)
|
||||
assert time.monotonic() - t0 < 5
|
||||
assert exc.value.error == "boom"
|
||||
|
||||
asyncio.run(run())
|
||||
@@ -18,7 +18,10 @@ use lancedb::{
|
||||
connection::Connection as LanceConnection,
|
||||
connection::NamespaceClientPushdownOperation,
|
||||
database::namespace::LanceNamespaceDatabase,
|
||||
database::{CreateTableMode, Database, ReadConsistency},
|
||||
database::{
|
||||
CreateFunctionRequest, CreateMaterializedViewRequest, CreateTableMode, Database,
|
||||
ReadConsistency, RefreshMaterializedViewRequest, TableLineageRequest,
|
||||
},
|
||||
};
|
||||
use pyo3::{
|
||||
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
|
||||
@@ -27,6 +30,92 @@ use pyo3::{
|
||||
types::{PyDict, PyDictMethods},
|
||||
};
|
||||
|
||||
/// A registered function, as returned by `list_functions`.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct FunctionInfo {
|
||||
pub name: String,
|
||||
pub language: String,
|
||||
pub return_type: String,
|
||||
pub description: String,
|
||||
}
|
||||
|
||||
/// A registered materialized view definition.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct MaterializedViewInfo {
|
||||
pub name: String,
|
||||
pub source_table: String,
|
||||
pub projection: Vec<String>,
|
||||
pub udf_columns: Vec<String>,
|
||||
pub filter: Option<String>,
|
||||
pub auto_refresh: bool,
|
||||
}
|
||||
|
||||
/// One inflight server-side job.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct JobInfo {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub age_seconds: Option<i64>,
|
||||
pub command: Option<String>,
|
||||
pub units_done: Option<i64>,
|
||||
pub units_total: Option<i64>,
|
||||
pub committed: bool,
|
||||
pub rows_skipped: u64,
|
||||
pub error: Option<String>,
|
||||
}
|
||||
|
||||
/// One durable, completed/terminal server-side job record (SHOW JOB HISTORY).
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct JobHistoryEntry {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub created_ms: i64,
|
||||
pub updated_ms: i64,
|
||||
pub completed_ms: Option<i64>,
|
||||
pub rows_processed: Option<i64>,
|
||||
pub rows_skipped: Option<i64>,
|
||||
pub error: Option<String>,
|
||||
pub events: Option<String>,
|
||||
}
|
||||
|
||||
/// One per-row UDF error recorded by `error_policy=skip` (SHOW ERRORS).
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct JobErrorEntry {
|
||||
pub job_id: String,
|
||||
pub table: String,
|
||||
pub column: String,
|
||||
pub error_type: String,
|
||||
pub error_message: String,
|
||||
pub fragment_id: Option<i64>,
|
||||
pub source_row_id: Option<i64>,
|
||||
pub table_version: Option<i64>,
|
||||
pub age_seconds: Option<i64>,
|
||||
}
|
||||
|
||||
/// The plan a REFRESH MATERIALIZED VIEW would execute (EXPLAIN REFRESH).
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone)]
|
||||
pub struct MvRefreshPlan {
|
||||
pub table_name: String,
|
||||
pub has_work: bool,
|
||||
pub source_version: u64,
|
||||
pub last_refreshed_version: Option<u64>,
|
||||
pub full_refresh: bool,
|
||||
pub rebuild: bool,
|
||||
pub units_total: u64,
|
||||
}
|
||||
|
||||
#[pyclass]
|
||||
pub struct Connection {
|
||||
inner: Option<LanceConnection>,
|
||||
@@ -310,6 +399,308 @@ impl Connection {
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, language, return_type, body, options=None))]
|
||||
pub fn create_function(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
language: String,
|
||||
return_type: String,
|
||||
body: String,
|
||||
options: Option<HashMap<String, String>>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.create_function(CreateFunctionRequest {
|
||||
name,
|
||||
language,
|
||||
return_type,
|
||||
body,
|
||||
options: options.unwrap_or_default(),
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn list_functions(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let functions = inner.list_functions().await.infer_error()?;
|
||||
Ok(functions
|
||||
.into_iter()
|
||||
.map(|f| FunctionInfo {
|
||||
name: f.name,
|
||||
language: f.language,
|
||||
return_type: f.return_type,
|
||||
description: f.description,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn drop_function(self_: PyRef<'_, Self>, name: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.drop_function(&name).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, query, auto_refresh=false, with_no_data=false, partition_by=None))]
|
||||
pub fn create_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
query: String,
|
||||
auto_refresh: bool,
|
||||
with_no_data: bool,
|
||||
partition_by: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.create_materialized_view(CreateMaterializedViewRequest {
|
||||
name,
|
||||
query,
|
||||
auto_refresh,
|
||||
with_no_data,
|
||||
partition_by,
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, full=false, src_version=None, num_workers=None, max_workers=None))]
|
||||
pub fn refresh_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.refresh_materialized_view(RefreshMaterializedViewRequest {
|
||||
name,
|
||||
full,
|
||||
src_version,
|
||||
num_workers,
|
||||
max_workers,
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
/// Derived-compute lineage of a table/view (or column), returned as the
|
||||
/// server's lineage JSON string (the Python layer parses it).
|
||||
pub fn table_lineage(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
column: Option<String>,
|
||||
direction: Option<String>,
|
||||
depth: Option<u32>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.table_lineage(TableLineageRequest {
|
||||
name,
|
||||
column,
|
||||
direction,
|
||||
depth,
|
||||
})
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (name, full=false, src_version=None))]
|
||||
pub fn explain_refresh_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let p = inner
|
||||
.explain_refresh_materialized_view(&name, full, src_version)
|
||||
.await
|
||||
.infer_error()?;
|
||||
Ok(MvRefreshPlan {
|
||||
table_name: p.table_name,
|
||||
has_work: p.has_work,
|
||||
source_version: p.source_version,
|
||||
last_refreshed_version: p.last_refreshed_version,
|
||||
full_refresh: p.full_refresh,
|
||||
rebuild: p.rebuild,
|
||||
units_total: p.units_total,
|
||||
})
|
||||
})
|
||||
}
|
||||
|
||||
pub fn alter_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
auto_refresh: bool,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.alter_materialized_view(&name, auto_refresh)
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn drop_materialized_view(
|
||||
self_: PyRef<'_, Self>,
|
||||
name: String,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.drop_materialized_view(&name).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn list_materialized_views(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let views = inner.list_materialized_views().await.infer_error()?;
|
||||
Ok(views
|
||||
.into_iter()
|
||||
.map(|v| MaterializedViewInfo {
|
||||
name: v.name,
|
||||
source_table: v.source_table,
|
||||
projection: v.projection,
|
||||
udf_columns: v.udf_columns,
|
||||
filter: v.filter,
|
||||
auto_refresh: v.auto_refresh,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn list_jobs(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let jobs = inner.list_jobs().await.infer_error()?;
|
||||
Ok(jobs
|
||||
.into_iter()
|
||||
.map(|j| JobInfo {
|
||||
table: j.table,
|
||||
job_id: j.job_id,
|
||||
job_type: j.job_type,
|
||||
state: j.state,
|
||||
column: j.column,
|
||||
age_seconds: j.age_seconds,
|
||||
command: j.command,
|
||||
units_done: j.units_done,
|
||||
units_total: j.units_total,
|
||||
committed: j.committed,
|
||||
rows_skipped: j.rows_skipped,
|
||||
error: j.error,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
pub fn cancel_job(self_: PyRef<'_, Self>, job_id: String) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.cancel_job(&job_id).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (job_id, table=None))]
|
||||
pub fn get_job(
|
||||
self_: PyRef<'_, Self>,
|
||||
job_id: String,
|
||||
table: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let job = inner
|
||||
.get_job(&job_id, table.as_deref())
|
||||
.await
|
||||
.infer_error()?;
|
||||
Ok(job.map(|j| JobInfo {
|
||||
table: j.table,
|
||||
job_id: j.job_id,
|
||||
job_type: j.job_type,
|
||||
state: j.state,
|
||||
column: j.column,
|
||||
age_seconds: j.age_seconds,
|
||||
command: j.command,
|
||||
units_done: j.units_done,
|
||||
units_total: j.units_total,
|
||||
committed: j.committed,
|
||||
rows_skipped: j.rows_skipped,
|
||||
error: j.error,
|
||||
}))
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (job_id=None))]
|
||||
pub fn job_history(
|
||||
self_: PyRef<'_, Self>,
|
||||
job_id: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let rows = inner.job_history(job_id.as_deref()).await.infer_error()?;
|
||||
Ok(rows
|
||||
.into_iter()
|
||||
.map(|r| JobHistoryEntry {
|
||||
table: r.table,
|
||||
job_id: r.job_id,
|
||||
job_type: r.job_type,
|
||||
state: r.state,
|
||||
column: r.column,
|
||||
created_ms: r.created_ms,
|
||||
updated_ms: r.updated_ms,
|
||||
completed_ms: r.completed_ms,
|
||||
rows_processed: r.rows_processed,
|
||||
rows_skipped: r.rows_skipped,
|
||||
error: r.error,
|
||||
events: r.events,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (job_id=None, table=None))]
|
||||
pub fn errors(
|
||||
self_: PyRef<'_, Self>,
|
||||
job_id: Option<String>,
|
||||
table: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.get_inner()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
let rows = inner
|
||||
.errors(job_id.as_deref(), table.as_deref())
|
||||
.await
|
||||
.infer_error()?;
|
||||
Ok(rows
|
||||
.into_iter()
|
||||
.map(|e| JobErrorEntry {
|
||||
job_id: e.job_id,
|
||||
table: e.table,
|
||||
column: e.column,
|
||||
error_type: e.error_type,
|
||||
error_message: e.error_message,
|
||||
fragment_id: e.fragment_id,
|
||||
source_row_id: e.source_row_id,
|
||||
table_version: e.table_version,
|
||||
age_seconds: e.age_seconds,
|
||||
})
|
||||
.collect::<Vec<_>>())
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (cur_name, new_name, cur_namespace_path=None, new_namespace_path=None))]
|
||||
pub fn rename_table(
|
||||
self_: PyRef<'_, Self>,
|
||||
|
||||
@@ -41,6 +41,11 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
|
||||
.write_style("LANCEDB_LOG_STYLE");
|
||||
env_logger::init_from_env(env);
|
||||
m.add_class::<Connection>()?;
|
||||
m.add_class::<connection::FunctionInfo>()?;
|
||||
m.add_class::<connection::MaterializedViewInfo>()?;
|
||||
m.add_class::<connection::JobInfo>()?;
|
||||
m.add_class::<connection::JobHistoryEntry>()?;
|
||||
m.add_class::<connection::JobErrorEntry>()?;
|
||||
m.add_class::<Session>()?;
|
||||
m.add_class::<Table>()?;
|
||||
m.add_class::<IndexConfig>()?;
|
||||
|
||||
@@ -17,8 +17,8 @@ use arrow::{
|
||||
pyarrow::{FromPyArrow, PyArrowType, ToPyArrow},
|
||||
};
|
||||
use lancedb::table::{
|
||||
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, NewColumnTransform,
|
||||
OptimizeAction, OptimizeOptions, Ref, Table as LanceDbTable,
|
||||
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, LoadColumnsRequest,
|
||||
NewColumnTransform, OptimizeAction, OptimizeOptions, Ref, Table as LanceDbTable,
|
||||
};
|
||||
use pyo3::{
|
||||
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
|
||||
@@ -1060,6 +1060,83 @@ impl Table {
|
||||
})
|
||||
}
|
||||
|
||||
pub fn add_computed_columns(
|
||||
self_: PyRef<'_, Self>,
|
||||
columns: Vec<(String, String)>,
|
||||
expression: String,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.add_computed_columns(&columns, &expression)
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[pyo3(signature = (columns, where_clause=None, num_workers=None, max_workers=None, batch_size=None, priority=None))]
|
||||
pub fn refresh_column(
|
||||
self_: PyRef<'_, Self>,
|
||||
columns: Vec<String>,
|
||||
where_clause: Option<String>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
batch_size: Option<u32>,
|
||||
priority: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
future_into_py(self_.py(), async move {
|
||||
inner
|
||||
.refresh_column(
|
||||
&columns,
|
||||
where_clause,
|
||||
num_workers,
|
||||
max_workers,
|
||||
batch_size,
|
||||
priority,
|
||||
)
|
||||
.await
|
||||
.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
#[allow(clippy::too_many_arguments)]
|
||||
#[pyo3(signature = (source_uris, source_format, target_key, columns, source_key=None, source_storage_options=None, on_missing=None, num_workers=None, max_workers=None, batch_size=None, commit_granularity=None, priority=None))]
|
||||
pub fn load_columns(
|
||||
self_: PyRef<'_, Self>,
|
||||
source_uris: Vec<String>,
|
||||
source_format: String,
|
||||
target_key: String,
|
||||
columns: Vec<(String, Option<String>)>,
|
||||
source_key: Option<String>,
|
||||
source_storage_options: Option<std::collections::HashMap<String, String>>,
|
||||
on_missing: Option<String>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
batch_size: Option<u32>,
|
||||
commit_granularity: Option<u32>,
|
||||
priority: Option<String>,
|
||||
) -> PyResult<Bound<'_, PyAny>> {
|
||||
let inner = self_.inner_ref()?.clone();
|
||||
let request = LoadColumnsRequest {
|
||||
source_uris,
|
||||
source_format,
|
||||
source_storage_options,
|
||||
target_key,
|
||||
source_key,
|
||||
columns,
|
||||
on_missing,
|
||||
num_workers,
|
||||
max_workers,
|
||||
batch_size,
|
||||
commit_granularity,
|
||||
priority,
|
||||
};
|
||||
future_into_py(self_.py(), async move {
|
||||
inner.load_columns(request).await.infer_error()
|
||||
})
|
||||
}
|
||||
|
||||
pub fn add_columns(
|
||||
self_: PyRef<'_, Self>,
|
||||
definitions: Vec<(String, String)>,
|
||||
|
||||
@@ -166,10 +166,6 @@ required-features = ["bedrock"]
|
||||
[[example]]
|
||||
name = "simple"
|
||||
|
||||
[[example]]
|
||||
name = "polars"
|
||||
required-features = ["polars"]
|
||||
|
||||
[[example]]
|
||||
name = "full_text_search"
|
||||
|
||||
|
||||
@@ -1,47 +0,0 @@
|
||||
// SPDX-License-Identifier: Apache-2.0
|
||||
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
|
||||
|
||||
//! This example demonstrates ingesting a Polars DataFrame into LanceDB and
|
||||
//! reading it back out as a Polars DataFrame.
|
||||
|
||||
use lancedb::arrow::IntoPolars;
|
||||
use lancedb::query::ExecutableQuery;
|
||||
use lancedb::{Result, connect};
|
||||
use polars::prelude::{DataFrame, NamedFrom, Series};
|
||||
|
||||
fn make_dataframe() -> DataFrame {
|
||||
let ids = Series::new("id", &[1i32, 2, 3, 4, 5]);
|
||||
let names = Series::new("name", &["Alice", "Bob", "Carol", "Dave", "Eve"]);
|
||||
let scores = Series::new("score", &[9.5f64, 8.1, 7.3, 9.0, 6.5]);
|
||||
DataFrame::new(vec![ids, names, scores]).unwrap()
|
||||
}
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<()> {
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
|
||||
|
||||
// Ingest a Polars DataFrame directly — DataFrame now implements Scannable.
|
||||
let df = make_dataframe();
|
||||
println!("Input DataFrame:\n{df}");
|
||||
|
||||
let table = db.create_table("people", df).execute().await?;
|
||||
|
||||
// Append more rows.
|
||||
let more = DataFrame::new(vec![
|
||||
Series::new("id", &[6i32, 7]),
|
||||
Series::new("name", &["Frank", "Grace"]),
|
||||
Series::new("score", &[7.8f64, 8.9]),
|
||||
])
|
||||
.unwrap();
|
||||
table.add(more).execute().await?;
|
||||
|
||||
// Read back as a Polars DataFrame.
|
||||
let result_df = table.query().execute().await?.into_polars().await?;
|
||||
|
||||
println!(
|
||||
"\nRound-tripped DataFrame ({} rows):\n{result_df}",
|
||||
result_df.height()
|
||||
);
|
||||
Ok(())
|
||||
}
|
||||
@@ -112,14 +112,54 @@ impl<S: Stream<Item = Result<arrow_array::RecordBatch>>> RecordBatchStream
|
||||
|
||||
/// A trait for converting incoming data to Arrow
|
||||
///
|
||||
/// Integrations should implement this trait to allow data to be
|
||||
/// imported directly from the integration. For example, implementing
|
||||
/// this trait for `Vec<Vec<...>>` would allow the `Vec` to be directly
|
||||
/// used in methods like [`crate::connection::Connection::create_table`]
|
||||
/// or [`crate::table::Table::add`]
|
||||
pub trait IntoArrow {
|
||||
/// Convert the data into an iterator of Arrow batches
|
||||
fn into_arrow(self) -> Result<Box<dyn arrow_array::RecordBatchReader + Send>>;
|
||||
}
|
||||
|
||||
pub type BoxedRecordBatchReader = Box<dyn arrow_array::RecordBatchReader + Send>;
|
||||
|
||||
impl<T: arrow_array::RecordBatchReader + Send + 'static> IntoArrow for T {
|
||||
fn into_arrow(self) -> Result<Box<dyn arrow_array::RecordBatchReader + Send>> {
|
||||
Ok(Box::new(self))
|
||||
}
|
||||
}
|
||||
|
||||
/// A trait for converting incoming data to Arrow asynchronously
|
||||
///
|
||||
/// Serves the same purpose as [`IntoArrow`], but for asynchronous data.
|
||||
///
|
||||
/// Note: Arrow has no async equivalent to RecordBatchReader and so
|
||||
pub trait IntoArrowStream {
|
||||
/// Convert the data into a stream of Arrow batches
|
||||
fn into_arrow(self) -> Result<SendableRecordBatchStream>;
|
||||
}
|
||||
|
||||
impl<S: Stream<Item = Result<arrow_array::RecordBatch>>> SimpleRecordBatchStream<S> {
|
||||
pub fn new(stream: S, schema: Arc<arrow_schema::Schema>) -> Self {
|
||||
Self { schema, stream }
|
||||
}
|
||||
}
|
||||
|
||||
impl IntoArrowStream for SendableRecordBatchStream {
|
||||
fn into_arrow(self) -> Result<SendableRecordBatchStream> {
|
||||
Ok(self)
|
||||
}
|
||||
}
|
||||
|
||||
impl IntoArrowStream for datafusion_physical_plan::SendableRecordBatchStream {
|
||||
fn into_arrow(self) -> Result<SendableRecordBatchStream> {
|
||||
let schema = self.schema();
|
||||
let stream = self.map_err(|df_err| df_err.into());
|
||||
Ok(Box::pin(SimpleRecordBatchStream::new(stream, schema)))
|
||||
}
|
||||
}
|
||||
|
||||
pub trait LanceDbDatagenExt {
|
||||
fn into_ldb_stream(
|
||||
self,
|
||||
@@ -224,7 +264,9 @@ impl IntoPolars for SendableRecordBatchStream {
|
||||
#[cfg(all(test, feature = "polars"))]
|
||||
mod tests {
|
||||
use super::SendableRecordBatchStream;
|
||||
use crate::arrow::{IntoPolars, PolarsDataFrameRecordBatchReader, SimpleRecordBatchStream};
|
||||
use crate::arrow::{
|
||||
IntoArrow, IntoPolars, PolarsDataFrameRecordBatchReader, SimpleRecordBatchStream,
|
||||
};
|
||||
use polars::prelude::{DataFrame, NamedFrom, Series};
|
||||
|
||||
fn get_record_batch_reader_from_polars() -> Box<dyn arrow_array::RecordBatchReader + Send> {
|
||||
@@ -238,7 +280,10 @@ mod tests {
|
||||
float_series = Series::new("float", &[2.0]);
|
||||
let df2 = DataFrame::new(vec![string_series, int_series, float_series]).unwrap();
|
||||
|
||||
Box::new(PolarsDataFrameRecordBatchReader::new(df1.vstack(&df2).unwrap()).unwrap())
|
||||
PolarsDataFrameRecordBatchReader::new(df1.vstack(&df2).unwrap())
|
||||
.unwrap()
|
||||
.into_arrow()
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@@ -23,8 +23,10 @@ use crate::connection::create_table::CreateTableBuilder;
|
||||
use crate::data::scannable::Scannable;
|
||||
use crate::database::listing::ListingDatabase;
|
||||
use crate::database::{
|
||||
CloneTableRequest, Database, DatabaseOptions, OpenTableRequest, ReadConsistency,
|
||||
TableNamesRequest,
|
||||
CloneTableRequest, CreateFunctionRequest, CreateMaterializedViewRequest, Database,
|
||||
DatabaseOptions, FunctionInfo, JobErrorInfo, JobHistoryInfo, JobInfo, MaterializedViewInfo,
|
||||
MvRefreshPlan, OpenTableRequest, ReadConsistency, RefreshMaterializedViewRequest,
|
||||
TableLineageRequest, TableNamesRequest,
|
||||
};
|
||||
use crate::embeddings::{EmbeddingRegistry, MemoryRegistry};
|
||||
use crate::error::{Error, Result};
|
||||
@@ -488,6 +490,113 @@ impl Connection {
|
||||
)
|
||||
}
|
||||
|
||||
// -- Derived compute: functions, materialized views, jobs -------------
|
||||
// Server-backed features (LanceDB Enterprise / Cloud); local
|
||||
// databases return NotSupported for now.
|
||||
|
||||
/// Register a UDF (CREATE FUNCTION).
|
||||
pub async fn create_function(&self, request: CreateFunctionRequest) -> Result<()> {
|
||||
self.internal.create_function(request).await
|
||||
}
|
||||
|
||||
/// List registered functions (SHOW FUNCTIONS).
|
||||
pub async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
|
||||
self.internal.list_functions().await
|
||||
}
|
||||
|
||||
/// Drop a registered function (DROP FUNCTION).
|
||||
pub async fn drop_function(&self, name: &str) -> Result<()> {
|
||||
self.internal.drop_function(name).await
|
||||
}
|
||||
|
||||
/// Create a materialized view (CREATE MATERIALIZED VIEW). Returns
|
||||
/// the initial-population job id, absent when `with_no_data`.
|
||||
pub async fn create_materialized_view(
|
||||
&self,
|
||||
request: CreateMaterializedViewRequest,
|
||||
) -> Result<Option<String>> {
|
||||
self.internal.create_materialized_view(request).await
|
||||
}
|
||||
|
||||
/// Refresh a materialized view; returns the refresh job id.
|
||||
pub async fn refresh_materialized_view(
|
||||
&self,
|
||||
request: RefreshMaterializedViewRequest,
|
||||
) -> Result<String> {
|
||||
self.internal.refresh_materialized_view(request).await
|
||||
}
|
||||
|
||||
/// Derived-compute lineage of a table/view (or column), as server-defined
|
||||
/// JSON. Read-only.
|
||||
pub async fn table_lineage(&self, request: TableLineageRequest) -> Result<String> {
|
||||
self.internal.table_lineage(request).await
|
||||
}
|
||||
|
||||
/// Plan a materialized-view refresh without submitting work
|
||||
/// (EXPLAIN REFRESH).
|
||||
pub async fn explain_refresh_materialized_view(
|
||||
&self,
|
||||
name: &str,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
) -> Result<MvRefreshPlan> {
|
||||
self.internal
|
||||
.explain_refresh_materialized_view(name, full, src_version)
|
||||
.await
|
||||
}
|
||||
|
||||
/// Update a materialized view's options (ALTER MATERIALIZED VIEW).
|
||||
pub async fn alter_materialized_view(&self, name: &str, auto_refresh: bool) -> Result<()> {
|
||||
self.internal
|
||||
.alter_materialized_view(name, auto_refresh)
|
||||
.await
|
||||
}
|
||||
|
||||
/// Drop a materialized view definition (DROP MATERIALIZED VIEW).
|
||||
pub async fn drop_materialized_view(&self, name: &str) -> Result<()> {
|
||||
self.internal.drop_materialized_view(name).await
|
||||
}
|
||||
|
||||
/// List registered materialized view definitions.
|
||||
pub async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
|
||||
self.internal.list_materialized_views().await
|
||||
}
|
||||
|
||||
/// List inflight server-side jobs across the database's tables.
|
||||
pub async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
|
||||
self.internal.list_jobs().await
|
||||
}
|
||||
|
||||
/// Cancel an inflight server-side job by id. Returns true if a
|
||||
/// matching inflight job was flagged for cancellation.
|
||||
pub async fn cancel_job(&self, job_id: &str) -> Result<bool> {
|
||||
self.internal.cancel_job(job_id).await
|
||||
}
|
||||
|
||||
/// Look up a single server-side job by id -- the `wait()`/status poll path.
|
||||
/// `table_hint` (the job's table) enables an O(1) server-side lookup; `None`
|
||||
/// scans the database's active jobs. A `None` result means unknown / not
|
||||
/// active.
|
||||
pub async fn get_job(&self, job_id: &str, table_hint: Option<&str>) -> Result<Option<JobInfo>> {
|
||||
self.internal.get_job(job_id, table_hint).await
|
||||
}
|
||||
|
||||
/// Durable job history (SHOW JOB HISTORY) across the database's tables.
|
||||
/// Pass `job_id` to narrow to a single job.
|
||||
pub async fn job_history(&self, job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
|
||||
self.internal.job_history(job_id).await
|
||||
}
|
||||
|
||||
/// Per-row UDF errors (SHOW ERRORS) across the database's tables, optionally
|
||||
/// filtered by `job_id` and/or `table`.
|
||||
pub async fn errors(
|
||||
&self,
|
||||
job_id: Option<&str>,
|
||||
table: Option<&str>,
|
||||
) -> Result<Vec<JobErrorInfo>> {
|
||||
self.internal.errors(job_id, table).await
|
||||
}
|
||||
|
||||
/// Rename a table in the database.
|
||||
///
|
||||
/// This is only supported in LanceDB Cloud.
|
||||
|
||||
@@ -185,43 +185,6 @@ impl Scannable for SendableRecordBatchStream {
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "polars")]
|
||||
impl Scannable for polars::frame::DataFrame {
|
||||
fn schema(&self) -> SchemaRef {
|
||||
crate::polars_arrow_convertors::convert_polars_df_schema_to_arrow_rb_schema(
|
||||
self.schema().clone(),
|
||||
)
|
||||
.expect("failed to convert Polars DataFrame schema to Arrow schema")
|
||||
}
|
||||
|
||||
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
|
||||
let schema = Scannable::schema(self);
|
||||
let batches: crate::Result<Vec<RecordBatch>> =
|
||||
match crate::arrow::PolarsDataFrameRecordBatchReader::new(self.clone()) {
|
||||
Err(e) => Err(e),
|
||||
Ok(reader) => reader.map(|b| b.map_err(Into::into)).collect(),
|
||||
};
|
||||
match batches {
|
||||
Err(e) => Box::pin(SimpleRecordBatchStream {
|
||||
schema,
|
||||
stream: once(async move { Err(e) }),
|
||||
}),
|
||||
Ok(batches) => {
|
||||
let stream = futures::stream::iter(batches.into_iter().map(Ok));
|
||||
Box::pin(SimpleRecordBatchStream { schema, stream })
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn num_rows(&self) -> Option<usize> {
|
||||
Some(self.height())
|
||||
}
|
||||
|
||||
fn rescannable(&self) -> bool {
|
||||
true
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl StreamingWriteSource for Box<dyn Scannable> {
|
||||
fn arrow_schema(&self) -> SchemaRef {
|
||||
@@ -1126,60 +1089,4 @@ mod tests {
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "polars")]
|
||||
mod polars_tests {
|
||||
use super::*;
|
||||
use crate::arrow::IntoPolars;
|
||||
use crate::query::ExecutableQuery;
|
||||
use polars::prelude::{DataFrame, NamedFrom, Series};
|
||||
|
||||
fn make_df() -> DataFrame {
|
||||
DataFrame::new(vec![
|
||||
Series::new("id", &[1i32, 2, 3]),
|
||||
Series::new("val", &[1.1f64, 2.2, 3.3]),
|
||||
])
|
||||
.unwrap()
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_dataframe_scannable_round_trip() {
|
||||
let tmp = tempfile::tempdir().unwrap();
|
||||
let db = crate::connect(tmp.path().to_str().unwrap())
|
||||
.execute()
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
let df = make_df();
|
||||
let table = db.create_table("t", df.clone()).execute().await.unwrap();
|
||||
|
||||
// Append the same rows again.
|
||||
table.add(df.clone()).execute().await.unwrap();
|
||||
|
||||
let result = table
|
||||
.query()
|
||||
.execute()
|
||||
.await
|
||||
.unwrap()
|
||||
.into_polars()
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
assert_eq!(result.height(), df.height() * 2);
|
||||
assert_eq!(result.schema(), df.schema());
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_dataframe_scannable_rescannable() {
|
||||
let mut df = make_df();
|
||||
assert!(df.rescannable());
|
||||
|
||||
let batches1: Vec<RecordBatch> = df.scan_as_stream().try_collect().await.unwrap();
|
||||
assert_eq!(batches1.iter().map(|b| b.num_rows()).sum::<usize>(), 3);
|
||||
|
||||
// Can be scanned again.
|
||||
let batches2: Vec<RecordBatch> = df.scan_as_stream().try_collect().await.unwrap();
|
||||
assert_eq!(batches2.iter().map(|b| b.num_rows()).sum::<usize>(), 3);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -27,7 +27,7 @@ use lance_namespace::models::{
|
||||
};
|
||||
|
||||
use crate::data::scannable::Scannable;
|
||||
use crate::error::Result;
|
||||
use crate::error::{Error, Result};
|
||||
use crate::table::{BaseTable, WriteOptions};
|
||||
|
||||
pub mod listing;
|
||||
@@ -200,6 +200,205 @@ pub enum ReadConsistency {
|
||||
Strong,
|
||||
}
|
||||
|
||||
/// A request to register a UDF (CREATE FUNCTION).
|
||||
///
|
||||
/// Functions are first-class database objects, decoupled from any
|
||||
/// column; computed columns and materialized views reference them by
|
||||
/// name. Server-backed feature (LanceDB Enterprise / Cloud).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CreateFunctionRequest {
|
||||
/// Function name.
|
||||
pub name: String,
|
||||
/// Implementation language (currently "python").
|
||||
pub language: String,
|
||||
/// SQL return type, e.g. `FLOAT`, `FLOAT[1536]`,
|
||||
/// `STRUCT(a FLOAT, b VARCHAR)`, `TABLE(chunk VARCHAR, idx INT)`.
|
||||
pub return_type: String,
|
||||
/// Function body: source text, or base64 cloudpickle bytes when
|
||||
/// `options["body_format"] = "cloudpickle"`.
|
||||
pub body: String,
|
||||
/// Options: input_columns, pip, num_gpus, batch_size, timeout,
|
||||
/// error_policy, docker_image, body_format, ...
|
||||
pub options: HashMap<String, String>,
|
||||
}
|
||||
|
||||
/// A registered function, as returned by `list_functions`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct FunctionInfo {
|
||||
pub name: String,
|
||||
pub language: String,
|
||||
pub return_type: String,
|
||||
pub description: String,
|
||||
}
|
||||
|
||||
/// A request to create a materialized view (CREATE MATERIALIZED VIEW).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct CreateMaterializedViewRequest {
|
||||
/// View name.
|
||||
pub name: String,
|
||||
/// The view's SELECT statement, e.g.
|
||||
/// `SELECT id, embed(body) AS vec FROM articles WHERE id > 1`.
|
||||
/// Bare columns project through; function-call columns compute via
|
||||
/// registered UDFs (a RETURNS TABLE function makes a row-expanding
|
||||
/// chunker view).
|
||||
pub query: String,
|
||||
/// Refresh automatically when the source table changes.
|
||||
pub auto_refresh: bool,
|
||||
/// Register the definition only; skip the initial population.
|
||||
pub with_no_data: bool,
|
||||
/// Optional source column to partition the view's table function on. If the
|
||||
/// column has an IVF vector index the server partitions by its clusters
|
||||
/// (image-dedup style); otherwise it groups by distinct value.
|
||||
pub partition_by: Option<String>,
|
||||
}
|
||||
|
||||
impl CreateMaterializedViewRequest {
|
||||
pub fn new(name: impl Into<String>, query: impl Into<String>) -> Self {
|
||||
Self {
|
||||
name: name.into(),
|
||||
query: query.into(),
|
||||
auto_refresh: false,
|
||||
with_no_data: false,
|
||||
partition_by: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A request to refresh a materialized view.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct RefreshMaterializedViewRequest {
|
||||
/// View name.
|
||||
pub name: String,
|
||||
/// Force a full rebuild (recompute and replace every row) instead of the
|
||||
/// default incremental refresh.
|
||||
pub full: bool,
|
||||
/// Pin the refresh to a source-table version; latest when absent.
|
||||
pub src_version: Option<u64>,
|
||||
/// Initial worker count.
|
||||
pub num_workers: Option<u32>,
|
||||
/// Elastic worker ceiling.
|
||||
pub max_workers: Option<u32>,
|
||||
}
|
||||
|
||||
/// A request for the derived-compute lineage of a table/view (or one of its
|
||||
/// columns). The response is server-defined lineage JSON, returned opaque so
|
||||
/// this client need not model the server's lineage schema.
|
||||
#[derive(Debug, Clone, Default)]
|
||||
pub struct TableLineageRequest {
|
||||
/// Table or view name.
|
||||
pub name: String,
|
||||
/// Column for column-level lineage; whole table/view when absent.
|
||||
pub column: Option<String>,
|
||||
/// "upstream" | "downstream" | "both" (server default when absent).
|
||||
pub direction: Option<String>,
|
||||
/// Column-hops to walk; transitive when absent.
|
||||
pub depth: Option<u32>,
|
||||
}
|
||||
|
||||
impl RefreshMaterializedViewRequest {
|
||||
pub fn new(name: impl Into<String>) -> Self {
|
||||
Self {
|
||||
name: name.into(),
|
||||
full: false,
|
||||
src_version: None,
|
||||
num_workers: None,
|
||||
max_workers: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// A registered materialized view definition, as returned by
|
||||
/// `list_materialized_views`.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MaterializedViewInfo {
|
||||
pub name: String,
|
||||
pub source_table: String,
|
||||
/// Source columns projected through.
|
||||
pub projection: Vec<String>,
|
||||
/// `alias=expression` per UDF-computed column.
|
||||
pub udf_columns: Vec<String>,
|
||||
pub filter: Option<String>,
|
||||
pub auto_refresh: bool,
|
||||
}
|
||||
|
||||
/// A row from `list_jobs`: one inflight server-side job (index build,
|
||||
/// compaction, column refresh, view refresh, ...).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct JobInfo {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
/// Lifecycle state: "running", "cancelling", or "stale".
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub age_seconds: Option<i64>,
|
||||
pub command: Option<String>,
|
||||
pub units_done: Option<i64>,
|
||||
pub units_total: Option<i64>,
|
||||
/// Whether the job's final commit has completed (output visible).
|
||||
pub committed: bool,
|
||||
pub rows_skipped: u64,
|
||||
pub error: Option<String>,
|
||||
}
|
||||
|
||||
/// A row from `job_history`: one durable, completed/terminal server-side job
|
||||
/// record (SHOW JOB HISTORY), read from a table's `_job_history` store. Unlike
|
||||
/// `JobInfo` (live, inflight jobs) this carries created/updated/completed
|
||||
/// timestamps and the lifecycle event log.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct JobHistoryInfo {
|
||||
pub table: String,
|
||||
pub job_id: String,
|
||||
pub job_type: String,
|
||||
pub state: String,
|
||||
pub column: Option<String>,
|
||||
pub created_ms: i64,
|
||||
pub updated_ms: i64,
|
||||
pub completed_ms: Option<i64>,
|
||||
pub rows_processed: Option<i64>,
|
||||
pub rows_skipped: Option<i64>,
|
||||
pub error: Option<String>,
|
||||
/// Newline-joined lifecycle event log, oldest first.
|
||||
pub events: Option<String>,
|
||||
}
|
||||
|
||||
/// A row from `errors`: one per-row UDF failure recorded by `error_policy=skip`
|
||||
/// (SHOW ERRORS).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct JobErrorInfo {
|
||||
pub job_id: String,
|
||||
pub table: String,
|
||||
pub column: String,
|
||||
pub error_type: String,
|
||||
pub error_message: String,
|
||||
pub fragment_id: Option<i64>,
|
||||
pub source_row_id: Option<i64>,
|
||||
pub table_version: Option<i64>,
|
||||
pub age_seconds: Option<i64>,
|
||||
}
|
||||
|
||||
/// The plan a `REFRESH MATERIALIZED VIEW` would execute, as returned by
|
||||
/// `explain_refresh_materialized_view` (EXPLAIN REFRESH). No work is run.
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct MvRefreshPlan {
|
||||
pub table_name: String,
|
||||
/// Whether a refresh would do anything (rebuild or non-empty units).
|
||||
pub has_work: bool,
|
||||
pub source_version: u64,
|
||||
pub last_refreshed_version: Option<u64>,
|
||||
pub full_refresh: bool,
|
||||
/// Source changed non-append-only since the last refresh -> rebuild.
|
||||
pub rebuild: bool,
|
||||
/// Number of row-range work units the refresh would process.
|
||||
pub units_total: u64,
|
||||
}
|
||||
|
||||
fn not_supported<T>(what: &str) -> Result<T> {
|
||||
Err(Error::NotSupported {
|
||||
message: format!("{} is not supported by this database", what),
|
||||
})
|
||||
}
|
||||
|
||||
/// The `Database` trait defines the interface for database implementations.
|
||||
///
|
||||
/// A database is responsible for managing tables and their metadata.
|
||||
@@ -245,6 +444,99 @@ pub trait Database:
|
||||
///
|
||||
/// See [`CloneTableRequest`] for detailed documentation and examples.
|
||||
async fn clone_table(&self, request: CloneTableRequest) -> Result<Arc<dyn BaseTable>>;
|
||||
|
||||
// -- Derived compute: functions, materialized views, jobs -------------
|
||||
//
|
||||
// Server-backed features (LanceDB Enterprise / Cloud). The defaults
|
||||
// return NotSupported; the remote database overrides them. Local
|
||||
// single-node implementations are planned.
|
||||
|
||||
/// Register a UDF (CREATE FUNCTION).
|
||||
async fn create_function(&self, _request: CreateFunctionRequest) -> Result<()> {
|
||||
not_supported("create_function")
|
||||
}
|
||||
/// List registered functions (SHOW FUNCTIONS).
|
||||
async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
|
||||
not_supported("list_functions")
|
||||
}
|
||||
/// Drop a registered function (DROP FUNCTION).
|
||||
async fn drop_function(&self, _name: &str) -> Result<()> {
|
||||
not_supported("drop_function")
|
||||
}
|
||||
/// Create a materialized view (CREATE MATERIALIZED VIEW). Returns
|
||||
/// the initial-population job id, absent when `with_no_data`.
|
||||
async fn create_materialized_view(
|
||||
&self,
|
||||
_request: CreateMaterializedViewRequest,
|
||||
) -> Result<Option<String>> {
|
||||
not_supported("create_materialized_view")
|
||||
}
|
||||
/// Refresh a materialized view; returns the refresh job id.
|
||||
async fn refresh_materialized_view(
|
||||
&self,
|
||||
_request: RefreshMaterializedViewRequest,
|
||||
) -> Result<String> {
|
||||
not_supported("refresh_materialized_view")
|
||||
}
|
||||
/// Derived-compute lineage of a table/view (or column), as server-defined
|
||||
/// JSON. Read-only.
|
||||
async fn table_lineage(&self, _request: TableLineageRequest) -> Result<String> {
|
||||
not_supported("table_lineage")
|
||||
}
|
||||
/// Plan a materialized-view refresh without submitting work
|
||||
/// (EXPLAIN REFRESH). `full` plans a full rebuild (incremental
|
||||
/// planning requires stable row IDs on the source).
|
||||
async fn explain_refresh_materialized_view(
|
||||
&self,
|
||||
_name: &str,
|
||||
_full: bool,
|
||||
_src_version: Option<u64>,
|
||||
) -> Result<MvRefreshPlan> {
|
||||
not_supported("explain_refresh_materialized_view")
|
||||
}
|
||||
/// Update a materialized view's options (ALTER MATERIALIZED VIEW).
|
||||
async fn alter_materialized_view(&self, _name: &str, _auto_refresh: bool) -> Result<()> {
|
||||
not_supported("alter_materialized_view")
|
||||
}
|
||||
/// Drop a materialized view definition (DROP MATERIALIZED VIEW).
|
||||
async fn drop_materialized_view(&self, _name: &str) -> Result<()> {
|
||||
not_supported("drop_materialized_view")
|
||||
}
|
||||
/// List registered materialized view definitions.
|
||||
async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
|
||||
not_supported("list_materialized_views")
|
||||
}
|
||||
/// List inflight server-side jobs across the database's tables.
|
||||
async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
|
||||
not_supported("list_jobs")
|
||||
}
|
||||
/// Cancel an inflight server-side job by id. Returns true if a
|
||||
/// matching inflight job was found and flagged for cancellation,
|
||||
/// false if none was inflight (best-effort, like SQL `CANCEL JOB`).
|
||||
async fn cancel_job(&self, _job_id: &str) -> Result<bool> {
|
||||
not_supported("cancel_job")
|
||||
}
|
||||
/// Point-access for a single job by id -- the `wait()`/status poll path.
|
||||
/// `table_hint` (the job's table, which `wait()` callers know) enables an
|
||||
/// O(1) server-side lookup. `None` if the job is unknown or not active.
|
||||
async fn get_job(&self, _job_id: &str, _table_hint: Option<&str>) -> Result<Option<JobInfo>> {
|
||||
not_supported("get_job")
|
||||
}
|
||||
/// Durable job history (SHOW JOB HISTORY) across the database's tables,
|
||||
/// optionally narrowed to a single `job_id`.
|
||||
async fn job_history(&self, _job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
|
||||
not_supported("job_history")
|
||||
}
|
||||
/// Per-row UDF errors (SHOW ERRORS) recorded by `error_policy=skip` across
|
||||
/// the database's tables, optionally filtered by `job_id` and/or `table`.
|
||||
async fn errors(
|
||||
&self,
|
||||
_job_id: Option<&str>,
|
||||
_table: Option<&str>,
|
||||
) -> Result<Vec<JobErrorInfo>> {
|
||||
not_supported("errors")
|
||||
}
|
||||
|
||||
/// Open a table in the database
|
||||
async fn open_table(&self, request: OpenTableRequest) -> Result<Arc<dyn BaseTable>>;
|
||||
/// Rename a table in the database
|
||||
|
||||
@@ -19,8 +19,10 @@ use lance_namespace::models::{
|
||||
|
||||
use crate::Error;
|
||||
use crate::database::{
|
||||
CloneTableRequest, CreateTableMode, CreateTableRequest, Database, DatabaseOptions,
|
||||
OpenTableRequest, ReadConsistency, TableNamesRequest,
|
||||
CloneTableRequest, CreateFunctionRequest, CreateMaterializedViewRequest, CreateTableMode,
|
||||
CreateTableRequest, Database, DatabaseOptions, FunctionInfo, JobErrorInfo, JobHistoryInfo,
|
||||
JobInfo, MaterializedViewInfo, MvRefreshPlan, OpenTableRequest, ReadConsistency,
|
||||
RefreshMaterializedViewRequest, TableLineageRequest, TableNamesRequest,
|
||||
};
|
||||
use crate::error::Result;
|
||||
use crate::remote::util::stream_as_body;
|
||||
@@ -33,6 +35,248 @@ use super::client::{
|
||||
use super::table::RemoteTable;
|
||||
use super::util::parse_server_version;
|
||||
|
||||
// Wire types for the derived-compute routes (functions, materialized
|
||||
// views, jobs). Field shapes mirror the server's REST contract.
|
||||
#[derive(serde::Serialize)]
|
||||
struct RemoteCreateFunctionRequest {
|
||||
language: String,
|
||||
return_type: String,
|
||||
body: String,
|
||||
options: std::collections::HashMap<String, String>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteFunctionEntry {
|
||||
name: String,
|
||||
language: String,
|
||||
return_type: String,
|
||||
#[serde(default)]
|
||||
description: String,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteListFunctionsResponse {
|
||||
functions: Vec<RemoteFunctionEntry>,
|
||||
}
|
||||
|
||||
#[derive(serde::Serialize)]
|
||||
struct RemoteCreateMaterializedViewRequest {
|
||||
query: String,
|
||||
auto_refresh: bool,
|
||||
with_no_data: bool,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
partition_by: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteCreateMaterializedViewResponse {
|
||||
#[serde(default)]
|
||||
job_id: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(serde::Serialize)]
|
||||
struct RemoteRefreshMaterializedViewRequest {
|
||||
#[serde(skip_serializing_if = "std::ops::Not::not")]
|
||||
full: bool,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
src_version: Option<u64>,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
num_workers: Option<u32>,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
max_workers: Option<u32>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteRefreshMaterializedViewResponse {
|
||||
job_id: String,
|
||||
}
|
||||
|
||||
#[derive(serde::Serialize)]
|
||||
struct RemoteExplainRefreshRequest {
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
full: Option<bool>,
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
src_version: Option<u64>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteExplainRefreshResponse {
|
||||
table_name: String,
|
||||
has_work: bool,
|
||||
source_version: u64,
|
||||
last_refreshed_version: Option<u64>,
|
||||
full_refresh: bool,
|
||||
rebuild: bool,
|
||||
units_total: u64,
|
||||
}
|
||||
|
||||
#[derive(serde::Serialize)]
|
||||
struct RemoteAlterMaterializedViewRequest {
|
||||
auto_refresh: bool,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteMaterializedViewEntry {
|
||||
name: String,
|
||||
source_table: String,
|
||||
#[serde(default)]
|
||||
projection: Vec<String>,
|
||||
#[serde(default)]
|
||||
udf_columns: Vec<String>,
|
||||
#[serde(default)]
|
||||
filter: Option<String>,
|
||||
#[serde(default)]
|
||||
auto_refresh: bool,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteListMaterializedViewsResponse {
|
||||
views: Vec<RemoteMaterializedViewEntry>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteJobEntry {
|
||||
table: String,
|
||||
job_id: String,
|
||||
job_type: String,
|
||||
state: String,
|
||||
#[serde(default)]
|
||||
column: Option<String>,
|
||||
#[serde(default)]
|
||||
age_seconds: Option<i64>,
|
||||
#[serde(default)]
|
||||
command: Option<String>,
|
||||
#[serde(default)]
|
||||
units_done: Option<i64>,
|
||||
#[serde(default)]
|
||||
units_total: Option<i64>,
|
||||
#[serde(default)]
|
||||
committed: bool,
|
||||
#[serde(default)]
|
||||
rows_skipped: u64,
|
||||
#[serde(default)]
|
||||
error: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteListJobsResponse {
|
||||
jobs: Vec<RemoteJobEntry>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteGetJobResponse {
|
||||
#[serde(default)]
|
||||
job: Option<RemoteJobEntry>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteCancelJobResponse {
|
||||
cancelled: bool,
|
||||
}
|
||||
|
||||
impl From<RemoteJobEntry> for JobInfo {
|
||||
fn from(j: RemoteJobEntry) -> Self {
|
||||
JobInfo {
|
||||
table: j.table,
|
||||
job_id: j.job_id,
|
||||
job_type: j.job_type,
|
||||
state: j.state,
|
||||
column: j.column,
|
||||
age_seconds: j.age_seconds,
|
||||
command: j.command,
|
||||
units_done: j.units_done,
|
||||
units_total: j.units_total,
|
||||
committed: j.committed,
|
||||
rows_skipped: j.rows_skipped,
|
||||
error: j.error,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteJobHistoryEntry {
|
||||
table: String,
|
||||
job_id: String,
|
||||
job_type: String,
|
||||
state: String,
|
||||
#[serde(default)]
|
||||
column: Option<String>,
|
||||
created_ms: i64,
|
||||
updated_ms: i64,
|
||||
#[serde(default)]
|
||||
completed_ms: Option<i64>,
|
||||
#[serde(default)]
|
||||
rows_processed: Option<i64>,
|
||||
#[serde(default)]
|
||||
rows_skipped: Option<i64>,
|
||||
#[serde(default)]
|
||||
error: Option<String>,
|
||||
#[serde(default)]
|
||||
events: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteJobHistoryResponse {
|
||||
jobs: Vec<RemoteJobHistoryEntry>,
|
||||
}
|
||||
|
||||
impl From<RemoteJobHistoryEntry> for JobHistoryInfo {
|
||||
fn from(j: RemoteJobHistoryEntry) -> Self {
|
||||
JobHistoryInfo {
|
||||
table: j.table,
|
||||
job_id: j.job_id,
|
||||
job_type: j.job_type,
|
||||
state: j.state,
|
||||
column: j.column,
|
||||
created_ms: j.created_ms,
|
||||
updated_ms: j.updated_ms,
|
||||
completed_ms: j.completed_ms,
|
||||
rows_processed: j.rows_processed,
|
||||
rows_skipped: j.rows_skipped,
|
||||
error: j.error,
|
||||
events: j.events,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteErrorEntry {
|
||||
job_id: String,
|
||||
table: String,
|
||||
column: String,
|
||||
error_type: String,
|
||||
error_message: String,
|
||||
#[serde(default)]
|
||||
fragment_id: Option<i64>,
|
||||
#[serde(default)]
|
||||
source_row_id: Option<i64>,
|
||||
#[serde(default)]
|
||||
table_version: Option<i64>,
|
||||
#[serde(default)]
|
||||
age_seconds: Option<i64>,
|
||||
}
|
||||
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RemoteErrorsResponse {
|
||||
errors: Vec<RemoteErrorEntry>,
|
||||
}
|
||||
|
||||
impl From<RemoteErrorEntry> for JobErrorInfo {
|
||||
fn from(e: RemoteErrorEntry) -> Self {
|
||||
JobErrorInfo {
|
||||
job_id: e.job_id,
|
||||
table: e.table,
|
||||
column: e.column,
|
||||
error_type: e.error_type,
|
||||
error_message: e.error_message,
|
||||
fragment_id: e.fragment_id,
|
||||
source_row_id: e.source_row_id,
|
||||
table_version: e.table_version,
|
||||
age_seconds: e.age_seconds,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Request structure for the remote clone table API
|
||||
#[derive(serde::Serialize)]
|
||||
struct RemoteCloneTableRequest {
|
||||
@@ -641,6 +885,228 @@ impl<S: HttpSend> Database for RemoteDatabase<S> {
|
||||
Ok(table)
|
||||
}
|
||||
|
||||
async fn create_function(&self, request: CreateFunctionRequest) -> Result<()> {
|
||||
let body = RemoteCreateFunctionRequest {
|
||||
language: request.language,
|
||||
return_type: request.return_type,
|
||||
body: request.body,
|
||||
options: request.options,
|
||||
};
|
||||
let req = self
|
||||
.client
|
||||
.post(&format!("/v1/function/{}/create", request.name))
|
||||
.json(&body);
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
self.client.check_response(&request_id, rsp).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
|
||||
let req = self.client.get("/v1/function/list");
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteListFunctionsResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body
|
||||
.functions
|
||||
.into_iter()
|
||||
.map(|f| FunctionInfo {
|
||||
name: f.name,
|
||||
language: f.language,
|
||||
return_type: f.return_type,
|
||||
description: f.description,
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
async fn drop_function(&self, name: &str) -> Result<()> {
|
||||
let req = self.client.post(&format!("/v1/function/{}/drop", name));
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
self.client.check_response(&request_id, rsp).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn create_materialized_view(
|
||||
&self,
|
||||
request: CreateMaterializedViewRequest,
|
||||
) -> Result<Option<String>> {
|
||||
let body = RemoteCreateMaterializedViewRequest {
|
||||
query: request.query,
|
||||
auto_refresh: request.auto_refresh,
|
||||
with_no_data: request.with_no_data,
|
||||
partition_by: request.partition_by,
|
||||
};
|
||||
let req = self
|
||||
.client
|
||||
.post(&format!("/v1/materialized_view/{}/create", request.name))
|
||||
.json(&body);
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteCreateMaterializedViewResponse =
|
||||
rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.job_id)
|
||||
}
|
||||
|
||||
async fn refresh_materialized_view(
|
||||
&self,
|
||||
request: RefreshMaterializedViewRequest,
|
||||
) -> Result<String> {
|
||||
let body = RemoteRefreshMaterializedViewRequest {
|
||||
full: request.full,
|
||||
src_version: request.src_version,
|
||||
num_workers: request.num_workers,
|
||||
max_workers: request.max_workers,
|
||||
};
|
||||
let req = self
|
||||
.client
|
||||
.post(&format!("/v1/materialized_view/{}/refresh", request.name))
|
||||
.json(&body);
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteRefreshMaterializedViewResponse =
|
||||
rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.job_id)
|
||||
}
|
||||
|
||||
async fn table_lineage(&self, request: TableLineageRequest) -> Result<String> {
|
||||
let mut req = self
|
||||
.client
|
||||
.get(&format!("/v1/table/{}/lineage", request.name));
|
||||
if let Some(column) = &request.column {
|
||||
req = req.query(&[("column", column)]);
|
||||
}
|
||||
if let Some(direction) = &request.direction {
|
||||
req = req.query(&[("direction", direction)]);
|
||||
}
|
||||
if let Some(depth) = request.depth {
|
||||
req = req.query(&[("depth", depth.to_string())]);
|
||||
}
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
// Server-defined lineage JSON, returned opaque (the client does not
|
||||
// model the lineage schema; the Python layer deserializes it).
|
||||
rsp.text().await.err_to_http(request_id)
|
||||
}
|
||||
|
||||
async fn explain_refresh_materialized_view(
|
||||
&self,
|
||||
name: &str,
|
||||
full: bool,
|
||||
src_version: Option<u64>,
|
||||
) -> Result<MvRefreshPlan> {
|
||||
let body = RemoteExplainRefreshRequest {
|
||||
full: Some(full),
|
||||
src_version,
|
||||
};
|
||||
let req = self
|
||||
.client
|
||||
.post(&format!("/v1/materialized_view/{}/explain_refresh", name))
|
||||
.json(&body);
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteExplainRefreshResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(MvRefreshPlan {
|
||||
table_name: body.table_name,
|
||||
has_work: body.has_work,
|
||||
source_version: body.source_version,
|
||||
last_refreshed_version: body.last_refreshed_version,
|
||||
full_refresh: body.full_refresh,
|
||||
rebuild: body.rebuild,
|
||||
units_total: body.units_total,
|
||||
})
|
||||
}
|
||||
|
||||
async fn alter_materialized_view(&self, name: &str, auto_refresh: bool) -> Result<()> {
|
||||
let req = self
|
||||
.client
|
||||
.post(&format!("/v1/materialized_view/{}/alter", name))
|
||||
.json(&RemoteAlterMaterializedViewRequest { auto_refresh });
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
self.client.check_response(&request_id, rsp).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn drop_materialized_view(&self, name: &str) -> Result<()> {
|
||||
let req = self
|
||||
.client
|
||||
.post(&format!("/v1/materialized_view/{}/drop", name));
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
self.client.check_response(&request_id, rsp).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
|
||||
let req = self.client.get("/v1/materialized_view/list");
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteListMaterializedViewsResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body
|
||||
.views
|
||||
.into_iter()
|
||||
.map(|v| MaterializedViewInfo {
|
||||
name: v.name,
|
||||
source_table: v.source_table,
|
||||
projection: v.projection,
|
||||
udf_columns: v.udf_columns,
|
||||
filter: v.filter,
|
||||
auto_refresh: v.auto_refresh,
|
||||
})
|
||||
.collect())
|
||||
}
|
||||
|
||||
async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
|
||||
let req = self.client.get("/v1/job/list");
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteListJobsResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.jobs.into_iter().map(JobInfo::from).collect())
|
||||
}
|
||||
|
||||
async fn get_job(&self, job_id: &str, table: Option<&str>) -> Result<Option<JobInfo>> {
|
||||
// Point-access poll path: GET /v1/job/{id}, with the table as the O(1)
|
||||
// hint when known. `query` handles URL-encoding the table name.
|
||||
let mut req = self.client.get(&format!("/v1/job/{job_id}"));
|
||||
if let Some(t) = table {
|
||||
req = req.query(&[("table", t)]);
|
||||
}
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteGetJobResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.job.map(JobInfo::from))
|
||||
}
|
||||
|
||||
async fn cancel_job(&self, job_id: &str) -> Result<bool> {
|
||||
let req = self.client.post(&format!("/v1/job/{}/cancel", job_id));
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteCancelJobResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.cancelled)
|
||||
}
|
||||
|
||||
async fn job_history(&self, job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
|
||||
let mut req = self.client.get("/v1/job/history");
|
||||
if let Some(j) = job_id {
|
||||
req = req.query(&[("job", j)]);
|
||||
}
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteJobHistoryResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.jobs.into_iter().map(JobHistoryInfo::from).collect())
|
||||
}
|
||||
|
||||
async fn errors(&self, job_id: Option<&str>, table: Option<&str>) -> Result<Vec<JobErrorInfo>> {
|
||||
let mut req = self.client.get("/v1/job/errors");
|
||||
if let Some(j) = job_id {
|
||||
req = req.query(&[("job", j)]);
|
||||
}
|
||||
if let Some(t) = table {
|
||||
req = req.query(&[("table", t)]);
|
||||
}
|
||||
let (request_id, rsp) = self.client.send(req).await?;
|
||||
let rsp = self.client.check_response(&request_id, rsp).await?;
|
||||
let body: RemoteErrorsResponse = rsp.json().await.err_to_http(request_id)?;
|
||||
Ok(body.errors.into_iter().map(JobErrorInfo::from).collect())
|
||||
}
|
||||
|
||||
async fn open_table(&self, request: OpenTableRequest) -> Result<Arc<dyn BaseTable>> {
|
||||
let identifier = build_table_identifier(
|
||||
&request.name,
|
||||
@@ -1580,6 +2046,223 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_derived_compute_routes() {
|
||||
// create_function
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::POST);
|
||||
assert_eq!(request.url().path(), "/v1/function/embed/create");
|
||||
let body: serde_json::Value =
|
||||
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
|
||||
assert_eq!(body["language"], "python");
|
||||
assert_eq!(body["return_type"], "FLOAT[4]");
|
||||
assert_eq!(body["body"], "def embed(x): ...");
|
||||
assert_eq!(body["options"]["pip"], "torch");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"name":"embed","status":"OK"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
conn.create_function(crate::database::CreateFunctionRequest {
|
||||
name: "embed".into(),
|
||||
language: "python".into(),
|
||||
return_type: "FLOAT[4]".into(),
|
||||
body: "def embed(x): ...".into(),
|
||||
options: [("pip".to_string(), "torch".to_string())].into(),
|
||||
})
|
||||
.await
|
||||
.unwrap();
|
||||
|
||||
// list_functions
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::GET);
|
||||
assert_eq!(request.url().path(), "/v1/function/list");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(
|
||||
r#"{"functions":[{"name":"embed","language":"python","return_type":"Float32","description":""}]}"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let functions = conn.list_functions().await.unwrap();
|
||||
assert_eq!(functions.len(), 1);
|
||||
assert_eq!(functions[0].name, "embed");
|
||||
|
||||
// drop_function
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::POST);
|
||||
assert_eq!(request.url().path(), "/v1/function/embed/drop");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"name":"embed","status":"OK"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
conn.drop_function("embed").await.unwrap();
|
||||
|
||||
// create_materialized_view
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::POST);
|
||||
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/create");
|
||||
let body: serde_json::Value =
|
||||
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
|
||||
assert_eq!(body["query"], "SELECT id, embed(body) AS vec FROM docs");
|
||||
assert_eq!(body["auto_refresh"], true);
|
||||
assert_eq!(body["with_no_data"], false);
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"name":"mv1","job_id":"j-1"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
let mut request = crate::database::CreateMaterializedViewRequest::new(
|
||||
"mv1",
|
||||
"SELECT id, embed(body) AS vec FROM docs",
|
||||
);
|
||||
request.auto_refresh = true;
|
||||
let job_id = conn.create_materialized_view(request).await.unwrap();
|
||||
assert_eq!(job_id.as_deref(), Some("j-1"));
|
||||
|
||||
// refresh_materialized_view
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::POST);
|
||||
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/refresh");
|
||||
let body: serde_json::Value =
|
||||
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
|
||||
assert_eq!(body["num_workers"], 2);
|
||||
assert!(body.get("src_version").is_none());
|
||||
http::Response::builder()
|
||||
.status(202)
|
||||
.body(r#"{"job_id":"j-2"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
let mut request = crate::database::RefreshMaterializedViewRequest::new("mv1");
|
||||
request.num_workers = Some(2);
|
||||
let job_id = conn.refresh_materialized_view(request).await.unwrap();
|
||||
assert_eq!(job_id, "j-2");
|
||||
|
||||
// alter_materialized_view
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::POST);
|
||||
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/alter");
|
||||
let body: serde_json::Value =
|
||||
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
|
||||
assert_eq!(body["auto_refresh"], false);
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"name":"mv1","status":"OK"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
conn.alter_materialized_view("mv1", false).await.unwrap();
|
||||
|
||||
// drop_materialized_view
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/drop");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"name":"mv1","status":"OK"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
conn.drop_materialized_view("mv1").await.unwrap();
|
||||
|
||||
// list_materialized_views
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::GET);
|
||||
assert_eq!(request.url().path(), "/v1/materialized_view/list");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(
|
||||
r#"{"views":[{"name":"mv1","source_table":"docs","projection":["id"],"udf_columns":["vec=embed(body)"],"filter":null,"auto_refresh":true}]}"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let views = conn.list_materialized_views().await.unwrap();
|
||||
assert_eq!(views.len(), 1);
|
||||
assert_eq!(views[0].source_table, "docs");
|
||||
assert!(views[0].auto_refresh);
|
||||
|
||||
// list_jobs
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::GET);
|
||||
assert_eq!(request.url().path(), "/v1/job/list");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(
|
||||
r#"{"jobs":[{"table":"docs","job_id":"j-3","job_type":"udf_virtual_column_backfill","state":"running","column":"vec","age_seconds":4,"command":null,"units_done":1,"units_total":2,"committed":false,"rows_skipped":0,"error":null}]}"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let jobs = conn.list_jobs().await.unwrap();
|
||||
assert_eq!(jobs.len(), 1);
|
||||
assert_eq!(jobs[0].state, "running");
|
||||
assert_eq!(jobs[0].units_total, Some(2));
|
||||
|
||||
// cancel_job
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::POST);
|
||||
assert_eq!(request.url().path(), "/v1/job/j-3/cancel");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"cancelled":true}"#)
|
||||
.unwrap()
|
||||
});
|
||||
assert!(conn.cancel_job("j-3").await.unwrap());
|
||||
|
||||
// cancel_job: no such inflight job -> false, not an error
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.url().path(), "/v1/job/gone/cancel");
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"cancelled":false}"#)
|
||||
.unwrap()
|
||||
});
|
||||
assert!(!conn.cancel_job("gone").await.unwrap());
|
||||
|
||||
// job_history: GET /v1/job/history, no filter
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::GET);
|
||||
assert_eq!(request.url().path(), "/v1/job/history");
|
||||
assert!(request.url().query().is_none());
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(
|
||||
r#"{"jobs":[{"table":"docs","job_id":"j-1","job_type":"udf_virtual_column_backfill","state":"done","column":"vec","created_ms":1000,"updated_ms":2000,"completed_ms":2000,"rows_processed":42,"rows_skipped":3,"error":null,"events":"created\ndone"}]}"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let hist = conn.job_history(None).await.unwrap();
|
||||
assert_eq!(hist.len(), 1);
|
||||
assert_eq!(hist[0].state, "done");
|
||||
assert_eq!(hist[0].rows_processed, Some(42));
|
||||
assert_eq!(hist[0].events.as_deref(), Some("created\ndone"));
|
||||
|
||||
// job_history: ?job= narrows to one job
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.url().path(), "/v1/job/history");
|
||||
assert_eq!(request.url().query(), Some("job=j-1"));
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(r#"{"jobs":[]}"#)
|
||||
.unwrap()
|
||||
});
|
||||
assert!(conn.job_history(Some("j-1")).await.unwrap().is_empty());
|
||||
|
||||
// errors: GET /v1/job/errors with job + table filters
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
assert_eq!(request.method(), &reqwest::Method::GET);
|
||||
assert_eq!(request.url().path(), "/v1/job/errors");
|
||||
assert_eq!(request.url().query(), Some("job=j-1&table=docs"));
|
||||
http::Response::builder()
|
||||
.status(200)
|
||||
.body(
|
||||
r#"{"errors":[{"job_id":"j-1","table":"docs","column":"vec","error_type":"ValueError","error_message":"boom","fragment_id":0,"source_row_id":42,"table_version":7,"age_seconds":5}]}"#,
|
||||
)
|
||||
.unwrap()
|
||||
});
|
||||
let errs = conn.errors(Some("j-1"), Some("docs")).await.unwrap();
|
||||
assert_eq!(errs.len(), 1);
|
||||
assert_eq!(errs[0].error_type, "ValueError");
|
||||
assert_eq!(errs[0].source_row_id, Some(42));
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_clone_table() {
|
||||
let conn = Connection::new_with_handler(|request| {
|
||||
|
||||
@@ -2309,6 +2309,126 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
|
||||
message: "optimize is not supported on LanceDB cloud.".into(),
|
||||
})
|
||||
}
|
||||
async fn add_computed_columns(
|
||||
&self,
|
||||
columns: &[(String, String)],
|
||||
expression: &str,
|
||||
) -> Result<()> {
|
||||
let new_columns: Vec<serde_json::Value> = columns
|
||||
.iter()
|
||||
.map(|(name, data_type)| {
|
||||
serde_json::json!({
|
||||
"name": name,
|
||||
"computed": { "data_type": data_type, "expression": expression },
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
let request = self
|
||||
.client
|
||||
.post(&format!("/v1/table/{}/add_columns/", self.identifier))
|
||||
.json(&serde_json::json!({ "new_columns": new_columns }));
|
||||
let (request_id, response) = self.send(request, true).await?;
|
||||
self.check_table_response(&request_id, response).await?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn refresh_column(
|
||||
&self,
|
||||
columns: &[String],
|
||||
where_clause: Option<String>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
batch_size: Option<u32>,
|
||||
priority: Option<String>,
|
||||
) -> Result<String> {
|
||||
let mut body = serde_json::json!({ "columns": columns });
|
||||
if let Some(w) = where_clause {
|
||||
body["where_clause"] = serde_json::Value::String(w);
|
||||
}
|
||||
if let Some(n) = num_workers {
|
||||
body["num_workers"] = n.into();
|
||||
}
|
||||
if let Some(n) = max_workers {
|
||||
body["max_workers"] = n.into();
|
||||
}
|
||||
if let Some(n) = batch_size {
|
||||
body["batch_size"] = n.into();
|
||||
}
|
||||
if let Some(p) = priority {
|
||||
body["priority"] = serde_json::Value::String(p);
|
||||
}
|
||||
let request = self
|
||||
.client
|
||||
.post(&format!("/v1/table/{}/refresh_column", self.identifier))
|
||||
.json(&body);
|
||||
let (request_id, response) = self.send(request, true).await?;
|
||||
let response = self.check_table_response(&request_id, response).await?;
|
||||
#[derive(serde::Deserialize)]
|
||||
struct RefreshColumnResponse {
|
||||
job_id: String,
|
||||
}
|
||||
let body: RefreshColumnResponse = response.json().await.err_to_http(request_id)?;
|
||||
Ok(body.job_id)
|
||||
}
|
||||
|
||||
async fn load_columns(&self, request: crate::table::LoadColumnsRequest) -> Result<String> {
|
||||
let columns: Vec<serde_json::Value> = request
|
||||
.columns
|
||||
.iter()
|
||||
.map(|(target, source)| {
|
||||
serde_json::json!({
|
||||
"target": target,
|
||||
"source": source.clone().unwrap_or_else(|| target.clone()),
|
||||
})
|
||||
})
|
||||
.collect();
|
||||
let mut source = serde_json::json!({
|
||||
"uris": request.source_uris,
|
||||
"format": request.source_format,
|
||||
});
|
||||
if let Some(opts) = request.source_storage_options {
|
||||
source["storage_options"] = serde_json::to_value(opts).unwrap_or_default();
|
||||
}
|
||||
let mut body = serde_json::json!({
|
||||
"columns": columns,
|
||||
"source": source,
|
||||
"target_key": request.target_key,
|
||||
});
|
||||
if let Some(k) = request.source_key {
|
||||
body["source_key"] = serde_json::Value::String(k);
|
||||
}
|
||||
if let Some(m) = request.on_missing {
|
||||
body["on_missing"] = serde_json::Value::String(m);
|
||||
}
|
||||
if let Some(n) = request.num_workers {
|
||||
body["num_workers"] = n.into();
|
||||
}
|
||||
if let Some(n) = request.max_workers {
|
||||
body["max_workers"] = n.into();
|
||||
}
|
||||
if let Some(n) = request.batch_size {
|
||||
body["batch_size"] = n.into();
|
||||
}
|
||||
if let Some(n) = request.commit_granularity {
|
||||
body["commit_granularity"] = n.into();
|
||||
}
|
||||
if let Some(p) = request.priority {
|
||||
body["priority"] = serde_json::Value::String(p);
|
||||
}
|
||||
let http_request = self
|
||||
.client
|
||||
.post(&format!("/v1/table/{}/load_columns", self.identifier))
|
||||
.json(&body);
|
||||
let (request_id, response) = self.send(http_request, true).await?;
|
||||
let response = self.check_table_response(&request_id, response).await?;
|
||||
#[derive(serde::Deserialize)]
|
||||
struct LoadColumnsResponse {
|
||||
job_id: String,
|
||||
}
|
||||
let body: LoadColumnsResponse = response.json().await.err_to_http(request_id)?;
|
||||
Ok(body.job_id)
|
||||
}
|
||||
|
||||
async fn add_columns(
|
||||
&self,
|
||||
transforms: NewColumnTransform,
|
||||
@@ -2801,6 +2921,75 @@ mod tests {
|
||||
}
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_refresh_column() {
|
||||
let table = Table::new_with_handler("my_table", |request| {
|
||||
assert_eq!(request.method(), "POST");
|
||||
assert_eq!(request.url().path(), "/v1/table/my_table/refresh_column");
|
||||
let body: serde_json::Value =
|
||||
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
|
||||
assert_eq!(body["columns"], serde_json::json!(["vec"]));
|
||||
assert_eq!(body["num_workers"], 2);
|
||||
assert!(body.get("where_clause").is_none());
|
||||
|
||||
http::Response::builder()
|
||||
.status(202)
|
||||
.body(r#"{"job_id":"j-9"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
|
||||
let job_id = table
|
||||
.refresh_column(&["vec".to_string()], None, Some(2), None, None, None)
|
||||
.await
|
||||
.unwrap();
|
||||
assert_eq!(job_id, "j-9");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_load_columns() {
|
||||
let table = Table::new_with_handler("my_table", |request| {
|
||||
assert_eq!(request.method(), "POST");
|
||||
assert_eq!(request.url().path(), "/v1/table/my_table/load_columns");
|
||||
let body: serde_json::Value =
|
||||
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
|
||||
assert_eq!(
|
||||
body["columns"],
|
||||
serde_json::json!([{"target": "embedding", "source": "emb"}])
|
||||
);
|
||||
assert_eq!(body["source"]["format"], "parquet");
|
||||
assert_eq!(
|
||||
body["source"]["uris"],
|
||||
serde_json::json!(["s3://b/x.parquet"])
|
||||
);
|
||||
assert_eq!(body["target_key"], "document_id");
|
||||
assert_eq!(body["source_key"], "doc_id");
|
||||
assert_eq!(body["on_missing"], "null");
|
||||
assert_eq!(body["num_workers"], 4);
|
||||
|
||||
http::Response::builder()
|
||||
.status(202)
|
||||
.body(r#"{"job_id":"lc-7"}"#)
|
||||
.unwrap()
|
||||
});
|
||||
|
||||
let request = crate::table::LoadColumnsRequest {
|
||||
source_uris: vec!["s3://b/x.parquet".to_string()],
|
||||
source_format: "parquet".to_string(),
|
||||
source_storage_options: None,
|
||||
target_key: "document_id".to_string(),
|
||||
source_key: Some("doc_id".to_string()),
|
||||
columns: vec![("embedding".to_string(), Some("emb".to_string()))],
|
||||
on_missing: Some("null".to_string()),
|
||||
num_workers: Some(4),
|
||||
max_workers: None,
|
||||
batch_size: None,
|
||||
commit_granularity: None,
|
||||
priority: None,
|
||||
};
|
||||
let job_id = table.load_columns(request).await.unwrap();
|
||||
assert_eq!(job_id, "lc-7");
|
||||
}
|
||||
|
||||
#[tokio::test]
|
||||
async fn test_version() {
|
||||
let table = Table::new_with_handler("my_table", |request| {
|
||||
|
||||
@@ -471,6 +471,33 @@ impl LsmWriteSpec {
|
||||
}
|
||||
}
|
||||
|
||||
/// Request to fill existing table columns from an external source by
|
||||
/// primary-key join (Geneva `Table.load_columns()` parity). Server-backed
|
||||
/// feature (LanceDB Enterprise / Cloud).
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct LoadColumnsRequest {
|
||||
/// External source URIs.
|
||||
pub source_uris: Vec<String>,
|
||||
/// Source format: "parquet" | "lance" | "ipc".
|
||||
pub source_format: String,
|
||||
/// Source-only storage options (e.g. cloud credentials).
|
||||
pub source_storage_options: Option<HashMap<String, String>>,
|
||||
/// Destination primary-key column.
|
||||
pub target_key: String,
|
||||
/// Source primary-key column. Defaults to `target_key` when None.
|
||||
pub source_key: Option<String>,
|
||||
/// Value column mappings as `(target, source)`; a None source defaults to
|
||||
/// the target name.
|
||||
pub columns: Vec<(String, Option<String>)>,
|
||||
/// Missing-row policy: "carry" (default) | "null" | "error".
|
||||
pub on_missing: Option<String>,
|
||||
pub num_workers: Option<u32>,
|
||||
pub max_workers: Option<u32>,
|
||||
pub batch_size: Option<u32>,
|
||||
pub commit_granularity: Option<u32>,
|
||||
pub priority: Option<String>,
|
||||
}
|
||||
|
||||
/// A trait for anything "table-like". This is used for both native tables (which target
|
||||
/// Lance datasets) and remote tables (which target LanceDB cloud)
|
||||
///
|
||||
@@ -620,6 +647,47 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
|
||||
transforms: NewColumnTransform,
|
||||
read_columns: Option<Vec<String>>,
|
||||
) -> Result<AddColumnsResult>;
|
||||
/// Declare computed columns bound to a registered function: each
|
||||
/// `(name, sql_type)` is added all-null with the expression stored
|
||||
/// as its binding; no compute happens here (the server's lazy
|
||||
/// detector or refresh_column fills them). Several columns map a
|
||||
/// struct-returning function's fields positionally. Server-backed
|
||||
/// feature; the default returns NotSupported.
|
||||
async fn add_computed_columns(
|
||||
&self,
|
||||
_columns: &[(String, String)],
|
||||
_expression: &str,
|
||||
) -> Result<()> {
|
||||
Err(Error::NotSupported {
|
||||
message: "computed columns are not supported by this table".into(),
|
||||
})
|
||||
}
|
||||
/// Trigger recompute of computed columns. The expression is
|
||||
/// resolved server-side from each column's stored binding; columns
|
||||
/// bound to the same struct-returning function refresh together.
|
||||
/// Returns the refresh job id. Server-backed feature (LanceDB
|
||||
/// Enterprise / Cloud); the default returns NotSupported.
|
||||
async fn refresh_column(
|
||||
&self,
|
||||
_columns: &[String],
|
||||
_where_clause: Option<String>,
|
||||
_num_workers: Option<u32>,
|
||||
_max_workers: Option<u32>,
|
||||
_batch_size: Option<u32>,
|
||||
_priority: Option<String>,
|
||||
) -> Result<String> {
|
||||
Err(Error::NotSupported {
|
||||
message: "refresh_column is not supported by this table".into(),
|
||||
})
|
||||
}
|
||||
/// Fill existing columns from an external source by primary-key join
|
||||
/// (Geneva `load_columns`). Returns the load job id. Server-backed feature;
|
||||
/// the default returns NotSupported.
|
||||
async fn load_columns(&self, _request: LoadColumnsRequest) -> Result<String> {
|
||||
Err(Error::NotSupported {
|
||||
message: "load_columns is not supported by this table".into(),
|
||||
})
|
||||
}
|
||||
/// Alter columns in the table.
|
||||
async fn alter_columns(&self, alterations: &[ColumnAlteration]) -> Result<AlterColumnsResult>;
|
||||
/// Drop columns from the table.
|
||||
@@ -1461,6 +1529,48 @@ impl Table {
|
||||
self.inner.add_columns(transforms, read_columns).await
|
||||
}
|
||||
|
||||
/// Declare computed columns bound to a registered function
|
||||
/// (`(name, sql_type)` pairs + a `f(args)` expression). No compute
|
||||
/// happens here. Server-backed feature.
|
||||
pub async fn add_computed_columns(
|
||||
&self,
|
||||
columns: &[(String, String)],
|
||||
expression: &str,
|
||||
) -> Result<()> {
|
||||
self.inner.add_computed_columns(columns, expression).await
|
||||
}
|
||||
|
||||
/// Trigger recompute of computed columns (REFRESH COLUMN). The
|
||||
/// expression comes from each column's stored binding; columns
|
||||
/// bound to the same struct-returning function refresh together.
|
||||
/// Returns the refresh job id. Server-backed feature.
|
||||
pub async fn refresh_column(
|
||||
&self,
|
||||
columns: &[String],
|
||||
where_clause: Option<String>,
|
||||
num_workers: Option<u32>,
|
||||
max_workers: Option<u32>,
|
||||
batch_size: Option<u32>,
|
||||
priority: Option<String>,
|
||||
) -> Result<String> {
|
||||
self.inner
|
||||
.refresh_column(
|
||||
columns,
|
||||
where_clause,
|
||||
num_workers,
|
||||
max_workers,
|
||||
batch_size,
|
||||
priority,
|
||||
)
|
||||
.await
|
||||
}
|
||||
|
||||
/// Fill existing columns from an external Parquet/Lance/IPC source by
|
||||
/// primary-key join (Geneva `Table.load_columns()`). Returns the job id.
|
||||
pub async fn load_columns(&self, request: LoadColumnsRequest) -> Result<String> {
|
||||
self.inner.load_columns(request).await
|
||||
}
|
||||
|
||||
/// Change a column's name or nullability.
|
||||
pub async fn alter_columns(
|
||||
&self,
|
||||
|
||||
Reference in New Issue
Block a user