Compare commits

..

27 Commits

Author SHA1 Message Date
Jack Ye
bff911a65d chore: point lance dependency at jack branch 2026-06-29 22:46:07 -07:00
Wyatt Alt
3a4cdb7aff fix(udf): JobHandle.wait() terminates on failed jobs
wait() (sync + async) only stopped on finished/stale/committed, so a job the
server already reported as state=failed was polled until the (default 3600s)
timeout, then raised a misleading TimeoutError instead of the real cause. A
doomed backfill -- e.g. a multi-column REFRESH COLUMN of a scalar UDF -- hung
the client even though get_job surfaced the failure within ~3s.

Add a terminal failed branch that raises JobFailedError carrying the server
error, exported from the package. Verified end-to-end against the cluster:
raises in 3.6s instead of hanging. Unit-tested with a mock conn (sync+async,
failure + success + committed paths).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:35 -07:00
Wyatt Alt
142ac835d3 client: job_history() and errors() over REST (SHOW JOB HISTORY / SHOW ERRORS)
The client exposed list_jobs/get_job/cancel_job but not the durable job
history or the per-row UDF errors, so those SQL/REST surfaces had no SDK
equivalent. Add job_history(job_id=None) and errors(job_id=None, table=None)
through every layer:

- Database trait + Connection API (JobHistoryInfo, JobErrorInfo types).
- Remote REST impl: GET /v1/job/history (?job=) and GET /v1/job/errors
  (?job=&table=), with serde response types + From mappings.
- pyo3 bindings + pyclasses JobHistoryEntry / JobErrorEntry, registered.
- Python sync + async db.py wrappers.

Mirrors the existing list_jobs plumbing exactly. Remote-handler test asserts
the GET paths, query filters, and response parsing for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:35 -07:00
Wyatt Alt
3f44f93e92 job wait(): poll by id via get_job (point access) instead of list_jobs
JobHandle/AsyncJobHandle now poll conn.get_job(id, table) -- one job -- instead
of list_jobs() + client-side filter over every active job. The job's table is
threaded in from refresh_column / MV refresh as an O(1) lookup hint. Plumbs
get_job through the Database trait (default not_supported), RemoteDatabase
(GET /v1/job/{id}?table=...), the Connection wrapper, and the pyo3 binding +
db.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:35 -07:00
Wyatt Alt
9dfa43a9de fix: sync Connection.lineage delegates to AsyncConnection.lineage
self._conn on a remote sync connection is an AsyncConnection (python), which
exposes `lineage` (parses the JSON), not the pyo3 `table_lineage`. The sync
wrapper was calling self._conn.table_lineage -> AttributeError. Drive
self._conn.lineage on the loop instead, mirroring create_materialized_view.
(table_lineage stays the pyo3 method the async path calls via self._inner.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
03e895fa5c Merge integration submodule (refresh_column JobHandle / create_mv fix) into lineage 2026-06-29 22:35:34 -07:00
Wyatt Alt
c31e53088e client: slice 4 -- Python lineage surface
- new lineage.py: Lineage / Node / Edge / FunctionRef dataclasses that parse the
  server's lineage JSON, with to_dict(), to_graphviz() (drift edges dashed+red),
  and _repr_html_(); plus .functions() / .stale() helpers.
- Connection.lineage(table, column=, direction=, depth=) (sync + async) calls the
  pyo3 table_lineage binding and deserializes into Lineage.
- Table.lineage(column=, ...) via the table's job connection; MaterializedView /
  AsyncMaterializedView .lineage() delegate to the backing table (the server
  already includes the view's sources + downstream dependents).
- export the new types.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
434a5be187 feat(client): Table.refresh_column returns a JobHandle (like MaterializedView.refresh)
refresh_column returned the bare job-id str, so callers had to wrap it:
db.job(tbl.refresh_column("c")).wait(). Mirror MaterializedView.refresh() and
return a JobHandle directly, so tbl.refresh_column("c").wait() / .status() / .id
work without the wrapper. (db.job(job_id) stays for reconnecting by a stored id.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
78aa005093 client: slice 3 -- thread table_lineage through the remote client + pyo3
A new Database::table_lineage(TableLineageRequest) -> Result<String> threaded
end to end: default not_supported in the trait; the remote impl issues
GET /v1/table/{name}/lineage with column/direction/depth query params and
returns the body verbatim; connection.rs exposes a pub wrapper; the pyo3
binding hands the JSON string to Python.

The lineage payload is carried as opaque JSON on purpose: the open-source
lancedb client must not depend on the sophon-internal derived_jobs crate that
defines the lineage schema, so the wire format is the contract and the Python
layer deserializes it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
6191542cfe fix(mv): MaterializedView.refresh calls the async _refresh (underscore)
The sync _refresh_materialized_view called self._conn.refresh_materialized_view
(no underscore); the async method is _refresh_materialized_view, so
MaterializedView.refresh() raised AttributeError. Add the underscore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
6af3088b91 client: make refresh_materialized_view private (reach it via the handle)
Refresh is a submit-a-job verb, so its only public surface should be
MaterializedView.refresh() / AsyncMaterializedView.refresh() (which return a
job handle). Rename the connection methods to _refresh_materialized_view and
have the handles call that, so the raw by-name refresh is no longer advertised
on the connection. The pyo3 native binding is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
e73d4618d8 fix(mv): create_materialized_view passes query as keyword, not positional
The sync RemoteDBConnection.create_materialized_view assembled the SELECT but
called the async create_materialized_view with the query as the 2nd positional
arg, which binds to `source=` (query= is keyword-only). Every call then failed
the "needs either query= or both source and select" validation. Pass query=query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
3d92106394 client: split create_view into create_materialized_view; return job handles
- create_materialized_view now takes either query= or source+select (folds in
  the old create_view builder) and returns a MaterializedView handle whose
  .wait() blocks on initial population. create_view is removed -- it was
  misnamed (it built a *materialized* view, while CREATE VIEW means the plain
  non-materialized view the engine also supports).
- MaterializedView.refresh() and the remote Table.refresh_column() now return a
  JobHandle directly, so tbl.refresh_column("c").wait() needs no db.job(...)
  wrapper. db.job(id) is narrowed to reconnect-by-id (stored id / SQL / REST).
- rename View/AsyncView -> MaterializedView/AsyncMaterializedView (+ exports).
- tighten the replace path: only a not-found error on the pre-drop is benign;
  real failures (perms/server) now surface instead of being swallowed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
5810974b37 feat(client): Table.load_columns() REST client for LOAD COLUMNS
Geneva Table.load_columns() parity on the REST-only client. Fills existing
columns from an external Parquet/Lance/IPC source by primary-key join.

- BaseTable::load_columns default (NotSupported) + public Table::load_columns,
  taking a LoadColumnsRequest (source uris/format/storage_options, target/source
  key, (target, source?) column mappings, on_missing, worker/batch/commit knobs).
- Remote impl POSTs to /v1/table/{id}/load_columns with the matching body;
  mock test asserts the request shape.
- PyO3 binding + Python remote Table.load_columns(source, pk, columns, *,
  source_format, source_pk, on_missing, ...) accepting a column list or
  {target: source} dict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
8b38500b07 feat(view): full=True force-rebuild on refresh_materialized_view
View.refresh(full=True) (sync + async) now works -- it previously raised
NotImplementedError. Thread the flag through the client: RefreshMaterialized-
ViewRequest.full -> the REST body (RemoteRefreshMaterializedViewRequest.full);
pyo3 refresh_materialized_view(full=...); Connection.refresh_materialized_view(
name, full=) sync + async. A full refresh forces a recompute-and-replace and
preserves the view's indexes (reindexed by the distributed indexer).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
fd0a3b97d0 feat(view): materialized views are first-class indexable + searchable
Add View.create_index / create_scalar_index / create_fts_index / search
as pass-throughs to open_table(name). A materialized view is a real Lance
dataset; these let it be indexed and searched like any other table,
closing the parity gap with Geneva (whose create_materialized_view returns
a first-class Table).

The server-side create_index handler records indexes declared on a view so
they survive a full refresh (which overwrites the dataset, dropping its
indices); that re-apply is wired in the sophon engine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
b9f33ba1c9 feat(refresh): priority as a per-refresh knob; fix batch_size on RemoteTable
Thread priority (Kueue tier) through refresh_column at every layer (Python sync+async
+ RemoteTable -> pyo3 -> Rust client trait/public/remote -> REST body), mirroring
num_workers/batch_size. The function keeps its priority as a default; the per-refresh
value overrides. Also adds the previously-missed batch_size to RemoteTable.refresh_column
(the REST sync path). cargo check (lancedb --features remote --tests, lancedb-python) +
ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
d4f4fef3ba feat(refresh): batch_size is a per-refresh knob (refresh_column), not a function-only option
batch_size / num_workers / max_workers are invocation concerns (how to schedule THIS
refresh), so expose batch_size on refresh_column through every layer (Python sync+async
-> pyo3 -> Rust client -> the REST RefreshColumnRequest.batch_size, which the handler
already forwards into the backfill). num_workers/max_workers were already invocation-
placed; batch_size was the gap. The function may still carry a default; the refresh
override wins (extends the batch_size_override model). Both crates cargo-check clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
fbe6a5a3fd feat(udf): computed columns as expressions -- add_columns(computed={col: fn("input")})
A computed column is an expression over a registered function applied to input
columns, not a UDF coupled to a column. fn("data") already returned the expression
string "fn(data)"; make it a ColumnExpr (a str subclass) that also carries the
function's return type, so add_columns(computed={"vec": embed("data")}) declares the
column with no hand-written type. _normalize_computed handles the new form (and tuple
keys for STRUCT fan-out) and keeps the legacy {col: (sql_type, expression)} tuple.
add_computed_column is deprecated (delegates, with a DeprecationWarning). The function
stays decoupled from columns -- register once, apply anywhere.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
127054069a feat(mv): partition_by option on create_materialized_view / create_view
Thread an optional partition_by through the client: CreateMaterializedViewRequest
-> REST body -> pyo3 binding -> Python create_materialized_view/create_view
kwarg (sync + async). The server partitions the view's table function by the
named source column -- by IVF index clusters if the column is indexed
(image-dedup), else by distinct value. Unifies Geneva's partition_by +
partition_by_indexed_column into one knob.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
b20931b8f7 feat: async UDF client ergonomics (AsyncConnection/AsyncTable + AsyncView/AsyncJobHandle)
Mirrors the sync ergonomics on the async surface: AsyncConnection
create_function(udf, replace=)/create_view/job; AsyncTable.add_computed_column;
AsyncView + AsyncJobHandle (await + asyncio.sleep; shared submission-prefix
matcher with the sync JobHandle). Decorator + REST routes are shared/already
validated; this is the async wrapper layer. Exported from the package root.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
396d68e490 fix: JobHandle resolves the manifest job id from the submission id
db.job(id) gets the submission id the refresh/backfill endpoints return,
but list_jobs / cancel report the agent's manifest id
(<table>-<type>-<first 8 of submission id>). JobHandle now matches that
(exact id or submission prefix) so wait()/progress() truly track, and
cancel() cancels by the resolved canonical id instead of the unusable
submission uuid.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
ad37f87387 feat: fold UDF authoring into lancedb (udf module + connection/table ergonomics)
Brings the @udf/@table_udf decorator + type inference into lancedb as
lancedb.udf (Apache-2.0), and adds the ergonomic glue to the existing
connection/table so there's no separate object model:
- create_function() accepts a Udf (and a replace= flag)
- Table.add_computed_column(column, udf)
- create_view(name, source, select, ...) -> View (assembles the SELECT)
- Connection.job(job_id) -> JobHandle
- View / JobHandle are thin references over a connection
Exports udf/table_udf/Udf/JobHandle/View from the package root. The
operations stay the existing remote-only methods (enterprise/cloud); the
decorator works locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
e93476f0e0 feat: explain_refresh_materialized_view over REST (EXPLAIN REFRESH SDK)
Database trait gains explain_refresh_materialized_view (default NotSupported)
returning an MvRefreshPlan; RemoteDatabase POSTs
/v1/materialized_view/{name}/explain_refresh; Connection method; pyo3
MvRefreshPlan pyclass + binding; sync+async python wrappers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
2b41fce033 feat: cancel_job over REST (Database::cancel_job + remote impl + pyo3 + python)
Exposes the existing server-side CANCEL JOB (CoordinatorCatalog::cancel_job)
as a REST-backed SDK method: Database trait default NotSupported,
RemoteDatabase POSTs /v1/job/{id}/cancel, pyo3 binding, sync+async python
wrappers. Best-effort: a missing job returns false, not an error. Mock-HTTP
unit test in test_derived_compute_routes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
04948fc4f6 feat: computed columns as a param on add_columns
Per the interface design: computed columns are parameters on the
existing add_columns operation, not a separate method.

- BaseTable::add_computed_columns((name, sql_type) pairs + a f(args)
  expression) -- default NotSupported; RemoteTable posts 'computed'
  entries to the existing /v1/table/{id}/add_columns route.
- python add_columns gains computed= on LanceTable, RemoteTable, and
  AsyncTable: tbl.add_columns(computed={'doubled': ('FLOAT',
  'double_it(val)')}); grouped by expression so struct-returning
  functions' columns land adjacently.
2026-06-29 22:35:34 -07:00
Wyatt Alt
ff3c7111b9 feat: SDK surface for functions, materialized views, jobs, refresh_column
Adds the derived-compute interface to the SDK:

- Database trait: create/list/drop_function, create/refresh/alter/
  drop/list_materialized_view, list_jobs -- default implementations
  return Error::NotSupported (NotImplementedError in python), so
  existing Database impls are unaffected; local single-node
  implementations are planned. BaseTable gains refresh_column with
  the same default.
- RemoteDatabase/RemoteTable implement them against the server REST
  routes (/v1/function/*, /v1/materialized_view/*, /v1/job/list,
  /v1/table/{id}/refresh_column), with mock-HTTP unit tests.
- Connection/Table public methods, pyo3 bindings (FunctionInfo,
  MaterializedViewInfo, JobInfo pyclasses), and python wrappers:
  sync on the DBConnection base (shared by local and remote
  connections), async on AsyncConnection; refresh_column on
  LanceTable, RemoteTable, and AsyncTable.
2026-06-29 22:35:34 -07:00
23 changed files with 3834 additions and 355 deletions

View File

@@ -1,137 +0,0 @@
---
name: lancedb-branch-ops
description: Branch management for LanceDB tables via the REST API. Use this skill whenever someone wants to create, delete, list, or switch branches on a LanceDB table — or needs to make sure a write (metadata update, index build, etc.) lands on a specific branch instead of main. Invoke it even without the word "branch" if context makes clear they want an experimental copy of a table, want to isolate changes, or want to confirm a mutation didn't touch main. Covers: branches/list, branches/create, branches/delete, and passing "branch" in describe/update_field_metadata/create_index to target a non-main version.
---
## Goal
Manage branches on a LanceDB table: list what exists, create new ones, delete stale ones, and direct read/write operations at a specific branch without touching main.
## Step 0: Establish the connection
Use the `lancedb-connect` skill to resolve the base URL and auth headers (`x-api-key`, `x-lancedb-database`). Skip this only if the connection is already known from the current conversation.
All examples below use `{base_url}` — substitute the resolved endpoint and include the auth headers on every request.
## The branch model (important)
LanceDB branches are named snapshots that diverge from the table's current state at creation time. There is **no checkout command** — you never switch the whole table to a branch. Instead, you **pass `"branch": "<name>"` in the request body** of any operation to target that branch. Omitting the key (or sending an empty body) always targets main.
`branches/list` returns only non-main branches. Main always exists and is not listed.
## List branches
```http
POST {base_url}/v1/table/{table_id}/branches/list
Content-Type: application/json
{}
```
Response:
```json
{
"branches": {
"experiment-reindex": {"parentVersion": 1, "createAt": 1782506085, "manifestSize": 1029}
}
}
```
If `branches` is `{}`, the table has no branches besides main.
## Create a branch
```http
POST {base_url}/v1/table/{table_id}/branches/create
Content-Type: application/json
{"name": "experiment-reindex"}
```
HTTP 200 with `{}` body = success. The branch is created off the table's current state on main.
Verify by calling `branches/list` and confirming the new name appears.
## Delete a branch
```http
POST {base_url}/v1/table/{table_id}/branches/delete
Content-Type: application/json
{"name": "stale-2024"}
```
HTTP 200 with `{}` body = success. Only the branch pointer is removed — main and all row data remain intact.
Verify by calling `branches/list` (name gone) and `describe` with no branch param (main still responds).
## Operate on a specific branch
Pass `"branch": "<name>"` in the body of any operation to scope it to that branch:
**Read schema on a branch:**
```http
POST {base_url}/v1/table/{table_id}/describe
Content-Type: application/json
{"branch": "wip-branch"}
```
**Write metadata to a branch (not main):**
```http
POST {base_url}/v1/table/{table_id}/update_field_metadata
Content-Type: application/json
{
"branch": "wip-branch",
"updates": [
{
"path": "category",
"metadata": {"lancedb:description": "Product category label."},
"replace": false
}
]
}
```
**Build an index on a branch:**
```http
POST {base_url}/v1/table/{table_id}/create_index
Content-Type: application/json
{
"branch": "wip-branch",
"column": "category",
"index_type": "BTREE"
}
```
## Verifying isolation
After writing to a branch, always confirm the change did NOT land on main:
```bash
# Should show the new metadata
curl -s -X POST {base_url}/v1/table/{table_id}/describe \
-H "x-api-key: <key>" -H "x-lancedb-database: <db>" \
-H "content-type: application/json" \
-d '{"branch": "wip-branch"}'
# Should NOT show the new metadata
curl -s -X POST {base_url}/v1/table/{table_id}/describe \
-H "x-api-key: <key>" -H "x-lancedb-database: <db>" \
-H "content-type: application/json" \
-d '{}'
```
## Quick reference
| Goal | Endpoint | Body |
|------|----------|------|
| List all branches | `branches/list` | `{}` |
| Create a branch | `branches/create` | `{"name": "..."}` |
| Delete a branch | `branches/delete` | `{"name": "..."}` |
| Read schema on branch | `describe` | `{"branch": "..."}` |
| Write metadata on branch | `update_field_metadata` | `{"branch": "...", "updates": [...]}` |
| Build index on branch | `create_index` | `{"branch": "...", "column": ..., "index_type": ...}` |
| Target main (default) | any endpoint | omit `"branch"` key |

74
Cargo.lock generated
View File

@@ -1750,7 +1750,7 @@ version = "3.1.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "faf9468729b8cbcea668e36183cb69d317348c2e08e994829fb56ebfdfbaac34"
dependencies = [
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -3014,7 +3014,7 @@ dependencies = [
"libc",
"option-ext",
"redox_users",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -3231,7 +3231,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb"
dependencies = [
"libc",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -3424,7 +3424,7 @@ checksum = "42703706b716c37f96a77aea830392ad231f44c9e9a67872fa5548707e11b11c"
[[package]]
name = "fsst"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"rand 0.9.4",
@@ -4466,7 +4466,7 @@ checksum = "3640c1c38b8e4e43584d8df18be5fc6b0aa314ce6ebf51b53313d4306cca8e46"
dependencies = [
"hermit-abi",
"libc",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -4560,7 +4560,7 @@ dependencies = [
"portable-atomic-util",
"serde_core",
"wasm-bindgen",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -4727,7 +4727,7 @@ checksum = "e037a2e1d8d5fdbd49b16a4ea09d5d6401c1f29eca5ff29d03d3824dba16256a"
[[package]]
name = "lance"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arc-swap",
"arrow",
@@ -4802,7 +4802,7 @@ dependencies = [
[[package]]
name = "lance-arrow"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4823,7 +4823,7 @@ dependencies = [
[[package]]
name = "lance-arrow-scalar"
version = "58.0.0"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4837,7 +4837,7 @@ dependencies = [
[[package]]
name = "lance-arrow-stats"
version = "58.0.0"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-schema",
@@ -4847,7 +4847,7 @@ dependencies = [
[[package]]
name = "lance-bitpacking"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrayref",
"crunchy",
@@ -4858,7 +4858,7 @@ dependencies = [
[[package]]
name = "lance-core"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -4897,7 +4897,7 @@ dependencies = [
[[package]]
name = "lance-datafusion"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow",
"arrow-array",
@@ -4928,7 +4928,7 @@ dependencies = [
[[package]]
name = "lance-datagen"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow",
"arrow-array",
@@ -4946,7 +4946,7 @@ dependencies = [
[[package]]
name = "lance-derive"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"proc-macro2",
"quote",
@@ -4956,7 +4956,7 @@ dependencies = [
[[package]]
name = "lance-encoding"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -4992,7 +4992,7 @@ dependencies = [
[[package]]
name = "lance-file"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-arith",
"arrow-array",
@@ -5023,7 +5023,7 @@ dependencies = [
[[package]]
name = "lance-index"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arc-swap",
"arrow",
@@ -5089,7 +5089,7 @@ dependencies = [
[[package]]
name = "lance-io"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow",
"arrow-arith",
@@ -5131,7 +5131,7 @@ dependencies = [
[[package]]
name = "lance-linalg"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -5148,7 +5148,7 @@ dependencies = [
[[package]]
name = "lance-namespace"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow",
"async-trait",
@@ -5161,7 +5161,7 @@ dependencies = [
[[package]]
name = "lance-namespace-impls"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow",
"arrow-ipc",
@@ -5216,7 +5216,7 @@ dependencies = [
[[package]]
name = "lance-select"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-buffer",
@@ -5232,7 +5232,7 @@ dependencies = [
[[package]]
name = "lance-table"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow",
"arrow-array",
@@ -5272,7 +5272,7 @@ dependencies = [
[[package]]
name = "lance-testing"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"arrow-array",
"arrow-schema",
@@ -5286,7 +5286,7 @@ dependencies = [
[[package]]
name = "lance-tokenizer"
version = "9.0.0-beta.10"
source = "git+https://github.com/lance-format/lance.git?tag=v9.0.0-beta.10#e25b71e74b89d10c57b412d111bde087117383f3"
source = "git+https://github.com/jackye1995/lance.git?branch=jack%2Fsophon-pr-6325#1c5b5061c60934b4c18dbe86c5e91b4961105989"
dependencies = [
"icu_segmenter",
"jieba-rs",
@@ -6085,7 +6085,7 @@ version = "0.50.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7957b9740744892f114936ab4a57b3f487491bbeafaf8083688b16841a4240e5"
dependencies = [
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -7400,8 +7400,8 @@ version = "0.14.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "343d3bd7056eda839b03204e68deff7d1b13aba7af2b2fd16890697274262ee7"
dependencies = [
"heck 0.5.0",
"itertools 0.14.0",
"heck 0.4.1",
"itertools 0.11.0",
"log",
"multimap",
"petgraph",
@@ -7420,7 +7420,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "27c6023962132f4b30eb4c172c91ce92d933da334c59c23cddee82358ddafb0b"
dependencies = [
"anyhow",
"itertools 0.14.0",
"itertools 0.11.0",
"proc-macro2",
"quote",
"syn 2.0.117",
@@ -7654,7 +7654,7 @@ dependencies = [
"once_cell",
"socket2 0.6.3",
"tracing",
"windows-sys 0.60.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -8394,7 +8394,7 @@ dependencies = [
"errno",
"libc",
"linux-raw-sys",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -8465,7 +8465,7 @@ dependencies = [
"security-framework",
"security-framework-sys",
"webpki-root-certs",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -9027,7 +9027,7 @@ version = "0.8.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c1c97747dbf44bb1ca44a561ece23508e99cb592e862f22222dcf42f51d1e451"
dependencies = [
"heck 0.5.0",
"heck 0.4.1",
"proc-macro2",
"quote",
"syn 2.0.117",
@@ -9039,7 +9039,7 @@ version = "0.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "54254b8531cafa275c5e096f62d48c81435d1015405a91198ddb11e967301d40"
dependencies = [
"heck 0.5.0",
"heck 0.4.1",
"proc-macro2",
"quote",
"syn 2.0.117",
@@ -9472,7 +9472,7 @@ dependencies = [
"getrandom 0.4.2",
"once_cell",
"rustix",
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]
@@ -10407,7 +10407,7 @@ version = "0.1.11"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22"
dependencies = [
"windows-sys 0.61.2",
"windows-sys 0.59.0",
]
[[package]]

View File

@@ -13,20 +13,20 @@ categories = ["database-implementations"]
rust-version = "1.91.0"
[workspace.dependencies]
lance = { "version" = "=9.0.0-beta.10", default-features = false, "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-core = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-datagen = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-file = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-io = { "version" = "=9.0.0-beta.10", default-features = false, "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-index = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-linalg = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-namespace-impls = { "version" = "=9.0.0-beta.10", default-features = false, "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-table = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-testing = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-datafusion = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-encoding = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance-arrow = { "version" = "=9.0.0-beta.10", "tag" = "v9.0.0-beta.10", "git" = "https://github.com/lance-format/lance.git" }
lance = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-core = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-datagen = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-file = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-io = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-index = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-linalg = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-namespace = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-namespace-impls = { "version" = "=9.0.0-beta.10", default-features = false, "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-table = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-testing = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-datafusion = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-encoding = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
lance-arrow = { "version" = "=9.0.0-beta.10", "branch" = "jack/sophon-pr-6325", "git" = "https://github.com/jackye1995/lance.git" }
ahash = "0.8"
# Note that this one does not include pyarrow
arrow = { version = "58.0.0", optional = false }

View File

@@ -3,7 +3,7 @@
use std::time::Duration;
use lancedb::{ipc::ipc_file_to_batches, table::merge::MergeInsertBuilder};
use lancedb::{arrow::IntoArrow, ipc::ipc_file_to_batches, table::merge::MergeInsertBuilder};
use napi::bindgen_prelude::*;
use napi_derive::napi;
@@ -66,9 +66,11 @@ impl NativeMergeInsertBuilder {
#[napi(catch_unwind)]
pub async fn execute(&self, buf: Buffer) -> napi::Result<MergeResult> {
let data = ipc_file_to_batches(buf.to_vec()).map_err(|e| {
napi::Error::from_reason(format!("Failed to read IPC file: {}", convert_error(&e)))
})?;
let data = ipc_file_to_batches(buf.to_vec())
.and_then(IntoArrow::into_arrow)
.map_err(|e| {
napi::Error::from_reason(format!("Failed to read IPC file: {}", convert_error(&e)))
})?;
let this = self.clone();

View File

@@ -17,6 +17,17 @@ from .db import AsyncConnection, DBConnection, LanceDBConnection
from .remote import ClientConfig
from .remote.db import RemoteDBConnection
from .expr import Expr, col, lit, func
from .udf import (
udf,
table_udf,
Udf,
JobHandle,
JobFailedError,
MaterializedView,
AsyncJobHandle,
AsyncMaterializedView,
)
from .lineage import Lineage, Node, Edge, FunctionRef
from .schema import vector
from .table import AsyncTable, Table
from ._lancedb import Session
@@ -448,6 +459,18 @@ async def connect_async(
__all__ = [
"udf",
"table_udf",
"Udf",
"JobHandle",
"JobFailedError",
"MaterializedView",
"AsyncJobHandle",
"AsyncMaterializedView",
"Lineage",
"Node",
"Edge",
"FunctionRef",
"connect",
"connect_async",
"connect_namespace",

View File

@@ -65,6 +65,7 @@ if TYPE_CHECKING:
from .common import DATA, URI
from .embeddings import EmbeddingFunctionConfig
from ._lancedb import Session
from .udf import MaterializedView, AsyncMaterializedView
from .namespace_utils import (
_normalize_create_namespace_mode,
@@ -562,6 +563,259 @@ class DBConnection(EnforceOverrides):
"""
raise NotImplementedError("serialize is not supported for this connection type")
# -- Derived compute: functions, materialized views, jobs -------------
# Server-backed features (LanceDB Enterprise / Cloud); local
# connections raise NotImplementedError for now.
def create_function(
self,
name,
language: str = "python",
return_type: Optional[str] = None,
body: Optional[str] = None,
options: Optional[Dict[str, str]] = None,
*,
replace: bool = False,
):
"""Register a UDF (CREATE FUNCTION).
Pass a ``@udf`` / ``@table_udf``-decorated function (preferred):
db.create_function(embed)
or the explicit fields:
Parameters
----------
name: str or Udf
A decorated UDF object, or the function name.
language: str
Implementation language (currently "python").
return_type: str
SQL return type, e.g. "FLOAT", "FLOAT[1536]",
"STRUCT(a FLOAT, b VARCHAR)", "TABLE(chunk VARCHAR, idx INT)".
body: str
Function body: source text, or base64 cloudpickle bytes when
options["body_format"] == "cloudpickle".
options: dict, optional
input_columns, pip, num_gpus, batch_size, timeout,
error_policy, docker_image, body_format, ...
replace: bool
Drop an existing function of the same name first.
"""
from .udf import Udf
if isinstance(name, Udf):
req = name.create_request()
name, language, return_type, body, options = (
req["name"],
req["language"],
req["return_type"],
req["body"],
req["options"],
)
if replace:
try:
self.drop_function(name)
except Exception:
pass
LOOP.run(self._conn.create_function(name, language, return_type, body, options))
def list_functions(self):
"""List registered functions (SHOW FUNCTIONS)."""
return LOOP.run(self._conn.list_functions())
def drop_function(self, name: str):
"""Drop a registered function (DROP FUNCTION)."""
LOOP.run(self._conn.drop_function(name))
def create_materialized_view(
self,
name: str,
source=None,
select=None,
*,
query: Optional[str] = None,
where: Optional[str] = None,
auto_refresh: bool = False,
with_no_data: bool = False,
replace: bool = False,
partition_by: Optional[str] = None,
) -> "MaterializedView":
"""Create a materialized view (CREATE MATERIALIZED VIEW); returns a
`MaterializedView` handle (``.wait()`` blocks until it is populated).
Two ways to specify the view body:
- ergonomic: pass ``source`` (a table name or table) and ``select``
items -- column names, expression strings ("embed(body)"),
(alias, expression) tuples, or ``@udf`` / ``@table_udf`` objects.
The SELECT is assembled and parsed server-side (one parser, shared
with SQL).
- raw: pass ``query=`` with a full SELECT, e.g.
"SELECT id, embed(body) AS vec FROM articles WHERE id > 1".
`partition_by` partitions the view's (single) table function on a source
column. If that column has an IVF vector index the server partitions by
its index clusters (image-dedup style); otherwise it groups by distinct
value. (Geneva's `partition_by` and `partition_by_indexed_column` unify
here -- the engine picks the strategy from the column.)
"""
from .udf import build_view_query, MaterializedView
if query is None:
if source is None or select is None:
raise ValueError(
"create_materialized_view needs either query= or both "
"source and select"
)
query = build_view_query(source, select)
if where:
query += f" WHERE {where}"
if replace:
self._drop_view_if_exists(name)
job_id = LOOP.run(
self._conn.create_materialized_view(
name,
query=query,
auto_refresh=auto_refresh,
with_no_data=with_no_data,
partition_by=partition_by,
)
)
return MaterializedView(self, name, job_id=job_id)
def _drop_view_if_exists(self, name: str) -> None:
# `replace=True` is "drop if present"; only a not-found error is
# benign here. Anything else (perms, server fault) must surface rather
# than be masked by a later create failure.
try:
self.drop_materialized_view(name)
except Exception as e:
msg = str(e).lower()
if "not found" not in msg and "does not exist" not in msg:
raise
def job(self, job_id: str):
"""A `JobHandle` for reconnecting to an inflight job by id -- e.g. an
id you stored, or one returned from the SQL / REST surface. Submit
methods (`refresh_column`, `MaterializedView.refresh`) already return a
handle directly, so you do not need this to wait on a fresh submission."""
from .udf import JobHandle
return JobHandle(self, job_id)
def lineage(
self,
table: str,
column: Optional[str] = None,
*,
direction: Optional[str] = None,
depth: Optional[int] = None,
):
"""Derived-compute lineage of a table/view, or one of its columns:
upstream sources, downstream dependents, and the function version +
location that produced each derived column (with a drift flag). Returns
a `Lineage`. `direction` is "upstream" | "downstream" | "both" (server
default both); `depth` limits column-hops (transitive when omitted)."""
# `self._conn` is the AsyncConnection; drive its async `lineage`
# (which parses the JSON) on the loop, mirroring create_materialized_view.
return LOOP.run(
self._conn.lineage(table, column, direction=direction, depth=depth)
)
def _refresh_materialized_view(
self,
name: str,
*,
full: bool = False,
src_version: Optional[int] = None,
num_workers: Optional[int] = None,
max_workers: Optional[int] = None,
) -> str:
"""Internal: submit a materialized-view refresh, return the job id.
The public surface is ``MaterializedView.refresh()`` (which returns a
`JobHandle`); this stays private so refresh is only reached through the
handle.
``full=True`` forces a full rebuild (recompute and replace every row)
instead of the default incremental refresh.
"""
return LOOP.run(
self._conn._refresh_materialized_view(
name,
full=full,
src_version=src_version,
num_workers=num_workers,
max_workers=max_workers,
)
)
def explain_refresh_materialized_view(
self,
name: str,
*,
full: bool = False,
src_version: Optional[int] = None,
):
"""Plan a refresh without running it (EXPLAIN REFRESH). Returns a
plan with .has_work / .source_version / .last_refreshed_version /
.full_refresh / .rebuild / .units_total. `full=True` plans a full
rebuild (incremental planning needs stable row IDs on the source)."""
return LOOP.run(
self._conn.explain_refresh_materialized_view(
name, full=full, src_version=src_version
)
)
def alter_materialized_view(self, name: str, *, auto_refresh: bool):
"""Update a materialized view's options (ALTER MATERIALIZED VIEW)."""
LOOP.run(self._conn.alter_materialized_view(name, auto_refresh=auto_refresh))
def drop_materialized_view(self, name: str):
"""Drop a materialized view definition (DROP MATERIALIZED VIEW)."""
LOOP.run(self._conn.drop_materialized_view(name))
def list_materialized_views(self):
"""List registered materialized view definitions."""
return LOOP.run(self._conn.list_materialized_views())
def list_jobs(self):
"""List inflight server-side jobs across the database's tables."""
return LOOP.run(self._conn.list_jobs())
def get_job(self, job_id: str, table: "str | None" = None):
"""Look up one server-side job by id (the wait()/status poll path).
Passing ``table`` (the job's table) lets the server answer with an O(1)
single-node read instead of scanning the database's active jobs.
Returns the job's status, or None if it's unknown or no longer active.
"""
return LOOP.run(self._conn.get_job(job_id, table))
def cancel_job(self, job_id: str) -> bool:
"""Cancel an inflight server-side job by id (CANCEL JOB).
Returns True if a matching inflight job was found and flagged for
cancellation, False if none was inflight (already finished or
unknown id) -- cancellation is best-effort.
"""
return LOOP.run(self._conn.cancel_job(job_id))
def job_history(self, job_id: "str | None" = None):
"""Durable history of completed server-side jobs (SHOW JOB HISTORY).
Pass ``job_id`` to narrow to a single job. Unlike :meth:`list_jobs`
(live, inflight) these are the terminal records.
"""
return LOOP.run(self._conn.job_history(job_id))
def errors(self, job_id: "str | None" = None, table: "str | None" = None):
"""Per-row UDF errors recorded by ``error_policy=skip`` (SHOW ERRORS),
optionally filtered by ``job_id`` and/or ``table``.
"""
return LOOP.run(self._conn.errors(job_id, table))
class LanceDBConnection(DBConnection):
"""
@@ -1787,6 +2041,200 @@ class AsyncConnection(object):
)
return AsyncTable(table)
# -- Derived compute: functions, materialized views, jobs -------------
# Server-backed features (LanceDB Enterprise / Cloud); local
# connections raise NotImplementedError for now.
async def create_function(
self,
name,
language: str = "python",
return_type: Optional[str] = None,
body: Optional[str] = None,
options: Optional[Dict[str, str]] = None,
*,
replace: bool = False,
):
"""Register a UDF (CREATE FUNCTION). Accepts a ``@udf``/``@table_udf``
object (preferred) or the explicit (name, language, return_type, body,
options)."""
from .udf import Udf
if isinstance(name, Udf):
req = name.create_request()
name, language, return_type, body, options = (
req["name"],
req["language"],
req["return_type"],
req["body"],
req["options"],
)
if replace:
try:
await self.drop_function(name)
except Exception:
pass
await self._inner.create_function(name, language, return_type, body, options)
async def list_functions(self):
"""List registered functions (SHOW FUNCTIONS)."""
return await self._inner.list_functions()
async def drop_function(self, name: str):
"""Drop a registered function (DROP FUNCTION)."""
await self._inner.drop_function(name)
async def create_materialized_view(
self,
name: str,
source=None,
select=None,
*,
query: Optional[str] = None,
where: Optional[str] = None,
auto_refresh: bool = False,
with_no_data: bool = False,
replace: bool = False,
partition_by: Optional[str] = None,
) -> "AsyncMaterializedView":
"""Create a materialized view; returns an `AsyncMaterializedView`
handle (``.wait()`` blocks until populated). Pass either ``query=`` (a
full SELECT) or ``source`` + ``select`` items; `partition_by`
partitions the view's table function on a source column (index-cluster
if the column is IVF-indexed, else distinct-value). See the sync
method for the select grammar."""
from .udf import build_view_query, AsyncMaterializedView
if query is None:
if source is None or select is None:
raise ValueError(
"create_materialized_view needs either query= or both "
"source and select"
)
query = build_view_query(source, select)
if where:
query += f" WHERE {where}"
if replace:
try:
await self.drop_materialized_view(name)
except Exception as e:
msg = str(e).lower()
if "not found" not in msg and "does not exist" not in msg:
raise
job_id = await self._inner.create_materialized_view(
name,
query,
auto_refresh=auto_refresh,
with_no_data=with_no_data,
partition_by=partition_by,
)
return AsyncMaterializedView(self, name, job_id=job_id)
def job(self, job_id: str):
"""An `AsyncJobHandle` for reconnecting to an inflight job by id (a
stored id, or one from the SQL / REST surface). Submit methods already
return a handle, so this is only needed to re-attach to an existing
job."""
from .udf import AsyncJobHandle
return AsyncJobHandle(self, job_id)
async def lineage(
self,
table: str,
column: Optional[str] = None,
*,
direction: Optional[str] = None,
depth: Optional[int] = None,
):
"""Derived-compute lineage of a table/view (or column). See the sync
`Connection.lineage`. Returns a `Lineage`."""
from .lineage import Lineage
raw = await self._inner.table_lineage(table, column, direction, depth)
return Lineage.from_json(raw)
async def _refresh_materialized_view(
self,
name: str,
*,
full: bool = False,
src_version: Optional[int] = None,
num_workers: Optional[int] = None,
max_workers: Optional[int] = None,
) -> str:
"""Internal: submit a refresh, return the job id. The public surface is
``AsyncMaterializedView.refresh()`` (returns an `AsyncJobHandle`).
``full=True`` forces a full rebuild (recompute and replace every row)
instead of the default incremental refresh.
"""
return await self._inner.refresh_materialized_view(
name,
full=full,
src_version=src_version,
num_workers=num_workers,
max_workers=max_workers,
)
async def explain_refresh_materialized_view(
self,
name: str,
*,
full: bool = False,
src_version: Optional[int] = None,
):
"""Plan a refresh without running it (EXPLAIN REFRESH)."""
return await self._inner.explain_refresh_materialized_view(
name, full=full, src_version=src_version
)
async def alter_materialized_view(self, name: str, *, auto_refresh: bool):
"""Update a materialized view's options."""
await self._inner.alter_materialized_view(name, auto_refresh)
async def drop_materialized_view(self, name: str):
"""Drop a materialized view definition."""
await self._inner.drop_materialized_view(name)
async def list_materialized_views(self):
"""List registered materialized view definitions."""
return await self._inner.list_materialized_views()
async def list_jobs(self):
"""List inflight server-side jobs across the database's tables."""
return await self._inner.list_jobs()
async def get_job(self, job_id: str, table: "str | None" = None):
"""Look up one server-side job by id (the wait()/status poll path).
``table`` (the job's table) enables an O(1) server-side lookup.
Returns the job's status, or None if unknown / no longer active."""
return await self._inner.get_job(job_id, table)
async def cancel_job(self, job_id: str) -> bool:
"""Cancel an inflight server-side job by id (CANCEL JOB).
Returns True if a matching inflight job was found and flagged for
cancellation, False otherwise (best-effort).
"""
return await self._inner.cancel_job(job_id)
async def job_history(self, job_id: "str | None" = None):
"""Durable history of completed server-side jobs (SHOW JOB HISTORY).
Reads each table's durable job-history store. Pass ``job_id`` to narrow
to a single job. Unlike :meth:`list_jobs` (live, inflight) these are the
terminal records, with created/updated/completed timestamps.
"""
return await self._inner.job_history(job_id)
async def errors(self, job_id: "str | None" = None, table: "str | None" = None):
"""Per-row UDF errors recorded by ``error_policy=skip`` (SHOW ERRORS).
Optionally filtered by ``job_id`` and/or ``table``.
"""
return await self._inner.errors(job_id, table)
async def rename_table(
self,
cur_name: str,

View File

@@ -0,0 +1,177 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""Client-side model of derived-compute lineage.
`Connection.lineage()` / `Table.lineage()` / `MaterializedView.lineage()` return
a `Lineage`: the graph of what a column or materialized view derives from
(upstream), what derives from it (downstream), and -- for each derived column --
the function that produced it, the version it was produced with, and whether
that is stale relative to the function the registry now holds.
The server returns this as JSON (the wire contract); these classes deserialize
it. Nothing here talks to the server.
"""
from __future__ import annotations
import json
from dataclasses import dataclass, field
from typing import List, Optional, Union
@dataclass
class FunctionRef:
"""The function that produced a derived column, with version + location."""
name: str
#: Version that produced the data (stamped at compute time), if known.
as_computed_version: Optional[str] = None
#: Version the registry currently holds for this function name.
current_version: Optional[str] = None
#: True when the column was produced by an older function than the registry
#: now holds -- i.e. silently stale; re-refresh to catch up.
stale_vs_current: bool = False
language: Optional[str] = None
docker_image: Optional[str] = None
env_digest: Optional[str] = None
code_uri: Optional[str] = None
@classmethod
def _from(cls, d: dict) -> "FunctionRef":
return cls(
name=d["name"],
as_computed_version=d.get("as_computed_version"),
current_version=d.get("current_version"),
stale_vs_current=d.get("stale_vs_current", False),
language=d.get("language"),
docker_image=d.get("docker_image"),
env_digest=d.get("env_digest"),
code_uri=d.get("code_uri"),
)
@dataclass
class Node:
"""A lineage node: a table, view, column, or function."""
kind: str # "table" | "view" | "column" | "function"
id: str # "table", "table.column", or "fn:name@version"
table: Optional[str] = None
function: Optional[FunctionRef] = None
@classmethod
def _from(cls, d: dict) -> "Node":
fn = d.get("function")
return cls(
kind=d["kind"],
id=d["id"],
table=d.get("table"),
function=FunctionRef._from(fn) if fn else None,
)
@dataclass
class Edge:
"""`downstream` depends on `upstream`, produced by `via` (a function name,
or None for a passthrough)."""
downstream: str
upstream: str
via: Optional[str] = None
@classmethod
def _from(cls, d: dict) -> "Edge":
return cls(downstream=d["downstream"], upstream=d["upstream"], via=d.get("via"))
@dataclass
class Lineage:
"""A derived-compute lineage graph (nodes + labeled edges)."""
target: str
nodes: List[Node] = field(default_factory=list)
edges: List[Edge] = field(default_factory=list)
@classmethod
def from_json(cls, raw: Union[str, bytes, dict]) -> "Lineage":
d = json.loads(raw) if isinstance(raw, (str, bytes)) else raw
return cls(
target=d.get("target", ""),
nodes=[Node._from(n) for n in d.get("nodes", [])],
edges=[Edge._from(e) for e in d.get("edges", [])],
)
def functions(self) -> List[FunctionRef]:
"""The function nodes in the graph."""
return [n.function for n in self.nodes if n.function is not None]
def stale(self) -> List[FunctionRef]:
"""Functions whose as-computed version is behind the current registry
version -- the columns they produced are silently out of date."""
return [f for f in self.functions() if f.stale_vs_current]
def to_dict(self) -> dict:
def prune(d: dict) -> dict:
return {k: v for k, v in d.items() if v is not None}
return {
"target": self.target,
"nodes": [
prune(
{
"kind": n.kind,
"id": n.id,
"table": n.table,
"function": prune(vars(n.function)) if n.function else None,
}
)
for n in self.nodes
],
"edges": [prune(vars(e)) for e in self.edges],
}
def to_graphviz(self) -> str:
"""Graphviz DOT for the lineage DAG: columns/tables as nodes, function
names on edges, drift edges dashed + red."""
stale_names = {f.name for f in self.stale()}
out = [
"digraph lineage {",
" rankdir=LR;",
' node [fontname="monospace"];',
]
for n in self.nodes:
if n.kind == "function":
continue
shape = "ellipse" if n.kind in ("table", "view") else "box"
out.append(f' "{n.id}" [shape={shape}];')
for e in self.edges:
attrs = ""
if e.via:
if e.via in stale_names:
attrs = f' [label="{e.via}" color=red style=dashed]'
else:
attrs = f' [label="{e.via}"]'
out.append(f' "{e.upstream}" -> "{e.downstream}"{attrs};')
out.append("}")
return "\n".join(out)
def _repr_html_(self) -> str:
warn = ""
drift = self.stale()
if drift:
names = ", ".join(sorted({f.name for f in drift}))
warn = (
f'<p style="color:#b00000"><b>stale vs current:</b> {names} '
"(re-refresh to catch up)</p>"
)
rows = "".join(
f"<tr><td><code>{e.downstream}</code></td>"
f"<td>&larr; {e.via or ''}</td>"
f"<td><code>{e.upstream}</code></td></tr>"
for e in self.edges
)
return (
f"<b>lineage: <code>{self.target}</code></b>{warn}"
"<table><tr><th>derived</th><th>via</th><th>from</th></tr>"
f"{rows}</table>"
)

View File

@@ -13,10 +13,14 @@ from typing import (
Iterable,
List,
Optional,
TYPE_CHECKING,
Union,
Literal,
overload,
)
if TYPE_CHECKING:
from ..udf import JobHandle
import warnings
from lancedb import __version__
@@ -884,8 +888,142 @@ class RemoteTable(Table):
def count_rows(self, filter: Optional[str] = None) -> int:
return LOOP.run(self._table.count_rows(filter))
def add_columns(self, transforms: Dict[str, str]) -> AddColumnsResult:
return LOOP.run(self._table.add_columns(transforms))
def add_columns(
self,
transforms: Optional[Dict[str, str]] = None,
*,
computed: Optional[Dict[str, tuple]] = None,
) -> Optional[AddColumnsResult]:
result = None
if transforms is not None:
result = LOOP.run(self._table.add_columns(transforms))
if computed:
LOOP.run(self._table.add_columns(computed=computed))
return result
def refresh_column(
self,
columns,
*,
where: Optional[str] = None,
num_workers: Optional[int] = None,
max_workers: Optional[int] = None,
batch_size: Optional[int] = None,
priority: Optional[str] = None,
) -> "JobHandle":
"""Trigger recompute of computed columns (REFRESH COLUMN).
The expression is resolved server-side from each column's stored
binding; columns bound to the same struct-returning function
refresh together. Returns a `JobHandle` to wait on, poll, or cancel
(``tbl.refresh_column("c").wait()``). Server-backed feature
(LanceDB Enterprise / Cloud).
num_workers / max_workers / batch_size / priority are per-refresh
scheduling knobs (how to run THIS refresh) and override any default
the function carries. `priority` is a Kueue tier
(training | interactive | backfill).
"""
from ..udf import JobHandle
if isinstance(columns, str):
columns = [columns]
job_id = LOOP.run(
self._table.refresh_column(
list(columns),
where=where,
num_workers=num_workers,
max_workers=max_workers,
batch_size=batch_size,
priority=priority,
)
)
return JobHandle(self._job_conn(), job_id)
def lineage(self, column=None, *, direction=None, depth=None):
"""Derived-compute lineage of this table, or one of its columns:
upstream sources, downstream dependents, and the function version +
location that produced each derived column (with a drift flag). Returns
a `Lineage`. See `Connection.lineage`."""
return self._job_conn().lineage(
self._name, column, direction=direction, depth=depth
)
def _job_conn(self):
"""A client connection for polling jobs this table spawns. Built lazily
from the table's serialized connection state and cached (not pickled --
a forked/unpickled table rebuilds it on next use)."""
from lancedb import deserialize_conn
conn = getattr(self, "_job_conn_cache", None)
if conn is None:
conn = deserialize_conn(self._serialized_connection_state())
self._job_conn_cache = conn
return conn
def load_columns(
self,
source: Union[str, Iterable[str]],
pk: str,
columns: Union[Iterable[str], Dict[str, str]],
*,
source_format: str = "parquet",
source_pk: Optional[str] = None,
on_missing: str = "carry",
source_storage_options: Optional[Dict[str, str]] = None,
num_workers: Optional[int] = None,
max_workers: Optional[int] = None,
batch_size: Optional[int] = None,
commit_granularity: Optional[int] = None,
priority: Optional[str] = None,
) -> str:
"""Fill existing columns from an external source by primary-key join.
The distributed-job equivalent of Geneva's ``Table.load_columns()``:
imports precomputed values (e.g. embeddings) from Parquet/Lance/IPC into
this table, matching on a primary key. Returns the load job id.
Server-backed feature (LanceDB Enterprise / Cloud).
Parameters
----------
source: str | list[str]
One source URI or a list of URIs.
pk: str
Destination primary-key column. Also the source key unless
``source_pk`` is given.
columns: list[str] | dict[str, str]
Value columns to load. A list loads same-named columns; a dict maps
``{target: source}``.
source_format: str
``"parquet"`` (default), ``"lance"``, or ``"ipc"``.
source_pk: str, optional
Source primary-key column when it differs from ``pk``.
on_missing: str
Behavior for destination rows with no source match:
``"carry"`` (default, keep existing), ``"null"``, or ``"error"``.
"""
if isinstance(source, str):
source = [source]
if isinstance(columns, dict):
mappings = [(target, src) for target, src in columns.items()]
else:
mappings = [(c, None) for c in columns]
return LOOP.run(
self._table.load_columns(
list(source),
source_format,
pk,
mappings,
source_key=source_pk,
source_storage_options=source_storage_options,
on_missing=on_missing,
num_workers=num_workers,
max_workers=max_workers,
batch_size=batch_size,
commit_granularity=commit_granularity,
priority=priority,
)
)
def alter_columns(
self, *alterations: Iterable[Dict[str, str]]

View File

@@ -702,6 +702,24 @@ def _normalize_progress(progress):
return progress, False
def _computed_groups(computed):
"""Group computed columns by expression, preserving declaration order
(struct-returning functions need their columns adjacent so schema order
matches field order). Accepts the ergonomic forms -- `fn("col")` values
and tuple keys for struct fan-out -- via `_normalize_computed`."""
from .udf import _normalize_computed
groups = []
for name, (sql_type, expression) in _normalize_computed(computed).items():
for expr, cols in groups:
if expr == expression:
cols.append((name, sql_type))
break
else:
groups.append((expression, [(name, sql_type)]))
return groups
class Table(ABC):
"""
A Table is a collection of Records in a LanceDB Database.
@@ -807,6 +825,59 @@ class Table(ABC):
"""The number of rows in this Table"""
return self.count_rows(None)
def add_computed_column(
self,
columns,
fn,
args: Optional[List[str]] = None,
types=None,
) -> None:
"""Declare computed column(s) bound to a UDF -- no compute happens
here (the agent fills them lazily, or refresh_column() triggers a run).
.. deprecated::
A computed column is an expression over a registered function, so
bind it as one: ``add_columns(computed={"vec": embed("data")})``.
``embed("data")`` applies the function to the `data` column and
infers the type from the function's return signature -- the
function never couples to a particular column. Prefer that form.
"""
import warnings
warnings.warn(
"add_computed_column is deprecated; use add_columns(computed="
'{"vec": embed("data")}).',
DeprecationWarning,
stacklevel=2,
)
from .udf import Udf, struct_field_types
multi = isinstance(columns, (tuple, list))
if isinstance(fn, Udf):
expr = fn.expression(*(args or []))
if types is None:
if multi:
if not fn.returns.upper().startswith("STRUCT"):
raise ValueError(
"several columns need a STRUCT-returning function"
)
types = struct_field_types(fn.returns)
else:
types = fn.returns
else:
if types is None:
raise ValueError("pass types= when fn is a name string")
expr = f"{fn}({', '.join(args or [])})"
if multi:
if len(types) != len(columns):
raise ValueError(
f"{len(columns)} columns but {len(types)} output types"
)
computed = {c: (t, expr) for c, t in zip(columns, types)}
else:
computed = {columns: (types, expr)}
self.add_columns(computed=computed)
@property
@abstractmethod
def embedding_functions(self) -> Dict[str, EmbeddingFunctionConfig]:
@@ -3710,9 +3781,68 @@ class LanceTable(Table):
return LOOP.run(self._table.index_stats(index_name))
def add_columns(
self, transforms: Dict[str, str] | pa.field | List[pa.field] | pa.Schema
) -> AddColumnsResult:
return LOOP.run(self._table.add_columns(transforms))
self,
transforms: Dict[str, str]
| pa.field
| List[pa.field]
| pa.Schema
| None = None,
*,
computed: Optional[Dict] = None,
) -> Optional[AddColumnsResult]:
result = None
if transforms is not None:
result = LOOP.run(self._table.add_columns(transforms))
if computed:
# computed binds an expression over a registered function to a
# column: {col: fn("input_col")} -- fn("input_col") yields the
# expression and carries the inferred type; a tuple key fans a
# STRUCT return out to several columns. Declares the binding only;
# the server fills the values (server-backed). The legacy
# {col: (sql_type, expression)} tuple form is still accepted.
result_unused = LOOP.run(self._table.add_columns(computed=computed))
del result_unused
return result
def refresh_column(
self,
columns,
*,
where: Optional[str] = None,
num_workers: Optional[int] = None,
max_workers: Optional[int] = None,
batch_size: Optional[int] = None,
priority: Optional[str] = None,
) -> "JobHandle":
"""Trigger recompute of computed columns (REFRESH COLUMN).
The expression is resolved server-side from each column's stored
binding; columns bound to the same struct-returning function
refresh together. Returns a `JobHandle` to wait on, poll, or cancel
(``tbl.refresh_column("col").wait()``) -- mirrors
`MaterializedView.refresh()`. Server-backed feature (LanceDB
Enterprise / Cloud).
num_workers / max_workers / batch_size / priority are per-refresh
scheduling knobs (how to run THIS refresh) and override any default
the function carries. `priority` is a Kueue tier
(training | interactive | backfill).
"""
from .udf import JobHandle
if isinstance(columns, str):
columns = [columns]
job_id = LOOP.run(
self._table.refresh_column(
list(columns),
where=where,
num_workers=num_workers,
max_workers=max_workers,
batch_size=batch_size,
priority=priority,
)
)
return JobHandle(self._conn, job_id, table=self.name)
def alter_columns(
self, *alterations: Iterable[Dict[str, str]]
@@ -5390,9 +5520,44 @@ class AsyncTable:
return await self._inner.update(updates_sql, where)
async def refresh_column(
self,
columns,
*,
where: Optional[str] = None,
num_workers: Optional[int] = None,
max_workers: Optional[int] = None,
batch_size: Optional[int] = None,
priority: Optional[str] = None,
) -> str:
"""Trigger recompute of computed columns (REFRESH COLUMN).
Returns the refresh job id. Server-backed feature.
num_workers / max_workers / batch_size / priority are per-refresh
scheduling knobs (how to run THIS refresh); they override any default
the function carries. `priority` is a Kueue tier
(training | interactive | backfill)."""
if isinstance(columns, str):
columns = [columns]
return await self._inner.refresh_column(
list(columns),
where_clause=where,
num_workers=num_workers,
max_workers=max_workers,
batch_size=batch_size,
priority=priority,
)
async def add_columns(
self, transforms: dict[str, str] | pa.field | List[pa.field] | pa.Schema
) -> AddColumnsResult:
self,
transforms: dict[str, str]
| pa.field
| List[pa.field]
| pa.Schema
| None = None,
*,
computed: Optional[Dict] = None,
) -> Optional[AddColumnsResult]:
"""
Add new columns with defined values.
@@ -5411,6 +5576,7 @@ class AsyncTable:
version: the new version number of the table after adding columns.
"""
result = None
if isinstance(transforms, pa.Field):
transforms = [transforms]
if isinstance(transforms, list) and all(
@@ -5418,9 +5584,69 @@ class AsyncTable:
):
transforms = pa.schema(transforms)
if isinstance(transforms, pa.Schema):
return await self._inner.add_columns_with_schema(transforms)
result = await self._inner.add_columns_with_schema(transforms)
elif transforms is not None:
result = await self._inner.add_columns(list(transforms.items()))
if computed:
# computed binds an expression over a registered function to a
# column: {col: fn("input_col")} -- fn("input_col") yields the
# expression and carries the inferred type; a tuple key fans a
# STRUCT return out to several columns. Declares the binding only;
# the server fills the values (server-backed). The legacy
# {col: (sql_type, expression)} tuple form is still accepted.
for expression, cols in _computed_groups(computed):
await self._inner.add_computed_columns(cols, expression)
return result
async def add_computed_column(
self,
columns,
fn,
args: Optional[List[str]] = None,
types=None,
) -> None:
"""Declare computed column(s) bound to a UDF (async).
.. deprecated::
Use ``add_columns(computed={"col": fn("input_col")})`` -- a computed
column is an expression over a registered function, so bind it that
way instead of coupling the UDF to the column here.
"""
import warnings
warnings.warn(
"add_computed_column is deprecated; use add_columns(computed="
'{"col": fn("input_col")}).',
DeprecationWarning,
stacklevel=2,
)
from .udf import Udf, struct_field_types
multi = isinstance(columns, (tuple, list))
if isinstance(fn, Udf):
expr = fn.expression(*(args or []))
if types is None:
if multi:
if not fn.returns.upper().startswith("STRUCT"):
raise ValueError(
"several columns need a STRUCT-returning function"
)
types = struct_field_types(fn.returns)
else:
types = fn.returns
else:
return await self._inner.add_columns(list(transforms.items()))
if types is None:
raise ValueError("pass types= when fn is a name string")
expr = f"{fn}({', '.join(args or [])})"
if multi:
if len(types) != len(columns):
raise ValueError(
f"{len(columns)} columns but {len(types)} output types"
)
computed = {c: (t, expr) for c, t in zip(columns, types)}
else:
computed = {columns: (types, expr)}
await self.add_columns(computed=computed)
async def alter_columns(
self, *alterations: Iterable[dict[str, Any]]

View File

@@ -0,0 +1,753 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""UDF authoring for LanceDB derived compute (server-backed).
`@udf` / `@table_udf` turn a plain Python function into a registrable
server-side UDF: a cloudpickled (or source) body, a SQL signature inferred
from type hints, and the runtime options (pip deps, GPUs, batching, ...).
Register and use them through the existing connection/table API:
import lancedb
from lancedb import udf, table_udf
db = lancedb.connect("db://my_db", api_key="...", host_override="...")
@udf(pip=["torch>=2.0"], num_gpus=1)
def embed(text: str) -> list[float]:
return model.encode(text).tolist()
db.create_function(embed) # CREATE FUNCTION (once)
tbl = db.open_table("docs")
tbl.add_columns(computed={"vec": embed("text")}) # bind embed(text) -> vec
tbl.refresh_column("vec").wait() # materialize (returns a JobHandle)
view = db.create_materialized_view("chunks", tbl, ["id", chunk_fn])
`embed("text")` applies the registered function to the `text` column and yields
the expression `embed(text)`; the function itself stays decoupled from any
column, so the same `embed` works on any column or table.
These operations are server-backed (LanceDB Enterprise / Cloud); the
decorator itself works locally (define + call), only registration needs a
remote connection.
"""
from __future__ import annotations
import asyncio
import base64
import dataclasses
import functools
import inspect
import re
import sys
import textwrap
import time
import typing
# -- type hints -> SQL type strings -------------------------------------
_SCALARS = {
int: "BIGINT",
# Pragmatic default for ML workloads: python float maps to FLOAT
# (Float32). Use an explicit `returns=` for DOUBLE.
float: "FLOAT",
str: "VARCHAR",
bool: "BOOLEAN",
bytes: "BLOB",
}
class TypeInferenceError(TypeError):
pass
def sql_type(hint) -> str:
"""SQL type string for a python type hint."""
if hint in _SCALARS:
return _SCALARS[hint]
origin = typing.get_origin(hint)
if origin in (list, typing.List):
(item,) = typing.get_args(hint) or (None,)
if item in _SCALARS:
return f"{_SCALARS[item]}[]"
raise TypeInferenceError(
f"unsupported list item type {item!r}; use an explicit returns="
)
fields = _struct_fields(hint)
if fields is not None:
inner = ", ".join(f"{name} {sql_type(h)}" for name, h in fields)
return f"STRUCT({inner})"
raise TypeInferenceError(
f"cannot infer a SQL type for {hint!r}; pass an explicit type string"
)
def _struct_fields(hint):
"""(name, hint) pairs for a TypedDict or dataclass, else None."""
if dataclasses.is_dataclass(hint):
return [(f.name, f.type) for f in dataclasses.fields(hint)]
# TypedDict detection: a dict subclass with __annotations__.
if (
isinstance(hint, type)
and issubclass(hint, dict)
and typing.get_type_hints(hint)
):
return list(typing.get_type_hints(hint).items())
return None
def return_type(fn, override: "str | None", table: bool) -> str:
"""SQL return type for a function: explicit override wins, else the
return annotation. Table functions render as TABLE(...) and accept
struct-shaped hints (TypedDict/dataclass, optionally list-wrapped)."""
if override is not None:
s = override.strip()
if table and not s.upper().startswith("TABLE"):
if s.upper().startswith("STRUCT"):
return "TABLE" + s[len("STRUCT") :]
raise TypeInferenceError(
"a table function's returns= must be TABLE(...) or STRUCT(...)"
)
return s
hints = typing.get_type_hints(fn)
ret = hints.get("return")
if ret is None:
raise TypeInferenceError(
f"function {fn.__name__!r} needs a return annotation or returns="
)
if table:
# Accept list[Row] / Row where Row is a TypedDict or dataclass.
if typing.get_origin(ret) in (list, typing.List):
(ret,) = typing.get_args(ret)
fields = _struct_fields(ret)
if fields is None:
raise TypeInferenceError(
"a table function must return rows shaped as a TypedDict or "
"dataclass (optionally list-wrapped); or pass returns=..."
)
inner = ", ".join(f"{name} {sql_type(h)}" for name, h in fields)
return f"TABLE({inner})"
return sql_type(ret)
def param_types(fn) -> "list[tuple[str, str]]":
"""(name, sql type) per parameter, from annotations. Each UDF
parameter binds to a source column of the same name by default."""
hints = typing.get_type_hints(fn)
out = []
for name, p in inspect.signature(fn).parameters.items():
if p.kind in (p.VAR_POSITIONAL, p.VAR_KEYWORD):
raise TypeInferenceError("*args/**kwargs are not supported in UDFs")
hint = hints.get(name)
if hint is None:
raise TypeInferenceError(
f"parameter {name!r} of {fn.__name__!r} needs a type annotation"
)
out.append((name, sql_type(hint)))
return out
# -- column expressions -------------------------------------------------
class ColumnExpr(str):
"""A computed-column expression produced by applying a registered
function to column names, e.g. ``embed("data") -> "embed(data)"``.
It IS the expression string everywhere a string is expected (views, SQL,
logging), and additionally carries the function's declared return type so
``add_columns(computed=...)`` can declare the column without a hand-written
type. ``field_types`` holds the per-field SQL types of a STRUCT return, for
fanning one expression out to several columns.
"""
data_type: "str | None"
field_types: "list[str] | None"
def __new__(cls, expr: str, data_type=None, field_types=None):
obj = super().__new__(cls, expr)
obj.data_type = data_type
obj.field_types = field_types
return obj
def _normalize_computed(computed: dict) -> dict:
"""Normalize the user-facing ``computed=`` mapping to the canonical
``{name: (sql_type, expression)}`` form.
Accepts, per entry:
- value is a `ColumnExpr` (from ``fn("col")``): the column's SQL type
comes from the function's return type -- no hand-written type needed. A
tuple key (``("chunk", "idx")``) fans a STRUCT return out to one
(type, expression) entry per field, in declared order.
- value is a legacy ``(sql_type, expression)`` tuple: passed through (the
escape hatch, e.g. bare-name function strings).
"""
out: dict = {}
for key, val in computed.items():
if isinstance(val, ColumnExpr):
expr = str(val)
if isinstance(key, (tuple, list)):
if not val.field_types:
raise ValueError(
f"columns {tuple(key)} need a STRUCT-returning function; "
f"{expr} returns a single value"
)
if len(val.field_types) != len(key):
raise ValueError(
f"{len(key)} columns but {len(val.field_types)} struct fields "
f"in {expr}"
)
for name, t in zip(key, val.field_types):
out[name] = (t, expr)
else:
if val.data_type is None:
raise ValueError(f"cannot infer a type for {expr}; pass types=")
out[key] = (val.data_type, expr)
else:
out[key] = val
return out
# -- the @udf / @table_udf decorators -----------------------------------
class Udf:
def __init__(
self,
fn,
*,
returns: "str | None" = None,
table: bool = False,
name: "str | None" = None,
pip: "list[str] | None" = None,
pip_index_url: "str | None" = None,
pip_extra_index_urls: "list[str] | None" = None,
find_links: "list[str] | None" = None,
requirements: "str | list[str] | None" = None,
conda: "list[str] | None" = None,
conda_channels: "list[str] | None" = None,
env: "dict[str, str] | list[str] | None" = None,
num_cpus: "int | None" = None,
num_gpus: "int | None" = None,
batch_size: "int | None" = None,
timeout: "float | None" = None,
error_policy: "str | None" = None,
max_skip_ratio: "float | None" = None,
retries: "int | None" = None,
docker_image: "str | None" = None,
description: "str | None" = None,
prefer_source: bool = False,
):
functools.update_wrapper(self, fn)
self.fn = fn
self.name = name or fn.__name__
self.table = table
self.params = param_types(fn)
self.returns = return_type(fn, returns, table)
self.prefer_source = prefer_source
self.options: "dict[str, str]" = {}
if conda and (pip or requirements):
raise ValueError("pass conda or pip/requirements, not both")
if conda_channels and not conda:
raise ValueError("conda_channels requires conda")
if pip:
self.options["pip"] = ",".join(pip)
if pip_extra_index_urls:
self.options["pip_extra_index_urls"] = ",".join(pip_extra_index_urls)
if find_links:
self.options["find_links"] = ",".join(find_links)
if requirements:
self.options["requirements"] = _format_requirements(requirements)
if conda:
self.options["conda"] = ",".join(conda)
if conda_channels:
self.options["conda_channels"] = ",".join(conda_channels)
if env:
self.options["env"] = _format_env(env)
for key, val in [
("pip_index_url", pip_index_url),
("num_cpus", num_cpus),
("num_gpus", num_gpus),
("batch_size", batch_size),
("timeout", timeout),
("error_policy", error_policy),
("max_skip_ratio", max_skip_ratio),
("retries", retries),
("docker_image", docker_image),
]:
if val is not None:
self.options[key] = str(val)
# Keep the source in the description (when available) so the
# catalog stays inspectable even for pickled bodies.
if description is not None:
self.options["description"] = description
else:
try:
self.options["description"] = textwrap.dedent(inspect.getsource(fn))
except (OSError, TypeError):
pass
def __call__(self, *args, **kwargs):
"""Call with real values to run locally; call with column-name
strings to build an expression for backfills and views, e.g.
``embed("data")`` -> the expression ``embed(data)`` (a `ColumnExpr`
carrying the function's return type for `add_columns(computed=...)`)."""
if args and all(isinstance(a, str) for a in args) and not kwargs:
return self.expression(*args)
return self.fn(*args, **kwargs)
def expression(self, *columns: str) -> ColumnExpr:
"""The expression applying this function to `columns` (default: the
function's own parameter names). Returns a `ColumnExpr` -- a string
that also carries the declared return type (and struct field types)."""
cols = columns or [p for p, _ in self.params]
expr = f"{self.name}({', '.join(cols)})"
field_types = None
if self.returns.upper().startswith("STRUCT"):
field_types = struct_field_types(self.returns)
return ColumnExpr(expr, data_type=self.returns, field_types=field_types)
def _body(self) -> "tuple[str, str]":
"""(body literal, body_format). Source when requested and
retrievable; cloudpickle otherwise (handles closures)."""
if self.prefer_source:
try:
src = textwrap.dedent(inspect.getsource(self.fn))
# Strip the decorator line(s) so the stored body is a
# plain function definition.
lines = src.splitlines(keepends=True)
while lines and lines[0].lstrip().startswith("@"):
lines.pop(0)
return "".join(lines), "source"
except (OSError, TypeError):
pass
import cloudpickle
raw = cloudpickle.dumps(self.fn)
return base64.b64encode(raw).decode("ascii"), "cloudpickle"
def _body_and_options(self) -> "tuple[str, dict[str, str]]":
"""The body literal plus the finalized options (body_format /
python_version / cloudpickle-pip bookkeeping for a non-source
body)."""
body, body_format = self._body()
options = dict(self.options)
if body_format != "source":
options["body_format"] = body_format
# Pickled code objects only load under the same interpreter
# minor version; record ours so the worker can fail with a
# clear message instead of a bytecode error.
options["python_version"] = self.pickle_environment()
# The worker deserializes the body with cloudpickle; make sure
# the job's pip environment provides it. Conda bakes inject
# cloudpickle server-side, so do not create an invalid pip+conda
# declaration here.
if "conda" not in options:
pip = [d for d in options.get("pip", "").split(",") if d]
if not any(d.startswith("cloudpickle") for d in pip):
pip.append("cloudpickle")
options["pip"] = ",".join(pip)
return body, options
def create_request(self) -> dict:
"""Keyword arguments for `connection.create_function`."""
body, options = self._body_and_options()
return {
"name": self.name,
"language": "python",
"return_type": self.returns,
"body": body,
"options": options,
}
def create_statement(self) -> str:
"""The equivalent `CREATE FUNCTION` SQL (for SQL-surface callers)."""
params = ", ".join(f"{n} {t}" for n, t in self.params)
body, options = self._body_and_options()
with_clause = ""
if options:
rendered = ", ".join(
f"{k} = '{_escape(v)}'" for k, v in sorted(options.items())
)
with_clause = f" WITH ({rendered})"
return (
f"CREATE FUNCTION {self.name}({params}) RETURNS {self.returns} "
f"LANGUAGE python AS '{_escape_body(body)}'{with_clause}"
)
def pickle_environment(self) -> str:
"""Python version the body pickles under -- workers should match
the minor version for cloudpickle compatibility."""
return f"{sys.version_info.major}.{sys.version_info.minor}"
def _escape(s: str) -> str:
return str(s).replace("'", "''")
def _format_requirements(requirements: "str | list[str]") -> str:
if isinstance(requirements, str):
return requirements
return "\n".join(str(req) for req in requirements)
def _format_env(env: "dict[str, str] | list[str]") -> str:
if isinstance(env, dict):
return "; ".join(f"{key}={value}" for key, value in env.items())
return "; ".join(str(entry) for entry in env)
def _escape_body(body: str) -> str:
# The server unescapes \n / \t in single-quoted bodies; encode real
# newlines accordingly and escape quotes.
return (
body.replace("\\", "\\\\")
.replace("'", "''")
.replace("\n", "\\n")
.replace("\t", "\\t")
)
def udf(fn=None, **kwargs):
"""Decorate a function as a scalar (or struct-returning) UDF.
@udf
def doubled(val: int) -> float: ...
@udf(pip=["torch>=2"], num_gpus=1)
def embed(body: str) -> list[float]: ...
"""
if fn is not None:
return Udf(fn, **kwargs)
return lambda f: Udf(f, **kwargs)
def table_udf(fn=None, **kwargs):
"""Decorate a table function (UDTF): each input row may emit zero or
more output rows. Only usable in materialized views.
class Chunk(TypedDict):
chunk: str
chunk_idx: int
@table_udf
def chunker(body: str) -> list[Chunk]: ...
"""
kwargs["table"] = True
if fn is not None:
return Udf(fn, **kwargs)
return lambda f: Udf(f, **kwargs)
# -- view / job handles (thin references over a connection) -------------
def struct_field_types(returns: str) -> "list[str]":
"""Field type strings of a STRUCT(...) SQL type, in declared order."""
inner = returns.strip()[len("STRUCT(") : -1]
fields, depth, start = [], 0, 0
for i, c in enumerate(inner):
if c in "([":
depth += 1
elif c in ")]":
depth -= 1
elif c == "," and depth == 0:
fields.append(inner[start:i].strip())
start = i + 1
fields.append(inner[start:].strip())
# Each field is "name TYPE"; drop the name.
return [f.split(None, 1)[1] for f in fields]
def build_view_query(source, select) -> str:
"""Assemble a view SELECT from a source (name or table) and select
items: a column name, an expression string, a (alias, expression)
tuple, or a @udf/@table_udf object."""
src = source.name if hasattr(source, "name") else source
items = []
for item in select:
if isinstance(item, Udf):
items.append(item.expression())
elif isinstance(item, tuple):
alias, expr = item
expr = expr.expression() if isinstance(expr, Udf) else expr
items.append(f"{expr} AS {alias}")
else:
items.append(item)
return f"SELECT {', '.join(items)} FROM {src}"
def _job_id_matches(handle_id: str, listed_id: str) -> bool:
# The refresh/backfill endpoints return the submission id (a uuid), but
# the agent names the manifest job "<table>-<type>-<first 8 of the
# submission id>" -- which is what list_jobs and cancel report. Match the
# canonical id directly, or by that submission prefix.
if listed_id == handle_id:
return True
prefix = handle_id[:8]
return len(prefix) >= 4 and prefix in listed_id
class MaterializedView:
"""A reference to a materialized view (name + connection). Operations are
server-backed connection calls bound to the name.
``create_materialized_view`` returns one of these; ``job_id`` is the
initial-population job (None when the view was created with no data), so
``db.create_materialized_view(...).wait()`` blocks until it is populated.
"""
def __init__(self, conn, name: str, job_id: "str | None" = None):
self.conn = conn
self.name = name
#: initial-population job id from create, or None (with_no_data).
self.job_id = job_id
def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
"""Block until the initial-population job (from create) finishes.
A no-op when the view was created with no data."""
if self.job_id is None:
return "finished"
return JobHandle(self.conn, self.job_id, table=self.name).wait(
timeout=timeout, poll=poll
)
def refresh(self, full: bool = False) -> "JobHandle":
"""Refresh the materialized view; returns a `JobHandle` to wait on,
poll, or cancel (``view.refresh().wait()``).
``full=True`` forces a full rebuild (recompute and replace every row)
instead of the default incremental refresh. A full rebuild preserves
the view's indexes -- they are reindexed by the distributed indexer.
"""
job_id = self.conn._refresh_materialized_view(self.name, full=full)
return JobHandle(self.conn, job_id, table=self.name)
def explain_refresh(self, full: bool = False):
"""Plan a refresh without running it (EXPLAIN REFRESH)."""
return self.conn.explain_refresh_materialized_view(self.name, full=full)
def alter(self, auto_refresh: bool) -> None:
self.conn.alter_materialized_view(self.name, auto_refresh=auto_refresh)
def drop(self) -> None:
self.conn.drop_materialized_view(self.name)
# A materialized view is a first-class table: it can be indexed and
# searched like any other. These open the materialized dataset by name and
# delegate. Indexes declared this way are recorded against the view, so the
# engine re-applies them after a full refresh rebuilds the dataset (a full
# refresh overwrites the dataset, which would otherwise drop its indices).
def _table(self):
return self.conn.open_table(self.name)
def create_index(self, *args, **kwargs):
"""Build an index on the materialized view (see Table.create_index)."""
return self._table().create_index(*args, **kwargs)
def create_scalar_index(self, *args, **kwargs):
"""Build a scalar index on the materialized view."""
return self._table().create_scalar_index(*args, **kwargs)
def create_fts_index(self, *args, **kwargs):
"""Build a full-text-search index on the materialized view."""
return self._table().create_fts_index(*args, **kwargs)
def search(self, *args, **kwargs):
"""Search the materialized view (vector / FTS / hybrid)."""
return self._table().search(*args, **kwargs)
def lineage(self, column=None, *, direction=None, depth=None):
"""Lineage of the materialized view (or one of its columns). Delegates
to the backing table; the server already includes the view's sources
and downstream dependents. Returns a `Lineage`."""
return self._table().lineage(column, direction=direction, depth=depth)
_PROGRESS = re.compile(r"(\d+)/(\d+)")
class JobFailedError(RuntimeError):
"""Raised by ``JobHandle.wait()`` when the server reports the job ``failed``.
Carries the server-side error so a doomed backfill (e.g. a multi-column
``REFRESH COLUMN`` of a scalar UDF) surfaces its real cause promptly,
instead of the caller blocking until ``wait()``'s timeout.
"""
def __init__(self, job_id: str, error: "str | None"):
self.job_id = job_id
self.error = error
super().__init__(f"job {job_id} failed: {error or 'unknown error'}")
class JobHandle:
"""A reference to an inflight server-side job, with polling helpers."""
#: How long an unseen job is treated as still materializing (submission
#: -> agent cycle -> manifest write is async).
GRACE_SECONDS = 20.0
def __init__(self, conn, job_id: str, table: "str | None" = None):
self.conn = conn
self.id = job_id
#: The job's table, when known (refresh_column / MV refresh). Lets the
#: server resolve this job with an O(1) single-node read; without it the
#: lookup scans the database's active jobs (still correct).
self.table = table
self._created = time.monotonic()
self._seen = False
def _job(self):
# Poll by id (one job), not list_jobs (every active job): the server
# matches the submission/manifest id and reads just this table's node.
return self.conn.get_job(self.id, self.table)
def status(self) -> str:
"""pending / running / cancelling / stale, or 'finished' once the
job has left the inflight listing."""
job = self._job()
if job is not None:
self._seen = True
return job.state
if not self._seen and time.monotonic() - self._created < self.GRACE_SECONDS:
return "pending"
return "finished"
def progress(self) -> "tuple[int, int] | None":
"""(units_done, units_total) while running, else None."""
job = self._job()
if job is not None and job.units_total is not None:
return job.units_done or 0, job.units_total
return None
def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
state = self.status()
if state in ("finished", "stale"):
return state
if state == "failed":
# Terminal failure -- surface the server error now, don't block
# until `timeout`. `finalize` wrote it to the job's status node.
job = self._job()
raise JobFailedError(self.id, job.error if job is not None else None)
if state == "pending":
time.sleep(min(poll, 0.5))
continue
job = self._job()
if job is not None and job.committed:
return "finished"
time.sleep(poll)
raise TimeoutError(f"job {self.id} still {self.status()} after {timeout}s")
def cancel(self) -> None:
# Cancel by the canonical manifest id (what cancel matches), found
# via the submission prefix; fall back to the raw id.
job = self._job()
self.conn.cancel_job(job.job_id if job is not None else self.id)
class AsyncMaterializedView:
"""Async reference to a materialized view (name + async connection)."""
def __init__(self, conn, name: str, job_id: "str | None" = None):
self.conn = conn
self.name = name
#: initial-population job id from create, or None (with_no_data).
self.job_id = job_id
async def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
"""Block until the initial-population job (from create) finishes.
A no-op when the view was created with no data."""
if self.job_id is None:
return "finished"
return await AsyncJobHandle(self.conn, self.job_id, table=self.name).wait(
timeout=timeout, poll=poll
)
async def refresh(self, full: bool = False) -> "AsyncJobHandle":
"""Refresh the materialized view; returns an `AsyncJobHandle` to wait
on, poll, or cancel.
``full=True`` forces a full rebuild instead of an incremental refresh
(indexes are preserved and reindexed by the distributed indexer).
"""
job_id = await self.conn._refresh_materialized_view(self.name, full=full)
return AsyncJobHandle(self.conn, job_id, table=self.name)
async def explain_refresh(self, full: bool = False):
return await self.conn.explain_refresh_materialized_view(self.name, full=full)
async def alter(self, auto_refresh: bool) -> None:
await self.conn.alter_materialized_view(self.name, auto_refresh=auto_refresh)
async def drop(self) -> None:
await self.conn.drop_materialized_view(self.name)
async def lineage(self, column=None, *, direction=None, depth=None):
"""Lineage of the materialized view (or column). Returns a `Lineage`."""
return await self.conn.lineage(
self.name, column, direction=direction, depth=depth
)
class AsyncJobHandle:
"""Async reference to an inflight server-side job, with polling helpers."""
GRACE_SECONDS = 20.0
def __init__(self, conn, job_id: str, table: "str | None" = None):
self.conn = conn
self.id = job_id
#: See JobHandle.table -- enables an O(1) by-id lookup when known.
self.table = table
self._created = time.monotonic()
self._seen = False
async def _job(self):
# Poll by id, not list_jobs (see JobHandle._job).
return await self.conn.get_job(self.id, self.table)
async def status(self) -> str:
job = await self._job()
if job is not None:
self._seen = True
return job.state
if not self._seen and time.monotonic() - self._created < self.GRACE_SECONDS:
return "pending"
return "finished"
async def progress(self) -> "tuple[int, int] | None":
job = await self._job()
if job is not None and job.units_total is not None:
return job.units_done or 0, job.units_total
return None
async def wait(self, timeout: float = 3600.0, poll: float = 2.0) -> str:
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
state = await self.status()
if state in ("finished", "stale"):
return state
if state == "failed":
# Terminal failure -- surface the server error now, don't block
# until `timeout`. `finalize` wrote it to the job's status node.
job = await self._job()
raise JobFailedError(self.id, job.error if job is not None else None)
if state == "pending":
await asyncio.sleep(min(poll, 0.5))
continue
job = await self._job()
if job is not None and job.committed:
return "finished"
await asyncio.sleep(poll)
raise TimeoutError(
f"job {self.id} still {await self.status()} after {timeout}s"
)
async def cancel(self) -> None:
job = await self._job()
await self.conn.cancel_job(job.job_id if job is not None else self.id)

View File

@@ -0,0 +1,92 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright The LanceDB Authors
"""JobHandle.wait() terminal-state handling.
Regression coverage for the cluster backfill-failure hang: the server reports a
doomed job as ``state="failed"`` within seconds, but ``wait()`` used to ignore
``failed`` and block until its (default 3600s) timeout. These tests pin that a
``failed`` job raises ``JobFailedError`` promptly, carrying the server error.
"""
import asyncio
import time
import pytest
from lancedb.udf import JobHandle, AsyncJobHandle, JobFailedError
class FakeJobInfo:
"""Mirror of the pyo3 builtins.JobInfo fields wait()/status() read."""
def __init__(self, state, error=None, committed=False, units_total=None):
self.state = state
self.error = error
self.committed = committed
self.units_total = units_total
self.units_done = None
self.job_id = "job-1"
class FakeConn:
"""get_job() walks a scripted list of JobInfo (or None) snapshots, holding
the last one once exhausted, so wait() polls a deterministic timeline."""
def __init__(self, snapshots):
self._snaps = list(snapshots)
self.calls = 0
def get_job(self, job_id, table=None):
snap = self._snaps[min(self.calls, len(self._snaps) - 1)]
self.calls += 1
return snap
class AsyncFakeConn(FakeConn):
async def get_job(self, job_id, table=None):
return FakeConn.get_job(self, job_id, table)
def test_wait_raises_on_failed_promptly():
# pending -> failed: wait() must raise the server error, not TimeoutError.
conn = FakeConn(
[None, FakeJobInfo("failed", error="multi-column backfill needs a STRUCT")]
)
jh = JobHandle(conn, "job-1", table="t")
t0 = time.monotonic()
with pytest.raises(JobFailedError) as exc:
jh.wait(timeout=30, poll=0.01)
assert time.monotonic() - t0 < 5 # prompt, nowhere near the 30s timeout
assert "STRUCT" in str(exc.value)
assert exc.value.error == "multi-column backfill needs a STRUCT"
assert exc.value.job_id == "job-1"
def test_wait_returns_finished_on_success():
# running -> finished (job left the inflight listing) returns normally.
conn = FakeConn([FakeJobInfo("running", units_total=2), None])
jh = JobHandle(conn, "job-1", table="t")
jh._seen = True # already observed, so a None now means "finished" not grace
assert jh.wait(timeout=30, poll=0.01) == "finished"
def test_wait_returns_finished_on_committed():
# A committed job that is still listed resolves to finished.
conn = FakeConn([FakeJobInfo("running", committed=True, units_total=2)])
jh = JobHandle(conn, "job-1", table="t")
jh._seen = True
assert jh.wait(timeout=30, poll=0.01) == "finished"
def test_async_wait_raises_on_failed_promptly():
conn = AsyncFakeConn([None, FakeJobInfo("failed", error="boom")])
jh = AsyncJobHandle(conn, "job-1", table="t")
async def run():
t0 = time.monotonic()
with pytest.raises(JobFailedError) as exc:
await jh.wait(timeout=30, poll=0.01)
assert time.monotonic() - t0 < 5
assert exc.value.error == "boom"
asyncio.run(run())

View File

@@ -18,7 +18,10 @@ use lancedb::{
connection::Connection as LanceConnection,
connection::NamespaceClientPushdownOperation,
database::namespace::LanceNamespaceDatabase,
database::{CreateTableMode, Database, ReadConsistency},
database::{
CreateFunctionRequest, CreateMaterializedViewRequest, CreateTableMode, Database,
ReadConsistency, RefreshMaterializedViewRequest, TableLineageRequest,
},
};
use pyo3::{
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
@@ -27,6 +30,92 @@ use pyo3::{
types::{PyDict, PyDictMethods},
};
/// A registered function, as returned by `list_functions`.
#[pyclass(get_all)]
#[derive(Clone)]
pub struct FunctionInfo {
pub name: String,
pub language: String,
pub return_type: String,
pub description: String,
}
/// A registered materialized view definition.
#[pyclass(get_all)]
#[derive(Clone)]
pub struct MaterializedViewInfo {
pub name: String,
pub source_table: String,
pub projection: Vec<String>,
pub udf_columns: Vec<String>,
pub filter: Option<String>,
pub auto_refresh: bool,
}
/// One inflight server-side job.
#[pyclass(get_all)]
#[derive(Clone)]
pub struct JobInfo {
pub table: String,
pub job_id: String,
pub job_type: String,
pub state: String,
pub column: Option<String>,
pub age_seconds: Option<i64>,
pub command: Option<String>,
pub units_done: Option<i64>,
pub units_total: Option<i64>,
pub committed: bool,
pub rows_skipped: u64,
pub error: Option<String>,
}
/// One durable, completed/terminal server-side job record (SHOW JOB HISTORY).
#[pyclass(get_all)]
#[derive(Clone)]
pub struct JobHistoryEntry {
pub table: String,
pub job_id: String,
pub job_type: String,
pub state: String,
pub column: Option<String>,
pub created_ms: i64,
pub updated_ms: i64,
pub completed_ms: Option<i64>,
pub rows_processed: Option<i64>,
pub rows_skipped: Option<i64>,
pub error: Option<String>,
pub events: Option<String>,
}
/// One per-row UDF error recorded by `error_policy=skip` (SHOW ERRORS).
#[pyclass(get_all)]
#[derive(Clone)]
pub struct JobErrorEntry {
pub job_id: String,
pub table: String,
pub column: String,
pub error_type: String,
pub error_message: String,
pub fragment_id: Option<i64>,
pub source_row_id: Option<i64>,
pub table_version: Option<i64>,
pub age_seconds: Option<i64>,
}
/// The plan a REFRESH MATERIALIZED VIEW would execute (EXPLAIN REFRESH).
#[pyclass(get_all)]
#[derive(Clone)]
pub struct MvRefreshPlan {
pub table_name: String,
pub has_work: bool,
pub source_version: u64,
pub last_refreshed_version: Option<u64>,
pub full_refresh: bool,
pub rebuild: bool,
pub units_total: u64,
}
#[pyclass]
pub struct Connection {
inner: Option<LanceConnection>,
@@ -310,6 +399,308 @@ impl Connection {
})
}
#[pyo3(signature = (name, language, return_type, body, options=None))]
pub fn create_function(
self_: PyRef<'_, Self>,
name: String,
language: String,
return_type: String,
body: String,
options: Option<HashMap<String, String>>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner
.create_function(CreateFunctionRequest {
name,
language,
return_type,
body,
options: options.unwrap_or_default(),
})
.await
.infer_error()
})
}
pub fn list_functions(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let functions = inner.list_functions().await.infer_error()?;
Ok(functions
.into_iter()
.map(|f| FunctionInfo {
name: f.name,
language: f.language,
return_type: f.return_type,
description: f.description,
})
.collect::<Vec<_>>())
})
}
pub fn drop_function(self_: PyRef<'_, Self>, name: String) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner.drop_function(&name).await.infer_error()
})
}
#[pyo3(signature = (name, query, auto_refresh=false, with_no_data=false, partition_by=None))]
pub fn create_materialized_view(
self_: PyRef<'_, Self>,
name: String,
query: String,
auto_refresh: bool,
with_no_data: bool,
partition_by: Option<String>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner
.create_materialized_view(CreateMaterializedViewRequest {
name,
query,
auto_refresh,
with_no_data,
partition_by,
})
.await
.infer_error()
})
}
#[pyo3(signature = (name, full=false, src_version=None, num_workers=None, max_workers=None))]
pub fn refresh_materialized_view(
self_: PyRef<'_, Self>,
name: String,
full: bool,
src_version: Option<u64>,
num_workers: Option<u32>,
max_workers: Option<u32>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner
.refresh_materialized_view(RefreshMaterializedViewRequest {
name,
full,
src_version,
num_workers,
max_workers,
})
.await
.infer_error()
})
}
/// Derived-compute lineage of a table/view (or column), returned as the
/// server's lineage JSON string (the Python layer parses it).
pub fn table_lineage(
self_: PyRef<'_, Self>,
name: String,
column: Option<String>,
direction: Option<String>,
depth: Option<u32>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner
.table_lineage(TableLineageRequest {
name,
column,
direction,
depth,
})
.await
.infer_error()
})
}
#[pyo3(signature = (name, full=false, src_version=None))]
pub fn explain_refresh_materialized_view(
self_: PyRef<'_, Self>,
name: String,
full: bool,
src_version: Option<u64>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let p = inner
.explain_refresh_materialized_view(&name, full, src_version)
.await
.infer_error()?;
Ok(MvRefreshPlan {
table_name: p.table_name,
has_work: p.has_work,
source_version: p.source_version,
last_refreshed_version: p.last_refreshed_version,
full_refresh: p.full_refresh,
rebuild: p.rebuild,
units_total: p.units_total,
})
})
}
pub fn alter_materialized_view(
self_: PyRef<'_, Self>,
name: String,
auto_refresh: bool,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner
.alter_materialized_view(&name, auto_refresh)
.await
.infer_error()
})
}
pub fn drop_materialized_view(
self_: PyRef<'_, Self>,
name: String,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner.drop_materialized_view(&name).await.infer_error()
})
}
pub fn list_materialized_views(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let views = inner.list_materialized_views().await.infer_error()?;
Ok(views
.into_iter()
.map(|v| MaterializedViewInfo {
name: v.name,
source_table: v.source_table,
projection: v.projection,
udf_columns: v.udf_columns,
filter: v.filter,
auto_refresh: v.auto_refresh,
})
.collect::<Vec<_>>())
})
}
pub fn list_jobs(self_: PyRef<'_, Self>) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let jobs = inner.list_jobs().await.infer_error()?;
Ok(jobs
.into_iter()
.map(|j| JobInfo {
table: j.table,
job_id: j.job_id,
job_type: j.job_type,
state: j.state,
column: j.column,
age_seconds: j.age_seconds,
command: j.command,
units_done: j.units_done,
units_total: j.units_total,
committed: j.committed,
rows_skipped: j.rows_skipped,
error: j.error,
})
.collect::<Vec<_>>())
})
}
pub fn cancel_job(self_: PyRef<'_, Self>, job_id: String) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
inner.cancel_job(&job_id).await.infer_error()
})
}
#[pyo3(signature = (job_id, table=None))]
pub fn get_job(
self_: PyRef<'_, Self>,
job_id: String,
table: Option<String>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let job = inner
.get_job(&job_id, table.as_deref())
.await
.infer_error()?;
Ok(job.map(|j| JobInfo {
table: j.table,
job_id: j.job_id,
job_type: j.job_type,
state: j.state,
column: j.column,
age_seconds: j.age_seconds,
command: j.command,
units_done: j.units_done,
units_total: j.units_total,
committed: j.committed,
rows_skipped: j.rows_skipped,
error: j.error,
}))
})
}
#[pyo3(signature = (job_id=None))]
pub fn job_history(
self_: PyRef<'_, Self>,
job_id: Option<String>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let rows = inner.job_history(job_id.as_deref()).await.infer_error()?;
Ok(rows
.into_iter()
.map(|r| JobHistoryEntry {
table: r.table,
job_id: r.job_id,
job_type: r.job_type,
state: r.state,
column: r.column,
created_ms: r.created_ms,
updated_ms: r.updated_ms,
completed_ms: r.completed_ms,
rows_processed: r.rows_processed,
rows_skipped: r.rows_skipped,
error: r.error,
events: r.events,
})
.collect::<Vec<_>>())
})
}
#[pyo3(signature = (job_id=None, table=None))]
pub fn errors(
self_: PyRef<'_, Self>,
job_id: Option<String>,
table: Option<String>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.get_inner()?.clone();
future_into_py(self_.py(), async move {
let rows = inner
.errors(job_id.as_deref(), table.as_deref())
.await
.infer_error()?;
Ok(rows
.into_iter()
.map(|e| JobErrorEntry {
job_id: e.job_id,
table: e.table,
column: e.column,
error_type: e.error_type,
error_message: e.error_message,
fragment_id: e.fragment_id,
source_row_id: e.source_row_id,
table_version: e.table_version,
age_seconds: e.age_seconds,
})
.collect::<Vec<_>>())
})
}
#[pyo3(signature = (cur_name, new_name, cur_namespace_path=None, new_namespace_path=None))]
pub fn rename_table(
self_: PyRef<'_, Self>,

View File

@@ -41,6 +41,11 @@ pub fn _lancedb(_py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> {
.write_style("LANCEDB_LOG_STYLE");
env_logger::init_from_env(env);
m.add_class::<Connection>()?;
m.add_class::<connection::FunctionInfo>()?;
m.add_class::<connection::MaterializedViewInfo>()?;
m.add_class::<connection::JobInfo>()?;
m.add_class::<connection::JobHistoryEntry>()?;
m.add_class::<connection::JobErrorEntry>()?;
m.add_class::<Session>()?;
m.add_class::<Table>()?;
m.add_class::<IndexConfig>()?;

View File

@@ -17,8 +17,8 @@ use arrow::{
pyarrow::{FromPyArrow, PyArrowType, ToPyArrow},
};
use lancedb::table::{
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, NewColumnTransform,
OptimizeAction, OptimizeOptions, Ref, Table as LanceDbTable,
AddDataMode, ColumnAlteration, Duration, FieldMetadataUpdate, LoadColumnsRequest,
NewColumnTransform, OptimizeAction, OptimizeOptions, Ref, Table as LanceDbTable,
};
use pyo3::{
Bound, FromPyObject, Py, PyAny, PyRef, PyResult, Python,
@@ -1060,6 +1060,83 @@ impl Table {
})
}
pub fn add_computed_columns(
self_: PyRef<'_, Self>,
columns: Vec<(String, String)>,
expression: String,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
inner
.add_computed_columns(&columns, &expression)
.await
.infer_error()
})
}
#[pyo3(signature = (columns, where_clause=None, num_workers=None, max_workers=None, batch_size=None, priority=None))]
pub fn refresh_column(
self_: PyRef<'_, Self>,
columns: Vec<String>,
where_clause: Option<String>,
num_workers: Option<u32>,
max_workers: Option<u32>,
batch_size: Option<u32>,
priority: Option<String>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
future_into_py(self_.py(), async move {
inner
.refresh_column(
&columns,
where_clause,
num_workers,
max_workers,
batch_size,
priority,
)
.await
.infer_error()
})
}
#[allow(clippy::too_many_arguments)]
#[pyo3(signature = (source_uris, source_format, target_key, columns, source_key=None, source_storage_options=None, on_missing=None, num_workers=None, max_workers=None, batch_size=None, commit_granularity=None, priority=None))]
pub fn load_columns(
self_: PyRef<'_, Self>,
source_uris: Vec<String>,
source_format: String,
target_key: String,
columns: Vec<(String, Option<String>)>,
source_key: Option<String>,
source_storage_options: Option<std::collections::HashMap<String, String>>,
on_missing: Option<String>,
num_workers: Option<u32>,
max_workers: Option<u32>,
batch_size: Option<u32>,
commit_granularity: Option<u32>,
priority: Option<String>,
) -> PyResult<Bound<'_, PyAny>> {
let inner = self_.inner_ref()?.clone();
let request = LoadColumnsRequest {
source_uris,
source_format,
source_storage_options,
target_key,
source_key,
columns,
on_missing,
num_workers,
max_workers,
batch_size,
commit_granularity,
priority,
};
future_into_py(self_.py(), async move {
inner.load_columns(request).await.infer_error()
})
}
pub fn add_columns(
self_: PyRef<'_, Self>,
definitions: Vec<(String, String)>,

View File

@@ -166,10 +166,6 @@ required-features = ["bedrock"]
[[example]]
name = "simple"
[[example]]
name = "polars"
required-features = ["polars"]
[[example]]
name = "full_text_search"

View File

@@ -1,47 +0,0 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright The LanceDB Authors
//! This example demonstrates ingesting a Polars DataFrame into LanceDB and
//! reading it back out as a Polars DataFrame.
use lancedb::arrow::IntoPolars;
use lancedb::query::ExecutableQuery;
use lancedb::{Result, connect};
use polars::prelude::{DataFrame, NamedFrom, Series};
fn make_dataframe() -> DataFrame {
let ids = Series::new("id", &[1i32, 2, 3, 4, 5]);
let names = Series::new("name", &["Alice", "Bob", "Carol", "Dave", "Eve"]);
let scores = Series::new("score", &[9.5f64, 8.1, 7.3, 9.0, 6.5]);
DataFrame::new(vec![ids, names, scores]).unwrap()
}
#[tokio::main]
async fn main() -> Result<()> {
let tmp = tempfile::tempdir().unwrap();
let db = connect(tmp.path().to_str().unwrap()).execute().await?;
// Ingest a Polars DataFrame directly — DataFrame now implements Scannable.
let df = make_dataframe();
println!("Input DataFrame:\n{df}");
let table = db.create_table("people", df).execute().await?;
// Append more rows.
let more = DataFrame::new(vec![
Series::new("id", &[6i32, 7]),
Series::new("name", &["Frank", "Grace"]),
Series::new("score", &[7.8f64, 8.9]),
])
.unwrap();
table.add(more).execute().await?;
// Read back as a Polars DataFrame.
let result_df = table.query().execute().await?.into_polars().await?;
println!(
"\nRound-tripped DataFrame ({} rows):\n{result_df}",
result_df.height()
);
Ok(())
}

View File

@@ -112,14 +112,54 @@ impl<S: Stream<Item = Result<arrow_array::RecordBatch>>> RecordBatchStream
/// A trait for converting incoming data to Arrow
///
/// Integrations should implement this trait to allow data to be
/// imported directly from the integration. For example, implementing
/// this trait for `Vec<Vec<...>>` would allow the `Vec` to be directly
/// used in methods like [`crate::connection::Connection::create_table`]
/// or [`crate::table::Table::add`]
pub trait IntoArrow {
/// Convert the data into an iterator of Arrow batches
fn into_arrow(self) -> Result<Box<dyn arrow_array::RecordBatchReader + Send>>;
}
pub type BoxedRecordBatchReader = Box<dyn arrow_array::RecordBatchReader + Send>;
impl<T: arrow_array::RecordBatchReader + Send + 'static> IntoArrow for T {
fn into_arrow(self) -> Result<Box<dyn arrow_array::RecordBatchReader + Send>> {
Ok(Box::new(self))
}
}
/// A trait for converting incoming data to Arrow asynchronously
///
/// Serves the same purpose as [`IntoArrow`], but for asynchronous data.
///
/// Note: Arrow has no async equivalent to RecordBatchReader and so
pub trait IntoArrowStream {
/// Convert the data into a stream of Arrow batches
fn into_arrow(self) -> Result<SendableRecordBatchStream>;
}
impl<S: Stream<Item = Result<arrow_array::RecordBatch>>> SimpleRecordBatchStream<S> {
pub fn new(stream: S, schema: Arc<arrow_schema::Schema>) -> Self {
Self { schema, stream }
}
}
impl IntoArrowStream for SendableRecordBatchStream {
fn into_arrow(self) -> Result<SendableRecordBatchStream> {
Ok(self)
}
}
impl IntoArrowStream for datafusion_physical_plan::SendableRecordBatchStream {
fn into_arrow(self) -> Result<SendableRecordBatchStream> {
let schema = self.schema();
let stream = self.map_err(|df_err| df_err.into());
Ok(Box::pin(SimpleRecordBatchStream::new(stream, schema)))
}
}
pub trait LanceDbDatagenExt {
fn into_ldb_stream(
self,
@@ -224,7 +264,9 @@ impl IntoPolars for SendableRecordBatchStream {
#[cfg(all(test, feature = "polars"))]
mod tests {
use super::SendableRecordBatchStream;
use crate::arrow::{IntoPolars, PolarsDataFrameRecordBatchReader, SimpleRecordBatchStream};
use crate::arrow::{
IntoArrow, IntoPolars, PolarsDataFrameRecordBatchReader, SimpleRecordBatchStream,
};
use polars::prelude::{DataFrame, NamedFrom, Series};
fn get_record_batch_reader_from_polars() -> Box<dyn arrow_array::RecordBatchReader + Send> {
@@ -238,7 +280,10 @@ mod tests {
float_series = Series::new("float", &[2.0]);
let df2 = DataFrame::new(vec![string_series, int_series, float_series]).unwrap();
Box::new(PolarsDataFrameRecordBatchReader::new(df1.vstack(&df2).unwrap()).unwrap())
PolarsDataFrameRecordBatchReader::new(df1.vstack(&df2).unwrap())
.unwrap()
.into_arrow()
.unwrap()
}
#[test]

View File

@@ -23,8 +23,10 @@ use crate::connection::create_table::CreateTableBuilder;
use crate::data::scannable::Scannable;
use crate::database::listing::ListingDatabase;
use crate::database::{
CloneTableRequest, Database, DatabaseOptions, OpenTableRequest, ReadConsistency,
TableNamesRequest,
CloneTableRequest, CreateFunctionRequest, CreateMaterializedViewRequest, Database,
DatabaseOptions, FunctionInfo, JobErrorInfo, JobHistoryInfo, JobInfo, MaterializedViewInfo,
MvRefreshPlan, OpenTableRequest, ReadConsistency, RefreshMaterializedViewRequest,
TableLineageRequest, TableNamesRequest,
};
use crate::embeddings::{EmbeddingRegistry, MemoryRegistry};
use crate::error::{Error, Result};
@@ -488,6 +490,113 @@ impl Connection {
)
}
// -- Derived compute: functions, materialized views, jobs -------------
// Server-backed features (LanceDB Enterprise / Cloud); local
// databases return NotSupported for now.
/// Register a UDF (CREATE FUNCTION).
pub async fn create_function(&self, request: CreateFunctionRequest) -> Result<()> {
self.internal.create_function(request).await
}
/// List registered functions (SHOW FUNCTIONS).
pub async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
self.internal.list_functions().await
}
/// Drop a registered function (DROP FUNCTION).
pub async fn drop_function(&self, name: &str) -> Result<()> {
self.internal.drop_function(name).await
}
/// Create a materialized view (CREATE MATERIALIZED VIEW). Returns
/// the initial-population job id, absent when `with_no_data`.
pub async fn create_materialized_view(
&self,
request: CreateMaterializedViewRequest,
) -> Result<Option<String>> {
self.internal.create_materialized_view(request).await
}
/// Refresh a materialized view; returns the refresh job id.
pub async fn refresh_materialized_view(
&self,
request: RefreshMaterializedViewRequest,
) -> Result<String> {
self.internal.refresh_materialized_view(request).await
}
/// Derived-compute lineage of a table/view (or column), as server-defined
/// JSON. Read-only.
pub async fn table_lineage(&self, request: TableLineageRequest) -> Result<String> {
self.internal.table_lineage(request).await
}
/// Plan a materialized-view refresh without submitting work
/// (EXPLAIN REFRESH).
pub async fn explain_refresh_materialized_view(
&self,
name: &str,
full: bool,
src_version: Option<u64>,
) -> Result<MvRefreshPlan> {
self.internal
.explain_refresh_materialized_view(name, full, src_version)
.await
}
/// Update a materialized view's options (ALTER MATERIALIZED VIEW).
pub async fn alter_materialized_view(&self, name: &str, auto_refresh: bool) -> Result<()> {
self.internal
.alter_materialized_view(name, auto_refresh)
.await
}
/// Drop a materialized view definition (DROP MATERIALIZED VIEW).
pub async fn drop_materialized_view(&self, name: &str) -> Result<()> {
self.internal.drop_materialized_view(name).await
}
/// List registered materialized view definitions.
pub async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
self.internal.list_materialized_views().await
}
/// List inflight server-side jobs across the database's tables.
pub async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
self.internal.list_jobs().await
}
/// Cancel an inflight server-side job by id. Returns true if a
/// matching inflight job was flagged for cancellation.
pub async fn cancel_job(&self, job_id: &str) -> Result<bool> {
self.internal.cancel_job(job_id).await
}
/// Look up a single server-side job by id -- the `wait()`/status poll path.
/// `table_hint` (the job's table) enables an O(1) server-side lookup; `None`
/// scans the database's active jobs. A `None` result means unknown / not
/// active.
pub async fn get_job(&self, job_id: &str, table_hint: Option<&str>) -> Result<Option<JobInfo>> {
self.internal.get_job(job_id, table_hint).await
}
/// Durable job history (SHOW JOB HISTORY) across the database's tables.
/// Pass `job_id` to narrow to a single job.
pub async fn job_history(&self, job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
self.internal.job_history(job_id).await
}
/// Per-row UDF errors (SHOW ERRORS) across the database's tables, optionally
/// filtered by `job_id` and/or `table`.
pub async fn errors(
&self,
job_id: Option<&str>,
table: Option<&str>,
) -> Result<Vec<JobErrorInfo>> {
self.internal.errors(job_id, table).await
}
/// Rename a table in the database.
///
/// This is only supported in LanceDB Cloud.

View File

@@ -185,43 +185,6 @@ impl Scannable for SendableRecordBatchStream {
}
}
#[cfg(feature = "polars")]
impl Scannable for polars::frame::DataFrame {
fn schema(&self) -> SchemaRef {
crate::polars_arrow_convertors::convert_polars_df_schema_to_arrow_rb_schema(
self.schema().clone(),
)
.expect("failed to convert Polars DataFrame schema to Arrow schema")
}
fn scan_as_stream(&mut self) -> SendableRecordBatchStream {
let schema = Scannable::schema(self);
let batches: crate::Result<Vec<RecordBatch>> =
match crate::arrow::PolarsDataFrameRecordBatchReader::new(self.clone()) {
Err(e) => Err(e),
Ok(reader) => reader.map(|b| b.map_err(Into::into)).collect(),
};
match batches {
Err(e) => Box::pin(SimpleRecordBatchStream {
schema,
stream: once(async move { Err(e) }),
}),
Ok(batches) => {
let stream = futures::stream::iter(batches.into_iter().map(Ok));
Box::pin(SimpleRecordBatchStream { schema, stream })
}
}
}
fn num_rows(&self) -> Option<usize> {
Some(self.height())
}
fn rescannable(&self) -> bool {
true
}
}
#[async_trait]
impl StreamingWriteSource for Box<dyn Scannable> {
fn arrow_schema(&self) -> SchemaRef {
@@ -1126,60 +1089,4 @@ mod tests {
);
}
}
#[cfg(feature = "polars")]
mod polars_tests {
use super::*;
use crate::arrow::IntoPolars;
use crate::query::ExecutableQuery;
use polars::prelude::{DataFrame, NamedFrom, Series};
fn make_df() -> DataFrame {
DataFrame::new(vec![
Series::new("id", &[1i32, 2, 3]),
Series::new("val", &[1.1f64, 2.2, 3.3]),
])
.unwrap()
}
#[tokio::test]
async fn test_dataframe_scannable_round_trip() {
let tmp = tempfile::tempdir().unwrap();
let db = crate::connect(tmp.path().to_str().unwrap())
.execute()
.await
.unwrap();
let df = make_df();
let table = db.create_table("t", df.clone()).execute().await.unwrap();
// Append the same rows again.
table.add(df.clone()).execute().await.unwrap();
let result = table
.query()
.execute()
.await
.unwrap()
.into_polars()
.await
.unwrap();
assert_eq!(result.height(), df.height() * 2);
assert_eq!(result.schema(), df.schema());
}
#[tokio::test]
async fn test_dataframe_scannable_rescannable() {
let mut df = make_df();
assert!(df.rescannable());
let batches1: Vec<RecordBatch> = df.scan_as_stream().try_collect().await.unwrap();
assert_eq!(batches1.iter().map(|b| b.num_rows()).sum::<usize>(), 3);
// Can be scanned again.
let batches2: Vec<RecordBatch> = df.scan_as_stream().try_collect().await.unwrap();
assert_eq!(batches2.iter().map(|b| b.num_rows()).sum::<usize>(), 3);
}
}
}

View File

@@ -27,7 +27,7 @@ use lance_namespace::models::{
};
use crate::data::scannable::Scannable;
use crate::error::Result;
use crate::error::{Error, Result};
use crate::table::{BaseTable, WriteOptions};
pub mod listing;
@@ -200,6 +200,205 @@ pub enum ReadConsistency {
Strong,
}
/// A request to register a UDF (CREATE FUNCTION).
///
/// Functions are first-class database objects, decoupled from any
/// column; computed columns and materialized views reference them by
/// name. Server-backed feature (LanceDB Enterprise / Cloud).
#[derive(Debug, Clone)]
pub struct CreateFunctionRequest {
/// Function name.
pub name: String,
/// Implementation language (currently "python").
pub language: String,
/// SQL return type, e.g. `FLOAT`, `FLOAT[1536]`,
/// `STRUCT(a FLOAT, b VARCHAR)`, `TABLE(chunk VARCHAR, idx INT)`.
pub return_type: String,
/// Function body: source text, or base64 cloudpickle bytes when
/// `options["body_format"] = "cloudpickle"`.
pub body: String,
/// Options: input_columns, pip, num_gpus, batch_size, timeout,
/// error_policy, docker_image, body_format, ...
pub options: HashMap<String, String>,
}
/// A registered function, as returned by `list_functions`.
#[derive(Debug, Clone)]
pub struct FunctionInfo {
pub name: String,
pub language: String,
pub return_type: String,
pub description: String,
}
/// A request to create a materialized view (CREATE MATERIALIZED VIEW).
#[derive(Debug, Clone)]
pub struct CreateMaterializedViewRequest {
/// View name.
pub name: String,
/// The view's SELECT statement, e.g.
/// `SELECT id, embed(body) AS vec FROM articles WHERE id > 1`.
/// Bare columns project through; function-call columns compute via
/// registered UDFs (a RETURNS TABLE function makes a row-expanding
/// chunker view).
pub query: String,
/// Refresh automatically when the source table changes.
pub auto_refresh: bool,
/// Register the definition only; skip the initial population.
pub with_no_data: bool,
/// Optional source column to partition the view's table function on. If the
/// column has an IVF vector index the server partitions by its clusters
/// (image-dedup style); otherwise it groups by distinct value.
pub partition_by: Option<String>,
}
impl CreateMaterializedViewRequest {
pub fn new(name: impl Into<String>, query: impl Into<String>) -> Self {
Self {
name: name.into(),
query: query.into(),
auto_refresh: false,
with_no_data: false,
partition_by: None,
}
}
}
/// A request to refresh a materialized view.
#[derive(Debug, Clone)]
pub struct RefreshMaterializedViewRequest {
/// View name.
pub name: String,
/// Force a full rebuild (recompute and replace every row) instead of the
/// default incremental refresh.
pub full: bool,
/// Pin the refresh to a source-table version; latest when absent.
pub src_version: Option<u64>,
/// Initial worker count.
pub num_workers: Option<u32>,
/// Elastic worker ceiling.
pub max_workers: Option<u32>,
}
/// A request for the derived-compute lineage of a table/view (or one of its
/// columns). The response is server-defined lineage JSON, returned opaque so
/// this client need not model the server's lineage schema.
#[derive(Debug, Clone, Default)]
pub struct TableLineageRequest {
/// Table or view name.
pub name: String,
/// Column for column-level lineage; whole table/view when absent.
pub column: Option<String>,
/// "upstream" | "downstream" | "both" (server default when absent).
pub direction: Option<String>,
/// Column-hops to walk; transitive when absent.
pub depth: Option<u32>,
}
impl RefreshMaterializedViewRequest {
pub fn new(name: impl Into<String>) -> Self {
Self {
name: name.into(),
full: false,
src_version: None,
num_workers: None,
max_workers: None,
}
}
}
/// A registered materialized view definition, as returned by
/// `list_materialized_views`.
#[derive(Debug, Clone)]
pub struct MaterializedViewInfo {
pub name: String,
pub source_table: String,
/// Source columns projected through.
pub projection: Vec<String>,
/// `alias=expression` per UDF-computed column.
pub udf_columns: Vec<String>,
pub filter: Option<String>,
pub auto_refresh: bool,
}
/// A row from `list_jobs`: one inflight server-side job (index build,
/// compaction, column refresh, view refresh, ...).
#[derive(Debug, Clone)]
pub struct JobInfo {
pub table: String,
pub job_id: String,
pub job_type: String,
/// Lifecycle state: "running", "cancelling", or "stale".
pub state: String,
pub column: Option<String>,
pub age_seconds: Option<i64>,
pub command: Option<String>,
pub units_done: Option<i64>,
pub units_total: Option<i64>,
/// Whether the job's final commit has completed (output visible).
pub committed: bool,
pub rows_skipped: u64,
pub error: Option<String>,
}
/// A row from `job_history`: one durable, completed/terminal server-side job
/// record (SHOW JOB HISTORY), read from a table's `_job_history` store. Unlike
/// `JobInfo` (live, inflight jobs) this carries created/updated/completed
/// timestamps and the lifecycle event log.
#[derive(Debug, Clone)]
pub struct JobHistoryInfo {
pub table: String,
pub job_id: String,
pub job_type: String,
pub state: String,
pub column: Option<String>,
pub created_ms: i64,
pub updated_ms: i64,
pub completed_ms: Option<i64>,
pub rows_processed: Option<i64>,
pub rows_skipped: Option<i64>,
pub error: Option<String>,
/// Newline-joined lifecycle event log, oldest first.
pub events: Option<String>,
}
/// A row from `errors`: one per-row UDF failure recorded by `error_policy=skip`
/// (SHOW ERRORS).
#[derive(Debug, Clone)]
pub struct JobErrorInfo {
pub job_id: String,
pub table: String,
pub column: String,
pub error_type: String,
pub error_message: String,
pub fragment_id: Option<i64>,
pub source_row_id: Option<i64>,
pub table_version: Option<i64>,
pub age_seconds: Option<i64>,
}
/// The plan a `REFRESH MATERIALIZED VIEW` would execute, as returned by
/// `explain_refresh_materialized_view` (EXPLAIN REFRESH). No work is run.
#[derive(Debug, Clone)]
pub struct MvRefreshPlan {
pub table_name: String,
/// Whether a refresh would do anything (rebuild or non-empty units).
pub has_work: bool,
pub source_version: u64,
pub last_refreshed_version: Option<u64>,
pub full_refresh: bool,
/// Source changed non-append-only since the last refresh -> rebuild.
pub rebuild: bool,
/// Number of row-range work units the refresh would process.
pub units_total: u64,
}
fn not_supported<T>(what: &str) -> Result<T> {
Err(Error::NotSupported {
message: format!("{} is not supported by this database", what),
})
}
/// The `Database` trait defines the interface for database implementations.
///
/// A database is responsible for managing tables and their metadata.
@@ -245,6 +444,99 @@ pub trait Database:
///
/// See [`CloneTableRequest`] for detailed documentation and examples.
async fn clone_table(&self, request: CloneTableRequest) -> Result<Arc<dyn BaseTable>>;
// -- Derived compute: functions, materialized views, jobs -------------
//
// Server-backed features (LanceDB Enterprise / Cloud). The defaults
// return NotSupported; the remote database overrides them. Local
// single-node implementations are planned.
/// Register a UDF (CREATE FUNCTION).
async fn create_function(&self, _request: CreateFunctionRequest) -> Result<()> {
not_supported("create_function")
}
/// List registered functions (SHOW FUNCTIONS).
async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
not_supported("list_functions")
}
/// Drop a registered function (DROP FUNCTION).
async fn drop_function(&self, _name: &str) -> Result<()> {
not_supported("drop_function")
}
/// Create a materialized view (CREATE MATERIALIZED VIEW). Returns
/// the initial-population job id, absent when `with_no_data`.
async fn create_materialized_view(
&self,
_request: CreateMaterializedViewRequest,
) -> Result<Option<String>> {
not_supported("create_materialized_view")
}
/// Refresh a materialized view; returns the refresh job id.
async fn refresh_materialized_view(
&self,
_request: RefreshMaterializedViewRequest,
) -> Result<String> {
not_supported("refresh_materialized_view")
}
/// Derived-compute lineage of a table/view (or column), as server-defined
/// JSON. Read-only.
async fn table_lineage(&self, _request: TableLineageRequest) -> Result<String> {
not_supported("table_lineage")
}
/// Plan a materialized-view refresh without submitting work
/// (EXPLAIN REFRESH). `full` plans a full rebuild (incremental
/// planning requires stable row IDs on the source).
async fn explain_refresh_materialized_view(
&self,
_name: &str,
_full: bool,
_src_version: Option<u64>,
) -> Result<MvRefreshPlan> {
not_supported("explain_refresh_materialized_view")
}
/// Update a materialized view's options (ALTER MATERIALIZED VIEW).
async fn alter_materialized_view(&self, _name: &str, _auto_refresh: bool) -> Result<()> {
not_supported("alter_materialized_view")
}
/// Drop a materialized view definition (DROP MATERIALIZED VIEW).
async fn drop_materialized_view(&self, _name: &str) -> Result<()> {
not_supported("drop_materialized_view")
}
/// List registered materialized view definitions.
async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
not_supported("list_materialized_views")
}
/// List inflight server-side jobs across the database's tables.
async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
not_supported("list_jobs")
}
/// Cancel an inflight server-side job by id. Returns true if a
/// matching inflight job was found and flagged for cancellation,
/// false if none was inflight (best-effort, like SQL `CANCEL JOB`).
async fn cancel_job(&self, _job_id: &str) -> Result<bool> {
not_supported("cancel_job")
}
/// Point-access for a single job by id -- the `wait()`/status poll path.
/// `table_hint` (the job's table, which `wait()` callers know) enables an
/// O(1) server-side lookup. `None` if the job is unknown or not active.
async fn get_job(&self, _job_id: &str, _table_hint: Option<&str>) -> Result<Option<JobInfo>> {
not_supported("get_job")
}
/// Durable job history (SHOW JOB HISTORY) across the database's tables,
/// optionally narrowed to a single `job_id`.
async fn job_history(&self, _job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
not_supported("job_history")
}
/// Per-row UDF errors (SHOW ERRORS) recorded by `error_policy=skip` across
/// the database's tables, optionally filtered by `job_id` and/or `table`.
async fn errors(
&self,
_job_id: Option<&str>,
_table: Option<&str>,
) -> Result<Vec<JobErrorInfo>> {
not_supported("errors")
}
/// Open a table in the database
async fn open_table(&self, request: OpenTableRequest) -> Result<Arc<dyn BaseTable>>;
/// Rename a table in the database

View File

@@ -19,8 +19,10 @@ use lance_namespace::models::{
use crate::Error;
use crate::database::{
CloneTableRequest, CreateTableMode, CreateTableRequest, Database, DatabaseOptions,
OpenTableRequest, ReadConsistency, TableNamesRequest,
CloneTableRequest, CreateFunctionRequest, CreateMaterializedViewRequest, CreateTableMode,
CreateTableRequest, Database, DatabaseOptions, FunctionInfo, JobErrorInfo, JobHistoryInfo,
JobInfo, MaterializedViewInfo, MvRefreshPlan, OpenTableRequest, ReadConsistency,
RefreshMaterializedViewRequest, TableLineageRequest, TableNamesRequest,
};
use crate::error::Result;
use crate::remote::util::stream_as_body;
@@ -33,6 +35,248 @@ use super::client::{
use super::table::RemoteTable;
use super::util::parse_server_version;
// Wire types for the derived-compute routes (functions, materialized
// views, jobs). Field shapes mirror the server's REST contract.
#[derive(serde::Serialize)]
struct RemoteCreateFunctionRequest {
language: String,
return_type: String,
body: String,
options: std::collections::HashMap<String, String>,
}
#[derive(serde::Deserialize)]
struct RemoteFunctionEntry {
name: String,
language: String,
return_type: String,
#[serde(default)]
description: String,
}
#[derive(serde::Deserialize)]
struct RemoteListFunctionsResponse {
functions: Vec<RemoteFunctionEntry>,
}
#[derive(serde::Serialize)]
struct RemoteCreateMaterializedViewRequest {
query: String,
auto_refresh: bool,
with_no_data: bool,
#[serde(skip_serializing_if = "Option::is_none")]
partition_by: Option<String>,
}
#[derive(serde::Deserialize)]
struct RemoteCreateMaterializedViewResponse {
#[serde(default)]
job_id: Option<String>,
}
#[derive(serde::Serialize)]
struct RemoteRefreshMaterializedViewRequest {
#[serde(skip_serializing_if = "std::ops::Not::not")]
full: bool,
#[serde(skip_serializing_if = "Option::is_none")]
src_version: Option<u64>,
#[serde(skip_serializing_if = "Option::is_none")]
num_workers: Option<u32>,
#[serde(skip_serializing_if = "Option::is_none")]
max_workers: Option<u32>,
}
#[derive(serde::Deserialize)]
struct RemoteRefreshMaterializedViewResponse {
job_id: String,
}
#[derive(serde::Serialize)]
struct RemoteExplainRefreshRequest {
#[serde(skip_serializing_if = "Option::is_none")]
full: Option<bool>,
#[serde(skip_serializing_if = "Option::is_none")]
src_version: Option<u64>,
}
#[derive(serde::Deserialize)]
struct RemoteExplainRefreshResponse {
table_name: String,
has_work: bool,
source_version: u64,
last_refreshed_version: Option<u64>,
full_refresh: bool,
rebuild: bool,
units_total: u64,
}
#[derive(serde::Serialize)]
struct RemoteAlterMaterializedViewRequest {
auto_refresh: bool,
}
#[derive(serde::Deserialize)]
struct RemoteMaterializedViewEntry {
name: String,
source_table: String,
#[serde(default)]
projection: Vec<String>,
#[serde(default)]
udf_columns: Vec<String>,
#[serde(default)]
filter: Option<String>,
#[serde(default)]
auto_refresh: bool,
}
#[derive(serde::Deserialize)]
struct RemoteListMaterializedViewsResponse {
views: Vec<RemoteMaterializedViewEntry>,
}
#[derive(serde::Deserialize)]
struct RemoteJobEntry {
table: String,
job_id: String,
job_type: String,
state: String,
#[serde(default)]
column: Option<String>,
#[serde(default)]
age_seconds: Option<i64>,
#[serde(default)]
command: Option<String>,
#[serde(default)]
units_done: Option<i64>,
#[serde(default)]
units_total: Option<i64>,
#[serde(default)]
committed: bool,
#[serde(default)]
rows_skipped: u64,
#[serde(default)]
error: Option<String>,
}
#[derive(serde::Deserialize)]
struct RemoteListJobsResponse {
jobs: Vec<RemoteJobEntry>,
}
#[derive(serde::Deserialize)]
struct RemoteGetJobResponse {
#[serde(default)]
job: Option<RemoteJobEntry>,
}
#[derive(serde::Deserialize)]
struct RemoteCancelJobResponse {
cancelled: bool,
}
impl From<RemoteJobEntry> for JobInfo {
fn from(j: RemoteJobEntry) -> Self {
JobInfo {
table: j.table,
job_id: j.job_id,
job_type: j.job_type,
state: j.state,
column: j.column,
age_seconds: j.age_seconds,
command: j.command,
units_done: j.units_done,
units_total: j.units_total,
committed: j.committed,
rows_skipped: j.rows_skipped,
error: j.error,
}
}
}
#[derive(serde::Deserialize)]
struct RemoteJobHistoryEntry {
table: String,
job_id: String,
job_type: String,
state: String,
#[serde(default)]
column: Option<String>,
created_ms: i64,
updated_ms: i64,
#[serde(default)]
completed_ms: Option<i64>,
#[serde(default)]
rows_processed: Option<i64>,
#[serde(default)]
rows_skipped: Option<i64>,
#[serde(default)]
error: Option<String>,
#[serde(default)]
events: Option<String>,
}
#[derive(serde::Deserialize)]
struct RemoteJobHistoryResponse {
jobs: Vec<RemoteJobHistoryEntry>,
}
impl From<RemoteJobHistoryEntry> for JobHistoryInfo {
fn from(j: RemoteJobHistoryEntry) -> Self {
JobHistoryInfo {
table: j.table,
job_id: j.job_id,
job_type: j.job_type,
state: j.state,
column: j.column,
created_ms: j.created_ms,
updated_ms: j.updated_ms,
completed_ms: j.completed_ms,
rows_processed: j.rows_processed,
rows_skipped: j.rows_skipped,
error: j.error,
events: j.events,
}
}
}
#[derive(serde::Deserialize)]
struct RemoteErrorEntry {
job_id: String,
table: String,
column: String,
error_type: String,
error_message: String,
#[serde(default)]
fragment_id: Option<i64>,
#[serde(default)]
source_row_id: Option<i64>,
#[serde(default)]
table_version: Option<i64>,
#[serde(default)]
age_seconds: Option<i64>,
}
#[derive(serde::Deserialize)]
struct RemoteErrorsResponse {
errors: Vec<RemoteErrorEntry>,
}
impl From<RemoteErrorEntry> for JobErrorInfo {
fn from(e: RemoteErrorEntry) -> Self {
JobErrorInfo {
job_id: e.job_id,
table: e.table,
column: e.column,
error_type: e.error_type,
error_message: e.error_message,
fragment_id: e.fragment_id,
source_row_id: e.source_row_id,
table_version: e.table_version,
age_seconds: e.age_seconds,
}
}
}
// Request structure for the remote clone table API
#[derive(serde::Serialize)]
struct RemoteCloneTableRequest {
@@ -641,6 +885,228 @@ impl<S: HttpSend> Database for RemoteDatabase<S> {
Ok(table)
}
async fn create_function(&self, request: CreateFunctionRequest) -> Result<()> {
let body = RemoteCreateFunctionRequest {
language: request.language,
return_type: request.return_type,
body: request.body,
options: request.options,
};
let req = self
.client
.post(&format!("/v1/function/{}/create", request.name))
.json(&body);
let (request_id, rsp) = self.client.send(req).await?;
self.client.check_response(&request_id, rsp).await?;
Ok(())
}
async fn list_functions(&self) -> Result<Vec<FunctionInfo>> {
let req = self.client.get("/v1/function/list");
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteListFunctionsResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body
.functions
.into_iter()
.map(|f| FunctionInfo {
name: f.name,
language: f.language,
return_type: f.return_type,
description: f.description,
})
.collect())
}
async fn drop_function(&self, name: &str) -> Result<()> {
let req = self.client.post(&format!("/v1/function/{}/drop", name));
let (request_id, rsp) = self.client.send(req).await?;
self.client.check_response(&request_id, rsp).await?;
Ok(())
}
async fn create_materialized_view(
&self,
request: CreateMaterializedViewRequest,
) -> Result<Option<String>> {
let body = RemoteCreateMaterializedViewRequest {
query: request.query,
auto_refresh: request.auto_refresh,
with_no_data: request.with_no_data,
partition_by: request.partition_by,
};
let req = self
.client
.post(&format!("/v1/materialized_view/{}/create", request.name))
.json(&body);
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteCreateMaterializedViewResponse =
rsp.json().await.err_to_http(request_id)?;
Ok(body.job_id)
}
async fn refresh_materialized_view(
&self,
request: RefreshMaterializedViewRequest,
) -> Result<String> {
let body = RemoteRefreshMaterializedViewRequest {
full: request.full,
src_version: request.src_version,
num_workers: request.num_workers,
max_workers: request.max_workers,
};
let req = self
.client
.post(&format!("/v1/materialized_view/{}/refresh", request.name))
.json(&body);
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteRefreshMaterializedViewResponse =
rsp.json().await.err_to_http(request_id)?;
Ok(body.job_id)
}
async fn table_lineage(&self, request: TableLineageRequest) -> Result<String> {
let mut req = self
.client
.get(&format!("/v1/table/{}/lineage", request.name));
if let Some(column) = &request.column {
req = req.query(&[("column", column)]);
}
if let Some(direction) = &request.direction {
req = req.query(&[("direction", direction)]);
}
if let Some(depth) = request.depth {
req = req.query(&[("depth", depth.to_string())]);
}
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
// Server-defined lineage JSON, returned opaque (the client does not
// model the lineage schema; the Python layer deserializes it).
rsp.text().await.err_to_http(request_id)
}
async fn explain_refresh_materialized_view(
&self,
name: &str,
full: bool,
src_version: Option<u64>,
) -> Result<MvRefreshPlan> {
let body = RemoteExplainRefreshRequest {
full: Some(full),
src_version,
};
let req = self
.client
.post(&format!("/v1/materialized_view/{}/explain_refresh", name))
.json(&body);
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteExplainRefreshResponse = rsp.json().await.err_to_http(request_id)?;
Ok(MvRefreshPlan {
table_name: body.table_name,
has_work: body.has_work,
source_version: body.source_version,
last_refreshed_version: body.last_refreshed_version,
full_refresh: body.full_refresh,
rebuild: body.rebuild,
units_total: body.units_total,
})
}
async fn alter_materialized_view(&self, name: &str, auto_refresh: bool) -> Result<()> {
let req = self
.client
.post(&format!("/v1/materialized_view/{}/alter", name))
.json(&RemoteAlterMaterializedViewRequest { auto_refresh });
let (request_id, rsp) = self.client.send(req).await?;
self.client.check_response(&request_id, rsp).await?;
Ok(())
}
async fn drop_materialized_view(&self, name: &str) -> Result<()> {
let req = self
.client
.post(&format!("/v1/materialized_view/{}/drop", name));
let (request_id, rsp) = self.client.send(req).await?;
self.client.check_response(&request_id, rsp).await?;
Ok(())
}
async fn list_materialized_views(&self) -> Result<Vec<MaterializedViewInfo>> {
let req = self.client.get("/v1/materialized_view/list");
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteListMaterializedViewsResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body
.views
.into_iter()
.map(|v| MaterializedViewInfo {
name: v.name,
source_table: v.source_table,
projection: v.projection,
udf_columns: v.udf_columns,
filter: v.filter,
auto_refresh: v.auto_refresh,
})
.collect())
}
async fn list_jobs(&self) -> Result<Vec<JobInfo>> {
let req = self.client.get("/v1/job/list");
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteListJobsResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body.jobs.into_iter().map(JobInfo::from).collect())
}
async fn get_job(&self, job_id: &str, table: Option<&str>) -> Result<Option<JobInfo>> {
// Point-access poll path: GET /v1/job/{id}, with the table as the O(1)
// hint when known. `query` handles URL-encoding the table name.
let mut req = self.client.get(&format!("/v1/job/{job_id}"));
if let Some(t) = table {
req = req.query(&[("table", t)]);
}
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteGetJobResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body.job.map(JobInfo::from))
}
async fn cancel_job(&self, job_id: &str) -> Result<bool> {
let req = self.client.post(&format!("/v1/job/{}/cancel", job_id));
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteCancelJobResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body.cancelled)
}
async fn job_history(&self, job_id: Option<&str>) -> Result<Vec<JobHistoryInfo>> {
let mut req = self.client.get("/v1/job/history");
if let Some(j) = job_id {
req = req.query(&[("job", j)]);
}
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteJobHistoryResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body.jobs.into_iter().map(JobHistoryInfo::from).collect())
}
async fn errors(&self, job_id: Option<&str>, table: Option<&str>) -> Result<Vec<JobErrorInfo>> {
let mut req = self.client.get("/v1/job/errors");
if let Some(j) = job_id {
req = req.query(&[("job", j)]);
}
if let Some(t) = table {
req = req.query(&[("table", t)]);
}
let (request_id, rsp) = self.client.send(req).await?;
let rsp = self.client.check_response(&request_id, rsp).await?;
let body: RemoteErrorsResponse = rsp.json().await.err_to_http(request_id)?;
Ok(body.errors.into_iter().map(JobErrorInfo::from).collect())
}
async fn open_table(&self, request: OpenTableRequest) -> Result<Arc<dyn BaseTable>> {
let identifier = build_table_identifier(
&request.name,
@@ -1580,6 +2046,223 @@ mod tests {
}
}
#[tokio::test]
async fn test_derived_compute_routes() {
// create_function
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::POST);
assert_eq!(request.url().path(), "/v1/function/embed/create");
let body: serde_json::Value =
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
assert_eq!(body["language"], "python");
assert_eq!(body["return_type"], "FLOAT[4]");
assert_eq!(body["body"], "def embed(x): ...");
assert_eq!(body["options"]["pip"], "torch");
http::Response::builder()
.status(200)
.body(r#"{"name":"embed","status":"OK"}"#)
.unwrap()
});
conn.create_function(crate::database::CreateFunctionRequest {
name: "embed".into(),
language: "python".into(),
return_type: "FLOAT[4]".into(),
body: "def embed(x): ...".into(),
options: [("pip".to_string(), "torch".to_string())].into(),
})
.await
.unwrap();
// list_functions
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::GET);
assert_eq!(request.url().path(), "/v1/function/list");
http::Response::builder()
.status(200)
.body(
r#"{"functions":[{"name":"embed","language":"python","return_type":"Float32","description":""}]}"#,
)
.unwrap()
});
let functions = conn.list_functions().await.unwrap();
assert_eq!(functions.len(), 1);
assert_eq!(functions[0].name, "embed");
// drop_function
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::POST);
assert_eq!(request.url().path(), "/v1/function/embed/drop");
http::Response::builder()
.status(200)
.body(r#"{"name":"embed","status":"OK"}"#)
.unwrap()
});
conn.drop_function("embed").await.unwrap();
// create_materialized_view
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::POST);
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/create");
let body: serde_json::Value =
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
assert_eq!(body["query"], "SELECT id, embed(body) AS vec FROM docs");
assert_eq!(body["auto_refresh"], true);
assert_eq!(body["with_no_data"], false);
http::Response::builder()
.status(200)
.body(r#"{"name":"mv1","job_id":"j-1"}"#)
.unwrap()
});
let mut request = crate::database::CreateMaterializedViewRequest::new(
"mv1",
"SELECT id, embed(body) AS vec FROM docs",
);
request.auto_refresh = true;
let job_id = conn.create_materialized_view(request).await.unwrap();
assert_eq!(job_id.as_deref(), Some("j-1"));
// refresh_materialized_view
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::POST);
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/refresh");
let body: serde_json::Value =
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
assert_eq!(body["num_workers"], 2);
assert!(body.get("src_version").is_none());
http::Response::builder()
.status(202)
.body(r#"{"job_id":"j-2"}"#)
.unwrap()
});
let mut request = crate::database::RefreshMaterializedViewRequest::new("mv1");
request.num_workers = Some(2);
let job_id = conn.refresh_materialized_view(request).await.unwrap();
assert_eq!(job_id, "j-2");
// alter_materialized_view
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::POST);
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/alter");
let body: serde_json::Value =
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
assert_eq!(body["auto_refresh"], false);
http::Response::builder()
.status(200)
.body(r#"{"name":"mv1","status":"OK"}"#)
.unwrap()
});
conn.alter_materialized_view("mv1", false).await.unwrap();
// drop_materialized_view
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.url().path(), "/v1/materialized_view/mv1/drop");
http::Response::builder()
.status(200)
.body(r#"{"name":"mv1","status":"OK"}"#)
.unwrap()
});
conn.drop_materialized_view("mv1").await.unwrap();
// list_materialized_views
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::GET);
assert_eq!(request.url().path(), "/v1/materialized_view/list");
http::Response::builder()
.status(200)
.body(
r#"{"views":[{"name":"mv1","source_table":"docs","projection":["id"],"udf_columns":["vec=embed(body)"],"filter":null,"auto_refresh":true}]}"#,
)
.unwrap()
});
let views = conn.list_materialized_views().await.unwrap();
assert_eq!(views.len(), 1);
assert_eq!(views[0].source_table, "docs");
assert!(views[0].auto_refresh);
// list_jobs
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::GET);
assert_eq!(request.url().path(), "/v1/job/list");
http::Response::builder()
.status(200)
.body(
r#"{"jobs":[{"table":"docs","job_id":"j-3","job_type":"udf_virtual_column_backfill","state":"running","column":"vec","age_seconds":4,"command":null,"units_done":1,"units_total":2,"committed":false,"rows_skipped":0,"error":null}]}"#,
)
.unwrap()
});
let jobs = conn.list_jobs().await.unwrap();
assert_eq!(jobs.len(), 1);
assert_eq!(jobs[0].state, "running");
assert_eq!(jobs[0].units_total, Some(2));
// cancel_job
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::POST);
assert_eq!(request.url().path(), "/v1/job/j-3/cancel");
http::Response::builder()
.status(200)
.body(r#"{"cancelled":true}"#)
.unwrap()
});
assert!(conn.cancel_job("j-3").await.unwrap());
// cancel_job: no such inflight job -> false, not an error
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.url().path(), "/v1/job/gone/cancel");
http::Response::builder()
.status(200)
.body(r#"{"cancelled":false}"#)
.unwrap()
});
assert!(!conn.cancel_job("gone").await.unwrap());
// job_history: GET /v1/job/history, no filter
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::GET);
assert_eq!(request.url().path(), "/v1/job/history");
assert!(request.url().query().is_none());
http::Response::builder()
.status(200)
.body(
r#"{"jobs":[{"table":"docs","job_id":"j-1","job_type":"udf_virtual_column_backfill","state":"done","column":"vec","created_ms":1000,"updated_ms":2000,"completed_ms":2000,"rows_processed":42,"rows_skipped":3,"error":null,"events":"created\ndone"}]}"#,
)
.unwrap()
});
let hist = conn.job_history(None).await.unwrap();
assert_eq!(hist.len(), 1);
assert_eq!(hist[0].state, "done");
assert_eq!(hist[0].rows_processed, Some(42));
assert_eq!(hist[0].events.as_deref(), Some("created\ndone"));
// job_history: ?job= narrows to one job
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.url().path(), "/v1/job/history");
assert_eq!(request.url().query(), Some("job=j-1"));
http::Response::builder()
.status(200)
.body(r#"{"jobs":[]}"#)
.unwrap()
});
assert!(conn.job_history(Some("j-1")).await.unwrap().is_empty());
// errors: GET /v1/job/errors with job + table filters
let conn = Connection::new_with_handler(|request| {
assert_eq!(request.method(), &reqwest::Method::GET);
assert_eq!(request.url().path(), "/v1/job/errors");
assert_eq!(request.url().query(), Some("job=j-1&table=docs"));
http::Response::builder()
.status(200)
.body(
r#"{"errors":[{"job_id":"j-1","table":"docs","column":"vec","error_type":"ValueError","error_message":"boom","fragment_id":0,"source_row_id":42,"table_version":7,"age_seconds":5}]}"#,
)
.unwrap()
});
let errs = conn.errors(Some("j-1"), Some("docs")).await.unwrap();
assert_eq!(errs.len(), 1);
assert_eq!(errs[0].error_type, "ValueError");
assert_eq!(errs[0].source_row_id, Some(42));
}
#[tokio::test]
async fn test_clone_table() {
let conn = Connection::new_with_handler(|request| {

View File

@@ -2309,6 +2309,126 @@ impl<S: HttpSend> BaseTable for RemoteTable<S> {
message: "optimize is not supported on LanceDB cloud.".into(),
})
}
async fn add_computed_columns(
&self,
columns: &[(String, String)],
expression: &str,
) -> Result<()> {
let new_columns: Vec<serde_json::Value> = columns
.iter()
.map(|(name, data_type)| {
serde_json::json!({
"name": name,
"computed": { "data_type": data_type, "expression": expression },
})
})
.collect();
let request = self
.client
.post(&format!("/v1/table/{}/add_columns/", self.identifier))
.json(&serde_json::json!({ "new_columns": new_columns }));
let (request_id, response) = self.send(request, true).await?;
self.check_table_response(&request_id, response).await?;
Ok(())
}
async fn refresh_column(
&self,
columns: &[String],
where_clause: Option<String>,
num_workers: Option<u32>,
max_workers: Option<u32>,
batch_size: Option<u32>,
priority: Option<String>,
) -> Result<String> {
let mut body = serde_json::json!({ "columns": columns });
if let Some(w) = where_clause {
body["where_clause"] = serde_json::Value::String(w);
}
if let Some(n) = num_workers {
body["num_workers"] = n.into();
}
if let Some(n) = max_workers {
body["max_workers"] = n.into();
}
if let Some(n) = batch_size {
body["batch_size"] = n.into();
}
if let Some(p) = priority {
body["priority"] = serde_json::Value::String(p);
}
let request = self
.client
.post(&format!("/v1/table/{}/refresh_column", self.identifier))
.json(&body);
let (request_id, response) = self.send(request, true).await?;
let response = self.check_table_response(&request_id, response).await?;
#[derive(serde::Deserialize)]
struct RefreshColumnResponse {
job_id: String,
}
let body: RefreshColumnResponse = response.json().await.err_to_http(request_id)?;
Ok(body.job_id)
}
async fn load_columns(&self, request: crate::table::LoadColumnsRequest) -> Result<String> {
let columns: Vec<serde_json::Value> = request
.columns
.iter()
.map(|(target, source)| {
serde_json::json!({
"target": target,
"source": source.clone().unwrap_or_else(|| target.clone()),
})
})
.collect();
let mut source = serde_json::json!({
"uris": request.source_uris,
"format": request.source_format,
});
if let Some(opts) = request.source_storage_options {
source["storage_options"] = serde_json::to_value(opts).unwrap_or_default();
}
let mut body = serde_json::json!({
"columns": columns,
"source": source,
"target_key": request.target_key,
});
if let Some(k) = request.source_key {
body["source_key"] = serde_json::Value::String(k);
}
if let Some(m) = request.on_missing {
body["on_missing"] = serde_json::Value::String(m);
}
if let Some(n) = request.num_workers {
body["num_workers"] = n.into();
}
if let Some(n) = request.max_workers {
body["max_workers"] = n.into();
}
if let Some(n) = request.batch_size {
body["batch_size"] = n.into();
}
if let Some(n) = request.commit_granularity {
body["commit_granularity"] = n.into();
}
if let Some(p) = request.priority {
body["priority"] = serde_json::Value::String(p);
}
let http_request = self
.client
.post(&format!("/v1/table/{}/load_columns", self.identifier))
.json(&body);
let (request_id, response) = self.send(http_request, true).await?;
let response = self.check_table_response(&request_id, response).await?;
#[derive(serde::Deserialize)]
struct LoadColumnsResponse {
job_id: String,
}
let body: LoadColumnsResponse = response.json().await.err_to_http(request_id)?;
Ok(body.job_id)
}
async fn add_columns(
&self,
transforms: NewColumnTransform,
@@ -2801,6 +2921,75 @@ mod tests {
}
}
#[tokio::test]
async fn test_refresh_column() {
let table = Table::new_with_handler("my_table", |request| {
assert_eq!(request.method(), "POST");
assert_eq!(request.url().path(), "/v1/table/my_table/refresh_column");
let body: serde_json::Value =
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
assert_eq!(body["columns"], serde_json::json!(["vec"]));
assert_eq!(body["num_workers"], 2);
assert!(body.get("where_clause").is_none());
http::Response::builder()
.status(202)
.body(r#"{"job_id":"j-9"}"#)
.unwrap()
});
let job_id = table
.refresh_column(&["vec".to_string()], None, Some(2), None, None, None)
.await
.unwrap();
assert_eq!(job_id, "j-9");
}
#[tokio::test]
async fn test_load_columns() {
let table = Table::new_with_handler("my_table", |request| {
assert_eq!(request.method(), "POST");
assert_eq!(request.url().path(), "/v1/table/my_table/load_columns");
let body: serde_json::Value =
serde_json::from_slice(request.body().unwrap().as_bytes().unwrap()).unwrap();
assert_eq!(
body["columns"],
serde_json::json!([{"target": "embedding", "source": "emb"}])
);
assert_eq!(body["source"]["format"], "parquet");
assert_eq!(
body["source"]["uris"],
serde_json::json!(["s3://b/x.parquet"])
);
assert_eq!(body["target_key"], "document_id");
assert_eq!(body["source_key"], "doc_id");
assert_eq!(body["on_missing"], "null");
assert_eq!(body["num_workers"], 4);
http::Response::builder()
.status(202)
.body(r#"{"job_id":"lc-7"}"#)
.unwrap()
});
let request = crate::table::LoadColumnsRequest {
source_uris: vec!["s3://b/x.parquet".to_string()],
source_format: "parquet".to_string(),
source_storage_options: None,
target_key: "document_id".to_string(),
source_key: Some("doc_id".to_string()),
columns: vec![("embedding".to_string(), Some("emb".to_string()))],
on_missing: Some("null".to_string()),
num_workers: Some(4),
max_workers: None,
batch_size: None,
commit_granularity: None,
priority: None,
};
let job_id = table.load_columns(request).await.unwrap();
assert_eq!(job_id, "lc-7");
}
#[tokio::test]
async fn test_version() {
let table = Table::new_with_handler("my_table", |request| {

View File

@@ -471,6 +471,33 @@ impl LsmWriteSpec {
}
}
/// Request to fill existing table columns from an external source by
/// primary-key join (Geneva `Table.load_columns()` parity). Server-backed
/// feature (LanceDB Enterprise / Cloud).
#[derive(Debug, Clone)]
pub struct LoadColumnsRequest {
/// External source URIs.
pub source_uris: Vec<String>,
/// Source format: "parquet" | "lance" | "ipc".
pub source_format: String,
/// Source-only storage options (e.g. cloud credentials).
pub source_storage_options: Option<HashMap<String, String>>,
/// Destination primary-key column.
pub target_key: String,
/// Source primary-key column. Defaults to `target_key` when None.
pub source_key: Option<String>,
/// Value column mappings as `(target, source)`; a None source defaults to
/// the target name.
pub columns: Vec<(String, Option<String>)>,
/// Missing-row policy: "carry" (default) | "null" | "error".
pub on_missing: Option<String>,
pub num_workers: Option<u32>,
pub max_workers: Option<u32>,
pub batch_size: Option<u32>,
pub commit_granularity: Option<u32>,
pub priority: Option<String>,
}
/// A trait for anything "table-like". This is used for both native tables (which target
/// Lance datasets) and remote tables (which target LanceDB cloud)
///
@@ -620,6 +647,47 @@ pub trait BaseTable: std::fmt::Display + std::fmt::Debug + Send + Sync {
transforms: NewColumnTransform,
read_columns: Option<Vec<String>>,
) -> Result<AddColumnsResult>;
/// Declare computed columns bound to a registered function: each
/// `(name, sql_type)` is added all-null with the expression stored
/// as its binding; no compute happens here (the server's lazy
/// detector or refresh_column fills them). Several columns map a
/// struct-returning function's fields positionally. Server-backed
/// feature; the default returns NotSupported.
async fn add_computed_columns(
&self,
_columns: &[(String, String)],
_expression: &str,
) -> Result<()> {
Err(Error::NotSupported {
message: "computed columns are not supported by this table".into(),
})
}
/// Trigger recompute of computed columns. The expression is
/// resolved server-side from each column's stored binding; columns
/// bound to the same struct-returning function refresh together.
/// Returns the refresh job id. Server-backed feature (LanceDB
/// Enterprise / Cloud); the default returns NotSupported.
async fn refresh_column(
&self,
_columns: &[String],
_where_clause: Option<String>,
_num_workers: Option<u32>,
_max_workers: Option<u32>,
_batch_size: Option<u32>,
_priority: Option<String>,
) -> Result<String> {
Err(Error::NotSupported {
message: "refresh_column is not supported by this table".into(),
})
}
/// Fill existing columns from an external source by primary-key join
/// (Geneva `load_columns`). Returns the load job id. Server-backed feature;
/// the default returns NotSupported.
async fn load_columns(&self, _request: LoadColumnsRequest) -> Result<String> {
Err(Error::NotSupported {
message: "load_columns is not supported by this table".into(),
})
}
/// Alter columns in the table.
async fn alter_columns(&self, alterations: &[ColumnAlteration]) -> Result<AlterColumnsResult>;
/// Drop columns from the table.
@@ -1461,6 +1529,48 @@ impl Table {
self.inner.add_columns(transforms, read_columns).await
}
/// Declare computed columns bound to a registered function
/// (`(name, sql_type)` pairs + a `f(args)` expression). No compute
/// happens here. Server-backed feature.
pub async fn add_computed_columns(
&self,
columns: &[(String, String)],
expression: &str,
) -> Result<()> {
self.inner.add_computed_columns(columns, expression).await
}
/// Trigger recompute of computed columns (REFRESH COLUMN). The
/// expression comes from each column's stored binding; columns
/// bound to the same struct-returning function refresh together.
/// Returns the refresh job id. Server-backed feature.
pub async fn refresh_column(
&self,
columns: &[String],
where_clause: Option<String>,
num_workers: Option<u32>,
max_workers: Option<u32>,
batch_size: Option<u32>,
priority: Option<String>,
) -> Result<String> {
self.inner
.refresh_column(
columns,
where_clause,
num_workers,
max_workers,
batch_size,
priority,
)
.await
}
/// Fill existing columns from an external Parquet/Lance/IPC source by
/// primary-key join (Geneva `Table.load_columns()`). Returns the job id.
pub async fn load_columns(&self, request: LoadColumnsRequest) -> Result<String> {
self.inner.load_columns(request).await
}
/// Change a column's name or nullability.
pub async fn alter_columns(
&self,