Commit Graph

1075 Commits

Author SHA1 Message Date
Wyatt Alt
03e895fa5c Merge integration submodule (refresh_column JobHandle / create_mv fix) into lineage 2026-06-29 22:35:34 -07:00
Wyatt Alt
c31e53088e client: slice 4 -- Python lineage surface
- new lineage.py: Lineage / Node / Edge / FunctionRef dataclasses that parse the
  server's lineage JSON, with to_dict(), to_graphviz() (drift edges dashed+red),
  and _repr_html_(); plus .functions() / .stale() helpers.
- Connection.lineage(table, column=, direction=, depth=) (sync + async) calls the
  pyo3 table_lineage binding and deserializes into Lineage.
- Table.lineage(column=, ...) via the table's job connection; MaterializedView /
  AsyncMaterializedView .lineage() delegate to the backing table (the server
  already includes the view's sources + downstream dependents).
- export the new types.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
434a5be187 feat(client): Table.refresh_column returns a JobHandle (like MaterializedView.refresh)
refresh_column returned the bare job-id str, so callers had to wrap it:
db.job(tbl.refresh_column("c")).wait(). Mirror MaterializedView.refresh() and
return a JobHandle directly, so tbl.refresh_column("c").wait() / .status() / .id
work without the wrapper. (db.job(job_id) stays for reconnecting by a stored id.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
78aa005093 client: slice 3 -- thread table_lineage through the remote client + pyo3
A new Database::table_lineage(TableLineageRequest) -> Result<String> threaded
end to end: default not_supported in the trait; the remote impl issues
GET /v1/table/{name}/lineage with column/direction/depth query params and
returns the body verbatim; connection.rs exposes a pub wrapper; the pyo3
binding hands the JSON string to Python.

The lineage payload is carried as opaque JSON on purpose: the open-source
lancedb client must not depend on the sophon-internal derived_jobs crate that
defines the lineage schema, so the wire format is the contract and the Python
layer deserializes it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
6191542cfe fix(mv): MaterializedView.refresh calls the async _refresh (underscore)
The sync _refresh_materialized_view called self._conn.refresh_materialized_view
(no underscore); the async method is _refresh_materialized_view, so
MaterializedView.refresh() raised AttributeError. Add the underscore.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
6af3088b91 client: make refresh_materialized_view private (reach it via the handle)
Refresh is a submit-a-job verb, so its only public surface should be
MaterializedView.refresh() / AsyncMaterializedView.refresh() (which return a
job handle). Rename the connection methods to _refresh_materialized_view and
have the handles call that, so the raw by-name refresh is no longer advertised
on the connection. The pyo3 native binding is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
e73d4618d8 fix(mv): create_materialized_view passes query as keyword, not positional
The sync RemoteDBConnection.create_materialized_view assembled the SELECT but
called the async create_materialized_view with the query as the 2nd positional
arg, which binds to `source=` (query= is keyword-only). Every call then failed
the "needs either query= or both source and select" validation. Pass query=query.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
3d92106394 client: split create_view into create_materialized_view; return job handles
- create_materialized_view now takes either query= or source+select (folds in
  the old create_view builder) and returns a MaterializedView handle whose
  .wait() blocks on initial population. create_view is removed -- it was
  misnamed (it built a *materialized* view, while CREATE VIEW means the plain
  non-materialized view the engine also supports).
- MaterializedView.refresh() and the remote Table.refresh_column() now return a
  JobHandle directly, so tbl.refresh_column("c").wait() needs no db.job(...)
  wrapper. db.job(id) is narrowed to reconnect-by-id (stored id / SQL / REST).
- rename View/AsyncView -> MaterializedView/AsyncMaterializedView (+ exports).
- tighten the replace path: only a not-found error on the pre-drop is benign;
  real failures (perms/server) now surface instead of being swallowed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
5810974b37 feat(client): Table.load_columns() REST client for LOAD COLUMNS
Geneva Table.load_columns() parity on the REST-only client. Fills existing
columns from an external Parquet/Lance/IPC source by primary-key join.

- BaseTable::load_columns default (NotSupported) + public Table::load_columns,
  taking a LoadColumnsRequest (source uris/format/storage_options, target/source
  key, (target, source?) column mappings, on_missing, worker/batch/commit knobs).
- Remote impl POSTs to /v1/table/{id}/load_columns with the matching body;
  mock test asserts the request shape.
- PyO3 binding + Python remote Table.load_columns(source, pk, columns, *,
  source_format, source_pk, on_missing, ...) accepting a column list or
  {target: source} dict.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
8b38500b07 feat(view): full=True force-rebuild on refresh_materialized_view
View.refresh(full=True) (sync + async) now works -- it previously raised
NotImplementedError. Thread the flag through the client: RefreshMaterialized-
ViewRequest.full -> the REST body (RemoteRefreshMaterializedViewRequest.full);
pyo3 refresh_materialized_view(full=...); Connection.refresh_materialized_view(
name, full=) sync + async. A full refresh forces a recompute-and-replace and
preserves the view's indexes (reindexed by the distributed indexer).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
fd0a3b97d0 feat(view): materialized views are first-class indexable + searchable
Add View.create_index / create_scalar_index / create_fts_index / search
as pass-throughs to open_table(name). A materialized view is a real Lance
dataset; these let it be indexed and searched like any other table,
closing the parity gap with Geneva (whose create_materialized_view returns
a first-class Table).

The server-side create_index handler records indexes declared on a view so
they survive a full refresh (which overwrites the dataset, dropping its
indices); that re-apply is wired in the sophon engine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
b9f33ba1c9 feat(refresh): priority as a per-refresh knob; fix batch_size on RemoteTable
Thread priority (Kueue tier) through refresh_column at every layer (Python sync+async
+ RemoteTable -> pyo3 -> Rust client trait/public/remote -> REST body), mirroring
num_workers/batch_size. The function keeps its priority as a default; the per-refresh
value overrides. Also adds the previously-missed batch_size to RemoteTable.refresh_column
(the REST sync path). cargo check (lancedb --features remote --tests, lancedb-python) +
ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
d4f4fef3ba feat(refresh): batch_size is a per-refresh knob (refresh_column), not a function-only option
batch_size / num_workers / max_workers are invocation concerns (how to schedule THIS
refresh), so expose batch_size on refresh_column through every layer (Python sync+async
-> pyo3 -> Rust client -> the REST RefreshColumnRequest.batch_size, which the handler
already forwards into the backfill). num_workers/max_workers were already invocation-
placed; batch_size was the gap. The function may still carry a default; the refresh
override wins (extends the batch_size_override model). Both crates cargo-check clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
fbe6a5a3fd feat(udf): computed columns as expressions -- add_columns(computed={col: fn("input")})
A computed column is an expression over a registered function applied to input
columns, not a UDF coupled to a column. fn("data") already returned the expression
string "fn(data)"; make it a ColumnExpr (a str subclass) that also carries the
function's return type, so add_columns(computed={"vec": embed("data")}) declares the
column with no hand-written type. _normalize_computed handles the new form (and tuple
keys for STRUCT fan-out) and keeps the legacy {col: (sql_type, expression)} tuple.
add_computed_column is deprecated (delegates, with a DeprecationWarning). The function
stays decoupled from columns -- register once, apply anywhere.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
127054069a feat(mv): partition_by option on create_materialized_view / create_view
Thread an optional partition_by through the client: CreateMaterializedViewRequest
-> REST body -> pyo3 binding -> Python create_materialized_view/create_view
kwarg (sync + async). The server partitions the view's table function by the
named source column -- by IVF index clusters if the column is indexed
(image-dedup), else by distinct value. Unifies Geneva's partition_by +
partition_by_indexed_column into one knob.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
b20931b8f7 feat: async UDF client ergonomics (AsyncConnection/AsyncTable + AsyncView/AsyncJobHandle)
Mirrors the sync ergonomics on the async surface: AsyncConnection
create_function(udf, replace=)/create_view/job; AsyncTable.add_computed_column;
AsyncView + AsyncJobHandle (await + asyncio.sleep; shared submission-prefix
matcher with the sync JobHandle). Decorator + REST routes are shared/already
validated; this is the async wrapper layer. Exported from the package root.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
396d68e490 fix: JobHandle resolves the manifest job id from the submission id
db.job(id) gets the submission id the refresh/backfill endpoints return,
but list_jobs / cancel report the agent's manifest id
(<table>-<type>-<first 8 of submission id>). JobHandle now matches that
(exact id or submission prefix) so wait()/progress() truly track, and
cancel() cancels by the resolved canonical id instead of the unusable
submission uuid.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
ad37f87387 feat: fold UDF authoring into lancedb (udf module + connection/table ergonomics)
Brings the @udf/@table_udf decorator + type inference into lancedb as
lancedb.udf (Apache-2.0), and adds the ergonomic glue to the existing
connection/table so there's no separate object model:
- create_function() accepts a Udf (and a replace= flag)
- Table.add_computed_column(column, udf)
- create_view(name, source, select, ...) -> View (assembles the SELECT)
- Connection.job(job_id) -> JobHandle
- View / JobHandle are thin references over a connection
Exports udf/table_udf/Udf/JobHandle/View from the package root. The
operations stay the existing remote-only methods (enterprise/cloud); the
decorator works locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
e93476f0e0 feat: explain_refresh_materialized_view over REST (EXPLAIN REFRESH SDK)
Database trait gains explain_refresh_materialized_view (default NotSupported)
returning an MvRefreshPlan; RemoteDatabase POSTs
/v1/materialized_view/{name}/explain_refresh; Connection method; pyo3
MvRefreshPlan pyclass + binding; sync+async python wrappers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
2b41fce033 feat: cancel_job over REST (Database::cancel_job + remote impl + pyo3 + python)
Exposes the existing server-side CANCEL JOB (CoordinatorCatalog::cancel_job)
as a REST-backed SDK method: Database trait default NotSupported,
RemoteDatabase POSTs /v1/job/{id}/cancel, pyo3 binding, sync+async python
wrappers. Best-effort: a missing job returns false, not an error. Mock-HTTP
unit test in test_derived_compute_routes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-29 22:35:34 -07:00
Wyatt Alt
04948fc4f6 feat: computed columns as a param on add_columns
Per the interface design: computed columns are parameters on the
existing add_columns operation, not a separate method.

- BaseTable::add_computed_columns((name, sql_type) pairs + a f(args)
  expression) -- default NotSupported; RemoteTable posts 'computed'
  entries to the existing /v1/table/{id}/add_columns route.
- python add_columns gains computed= on LanceTable, RemoteTable, and
  AsyncTable: tbl.add_columns(computed={'doubled': ('FLOAT',
  'double_it(val)')}); grouped by expression so struct-returning
  functions' columns land adjacently.
2026-06-29 22:35:34 -07:00
Wyatt Alt
ff3c7111b9 feat: SDK surface for functions, materialized views, jobs, refresh_column
Adds the derived-compute interface to the SDK:

- Database trait: create/list/drop_function, create/refresh/alter/
  drop/list_materialized_view, list_jobs -- default implementations
  return Error::NotSupported (NotImplementedError in python), so
  existing Database impls are unaffected; local single-node
  implementations are planned. BaseTable gains refresh_column with
  the same default.
- RemoteDatabase/RemoteTable implement them against the server REST
  routes (/v1/function/*, /v1/materialized_view/*, /v1/job/list,
  /v1/table/{id}/refresh_column), with mock-HTTP unit tests.
- Connection/Table public methods, pyo3 bindings (FunctionInfo,
  MaterializedViewInfo, JobInfo pyclasses), and python wrappers:
  sync on the DBConnection base (shared by local and remote
  connections), async on AsyncConnection; refresh_column on
  LanceTable, RemoteTable, and AsyncTable.
2026-06-29 22:35:34 -07:00
Raphael Malikian
05756f0bbf fix(python): raise clear error when permutation API is used on remote tables (Fixes #2934) (#3591)
Fixes #2934

## Problem
Passing a `RemoteTable` to `permutation_builder()` raises a cryptic
`AttributeError`:
```
AttributeError: 'RemoteTable' object has no attribute '_inner'
```
This leaves users confused about what went wrong and why.

## Root Cause
`PermutationBuilder.__init__()` calls `async_permutation_builder(table)`
which accesses `table._inner` — the underlying Rust Lance table object.
`RemoteTable` connects to LanceDB Cloud/Enterprise and does not have a
local `_inner` attribute, making permutations fundamentally unsupported
on remote tables.

## Solution
Added an early check in `PermutationBuilder.__init__()` that verifies
the table has `_inner` before calling the Rust function, raising a clear
`TypeError` with an explanation of why permutations don't work on remote
tables.

## Verification
- Syntax validated with `ast.parse()`
- Structural verification: single call site (`permutation_builder()`),
guard placed before Rust FFI call
- Error message tested with mock: `MockRemoteTable()` correctly triggers
`TypeError`

## Changelog

| Date | Change | Author |
|------|--------|--------|
| 2026-06-28 | Added remote table guard in PermutationBuilder.__init__ |
rtmalikian |

### Files Changed
- python/python/lancedb/permutation.py — Added `hasattr(table,
"_inner")` check with clear error

---

**About the Author:** Raphael Malikian — Clinical AI Solutions
Architect. I specialise in building and fixing AI/ML systems for
healthcare, including vector databases, RAG pipelines, and clinical NLP.
If you need help with your project or think I can add value to your
organisation, feel free to reach out — I'd love to connect.

📧 rtmalikian@gmail.com
🔗 GitHub: https://github.com/rtmalikian
🔗 LinkedIn:
http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a

---

**Disclosure:** This code was developed with assistance from
deepseek-v4-pro (DeepSeek) via Hermes Agent (Nous Research). All changes
were reviewed, tested against the actual codebase, and verified for
correctness.

Signed-off-by: rtmalikian <rtmalikian@gmail.com>
2026-06-29 16:36:01 -07:00
Jack Ye
39e819b6a7 feat(python): expose OAuth connection config (#3586)
Expose the merged Rust OAuth header provider through the Python async
connection path.

Includes:
- Python OAuthConfig and OAuthFlowType public config objects
- PyO3 conversion into the Rust OAuthConfig
- connect_async(oauth_config=...) plumbing
- repr redaction coverage for client_secret

Local validation: cargo fmt --all; ruff format/check on touched Python
files.
2026-06-29 12:36:35 -07:00
Lance Release
3878adc6dc Bump version: 0.34.0-beta.3 → 0.34.0-beta.4 2026-06-29 11:11:05 +00:00
Ryan Green
8a5cd74e48 fix: ensure read freshness provider is built into namespace client (#3571)
By default the read freshness provider was not included in the namespace
client, preventing the read freshness headers from being included in the
request. This prevents checkout_latest() from working as expected when
using the namespace client.

This fix ensures the provided is built into the client when the
namespace impl and properties are provided.
2026-06-25 21:47:55 -07:00
Lance Release
8718345229 Bump version: 0.34.0-beta.2 → 0.34.0-beta.3 2026-06-25 01:53:51 +00:00
Raphael Malikian
0ba70d96c3 fix: add missing stacklevel=2 to warnings.warn() and fix broken message concatenation (Fixes #3563) (#3564)
Fixes #3563

## Summary

- Add `stacklevel=2` to 10 `warnings.warn()` calls across 4 files
- Fix broken message concatenation in `table.py` where the second string
was incorrectly passed as the `category` parameter

## Problem

Multiple `warnings.warn()` calls in the `python/lancedb/` codebase were
missing the `stacklevel` parameter. Without `stacklevel=2`, warnings
point to library internals instead of the caller's code, making it
impossible for users to identify which of their function calls triggered
the warning.

Additionally, two calls in `table.py` (lines 3411 and 3420) had a more
serious bug: the deprecation message was split across two separate
string arguments, causing the second string to be passed as the
`category` parameter instead of being concatenated with the first
string. This would cause `TypeError` when the warning was triggered.

## Changes

| File | Fixes | Description |
|------|-------|-------------|
| `embeddings/colpali.py` | 1 | Add `stacklevel=2` to
`use_token_pooling` deprecation warning |
| `remote/db.py` | 3 | Add `stacklevel=2` to `request_thread_pool`,
`connection_timeout`, `read_timeout` deprecation warnings |
| `remote/table.py` | 3 | Add `stacklevel=2` to `cleanup_old_versions`,
`compact_files`, `optimize` no-op warnings |
| `table.py` | 3 | Fix broken message concatenation for
`data_storage_version` and `enable_v2_manifest_paths` deprecation
warnings + add `stacklevel=2` to `retrain` deprecation warning |

## Verification

```python
# All warnings.warn() calls now have stacklevel
python3 -c "import ast, os; ..."
# Result: All warnings.warn() calls now have stacklevel!
```

## Changelog

| Date | Change | Author |
|------|--------|--------|
| 2026-06-20 | Fix missing stacklevel=2 in 10 warnings.warn() calls +
fix broken message concatenation | rtmalikian |

### Files Changed
- `python/python/lancedb/embeddings/colpali.py` — Add stacklevel=2
- `python/python/lancedb/remote/db.py` — Add stacklevel=2 to 3
deprecation warnings
- `python/python/lancedb/remote/table.py` — Add stacklevel=2 to 3 no-op
warnings
- `python/python/lancedb/table.py` — Fix broken message concatenation +
add stacklevel=2

### Verification
- AST-based audit confirms all `warnings.warn()` calls now include
`stacklevel=2`
- Syntax check passes for all 4 modified files

---

**About the Author:** Raphael Malikian — Clinical AI Solutions
Architect. I specialise in building and fixing AI/ML systems for
healthcare, including vector databases, RAG pipelines, and clinical NLP.
If you need help with your project or think I can add value to your
organisation, feel free to reach out — I'd love to connect.

📧 rtmalikian@gmail.com
🔗 GitHub: https://github.com/rtmalikian
🔗 LinkedIn:
http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a

---

**Disclosure:** This code was developed with assistance from **Hermes
Agent** (Nous Research). All changes were reviewed, tested against the
actual codebase, and verified for correctness.

Signed-off-by: rtmalikian <rtmalikian@gmail.com>
2026-06-23 13:42:59 -07:00
Lance Release
26481a4b74 Bump version: 0.34.0-beta.1 → 0.34.0-beta.2 2026-06-23 16:21:52 +00:00
Will Jones
85d870b397 fix: parse RFC 3339 created_at and improve IndexConfig repr (#3558)
The server now serializes an index's `created_at` as an RFC 3339 string
(e.g. `"2026-06-18T21:37:36.637Z"`), but the client deserializer only
accepted a unix timestamp in milliseconds. This caused `list_indices` to
fail with:

```
Failed to parse list_indices response: invalid type: string "2026-06-18T21:37:36.637Z", expected a unix timestamp in milliseconds
```

This PR replaces the fixed millisecond deserializer with a custom one
that accepts both an RFC 3339 string (current server) and a
unix-millisecond integer (legacy deployments), so the client works
against any server version.

It also improves the `IndexConfig` repr in the Python bindings.
Previously it printed only three fields (`Index(FTS, columns=["text"],
name="text_idx")`), hiding the metadata that `list_indices` returns. It
now renders every populated field, omitting any that are `None`. Each
value is valid Python — integer counts use `_` thousands separators and
`created_at` uses the `datetime` repr — so values round-trip. The real
repr is a single line; it's wrapped here for readability:

```python
>>> table.list_indices()
[IndexConfig(
    name="text_idx",
    index_type="FTS",
    columns=["text"],
    index_uuid="aefd3e00-2f95-4bdc-92ac-06de84442bf1",
    type_url="/lance.table.InvertedIndexDetails",
    created_at=datetime.datetime(2026, 6, 18, 21, 37, 36, 637000, tzinfo=datetime.timezone.utc),
    num_indexed_rows=2,
    size_bytes=3_669,
    num_segments=1,
    index_version=1,
    index_details={
        'lance_tokenizer': None,
        'base_tokenizer': 'simple',
        'language': 'English',
        'with_position': False,
        'max_token_length': 40,
        'lower_case': True,
        'stem': True,
        'remove_stop_words': True,
        'custom_stop_words': None,
        'ascii_folding': True,
        'min_ngram_length': 3,
        'max_ngram_length': 3,
        'prefix_only': False,
    },
)]
```

Fixes #3556

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 10:40:56 -07:00
Lance Release
3b279f5705 Bump version: 0.34.0-beta.0 → 0.34.0-beta.1 2026-06-19 15:59:43 +00:00
Ryan Green
e1334954d7 fix: overflow using sys.maxsize for k in query with namespace connection (#3561) 2026-06-19 12:57:10 -02:30
Lance Release
4f4cce3f64 Bump version: 0.33.1-beta.2 → 0.34.0-beta.0 2026-06-18 18:42:07 +00:00
Will Jones
ce5dadd386 fix(ci): allow shell pre-commit hooks in bumpversion configs (#3554)
The "Create release commit" workflow (`make-release-commit.yml`) has
failed on its last two runs; no release tags have been created since
June 4. Since this workflow creates the tag that the cargo/npm/pypi/java
publish workflows trigger off of, all recent releases are effectively
blocked.

The workflow installs `bump-my-version` unpinned. Version `1.4.0` added
a check that refuses to run `pre_commit_hooks` containing shell syntax
(pipes, `&&`, `if`, variable expansion) unless `allow_shell_hooks =
true` is set. Both bumpversion configs use such hooks:

- `python/.bumpversion.toml` — updates `Cargo.lock` after the bump
(fails first)
- `.bumpversion.toml` — runs `mvn versions:set` for the Java packages

The job dies at the version-bump step with:

> Hook '…' contains shell syntax (pipes, redirects, or variable
expansion). Set `allow_shell_hooks = true` in your configuration to
enable shell execution…

This sets `allow_shell_hooks = true` in both configs to restore the
previous behavior.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-17 15:22:05 -07:00
whitewooood
217fd8491d fix(python): clarify single dictionary input error (#3537)
## Summary
- clarify the Python error for passing a single dictionary to table
creation/add paths
- add a regression test for `create_table(..., data=dict)` so it points
users to a list of dictionaries

Fixes #409

## Testing
- `python -m pytest python/tests/test_table.py -q`
- `python -m ruff format python/lancedb/table.py
python/lancedb/scannable.py python/tests/test_table.py`
- `python -m ruff check python/lancedb/table.py
python/lancedb/scannable.py python/tests/test_table.py`
2026-06-17 12:55:55 -07:00
JSap0914
9128dbcd7a fix(util): escape single quotes in struct field names in value_to_sql (#3548)
### Bug

`value_to_sql({...})` builds a DataFusion `named_struct(...)` literal
but interpolates the struct field names directly as `f"'{k}'"`. A field
name that contains a single quote therefore produces invalid SQL:

```python
>>> from lancedb.util import value_to_sql
>>> value_to_sql({"it's": 1})
"named_struct('it's', 1)"        # invalid SQL — the quote terminates the literal
```

String *values* are already escaped (single quotes doubled) by the `str`
branch of `value_to_sql`, so keys and values were handled
inconsistently. This affects `Table.update(values={...})` /
`merge_insert` when a struct column has a field name containing `'`.

### Fix

Render the key through `value_to_sql(str(k))` so field names are escaped
exactly like string values:

```python
>>> value_to_sql({"it's": 1})
"named_struct('it''s', 1)"
```

Keys without special characters are unchanged (`'a'` stays `'a'`), so
existing behavior is preserved.

### Verification

```
$ pytest python/tests/test_util.py -k value_to_sql_dict
```

The new `test_value_to_sql_dict_key_escaping` covers quoted keys (incl.
nested structs) and fails on `main` (`named_struct('it's', 1)`), passes
with this change; the existing `test_value_to_sql_dict` still passes.

Co-authored-by: JSap0914 <JSap0914@users.noreply.github.com>
2026-06-17 12:55:43 -07:00
Armaan Sandhu
b2ae763254 fix(python): raise clear TypeError for bare List/Tuple in pydantic schema conversion (#3511)
Closes #3502

## Problem

A bare, unparameterised `typing.List` / `typing.Tuple` field crashes
`to_arrow_schema` with an opaque `AttributeError: __args__`:

```python
from typing import Tuple
from lancedb.pydantic import LanceModel

class Doc(LanceModel):
    items: Tuple

Doc.to_arrow_schema()  # AttributeError: __args__
```

In `_py_type_to_arrow_type`, the branch `elif getattr(py_type,
"__origin__", None) in (list, tuple)` is taken for a bare generic (its
`__origin__` is `list / tuple`), but the next line reads
`py_type.__args__[0]`, and a bare generic has no `__args__`. Other
unsupported types (e.g. `Dict[str, int]`) correctly raise a clear
`TypeError`, so this case is inconsistent.

Fix

Guard the element-type lookup with `getattr(py_type, "__args__", None)`
and raise a clear `TypeError` when it is missing, matching the existing
behavior for other unsupported types. Bare builtin list / tuple are
unaffected (their `__origin__` is `None`, so they already fall through
to the existing `TypeError`).

Testing

- Added `test_bare_generic_raises_type_error` covering both `List` and
`Tuple`.
- ruff format and ruff check clean.
2026-06-17 11:58:48 -07:00
Brendan Clement
f76b075d13 feat: add table branch support to remote tables and Python/TS bindings (#3540)
### Description
Adding branch support for RemoteTable by threading a branch selector
onto every operation the data plane accepts it on. Exposes the
currentBranch to nodejs and python through the bindings.

Matching the server handlers, the branch rides as:
- a `?branch=` query parameter for Arrow-body and query-only ops
(insert, merge_insert, multipart_*, version/list, drop_index)
- a `branch` field in the JSON body for everything else (count_rows,
query, update, delete, create_index, column ops, index list/stats,
stats, restore, describe, tags create/update)

A main-branch handle (`branch == None`) produces byte-identical requests
to before: no `branch` field and no `?branch=`

- Handle-per-branch: `create_branch` / `checkout_branch` return a new
handle with fresh caches and reset version/freshness state, mirroring
`NativeTable`.
- `create_branch` maps 409 to already-exists, 400 to invalid, and 404 to
not-found with source context, and sends without retry so the 409 stays
observable.
- `Ref` translation covers version, version-number (relative to the
handle's branch), and tag (resolved via the tags endpoint); `"main"` and
empty normalize to the main branch.
- Python branch handles persist their branch (and pinned version) across
pickle/fork, so a forked or pickled handle reopens on its branch rather
than silently reverting to main.

### Tests

- Rust mock tests per op category (query-param and body mechanisms,
branch CRUD, error paths, backward-compat).
- Python sync branch CRUD, `open_table(branch=)`, and a pickle
round-trip regression test.
2026-06-15 18:07:40 -04:00
Will Jones
f8caef3aca feat(bindings): expose new IndexConfig fields in Python and Node.js (#3534)
## Summary

Surfaces the rich per-index metadata added in #3497 to the Python and
Node.js language bindings. Closes #3495.

New optional fields exposed on `IndexConfig` in both bindings:

- `index_uuid` / `indexUuid` — UUID of the first index segment
- `type_url` / `typeUrl` — protobuf type URL for the index
- `created_at` / `createdAt` — creation timestamp (milliseconds since
Unix epoch)
- `num_indexed_rows` / `numIndexedRows` — rows covered by the index
- `num_unindexed_rows` / `numUnindexedRows` — rows not yet indexed
- `size_bytes` / `sizeBytes` — total index file size in bytes
- `num_segments` / `numSegments` — number of index segments
- `index_version` / `indexVersion` — on-disk format version
- `index_details` / `indexDetails` — type-specific JSON details string

All fields are `None`/`undefined` for remote tables (which don't yet
surface this metadata through the server response).

## Changes

- `python/src/index.rs`: extend `IndexConfig` pyclass; update `From`
impl; update `__getitem__`
- `python/python/lancedb/_lancedb.pyi`: add type hints for new fields
- `python/python/tests/test_table.py`: new `test_index_config_fields`
test
- `nodejs/src/table.rs`: extend `IndexConfig` napi struct; update `From`
impl
- `nodejs/__test__/table.test.ts`: new test; update existing `toEqual`
assertions to `expect.objectContaining` to accommodate new fields

## Test plan

- [x] Python: `uv run --extra tests pytest
python/tests/test_table.py::test_index_config_fields`
- [x] Node.js: `pnpm test __test__/table.test.ts`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-11 13:37:39 -07:00
nuthalapativarun
40f3e22600 feat: support rename_table on LanceNamespaceDatabase (#3520)
## Summary

Closes #3412

Implements `rename_table` for `LanceNamespaceDatabase` (sync and async
Python) and the Rust `NamespaceDatabase` backend. Previously these
raised `NotImplementedError`; this PR delegates to the
`LanceNamespace.rename_table` method which is part of the
lance-namespace spec.

### Changes

- **`rust/lancedb/src/database/namespace.rs`**: Remove the
`NotImplementedError` stub for `rename_table`. Build a
`RenameTableRequest` (with `id`, `new_table_name`, and optionally
`new_namespace_id`) and call `self.namespace.rename_table(...)`,
mirroring the existing `drop_table` pattern.

- **`python/python/lancedb/namespace.py`**: Import `RenameTableRequest`
from `lance_namespace`. Replace the `raise NotImplementedError` in both
`LanceNamespaceDatabase.rename_table` (sync) and
`AsyncLanceNamespaceDatabase.rename_table` (async) with a call to
`self._namespace_client.rename_table(request)`.

- **`python/python/tests/test_namespace.py`**: Replace the
`test_rename_table_not_supported` test (which checked for
`NotImplementedError`) with `test_rename_table`, which:
  1. Creates a table in a namespace
2. Calls `rename_table` with `cur_namespace_path` and
`new_namespace_path`
  3. Asserts the old name is gone from `table_names()`
  4. Asserts the new name appears in `table_names()`
  5. Verifies the renamed table can be opened

## Test plan

- [ ] Existing namespace tests pass in CI (all rely on
`lance.namespace.DirectoryNamespace` which requires the full lance
package)
- [ ] `test_rename_table` exercises the full rename path: create →
rename → verify old gone → verify new present → open
- [ ] Rust build passes with the updated `namespace.rs` (requires Rust
toolchain in CI)
2026-06-11 11:41:07 -07:00
nuthalapativarun
04480c274a test(python): add nested field regression matrix tests (#3518)
## Summary

Closes #3406

Add a regression matrix in `python/python/tests/test_nested_fields.py`
that exercises the full nested field index lifecycle for both the sync
and async Python table APIs. The tests will fail if any implementation
regresses to leaf-only field names in `list_indices`, `index_stats`,
search, or filter results.

## Test scenarios covered

**Index types:** BTree scalar, IvfPq vector, FTS

**Field-name edge cases (per acceptance criteria):**
- `rowId` — camelCase top-level field
- `` `row-id` `` — hyphenated top-level field (escaped)
- `parent.`\``leaf.name`\`` ` — struct leaf whose name contains a
literal dot
- `MetaData.userId` — mixed-case nested path
- `` `meta-data`.`user-id` `` — hyphenated struct with hyphenated leaf

**Lifecycle operations per index type:**
- `create_index` / `create_scalar_index` / `create_fts_index`
- `list_indices` → verify canonical full dotted path (not leaf name)
- `index_stats` → verify row count and index type
- Filtered scan (`WHERE nested.field = value`)
- Vector search via nested embedding column
- FTS search via nested text column
- `add` (append) then re-check index listing
- `optimize` then re-check index listing

**Both sync and async APIs** are covered in parallel test classes.

## Notes

Lance forbids top-level field names that contain a literal `.`, so the
`` `a.b` `` acceptance-criterion variant is exercised as a *struct leaf*
field (`parent.`\``leaf.name`\``) rather than a top-level column.
2026-06-11 08:06:04 -07:00
Trenton H
ae7f2cbfe8 feat(python): accept Expr in Table.delete and merge when_not_matched_by_source_delete (#3524)
Another little pain point as I was working to integrate with
paperless-ngx. The read path of table.search() or table.query() already
accepted an Expr, but write paths Table.delete and
merge_insert(...).when_not_matched_by_source_delete did not. This PR
attempts to close that gap, so writes and reads can both use Expr,
instead of one side needing to build a string.
2026-06-11 07:59:49 -07:00
Trenton H
85d9c1ce63 feat: adds isin support to the 'Expr' builder (#3523)
The `Expr` build already includes a lot of useful filtering options,
`eq, ne, gt/gte, lt/lte, and_, or_, contains, cast`, but is was missing
a membership like `isin`. This PR adds that support, as minimally as
possible, allowing easy filtering for membership in a list, without
needing to be a series of `where` expressions.

I didn't see anything in CONTRIBUTING.md about needing a feature request
or issue first, so I just made the change. My apologies if I missed that
somewhere.

Thanks for the vector store, we're using it now in paperless-ngx.
2026-06-10 15:28:19 -07:00
Jack Ye
8373318e89 feat: support FM-Index scalar index for substring search (#3532)
Adds an FM-Index — a scalar index over string and binary columns that
accelerates substring search (`contains(col, 'needle')`), distinct from
the tokenized `FTS` index — across the Rust core and the Python and
TypeScript bindings.

## Rust

- `Index::Fm(FmIndexBuilder)` and `IndexType::Fm`.
- `make_index_params` maps `Index::Fm` to Lance's
`ScalarIndexParams::for_builtin(BuiltinIndexType::Fm)`.
- `supported_fm_data_type` validates
`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary` columns.
- `list_indices` round-trips the type (`"Fm"` → `IndexType::Fm`); the
remote wire type is `"FM"`.

## Python

Adds `lancedb.index.Fm`, accepted by `create_index`:

```python
from lancedb.index import Fm

await tbl.create_index("text", config=Fm())
```

## TypeScript

Adds the `Index.fm()` factory:

```ts
await tbl.createIndex("text", { config: Index.fm() });
```
2026-06-10 12:28:20 -07:00
Xuanwo
566b67a634 fix: support LargeList label list indexes (#3529)
## Summary

This PR extends nested-field regression coverage across Rust
local/remote, Python sync/async, and Node so canonical escaped paths
stay consistent across scalar, vector, and FTS index lifecycle behavior.

It also aligns LanceDB's LabelList type gate with Lance by accepting
`LargeList<primitive>` columns while keeping `List<Struct<...>>`
unsupported until Lance defines stable membership semantics for struct
labels.

Part of #3406.
2026-06-10 23:53:56 +08:00
devteamaegis
f260d3bf12 fix(util): convert numpy scalars in value_to_sql (#3522)
## What's broken

`Table.update(values={...})` raises `NotImplementedError: SQL conversion
is not implemented for this type` when a value is a numpy scalar such as
`np.int64`, `np.int32`, `np.float32`, or `np.bool_`. These arise
naturally from indexing an ndarray or a pandas int/bool column.
`np.float64` happens to work (it subclasses `float`), which makes the
failure inconsistent and surprising.

```python
df = pd.DataFrame({"id": np.array([10, 20], dtype="int32")})
t.update(where="id = 1", values={"id": df["id"].iloc[0]})   # np.int32
# -> NotImplementedError: SQL conversion is not implemented for this type
```

## Why it happens

`value_to_sql` is a `singledispatch` with handlers only for native
Python types and `np.ndarray`; numpy `integer`/`floating`/`bool_`
scalars aren't Python subclasses, so they fall through to the
`NotImplementedError` base.

## Fix

Register handlers for `np.bool_`, `np.integer`, and `np.floating` that
delegate to the existing native handlers.

## Test

`value_to_sql` on `np.int32/int64/float32/float64/bool_` all convert;
`np.int32` raised before.

Co-authored-by: Ishaan Samantray <ishaansamantray@Ishaans-MacBook-Pro.local>
2026-06-09 15:57:02 -07:00
Brendan Clement
d9018067b3 feat: support checking out a version on a branch (#3504)
### Description

Stacked on #3490. Adds an optional version to branch checkout across the
Rust core and the Python and TypeScript SDKs, so you can open a specific
version on a branch ("version V of branch B"), not just the branch's
latest version

Rust

```rust
// Open version 3 of branch "exp" (a read-only view): check out from an
// existing table, or open it directly from the connection.
let exp_v3 = table.checkout_branch("exp", Some(3)).await?;
let exp_v3 = db.open_table("items").branch("exp").version(3).execute().await?;
// checkout_latest re-attaches to the branch's writable HEAD.
exp_v3.checkout_latest().await?;

// With no branch, a version opens main at that version.
let main_v3 = db.open_table("items").version(3).execute().await?;
```

Python

```python
# Open version 3 of branch "exp" (a read-only view): check out from an
# existing table, or open it directly from the connection.
branch_v3 = await table.branches.checkout("exp", version=3)
branch_v3 = await db.open_table("items", branch="exp", version=3)
# checkout_latest re-attaches to the branch's writable HEAD.
await branch_v3.checkout_latest()

# With no branch, a version opens main at that version.
main_v3 = await db.open_table("items", version=3)
```

TypeScript

```typescript
// Open version 3 of branch "exp" (a read-only view): check out from an
// existing table, or open it directly from the connection.
const branchV3 = await (await table.branches()).checkout("exp", 3);
const opened = await db.openTable("items", undefined, { branch: "exp", version: 3 });
// checkoutLatest re-attaches to the branch's writable HEAD.
await branchV3.checkoutLatest();

// With no branch, a version opens main at that version.
const mainV3 = await db.openTable("items", undefined, { version: 3 });
```

### Testing
- Added unit tests (Rust, Python sync + async, TypeScript):
branch-scoped resolution at a version number shared with `main` and with
another branch, read-only enforcement on a pinned handle,
`checkout_latest` recovery to the branch's HEAD, fork-point reads, and
the nonexistent-version/branch error paths.
- Ran smoke tests against the Python and TypeScript SDKs on local
machine.
2026-06-08 17:36:38 -07:00
Brendan Clement
53517b3aaa feat: add table branch support (#3490)
### Description

Adds first-class support for table branches across the Rust core and the
Python and TypeScript SDKs.

Rust

```rust
use lance::dataset::refs::Ref;

// Create a branch from main and write to it — main is untouched.
let exp = table.create_branch("exp", Ref::Version(None, None)).await?;
exp.add(batches).await?;

// Reopen the branch later: check out from a table, or open it directly.
let exp = table.checkout_branch("exp").await?;
let exp = db.open_table("items").branch("exp").execute().await?;

let branches = table.list_branches().await?;
table.delete_branch("exp").await?;
```

Python

```python
# Create a branch from main and write to it
branch = await table.branches.create("exp", from_ref="main")
await branch.add(data)

# Reopen the branch later: check out from a table, or open it directly.
branch = await table.branches.checkout("exp")
branch = await db.open_table("items", branch="exp")

await table.branches.list()
await table.branches.delete("exp")
```

TypeScript

```typescript
const branches = await table.branches();

// Create a branch from main and write to it
const branch = await branches.create("exp");
await branch.add(data);

// Reopen the branch later: check out from a table, or open it directly.
const checkedOut = await branches.checkout("exp");
const opened = await db.openTable("items", undefined, { branch: "exp" });

await branches.list();
await branches.delete("exp");
```

### Testing
- Added unit tests
- ran smoke tests against python and typescript sdks on local machine


### Next steps
- Add RemoteTable support
- Add Branch Comparison support
- Merge Branching support
2026-06-08 16:26:46 -07:00
Yang Cen
3e25f584eb fix(python): push down namespace full reads (#3516)
## Bug Fix

### What is the bug?

Namespace-backed `LanceTable.to_arrow()` full-table reads bypassed the
existing `QueryTable` server-side query path and called the lower-level
table `to_arrow()` implementation directly. In Geneva/Sophon this could
fail while parsing the Arrow IPC response for
`hist.get_table().to_arrow()` / `to_pandas()`, even though
`hist.get_table().search().to_arrow()` worked.

### What issues or incorrect behavior does the bug cause?

Full-table reads on namespace-backed tables with `QueryTable` pushdown
could fail with Arrow IPC parse errors, while query/search reads on the
same table succeeded. Since `to_pandas()` delegates through `to_arrow()`
for non-blob/native cases, pandas export was affected too.

### How does this PR fix the problem?

When `QueryTable` pushdown is enabled, sync and async table `to_arrow()`
now construct a plain no-filter, no-limit, all-columns query and execute
it through the table-level `_execute_query()` path. `AsyncTable` now
preserves namespace context from async namespace connections so async
full reads can make the same pushdown decision. Non-namespace tables and
namespace tables without `QueryTable` pushdown keep their existing
behavior.

### Tests

- `uv run --extra tests --extra dev --no-sync ruff check
python/lancedb/table.py python/lancedb/namespace.py
python/tests/test_namespace.py`
- `uv run --extra tests --extra dev --no-sync ruff format
python/lancedb/table.py python/lancedb/namespace.py
python/tests/test_namespace.py`
- `uv run --extra tests --extra dev --no-sync pytest
python/tests/test_namespace.py::TestPushdownOperations::test_lance_table_to_arrow_uses_query_pushdown
python/tests/test_namespace.py::TestAsyncPushdownOperations::test_async_table_to_arrow_uses_query_pushdown
python/tests/test_namespace.py::test_local_table_to_arrow_and_to_pandas_are_unchanged
-q`
- `uv run --extra tests --extra dev --no-sync pytest
python/tests/test_namespace.py -q`
2026-06-08 19:48:40 +08:00
Will Jones
09b1bbc12a refactor!: drop unused loss field from IndexStatistics (#3496)
BREAKING CHANGE: direct Rust users lose the `IndexStatistics::loss`
field. Python and Node.js consumers are unaffected in practice for
remote tables (the value was always `None`/absent), but the attribute is
gone for local tables too.

`IndexStatistics::loss` was local-only — LanceDB Cloud never returned
it, so
`RemoteTable::index_stats` always set `loss: None`. It's vestigial; this
removes it.

- Remove `loss` from `IndexStatistics` and the internal `IndexMetadata`
in `rust/lancedb/src/index.rs`, plus the summing logic in
`NativeTable::index_stats`.
- Drop `loss` from the Python and Node.js bindings (and their
tests/docs).

Fixes #3493

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:52:40 -07:00