lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-07-01 01:50:39 +00:00

Author	SHA1	Message	Date
Wyatt Alt	03e895fa5c	Merge integration submodule (refresh_column JobHandle / create_mv fix) into lineage	2026-06-29 22:35:34 -07:00
Wyatt Alt	c31e53088e	client: slice 4 -- Python lineage surface - new lineage.py: Lineage / Node / Edge / FunctionRef dataclasses that parse the server's lineage JSON, with to_dict(), to_graphviz() (drift edges dashed+red), and _repr_html_(); plus .functions() / .stale() helpers. - Connection.lineage(table, column=, direction=, depth=) (sync + async) calls the pyo3 table_lineage binding and deserializes into Lineage. - Table.lineage(column=, ...) via the table's job connection; MaterializedView / AsyncMaterializedView .lineage() delegate to the backing table (the server already includes the view's sources + downstream dependents). - export the new types. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	434a5be187	feat(client): Table.refresh_column returns a JobHandle (like MaterializedView.refresh) refresh_column returned the bare job-id str, so callers had to wrap it: db.job(tbl.refresh_column("c")).wait(). Mirror MaterializedView.refresh() and return a JobHandle directly, so tbl.refresh_column("c").wait() / .status() / .id work without the wrapper. (db.job(job_id) stays for reconnecting by a stored id.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	78aa005093	client: slice 3 -- thread table_lineage through the remote client + pyo3 A new Database::table_lineage(TableLineageRequest) -> Result<String> threaded end to end: default not_supported in the trait; the remote impl issues GET /v1/table/{name}/lineage with column/direction/depth query params and returns the body verbatim; connection.rs exposes a pub wrapper; the pyo3 binding hands the JSON string to Python. The lineage payload is carried as opaque JSON on purpose: the open-source lancedb client must not depend on the sophon-internal derived_jobs crate that defines the lineage schema, so the wire format is the contract and the Python layer deserializes it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	6191542cfe	fix(mv): MaterializedView.refresh calls the async _refresh (underscore) The sync _refresh_materialized_view called self._conn.refresh_materialized_view (no underscore); the async method is _refresh_materialized_view, so MaterializedView.refresh() raised AttributeError. Add the underscore. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	6af3088b91	client: make refresh_materialized_view private (reach it via the handle) Refresh is a submit-a-job verb, so its only public surface should be MaterializedView.refresh() / AsyncMaterializedView.refresh() (which return a job handle). Rename the connection methods to _refresh_materialized_view and have the handles call that, so the raw by-name refresh is no longer advertised on the connection. The pyo3 native binding is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	e73d4618d8	fix(mv): create_materialized_view passes query as keyword, not positional The sync RemoteDBConnection.create_materialized_view assembled the SELECT but called the async create_materialized_view with the query as the 2nd positional arg, which binds to `source=` (query= is keyword-only). Every call then failed the "needs either query= or both source and select" validation. Pass query=query. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	3d92106394	client: split create_view into create_materialized_view; return job handles - create_materialized_view now takes either query= or source+select (folds in the old create_view builder) and returns a MaterializedView handle whose .wait() blocks on initial population. create_view is removed -- it was misnamed (it built a materialized view, while CREATE VIEW means the plain non-materialized view the engine also supports). - MaterializedView.refresh() and the remote Table.refresh_column() now return a JobHandle directly, so tbl.refresh_column("c").wait() needs no db.job(...) wrapper. db.job(id) is narrowed to reconnect-by-id (stored id / SQL / REST). - rename View/AsyncView -> MaterializedView/AsyncMaterializedView (+ exports). - tighten the replace path: only a not-found error on the pre-drop is benign; real failures (perms/server) now surface instead of being swallowed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	5810974b37	feat(client): Table.load_columns() REST client for LOAD COLUMNS Geneva Table.load_columns() parity on the REST-only client. Fills existing columns from an external Parquet/Lance/IPC source by primary-key join. - BaseTable::load_columns default (NotSupported) + public Table::load_columns, taking a LoadColumnsRequest (source uris/format/storage_options, target/source key, (target, source?) column mappings, on_missing, worker/batch/commit knobs). - Remote impl POSTs to /v1/table/{id}/load_columns with the matching body; mock test asserts the request shape. - PyO3 binding + Python remote Table.load_columns(source, pk, columns, *, source_format, source_pk, on_missing, ...) accepting a column list or {target: source} dict. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	8b38500b07	feat(view): full=True force-rebuild on refresh_materialized_view View.refresh(full=True) (sync + async) now works -- it previously raised NotImplementedError. Thread the flag through the client: RefreshMaterialized- ViewRequest.full -> the REST body (RemoteRefreshMaterializedViewRequest.full); pyo3 refresh_materialized_view(full=...); Connection.refresh_materialized_view( name, full=) sync + async. A full refresh forces a recompute-and-replace and preserves the view's indexes (reindexed by the distributed indexer). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	fd0a3b97d0	feat(view): materialized views are first-class indexable + searchable Add View.create_index / create_scalar_index / create_fts_index / search as pass-throughs to open_table(name). A materialized view is a real Lance dataset; these let it be indexed and searched like any other table, closing the parity gap with Geneva (whose create_materialized_view returns a first-class Table). The server-side create_index handler records indexes declared on a view so they survive a full refresh (which overwrites the dataset, dropping its indices); that re-apply is wired in the sophon engine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	b9f33ba1c9	feat(refresh): priority as a per-refresh knob; fix batch_size on RemoteTable Thread priority (Kueue tier) through refresh_column at every layer (Python sync+async + RemoteTable -> pyo3 -> Rust client trait/public/remote -> REST body), mirroring num_workers/batch_size. The function keeps its priority as a default; the per-refresh value overrides. Also adds the previously-missed batch_size to RemoteTable.refresh_column (the REST sync path). cargo check (lancedb --features remote --tests, lancedb-python) + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	d4f4fef3ba	feat(refresh): batch_size is a per-refresh knob (refresh_column), not a function-only option batch_size / num_workers / max_workers are invocation concerns (how to schedule THIS refresh), so expose batch_size on refresh_column through every layer (Python sync+async -> pyo3 -> Rust client -> the REST RefreshColumnRequest.batch_size, which the handler already forwards into the backfill). num_workers/max_workers were already invocation- placed; batch_size was the gap. The function may still carry a default; the refresh override wins (extends the batch_size_override model). Both crates cargo-check clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	fbe6a5a3fd	feat(udf): computed columns as expressions -- add_columns(computed={col: fn("input")}) A computed column is an expression over a registered function applied to input columns, not a UDF coupled to a column. fn("data") already returned the expression string "fn(data)"; make it a ColumnExpr (a str subclass) that also carries the function's return type, so add_columns(computed={"vec": embed("data")}) declares the column with no hand-written type. _normalize_computed handles the new form (and tuple keys for STRUCT fan-out) and keeps the legacy {col: (sql_type, expression)} tuple. add_computed_column is deprecated (delegates, with a DeprecationWarning). The function stays decoupled from columns -- register once, apply anywhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	127054069a	feat(mv): partition_by option on create_materialized_view / create_view Thread an optional partition_by through the client: CreateMaterializedViewRequest -> REST body -> pyo3 binding -> Python create_materialized_view/create_view kwarg (sync + async). The server partitions the view's table function by the named source column -- by IVF index clusters if the column is indexed (image-dedup), else by distinct value. Unifies Geneva's partition_by + partition_by_indexed_column into one knob. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	b20931b8f7	feat: async UDF client ergonomics (AsyncConnection/AsyncTable + AsyncView/AsyncJobHandle) Mirrors the sync ergonomics on the async surface: AsyncConnection create_function(udf, replace=)/create_view/job; AsyncTable.add_computed_column; AsyncView + AsyncJobHandle (await + asyncio.sleep; shared submission-prefix matcher with the sync JobHandle). Decorator + REST routes are shared/already validated; this is the async wrapper layer. Exported from the package root. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	396d68e490	fix: JobHandle resolves the manifest job id from the submission id db.job(id) gets the submission id the refresh/backfill endpoints return, but list_jobs / cancel report the agent's manifest id (<table>-<type>-<first 8 of submission id>). JobHandle now matches that (exact id or submission prefix) so wait()/progress() truly track, and cancel() cancels by the resolved canonical id instead of the unusable submission uuid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	ad37f87387	feat: fold UDF authoring into lancedb (udf module + connection/table ergonomics) Brings the @udf/@table_udf decorator + type inference into lancedb as lancedb.udf (Apache-2.0), and adds the ergonomic glue to the existing connection/table so there's no separate object model: - create_function() accepts a Udf (and a replace= flag) - Table.add_computed_column(column, udf) - create_view(name, source, select, ...) -> View (assembles the SELECT) - Connection.job(job_id) -> JobHandle - View / JobHandle are thin references over a connection Exports udf/table_udf/Udf/JobHandle/View from the package root. The operations stay the existing remote-only methods (enterprise/cloud); the decorator works locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	e93476f0e0	feat: explain_refresh_materialized_view over REST (EXPLAIN REFRESH SDK) Database trait gains explain_refresh_materialized_view (default NotSupported) returning an MvRefreshPlan; RemoteDatabase POSTs /v1/materialized_view/{name}/explain_refresh; Connection method; pyo3 MvRefreshPlan pyclass + binding; sync+async python wrappers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	2b41fce033	feat: cancel_job over REST (Database::cancel_job + remote impl + pyo3 + python) Exposes the existing server-side CANCEL JOB (CoordinatorCatalog::cancel_job) as a REST-backed SDK method: Database trait default NotSupported, RemoteDatabase POSTs /v1/job/{id}/cancel, pyo3 binding, sync+async python wrappers. Best-effort: a missing job returns false, not an error. Mock-HTTP unit test in test_derived_compute_routes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-29 22:35:34 -07:00
Wyatt Alt	04948fc4f6	feat: computed columns as a param on add_columns Per the interface design: computed columns are parameters on the existing add_columns operation, not a separate method. - BaseTable::add_computed_columns((name, sql_type) pairs + a f(args) expression) -- default NotSupported; RemoteTable posts 'computed' entries to the existing /v1/table/{id}/add_columns route. - python add_columns gains computed= on LanceTable, RemoteTable, and AsyncTable: tbl.add_columns(computed={'doubled': ('FLOAT', 'double_it(val)')}); grouped by expression so struct-returning functions' columns land adjacently.	2026-06-29 22:35:34 -07:00
Wyatt Alt	ff3c7111b9	feat: SDK surface for functions, materialized views, jobs, refresh_column Adds the derived-compute interface to the SDK: - Database trait: create/list/drop_function, create/refresh/alter/ drop/list_materialized_view, list_jobs -- default implementations return Error::NotSupported (NotImplementedError in python), so existing Database impls are unaffected; local single-node implementations are planned. BaseTable gains refresh_column with the same default. - RemoteDatabase/RemoteTable implement them against the server REST routes (/v1/function/, /v1/materialized_view/, /v1/job/list, /v1/table/{id}/refresh_column), with mock-HTTP unit tests. - Connection/Table public methods, pyo3 bindings (FunctionInfo, MaterializedViewInfo, JobInfo pyclasses), and python wrappers: sync on the DBConnection base (shared by local and remote connections), async on AsyncConnection; refresh_column on LanceTable, RemoteTable, and AsyncTable.	2026-06-29 22:35:34 -07:00
Raphael Malikian	05756f0bbf	fix(python): raise clear error when permutation API is used on remote tables (Fixes #2934 ) (#3591 ) Fixes #2934 ## Problem Passing a `RemoteTable` to `permutation_builder()` raises a cryptic `AttributeError`: ``` AttributeError: 'RemoteTable' object has no attribute '_inner' ``` This leaves users confused about what went wrong and why. ## Root Cause `PermutationBuilder.__init__()` calls `async_permutation_builder(table)` which accesses `table._inner` — the underlying Rust Lance table object. `RemoteTable` connects to LanceDB Cloud/Enterprise and does not have a local `_inner` attribute, making permutations fundamentally unsupported on remote tables. ## Solution Added an early check in `PermutationBuilder.__init__()` that verifies the table has `_inner` before calling the Rust function, raising a clear `TypeError` with an explanation of why permutations don't work on remote tables. ## Verification - Syntax validated with `ast.parse()` - Structural verification: single call site (`permutation_builder()`), guard placed before Rust FFI call - Error message tested with mock: `MockRemoteTable()` correctly triggers `TypeError` ## Changelog \| Date \| Change \| Author \| \|------\|--------\|--------\| \| 2026-06-28 \| Added remote table guard in PermutationBuilder.__init__ \| rtmalikian \| ### Files Changed - python/python/lancedb/permutation.py — Added `hasattr(table, "_inner")` check with clear error --- About the Author: Raphael Malikian — Clinical AI Solutions Architect. I specialise in building and fixing AI/ML systems for healthcare, including vector databases, RAG pipelines, and clinical NLP. If you need help with your project or think I can add value to your organisation, feel free to reach out — I'd love to connect. 📧 rtmalikian@gmail.com 🔗 GitHub: https://github.com/rtmalikian 🔗 LinkedIn: http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a --- Disclosure: This code was developed with assistance from deepseek-v4-pro (DeepSeek) via Hermes Agent (Nous Research). All changes were reviewed, tested against the actual codebase, and verified for correctness. Signed-off-by: rtmalikian <rtmalikian@gmail.com>	2026-06-29 16:36:01 -07:00
Jack Ye	39e819b6a7	feat(python): expose OAuth connection config (#3586 ) Expose the merged Rust OAuth header provider through the Python async connection path. Includes: - Python OAuthConfig and OAuthFlowType public config objects - PyO3 conversion into the Rust OAuthConfig - connect_async(oauth_config=...) plumbing - repr redaction coverage for client_secret Local validation: cargo fmt --all; ruff format/check on touched Python files.	2026-06-29 12:36:35 -07:00
Lance Release	3878adc6dc	Bump version: 0.34.0-beta.3 → 0.34.0-beta.4	2026-06-29 11:11:05 +00:00
Ryan Green	8a5cd74e48	fix: ensure read freshness provider is built into namespace client (#3571 ) By default the read freshness provider was not included in the namespace client, preventing the read freshness headers from being included in the request. This prevents checkout_latest() from working as expected when using the namespace client. This fix ensures the provided is built into the client when the namespace impl and properties are provided.	2026-06-25 21:47:55 -07:00
Lance Release	8718345229	Bump version: 0.34.0-beta.2 → 0.34.0-beta.3	2026-06-25 01:53:51 +00:00
Raphael Malikian	0ba70d96c3	fix: add missing stacklevel=2 to warnings.warn() and fix broken message concatenation (Fixes #3563 ) (#3564 ) Fixes #3563 ## Summary - Add `stacklevel=2` to 10 `warnings.warn()` calls across 4 files - Fix broken message concatenation in `table.py` where the second string was incorrectly passed as the `category` parameter ## Problem Multiple `warnings.warn()` calls in the `python/lancedb/` codebase were missing the `stacklevel` parameter. Without `stacklevel=2`, warnings point to library internals instead of the caller's code, making it impossible for users to identify which of their function calls triggered the warning. Additionally, two calls in `table.py` (lines 3411 and 3420) had a more serious bug: the deprecation message was split across two separate string arguments, causing the second string to be passed as the `category` parameter instead of being concatenated with the first string. This would cause `TypeError` when the warning was triggered. ## Changes \| File \| Fixes \| Description \| \|------\|-------\|-------------\| \| `embeddings/colpali.py` \| 1 \| Add `stacklevel=2` to `use_token_pooling` deprecation warning \| \| `remote/db.py` \| 3 \| Add `stacklevel=2` to `request_thread_pool`, `connection_timeout`, `read_timeout` deprecation warnings \| \| `remote/table.py` \| 3 \| Add `stacklevel=2` to `cleanup_old_versions`, `compact_files`, `optimize` no-op warnings \| \| `table.py` \| 3 \| Fix broken message concatenation for `data_storage_version` and `enable_v2_manifest_paths` deprecation warnings + add `stacklevel=2` to `retrain` deprecation warning \| ## Verification ```python # All warnings.warn() calls now have stacklevel python3 -c "import ast, os; ..." # Result: All warnings.warn() calls now have stacklevel! ``` ## Changelog \| Date \| Change \| Author \| \|------\|--------\|--------\| \| 2026-06-20 \| Fix missing stacklevel=2 in 10 warnings.warn() calls + fix broken message concatenation \| rtmalikian \| ### Files Changed - `python/python/lancedb/embeddings/colpali.py` — Add stacklevel=2 - `python/python/lancedb/remote/db.py` — Add stacklevel=2 to 3 deprecation warnings - `python/python/lancedb/remote/table.py` — Add stacklevel=2 to 3 no-op warnings - `python/python/lancedb/table.py` — Fix broken message concatenation + add stacklevel=2 ### Verification - AST-based audit confirms all `warnings.warn()` calls now include `stacklevel=2` - Syntax check passes for all 4 modified files --- About the Author: Raphael Malikian — Clinical AI Solutions Architect. I specialise in building and fixing AI/ML systems for healthcare, including vector databases, RAG pipelines, and clinical NLP. If you need help with your project or think I can add value to your organisation, feel free to reach out — I'd love to connect. 📧 rtmalikian@gmail.com 🔗 GitHub: https://github.com/rtmalikian 🔗 LinkedIn: http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a --- Disclosure: This code was developed with assistance from Hermes Agent (Nous Research). All changes were reviewed, tested against the actual codebase, and verified for correctness. Signed-off-by: rtmalikian <rtmalikian@gmail.com>	2026-06-23 13:42:59 -07:00
Lance Release	26481a4b74	Bump version: 0.34.0-beta.1 → 0.34.0-beta.2	2026-06-23 16:21:52 +00:00
Will Jones	85d870b397	fix: parse RFC 3339 created_at and improve IndexConfig repr (#3558 ) The server now serializes an index's `created_at` as an RFC 3339 string (e.g. `"2026-06-18T21:37:36.637Z"`), but the client deserializer only accepted a unix timestamp in milliseconds. This caused `list_indices` to fail with: ``` Failed to parse list_indices response: invalid type: string "2026-06-18T21:37:36.637Z", expected a unix timestamp in milliseconds ``` This PR replaces the fixed millisecond deserializer with a custom one that accepts both an RFC 3339 string (current server) and a unix-millisecond integer (legacy deployments), so the client works against any server version. It also improves the `IndexConfig` repr in the Python bindings. Previously it printed only three fields (`Index(FTS, columns=["text"], name="text_idx")`), hiding the metadata that `list_indices` returns. It now renders every populated field, omitting any that are `None`. Each value is valid Python — integer counts use `_` thousands separators and `created_at` uses the `datetime` repr — so values round-trip. The real repr is a single line; it's wrapped here for readability: ```python >>> table.list_indices() [IndexConfig( name="text_idx", index_type="FTS", columns=["text"], index_uuid="aefd3e00-2f95-4bdc-92ac-06de84442bf1", type_url="/lance.table.InvertedIndexDetails", created_at=datetime.datetime(2026, 6, 18, 21, 37, 36, 637000, tzinfo=datetime.timezone.utc), num_indexed_rows=2, size_bytes=3_669, num_segments=1, index_version=1, index_details={ 'lance_tokenizer': None, 'base_tokenizer': 'simple', 'language': 'English', 'with_position': False, 'max_token_length': 40, 'lower_case': True, 'stem': True, 'remove_stop_words': True, 'custom_stop_words': None, 'ascii_folding': True, 'min_ngram_length': 3, 'max_ngram_length': 3, 'prefix_only': False, }, )] ``` Fixes #3556 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 10:40:56 -07:00
Lance Release	3b279f5705	Bump version: 0.34.0-beta.0 → 0.34.0-beta.1	2026-06-19 15:59:43 +00:00
Ryan Green	e1334954d7	fix: overflow using sys.maxsize for k in query with namespace connection (#3561 )	2026-06-19 12:57:10 -02:30
Lance Release	4f4cce3f64	Bump version: 0.33.1-beta.2 → 0.34.0-beta.0	2026-06-18 18:42:07 +00:00
Will Jones	ce5dadd386	fix(ci): allow shell pre-commit hooks in bumpversion configs (#3554 ) The "Create release commit" workflow (`make-release-commit.yml`) has failed on its last two runs; no release tags have been created since June 4. Since this workflow creates the tag that the cargo/npm/pypi/java publish workflows trigger off of, all recent releases are effectively blocked. The workflow installs `bump-my-version` unpinned. Version `1.4.0` added a check that refuses to run `pre_commit_hooks` containing shell syntax (pipes, `&&`, `if`, variable expansion) unless `allow_shell_hooks = true` is set. Both bumpversion configs use such hooks: - `python/.bumpversion.toml` — updates `Cargo.lock` after the bump (fails first) - `.bumpversion.toml` — runs `mvn versions:set` for the Java packages The job dies at the version-bump step with: > Hook '…' contains shell syntax (pipes, redirects, or variable expansion). Set `allow_shell_hooks = true` in your configuration to enable shell execution… This sets `allow_shell_hooks = true` in both configs to restore the previous behavior. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:22:05 -07:00
whitewooood	217fd8491d	fix(python): clarify single dictionary input error (#3537 ) ## Summary - clarify the Python error for passing a single dictionary to table creation/add paths - add a regression test for `create_table(..., data=dict)` so it points users to a list of dictionaries Fixes #409 ## Testing - `python -m pytest python/tests/test_table.py -q` - `python -m ruff format python/lancedb/table.py python/lancedb/scannable.py python/tests/test_table.py` - `python -m ruff check python/lancedb/table.py python/lancedb/scannable.py python/tests/test_table.py`	2026-06-17 12:55:55 -07:00
JSap0914	9128dbcd7a	fix(util): escape single quotes in struct field names in value_to_sql (#3548 ) ### Bug `value_to_sql({...})` builds a DataFusion `named_struct(...)` literal but interpolates the struct field names directly as `f"'{k}'"`. A field name that contains a single quote therefore produces invalid SQL: ```python >>> from lancedb.util import value_to_sql >>> value_to_sql({"it's": 1}) "named_struct('it's', 1)" # invalid SQL — the quote terminates the literal ``` String values are already escaped (single quotes doubled) by the `str` branch of `value_to_sql`, so keys and values were handled inconsistently. This affects `Table.update(values={...})` / `merge_insert` when a struct column has a field name containing `'`. ### Fix Render the key through `value_to_sql(str(k))` so field names are escaped exactly like string values: ```python >>> value_to_sql({"it's": 1}) "named_struct('it''s', 1)" ``` Keys without special characters are unchanged (`'a'` stays `'a'`), so existing behavior is preserved. ### Verification ``` $ pytest python/tests/test_util.py -k value_to_sql_dict ``` The new `test_value_to_sql_dict_key_escaping` covers quoted keys (incl. nested structs) and fails on `main` (`named_struct('it's', 1)`), passes with this change; the existing `test_value_to_sql_dict` still passes. Co-authored-by: JSap0914 <JSap0914@users.noreply.github.com>	2026-06-17 12:55:43 -07:00
Armaan Sandhu	b2ae763254	fix(python): raise clear TypeError for bare List/Tuple in pydantic schema conversion (#3511 ) Closes #3502 ## Problem A bare, unparameterised `typing.List` / `typing.Tuple` field crashes `to_arrow_schema` with an opaque `AttributeError: __args__`: ```python from typing import Tuple from lancedb.pydantic import LanceModel class Doc(LanceModel): items: Tuple Doc.to_arrow_schema() # AttributeError: __args__ ``` In `_py_type_to_arrow_type`, the branch `elif getattr(py_type, "__origin__", None) in (list, tuple)` is taken for a bare generic (its `__origin__` is `list / tuple`), but the next line reads `py_type.__args__[0]`, and a bare generic has no `__args__`. Other unsupported types (e.g. `Dict[str, int]`) correctly raise a clear `TypeError`, so this case is inconsistent. Fix Guard the element-type lookup with `getattr(py_type, "__args__", None)` and raise a clear `TypeError` when it is missing, matching the existing behavior for other unsupported types. Bare builtin list / tuple are unaffected (their `__origin__` is `None`, so they already fall through to the existing `TypeError`). Testing - Added `test_bare_generic_raises_type_error` covering both `List` and `Tuple`. - ruff format and ruff check clean.	2026-06-17 11:58:48 -07:00
Brendan Clement	f76b075d13	feat: add table branch support to remote tables and Python/TS bindings (#3540 ) ### Description Adding branch support for RemoteTable by threading a branch selector onto every operation the data plane accepts it on. Exposes the currentBranch to nodejs and python through the bindings. Matching the server handlers, the branch rides as: - a `?branch=` query parameter for Arrow-body and query-only ops (insert, merge_insert, multipart_*, version/list, drop_index) - a `branch` field in the JSON body for everything else (count_rows, query, update, delete, create_index, column ops, index list/stats, stats, restore, describe, tags create/update) A main-branch handle (`branch == None`) produces byte-identical requests to before: no `branch` field and no `?branch=` - Handle-per-branch: `create_branch` / `checkout_branch` return a new handle with fresh caches and reset version/freshness state, mirroring `NativeTable`. - `create_branch` maps 409 to already-exists, 400 to invalid, and 404 to not-found with source context, and sends without retry so the 409 stays observable. - `Ref` translation covers version, version-number (relative to the handle's branch), and tag (resolved via the tags endpoint); `"main"` and empty normalize to the main branch. - Python branch handles persist their branch (and pinned version) across pickle/fork, so a forked or pickled handle reopens on its branch rather than silently reverting to main. ### Tests - Rust mock tests per op category (query-param and body mechanisms, branch CRUD, error paths, backward-compat). - Python sync branch CRUD, `open_table(branch=)`, and a pickle round-trip regression test.	2026-06-15 18:07:40 -04:00
Will Jones	f8caef3aca	feat(bindings): expose new IndexConfig fields in Python and Node.js (#3534 ) ## Summary Surfaces the rich per-index metadata added in #3497 to the Python and Node.js language bindings. Closes #3495. New optional fields exposed on `IndexConfig` in both bindings: - `index_uuid` / `indexUuid` — UUID of the first index segment - `type_url` / `typeUrl` — protobuf type URL for the index - `created_at` / `createdAt` — creation timestamp (milliseconds since Unix epoch) - `num_indexed_rows` / `numIndexedRows` — rows covered by the index - `num_unindexed_rows` / `numUnindexedRows` — rows not yet indexed - `size_bytes` / `sizeBytes` — total index file size in bytes - `num_segments` / `numSegments` — number of index segments - `index_version` / `indexVersion` — on-disk format version - `index_details` / `indexDetails` — type-specific JSON details string All fields are `None`/`undefined` for remote tables (which don't yet surface this metadata through the server response). ## Changes - `python/src/index.rs`: extend `IndexConfig` pyclass; update `From` impl; update `__getitem__` - `python/python/lancedb/_lancedb.pyi`: add type hints for new fields - `python/python/tests/test_table.py`: new `test_index_config_fields` test - `nodejs/src/table.rs`: extend `IndexConfig` napi struct; update `From` impl - `nodejs/__test__/table.test.ts`: new test; update existing `toEqual` assertions to `expect.objectContaining` to accommodate new fields ## Test plan - [x] Python: `uv run --extra tests pytest python/tests/test_table.py::test_index_config_fields` - [x] Node.js: `pnpm test __test__/table.test.ts` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 13:37:39 -07:00
nuthalapativarun	40f3e22600	feat: support rename_table on LanceNamespaceDatabase (#3520 ) ## Summary Closes #3412 Implements `rename_table` for `LanceNamespaceDatabase` (sync and async Python) and the Rust `NamespaceDatabase` backend. Previously these raised `NotImplementedError`; this PR delegates to the `LanceNamespace.rename_table` method which is part of the lance-namespace spec. ### Changes - `rust/lancedb/src/database/namespace.rs`: Remove the `NotImplementedError` stub for `rename_table`. Build a `RenameTableRequest` (with `id`, `new_table_name`, and optionally `new_namespace_id`) and call `self.namespace.rename_table(...)`, mirroring the existing `drop_table` pattern. - `python/python/lancedb/namespace.py`: Import `RenameTableRequest` from `lance_namespace`. Replace the `raise NotImplementedError` in both `LanceNamespaceDatabase.rename_table` (sync) and `AsyncLanceNamespaceDatabase.rename_table` (async) with a call to `self._namespace_client.rename_table(request)`. - `python/python/tests/test_namespace.py`: Replace the `test_rename_table_not_supported` test (which checked for `NotImplementedError`) with `test_rename_table`, which: 1. Creates a table in a namespace 2. Calls `rename_table` with `cur_namespace_path` and `new_namespace_path` 3. Asserts the old name is gone from `table_names()` 4. Asserts the new name appears in `table_names()` 5. Verifies the renamed table can be opened ## Test plan - [ ] Existing namespace tests pass in CI (all rely on `lance.namespace.DirectoryNamespace` which requires the full lance package) - [ ] `test_rename_table` exercises the full rename path: create → rename → verify old gone → verify new present → open - [ ] Rust build passes with the updated `namespace.rs` (requires Rust toolchain in CI)	2026-06-11 11:41:07 -07:00
nuthalapativarun	04480c274a	test(python): add nested field regression matrix tests (#3518 ) ## Summary Closes #3406 Add a regression matrix in `python/python/tests/test_nested_fields.py` that exercises the full nested field index lifecycle for both the sync and async Python table APIs. The tests will fail if any implementation regresses to leaf-only field names in `list_indices`, `index_stats`, search, or filter results. ## Test scenarios covered Index types: BTree scalar, IvfPq vector, FTS Field-name edge cases (per acceptance criteria): - `rowId` — camelCase top-level field - `` `row-id` `` — hyphenated top-level field (escaped) - `parent.`\``leaf.name`\`` ` — struct leaf whose name contains a literal dot - `MetaData.userId` — mixed-case nested path - `` `meta-data`.`user-id` `` — hyphenated struct with hyphenated leaf Lifecycle operations per index type: - `create_index` / `create_scalar_index` / `create_fts_index` - `list_indices` → verify canonical full dotted path (not leaf name) - `index_stats` → verify row count and index type - Filtered scan (`WHERE nested.field = value`) - Vector search via nested embedding column - FTS search via nested text column - `add` (append) then re-check index listing - `optimize` then re-check index listing Both sync and async APIs are covered in parallel test classes. ## Notes Lance forbids top-level field names that contain a literal `.`, so the `` `a.b` `` acceptance-criterion variant is exercised as a struct leaf field (`parent.`\``leaf.name`\``) rather than a top-level column.	2026-06-11 08:06:04 -07:00
Trenton H	ae7f2cbfe8	feat(python): accept Expr in Table.delete and merge when_not_matched_by_source_delete (#3524 ) Another little pain point as I was working to integrate with paperless-ngx. The read path of table.search() or table.query() already accepted an Expr, but write paths Table.delete and merge_insert(...).when_not_matched_by_source_delete did not. This PR attempts to close that gap, so writes and reads can both use Expr, instead of one side needing to build a string.	2026-06-11 07:59:49 -07:00
Trenton H	85d9c1ce63	feat: adds isin support to the 'Expr' builder (#3523 ) The `Expr` build already includes a lot of useful filtering options, `eq, ne, gt/gte, lt/lte, and_, or_, contains, cast`, but is was missing a membership like `isin`. This PR adds that support, as minimally as possible, allowing easy filtering for membership in a list, without needing to be a series of `where` expressions. I didn't see anything in CONTRIBUTING.md about needing a feature request or issue first, so I just made the change. My apologies if I missed that somewhere. Thanks for the vector store, we're using it now in paperless-ngx.	2026-06-10 15:28:19 -07:00
Jack Ye	8373318e89	feat: support FM-Index scalar index for substring search (#3532 ) Adds an FM-Index — a scalar index over string and binary columns that accelerates substring search (`contains(col, 'needle')`), distinct from the tokenized `FTS` index — across the Rust core and the Python and TypeScript bindings. ## Rust - `Index::Fm(FmIndexBuilder)` and `IndexType::Fm`. - `make_index_params` maps `Index::Fm` to Lance's `ScalarIndexParams::for_builtin(BuiltinIndexType::Fm)`. - `supported_fm_data_type` validates `Utf8`/`LargeUtf8`/`Binary`/`LargeBinary` columns. - `list_indices` round-trips the type (`"Fm"` → `IndexType::Fm`); the remote wire type is `"FM"`. ## Python Adds `lancedb.index.Fm`, accepted by `create_index`: ```python from lancedb.index import Fm await tbl.create_index("text", config=Fm()) ``` ## TypeScript Adds the `Index.fm()` factory: ```ts await tbl.createIndex("text", { config: Index.fm() }); ```	2026-06-10 12:28:20 -07:00
Xuanwo	566b67a634	fix: support LargeList label list indexes (#3529 ) ## Summary This PR extends nested-field regression coverage across Rust local/remote, Python sync/async, and Node so canonical escaped paths stay consistent across scalar, vector, and FTS index lifecycle behavior. It also aligns LanceDB's LabelList type gate with Lance by accepting `LargeList<primitive>` columns while keeping `List<Struct<...>>` unsupported until Lance defines stable membership semantics for struct labels. Part of #3406.	2026-06-10 23:53:56 +08:00
devteamaegis	f260d3bf12	fix(util): convert numpy scalars in value_to_sql (#3522 ) ## What's broken `Table.update(values={...})` raises `NotImplementedError: SQL conversion is not implemented for this type` when a value is a numpy scalar such as `np.int64`, `np.int32`, `np.float32`, or `np.bool_`. These arise naturally from indexing an ndarray or a pandas int/bool column. `np.float64` happens to work (it subclasses `float`), which makes the failure inconsistent and surprising. ```python df = pd.DataFrame({"id": np.array([10, 20], dtype="int32")}) t.update(where="id = 1", values={"id": df["id"].iloc[0]}) # np.int32 # -> NotImplementedError: SQL conversion is not implemented for this type ``` ## Why it happens `value_to_sql` is a `singledispatch` with handlers only for native Python types and `np.ndarray`; numpy `integer`/`floating`/`bool_` scalars aren't Python subclasses, so they fall through to the `NotImplementedError` base. ## Fix Register handlers for `np.bool_`, `np.integer`, and `np.floating` that delegate to the existing native handlers. ## Test `value_to_sql` on `np.int32/int64/float32/float64/bool_` all convert; `np.int32` raised before. Co-authored-by: Ishaan Samantray <ishaansamantray@Ishaans-MacBook-Pro.local>	2026-06-09 15:57:02 -07:00
Brendan Clement	d9018067b3	feat: support checking out a version on a branch (#3504 ) ### Description Stacked on #3490. Adds an optional version to branch checkout across the Rust core and the Python and TypeScript SDKs, so you can open a specific version on a branch ("version V of branch B"), not just the branch's latest version Rust ```rust // Open version 3 of branch "exp" (a read-only view): check out from an // existing table, or open it directly from the connection. let exp_v3 = table.checkout_branch("exp", Some(3)).await?; let exp_v3 = db.open_table("items").branch("exp").version(3).execute().await?; // checkout_latest re-attaches to the branch's writable HEAD. exp_v3.checkout_latest().await?; // With no branch, a version opens main at that version. let main_v3 = db.open_table("items").version(3).execute().await?; ``` Python ```python # Open version 3 of branch "exp" (a read-only view): check out from an # existing table, or open it directly from the connection. branch_v3 = await table.branches.checkout("exp", version=3) branch_v3 = await db.open_table("items", branch="exp", version=3) # checkout_latest re-attaches to the branch's writable HEAD. await branch_v3.checkout_latest() # With no branch, a version opens main at that version. main_v3 = await db.open_table("items", version=3) ``` TypeScript ```typescript // Open version 3 of branch "exp" (a read-only view): check out from an // existing table, or open it directly from the connection. const branchV3 = await (await table.branches()).checkout("exp", 3); const opened = await db.openTable("items", undefined, { branch: "exp", version: 3 }); // checkoutLatest re-attaches to the branch's writable HEAD. await branchV3.checkoutLatest(); // With no branch, a version opens main at that version. const mainV3 = await db.openTable("items", undefined, { version: 3 }); ``` ### Testing - Added unit tests (Rust, Python sync + async, TypeScript): branch-scoped resolution at a version number shared with `main` and with another branch, read-only enforcement on a pinned handle, `checkout_latest` recovery to the branch's HEAD, fork-point reads, and the nonexistent-version/branch error paths. - Ran smoke tests against the Python and TypeScript SDKs on local machine.	2026-06-08 17:36:38 -07:00
Brendan Clement	53517b3aaa	feat: add table branch support (#3490 ) ### Description Adds first-class support for table branches across the Rust core and the Python and TypeScript SDKs. Rust ```rust use lance::dataset::refs::Ref; // Create a branch from main and write to it — main is untouched. let exp = table.create_branch("exp", Ref::Version(None, None)).await?; exp.add(batches).await?; // Reopen the branch later: check out from a table, or open it directly. let exp = table.checkout_branch("exp").await?; let exp = db.open_table("items").branch("exp").execute().await?; let branches = table.list_branches().await?; table.delete_branch("exp").await?; ``` Python ```python # Create a branch from main and write to it branch = await table.branches.create("exp", from_ref="main") await branch.add(data) # Reopen the branch later: check out from a table, or open it directly. branch = await table.branches.checkout("exp") branch = await db.open_table("items", branch="exp") await table.branches.list() await table.branches.delete("exp") ``` TypeScript ```typescript const branches = await table.branches(); // Create a branch from main and write to it const branch = await branches.create("exp"); await branch.add(data); // Reopen the branch later: check out from a table, or open it directly. const checkedOut = await branches.checkout("exp"); const opened = await db.openTable("items", undefined, { branch: "exp" }); await branches.list(); await branches.delete("exp"); ``` ### Testing - Added unit tests - ran smoke tests against python and typescript sdks on local machine ### Next steps - Add RemoteTable support - Add Branch Comparison support - Merge Branching support	2026-06-08 16:26:46 -07:00
Yang Cen	3e25f584eb	fix(python): push down namespace full reads (#3516 ) ## Bug Fix ### What is the bug? Namespace-backed `LanceTable.to_arrow()` full-table reads bypassed the existing `QueryTable` server-side query path and called the lower-level table `to_arrow()` implementation directly. In Geneva/Sophon this could fail while parsing the Arrow IPC response for `hist.get_table().to_arrow()` / `to_pandas()`, even though `hist.get_table().search().to_arrow()` worked. ### What issues or incorrect behavior does the bug cause? Full-table reads on namespace-backed tables with `QueryTable` pushdown could fail with Arrow IPC parse errors, while query/search reads on the same table succeeded. Since `to_pandas()` delegates through `to_arrow()` for non-blob/native cases, pandas export was affected too. ### How does this PR fix the problem? When `QueryTable` pushdown is enabled, sync and async table `to_arrow()` now construct a plain no-filter, no-limit, all-columns query and execute it through the table-level `_execute_query()` path. `AsyncTable` now preserves namespace context from async namespace connections so async full reads can make the same pushdown decision. Non-namespace tables and namespace tables without `QueryTable` pushdown keep their existing behavior. ### Tests - `uv run --extra tests --extra dev --no-sync ruff check python/lancedb/table.py python/lancedb/namespace.py python/tests/test_namespace.py` - `uv run --extra tests --extra dev --no-sync ruff format python/lancedb/table.py python/lancedb/namespace.py python/tests/test_namespace.py` - `uv run --extra tests --extra dev --no-sync pytest python/tests/test_namespace.py::TestPushdownOperations::test_lance_table_to_arrow_uses_query_pushdown python/tests/test_namespace.py::TestAsyncPushdownOperations::test_async_table_to_arrow_uses_query_pushdown python/tests/test_namespace.py::test_local_table_to_arrow_and_to_pandas_are_unchanged -q` - `uv run --extra tests --extra dev --no-sync pytest python/tests/test_namespace.py -q`	2026-06-08 19:48:40 +08:00
Will Jones	09b1bbc12a	refactor!: drop unused loss field from IndexStatistics (#3496 ) BREAKING CHANGE: direct Rust users lose the `IndexStatistics::loss` field. Python and Node.js consumers are unaffected in practice for remote tables (the value was always `None`/absent), but the attribute is gone for local tables too. `IndexStatistics::loss` was local-only — LanceDB Cloud never returned it, so `RemoteTable::index_stats` always set `loss: None`. It's vestigial; this removes it. - Remove `loss` from `IndexStatistics` and the internal `IndexMetadata` in `rust/lancedb/src/index.rs`, plus the summing logic in `NativeTable::index_stats`. - Drop `loss` from the Python and Node.js bindings (and their tests/docs). Fixes #3493 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 07:52:40 -07:00

1 2 3 4 5 ...

1075 Commits