lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-06-29 00:50:38 +00:00

Author	SHA1	Message	Date
Jack Ye	956a8ee714	feat(python): expose OAuth connection config	2026-06-27 00:01:11 -07:00
Ryan Green	8a5cd74e48	fix: ensure read freshness provider is built into namespace client (#3571 ) By default the read freshness provider was not included in the namespace client, preventing the read freshness headers from being included in the request. This prevents checkout_latest() from working as expected when using the namespace client. This fix ensures the provided is built into the client when the namespace impl and properties are provided.	2026-06-25 21:47:55 -07:00
Lance Release	8718345229	Bump version: 0.34.0-beta.2 → 0.34.0-beta.3	2026-06-25 01:53:51 +00:00
Raphael Malikian	0ba70d96c3	fix: add missing stacklevel=2 to warnings.warn() and fix broken message concatenation (Fixes #3563 ) (#3564 ) Fixes #3563 ## Summary - Add `stacklevel=2` to 10 `warnings.warn()` calls across 4 files - Fix broken message concatenation in `table.py` where the second string was incorrectly passed as the `category` parameter ## Problem Multiple `warnings.warn()` calls in the `python/lancedb/` codebase were missing the `stacklevel` parameter. Without `stacklevel=2`, warnings point to library internals instead of the caller's code, making it impossible for users to identify which of their function calls triggered the warning. Additionally, two calls in `table.py` (lines 3411 and 3420) had a more serious bug: the deprecation message was split across two separate string arguments, causing the second string to be passed as the `category` parameter instead of being concatenated with the first string. This would cause `TypeError` when the warning was triggered. ## Changes \| File \| Fixes \| Description \| \|------\|-------\|-------------\| \| `embeddings/colpali.py` \| 1 \| Add `stacklevel=2` to `use_token_pooling` deprecation warning \| \| `remote/db.py` \| 3 \| Add `stacklevel=2` to `request_thread_pool`, `connection_timeout`, `read_timeout` deprecation warnings \| \| `remote/table.py` \| 3 \| Add `stacklevel=2` to `cleanup_old_versions`, `compact_files`, `optimize` no-op warnings \| \| `table.py` \| 3 \| Fix broken message concatenation for `data_storage_version` and `enable_v2_manifest_paths` deprecation warnings + add `stacklevel=2` to `retrain` deprecation warning \| ## Verification ```python # All warnings.warn() calls now have stacklevel python3 -c "import ast, os; ..." # Result: All warnings.warn() calls now have stacklevel! ``` ## Changelog \| Date \| Change \| Author \| \|------\|--------\|--------\| \| 2026-06-20 \| Fix missing stacklevel=2 in 10 warnings.warn() calls + fix broken message concatenation \| rtmalikian \| ### Files Changed - `python/python/lancedb/embeddings/colpali.py` — Add stacklevel=2 - `python/python/lancedb/remote/db.py` — Add stacklevel=2 to 3 deprecation warnings - `python/python/lancedb/remote/table.py` — Add stacklevel=2 to 3 no-op warnings - `python/python/lancedb/table.py` — Fix broken message concatenation + add stacklevel=2 ### Verification - AST-based audit confirms all `warnings.warn()` calls now include `stacklevel=2` - Syntax check passes for all 4 modified files --- About the Author: Raphael Malikian — Clinical AI Solutions Architect. I specialise in building and fixing AI/ML systems for healthcare, including vector databases, RAG pipelines, and clinical NLP. If you need help with your project or think I can add value to your organisation, feel free to reach out — I'd love to connect. 📧 rtmalikian@gmail.com 🔗 GitHub: https://github.com/rtmalikian 🔗 LinkedIn: http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a --- Disclosure: This code was developed with assistance from Hermes Agent (Nous Research). All changes were reviewed, tested against the actual codebase, and verified for correctness. Signed-off-by: rtmalikian <rtmalikian@gmail.com>	2026-06-23 13:42:59 -07:00
Lance Release	26481a4b74	Bump version: 0.34.0-beta.1 → 0.34.0-beta.2	2026-06-23 16:21:52 +00:00
Will Jones	85d870b397	fix: parse RFC 3339 created_at and improve IndexConfig repr (#3558 ) The server now serializes an index's `created_at` as an RFC 3339 string (e.g. `"2026-06-18T21:37:36.637Z"`), but the client deserializer only accepted a unix timestamp in milliseconds. This caused `list_indices` to fail with: ``` Failed to parse list_indices response: invalid type: string "2026-06-18T21:37:36.637Z", expected a unix timestamp in milliseconds ``` This PR replaces the fixed millisecond deserializer with a custom one that accepts both an RFC 3339 string (current server) and a unix-millisecond integer (legacy deployments), so the client works against any server version. It also improves the `IndexConfig` repr in the Python bindings. Previously it printed only three fields (`Index(FTS, columns=["text"], name="text_idx")`), hiding the metadata that `list_indices` returns. It now renders every populated field, omitting any that are `None`. Each value is valid Python — integer counts use `_` thousands separators and `created_at` uses the `datetime` repr — so values round-trip. The real repr is a single line; it's wrapped here for readability: ```python >>> table.list_indices() [IndexConfig( name="text_idx", index_type="FTS", columns=["text"], index_uuid="aefd3e00-2f95-4bdc-92ac-06de84442bf1", type_url="/lance.table.InvertedIndexDetails", created_at=datetime.datetime(2026, 6, 18, 21, 37, 36, 637000, tzinfo=datetime.timezone.utc), num_indexed_rows=2, size_bytes=3_669, num_segments=1, index_version=1, index_details={ 'lance_tokenizer': None, 'base_tokenizer': 'simple', 'language': 'English', 'with_position': False, 'max_token_length': 40, 'lower_case': True, 'stem': True, 'remove_stop_words': True, 'custom_stop_words': None, 'ascii_folding': True, 'min_ngram_length': 3, 'max_ngram_length': 3, 'prefix_only': False, }, )] ``` Fixes #3556 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 10:40:56 -07:00
Lance Release	3b279f5705	Bump version: 0.34.0-beta.0 → 0.34.0-beta.1	2026-06-19 15:59:43 +00:00
Ryan Green	e1334954d7	fix: overflow using sys.maxsize for k in query with namespace connection (#3561 )	2026-06-19 12:57:10 -02:30
Lance Release	4f4cce3f64	Bump version: 0.33.1-beta.2 → 0.34.0-beta.0	2026-06-18 18:42:07 +00:00
Will Jones	ce5dadd386	fix(ci): allow shell pre-commit hooks in bumpversion configs (#3554 ) The "Create release commit" workflow (`make-release-commit.yml`) has failed on its last two runs; no release tags have been created since June 4. Since this workflow creates the tag that the cargo/npm/pypi/java publish workflows trigger off of, all recent releases are effectively blocked. The workflow installs `bump-my-version` unpinned. Version `1.4.0` added a check that refuses to run `pre_commit_hooks` containing shell syntax (pipes, `&&`, `if`, variable expansion) unless `allow_shell_hooks = true` is set. Both bumpversion configs use such hooks: - `python/.bumpversion.toml` — updates `Cargo.lock` after the bump (fails first) - `.bumpversion.toml` — runs `mvn versions:set` for the Java packages The job dies at the version-bump step with: > Hook '…' contains shell syntax (pipes, redirects, or variable expansion). Set `allow_shell_hooks = true` in your configuration to enable shell execution… This sets `allow_shell_hooks = true` in both configs to restore the previous behavior. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:22:05 -07:00
whitewooood	217fd8491d	fix(python): clarify single dictionary input error (#3537 ) ## Summary - clarify the Python error for passing a single dictionary to table creation/add paths - add a regression test for `create_table(..., data=dict)` so it points users to a list of dictionaries Fixes #409 ## Testing - `python -m pytest python/tests/test_table.py -q` - `python -m ruff format python/lancedb/table.py python/lancedb/scannable.py python/tests/test_table.py` - `python -m ruff check python/lancedb/table.py python/lancedb/scannable.py python/tests/test_table.py`	2026-06-17 12:55:55 -07:00
JSap0914	9128dbcd7a	fix(util): escape single quotes in struct field names in value_to_sql (#3548 ) ### Bug `value_to_sql({...})` builds a DataFusion `named_struct(...)` literal but interpolates the struct field names directly as `f"'{k}'"`. A field name that contains a single quote therefore produces invalid SQL: ```python >>> from lancedb.util import value_to_sql >>> value_to_sql({"it's": 1}) "named_struct('it's', 1)" # invalid SQL — the quote terminates the literal ``` String values are already escaped (single quotes doubled) by the `str` branch of `value_to_sql`, so keys and values were handled inconsistently. This affects `Table.update(values={...})` / `merge_insert` when a struct column has a field name containing `'`. ### Fix Render the key through `value_to_sql(str(k))` so field names are escaped exactly like string values: ```python >>> value_to_sql({"it's": 1}) "named_struct('it''s', 1)" ``` Keys without special characters are unchanged (`'a'` stays `'a'`), so existing behavior is preserved. ### Verification ``` $ pytest python/tests/test_util.py -k value_to_sql_dict ``` The new `test_value_to_sql_dict_key_escaping` covers quoted keys (incl. nested structs) and fails on `main` (`named_struct('it's', 1)`), passes with this change; the existing `test_value_to_sql_dict` still passes. Co-authored-by: JSap0914 <JSap0914@users.noreply.github.com>	2026-06-17 12:55:43 -07:00
Armaan Sandhu	b2ae763254	fix(python): raise clear TypeError for bare List/Tuple in pydantic schema conversion (#3511 ) Closes #3502 ## Problem A bare, unparameterised `typing.List` / `typing.Tuple` field crashes `to_arrow_schema` with an opaque `AttributeError: __args__`: ```python from typing import Tuple from lancedb.pydantic import LanceModel class Doc(LanceModel): items: Tuple Doc.to_arrow_schema() # AttributeError: __args__ ``` In `_py_type_to_arrow_type`, the branch `elif getattr(py_type, "__origin__", None) in (list, tuple)` is taken for a bare generic (its `__origin__` is `list / tuple`), but the next line reads `py_type.__args__[0]`, and a bare generic has no `__args__`. Other unsupported types (e.g. `Dict[str, int]`) correctly raise a clear `TypeError`, so this case is inconsistent. Fix Guard the element-type lookup with `getattr(py_type, "__args__", None)` and raise a clear `TypeError` when it is missing, matching the existing behavior for other unsupported types. Bare builtin list / tuple are unaffected (their `__origin__` is `None`, so they already fall through to the existing `TypeError`). Testing - Added `test_bare_generic_raises_type_error` covering both `List` and `Tuple`. - ruff format and ruff check clean.	2026-06-17 11:58:48 -07:00
Brendan Clement	f76b075d13	feat: add table branch support to remote tables and Python/TS bindings (#3540 ) ### Description Adding branch support for RemoteTable by threading a branch selector onto every operation the data plane accepts it on. Exposes the currentBranch to nodejs and python through the bindings. Matching the server handlers, the branch rides as: - a `?branch=` query parameter for Arrow-body and query-only ops (insert, merge_insert, multipart_*, version/list, drop_index) - a `branch` field in the JSON body for everything else (count_rows, query, update, delete, create_index, column ops, index list/stats, stats, restore, describe, tags create/update) A main-branch handle (`branch == None`) produces byte-identical requests to before: no `branch` field and no `?branch=` - Handle-per-branch: `create_branch` / `checkout_branch` return a new handle with fresh caches and reset version/freshness state, mirroring `NativeTable`. - `create_branch` maps 409 to already-exists, 400 to invalid, and 404 to not-found with source context, and sends without retry so the 409 stays observable. - `Ref` translation covers version, version-number (relative to the handle's branch), and tag (resolved via the tags endpoint); `"main"` and empty normalize to the main branch. - Python branch handles persist their branch (and pinned version) across pickle/fork, so a forked or pickled handle reopens on its branch rather than silently reverting to main. ### Tests - Rust mock tests per op category (query-param and body mechanisms, branch CRUD, error paths, backward-compat). - Python sync branch CRUD, `open_table(branch=)`, and a pickle round-trip regression test.	2026-06-15 18:07:40 -04:00
Will Jones	f8caef3aca	feat(bindings): expose new IndexConfig fields in Python and Node.js (#3534 ) ## Summary Surfaces the rich per-index metadata added in #3497 to the Python and Node.js language bindings. Closes #3495. New optional fields exposed on `IndexConfig` in both bindings: - `index_uuid` / `indexUuid` — UUID of the first index segment - `type_url` / `typeUrl` — protobuf type URL for the index - `created_at` / `createdAt` — creation timestamp (milliseconds since Unix epoch) - `num_indexed_rows` / `numIndexedRows` — rows covered by the index - `num_unindexed_rows` / `numUnindexedRows` — rows not yet indexed - `size_bytes` / `sizeBytes` — total index file size in bytes - `num_segments` / `numSegments` — number of index segments - `index_version` / `indexVersion` — on-disk format version - `index_details` / `indexDetails` — type-specific JSON details string All fields are `None`/`undefined` for remote tables (which don't yet surface this metadata through the server response). ## Changes - `python/src/index.rs`: extend `IndexConfig` pyclass; update `From` impl; update `__getitem__` - `python/python/lancedb/_lancedb.pyi`: add type hints for new fields - `python/python/tests/test_table.py`: new `test_index_config_fields` test - `nodejs/src/table.rs`: extend `IndexConfig` napi struct; update `From` impl - `nodejs/__test__/table.test.ts`: new test; update existing `toEqual` assertions to `expect.objectContaining` to accommodate new fields ## Test plan - [x] Python: `uv run --extra tests pytest python/tests/test_table.py::test_index_config_fields` - [x] Node.js: `pnpm test __test__/table.test.ts` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 13:37:39 -07:00
nuthalapativarun	40f3e22600	feat: support rename_table on LanceNamespaceDatabase (#3520 ) ## Summary Closes #3412 Implements `rename_table` for `LanceNamespaceDatabase` (sync and async Python) and the Rust `NamespaceDatabase` backend. Previously these raised `NotImplementedError`; this PR delegates to the `LanceNamespace.rename_table` method which is part of the lance-namespace spec. ### Changes - `rust/lancedb/src/database/namespace.rs`: Remove the `NotImplementedError` stub for `rename_table`. Build a `RenameTableRequest` (with `id`, `new_table_name`, and optionally `new_namespace_id`) and call `self.namespace.rename_table(...)`, mirroring the existing `drop_table` pattern. - `python/python/lancedb/namespace.py`: Import `RenameTableRequest` from `lance_namespace`. Replace the `raise NotImplementedError` in both `LanceNamespaceDatabase.rename_table` (sync) and `AsyncLanceNamespaceDatabase.rename_table` (async) with a call to `self._namespace_client.rename_table(request)`. - `python/python/tests/test_namespace.py`: Replace the `test_rename_table_not_supported` test (which checked for `NotImplementedError`) with `test_rename_table`, which: 1. Creates a table in a namespace 2. Calls `rename_table` with `cur_namespace_path` and `new_namespace_path` 3. Asserts the old name is gone from `table_names()` 4. Asserts the new name appears in `table_names()` 5. Verifies the renamed table can be opened ## Test plan - [ ] Existing namespace tests pass in CI (all rely on `lance.namespace.DirectoryNamespace` which requires the full lance package) - [ ] `test_rename_table` exercises the full rename path: create → rename → verify old gone → verify new present → open - [ ] Rust build passes with the updated `namespace.rs` (requires Rust toolchain in CI)	2026-06-11 11:41:07 -07:00
nuthalapativarun	04480c274a	test(python): add nested field regression matrix tests (#3518 ) ## Summary Closes #3406 Add a regression matrix in `python/python/tests/test_nested_fields.py` that exercises the full nested field index lifecycle for both the sync and async Python table APIs. The tests will fail if any implementation regresses to leaf-only field names in `list_indices`, `index_stats`, search, or filter results. ## Test scenarios covered Index types: BTree scalar, IvfPq vector, FTS Field-name edge cases (per acceptance criteria): - `rowId` — camelCase top-level field - `` `row-id` `` — hyphenated top-level field (escaped) - `parent.`\``leaf.name`\`` ` — struct leaf whose name contains a literal dot - `MetaData.userId` — mixed-case nested path - `` `meta-data`.`user-id` `` — hyphenated struct with hyphenated leaf Lifecycle operations per index type: - `create_index` / `create_scalar_index` / `create_fts_index` - `list_indices` → verify canonical full dotted path (not leaf name) - `index_stats` → verify row count and index type - Filtered scan (`WHERE nested.field = value`) - Vector search via nested embedding column - FTS search via nested text column - `add` (append) then re-check index listing - `optimize` then re-check index listing Both sync and async APIs are covered in parallel test classes. ## Notes Lance forbids top-level field names that contain a literal `.`, so the `` `a.b` `` acceptance-criterion variant is exercised as a struct leaf field (`parent.`\``leaf.name`\``) rather than a top-level column.	2026-06-11 08:06:04 -07:00
Trenton H	ae7f2cbfe8	feat(python): accept Expr in Table.delete and merge when_not_matched_by_source_delete (#3524 ) Another little pain point as I was working to integrate with paperless-ngx. The read path of table.search() or table.query() already accepted an Expr, but write paths Table.delete and merge_insert(...).when_not_matched_by_source_delete did not. This PR attempts to close that gap, so writes and reads can both use Expr, instead of one side needing to build a string.	2026-06-11 07:59:49 -07:00
Trenton H	85d9c1ce63	feat: adds isin support to the 'Expr' builder (#3523 ) The `Expr` build already includes a lot of useful filtering options, `eq, ne, gt/gte, lt/lte, and_, or_, contains, cast`, but is was missing a membership like `isin`. This PR adds that support, as minimally as possible, allowing easy filtering for membership in a list, without needing to be a series of `where` expressions. I didn't see anything in CONTRIBUTING.md about needing a feature request or issue first, so I just made the change. My apologies if I missed that somewhere. Thanks for the vector store, we're using it now in paperless-ngx.	2026-06-10 15:28:19 -07:00
Jack Ye	8373318e89	feat: support FM-Index scalar index for substring search (#3532 ) Adds an FM-Index — a scalar index over string and binary columns that accelerates substring search (`contains(col, 'needle')`), distinct from the tokenized `FTS` index — across the Rust core and the Python and TypeScript bindings. ## Rust - `Index::Fm(FmIndexBuilder)` and `IndexType::Fm`. - `make_index_params` maps `Index::Fm` to Lance's `ScalarIndexParams::for_builtin(BuiltinIndexType::Fm)`. - `supported_fm_data_type` validates `Utf8`/`LargeUtf8`/`Binary`/`LargeBinary` columns. - `list_indices` round-trips the type (`"Fm"` → `IndexType::Fm`); the remote wire type is `"FM"`. ## Python Adds `lancedb.index.Fm`, accepted by `create_index`: ```python from lancedb.index import Fm await tbl.create_index("text", config=Fm()) ``` ## TypeScript Adds the `Index.fm()` factory: ```ts await tbl.createIndex("text", { config: Index.fm() }); ```	2026-06-10 12:28:20 -07:00
Xuanwo	566b67a634	fix: support LargeList label list indexes (#3529 ) ## Summary This PR extends nested-field regression coverage across Rust local/remote, Python sync/async, and Node so canonical escaped paths stay consistent across scalar, vector, and FTS index lifecycle behavior. It also aligns LanceDB's LabelList type gate with Lance by accepting `LargeList<primitive>` columns while keeping `List<Struct<...>>` unsupported until Lance defines stable membership semantics for struct labels. Part of #3406.	2026-06-10 23:53:56 +08:00
devteamaegis	f260d3bf12	fix(util): convert numpy scalars in value_to_sql (#3522 ) ## What's broken `Table.update(values={...})` raises `NotImplementedError: SQL conversion is not implemented for this type` when a value is a numpy scalar such as `np.int64`, `np.int32`, `np.float32`, or `np.bool_`. These arise naturally from indexing an ndarray or a pandas int/bool column. `np.float64` happens to work (it subclasses `float`), which makes the failure inconsistent and surprising. ```python df = pd.DataFrame({"id": np.array([10, 20], dtype="int32")}) t.update(where="id = 1", values={"id": df["id"].iloc[0]}) # np.int32 # -> NotImplementedError: SQL conversion is not implemented for this type ``` ## Why it happens `value_to_sql` is a `singledispatch` with handlers only for native Python types and `np.ndarray`; numpy `integer`/`floating`/`bool_` scalars aren't Python subclasses, so they fall through to the `NotImplementedError` base. ## Fix Register handlers for `np.bool_`, `np.integer`, and `np.floating` that delegate to the existing native handlers. ## Test `value_to_sql` on `np.int32/int64/float32/float64/bool_` all convert; `np.int32` raised before. Co-authored-by: Ishaan Samantray <ishaansamantray@Ishaans-MacBook-Pro.local>	2026-06-09 15:57:02 -07:00
Brendan Clement	d9018067b3	feat: support checking out a version on a branch (#3504 ) ### Description Stacked on #3490. Adds an optional version to branch checkout across the Rust core and the Python and TypeScript SDKs, so you can open a specific version on a branch ("version V of branch B"), not just the branch's latest version Rust ```rust // Open version 3 of branch "exp" (a read-only view): check out from an // existing table, or open it directly from the connection. let exp_v3 = table.checkout_branch("exp", Some(3)).await?; let exp_v3 = db.open_table("items").branch("exp").version(3).execute().await?; // checkout_latest re-attaches to the branch's writable HEAD. exp_v3.checkout_latest().await?; // With no branch, a version opens main at that version. let main_v3 = db.open_table("items").version(3).execute().await?; ``` Python ```python # Open version 3 of branch "exp" (a read-only view): check out from an # existing table, or open it directly from the connection. branch_v3 = await table.branches.checkout("exp", version=3) branch_v3 = await db.open_table("items", branch="exp", version=3) # checkout_latest re-attaches to the branch's writable HEAD. await branch_v3.checkout_latest() # With no branch, a version opens main at that version. main_v3 = await db.open_table("items", version=3) ``` TypeScript ```typescript // Open version 3 of branch "exp" (a read-only view): check out from an // existing table, or open it directly from the connection. const branchV3 = await (await table.branches()).checkout("exp", 3); const opened = await db.openTable("items", undefined, { branch: "exp", version: 3 }); // checkoutLatest re-attaches to the branch's writable HEAD. await branchV3.checkoutLatest(); // With no branch, a version opens main at that version. const mainV3 = await db.openTable("items", undefined, { version: 3 }); ``` ### Testing - Added unit tests (Rust, Python sync + async, TypeScript): branch-scoped resolution at a version number shared with `main` and with another branch, read-only enforcement on a pinned handle, `checkout_latest` recovery to the branch's HEAD, fork-point reads, and the nonexistent-version/branch error paths. - Ran smoke tests against the Python and TypeScript SDKs on local machine.	2026-06-08 17:36:38 -07:00
Brendan Clement	53517b3aaa	feat: add table branch support (#3490 ) ### Description Adds first-class support for table branches across the Rust core and the Python and TypeScript SDKs. Rust ```rust use lance::dataset::refs::Ref; // Create a branch from main and write to it — main is untouched. let exp = table.create_branch("exp", Ref::Version(None, None)).await?; exp.add(batches).await?; // Reopen the branch later: check out from a table, or open it directly. let exp = table.checkout_branch("exp").await?; let exp = db.open_table("items").branch("exp").execute().await?; let branches = table.list_branches().await?; table.delete_branch("exp").await?; ``` Python ```python # Create a branch from main and write to it branch = await table.branches.create("exp", from_ref="main") await branch.add(data) # Reopen the branch later: check out from a table, or open it directly. branch = await table.branches.checkout("exp") branch = await db.open_table("items", branch="exp") await table.branches.list() await table.branches.delete("exp") ``` TypeScript ```typescript const branches = await table.branches(); // Create a branch from main and write to it const branch = await branches.create("exp"); await branch.add(data); // Reopen the branch later: check out from a table, or open it directly. const checkedOut = await branches.checkout("exp"); const opened = await db.openTable("items", undefined, { branch: "exp" }); await branches.list(); await branches.delete("exp"); ``` ### Testing - Added unit tests - ran smoke tests against python and typescript sdks on local machine ### Next steps - Add RemoteTable support - Add Branch Comparison support - Merge Branching support	2026-06-08 16:26:46 -07:00
Yang Cen	3e25f584eb	fix(python): push down namespace full reads (#3516 ) ## Bug Fix ### What is the bug? Namespace-backed `LanceTable.to_arrow()` full-table reads bypassed the existing `QueryTable` server-side query path and called the lower-level table `to_arrow()` implementation directly. In Geneva/Sophon this could fail while parsing the Arrow IPC response for `hist.get_table().to_arrow()` / `to_pandas()`, even though `hist.get_table().search().to_arrow()` worked. ### What issues or incorrect behavior does the bug cause? Full-table reads on namespace-backed tables with `QueryTable` pushdown could fail with Arrow IPC parse errors, while query/search reads on the same table succeeded. Since `to_pandas()` delegates through `to_arrow()` for non-blob/native cases, pandas export was affected too. ### How does this PR fix the problem? When `QueryTable` pushdown is enabled, sync and async table `to_arrow()` now construct a plain no-filter, no-limit, all-columns query and execute it through the table-level `_execute_query()` path. `AsyncTable` now preserves namespace context from async namespace connections so async full reads can make the same pushdown decision. Non-namespace tables and namespace tables without `QueryTable` pushdown keep their existing behavior. ### Tests - `uv run --extra tests --extra dev --no-sync ruff check python/lancedb/table.py python/lancedb/namespace.py python/tests/test_namespace.py` - `uv run --extra tests --extra dev --no-sync ruff format python/lancedb/table.py python/lancedb/namespace.py python/tests/test_namespace.py` - `uv run --extra tests --extra dev --no-sync pytest python/tests/test_namespace.py::TestPushdownOperations::test_lance_table_to_arrow_uses_query_pushdown python/tests/test_namespace.py::TestAsyncPushdownOperations::test_async_table_to_arrow_uses_query_pushdown python/tests/test_namespace.py::test_local_table_to_arrow_and_to_pandas_are_unchanged -q` - `uv run --extra tests --extra dev --no-sync pytest python/tests/test_namespace.py -q`	2026-06-08 19:48:40 +08:00
Will Jones	09b1bbc12a	refactor!: drop unused loss field from IndexStatistics (#3496 ) BREAKING CHANGE: direct Rust users lose the `IndexStatistics::loss` field. Python and Node.js consumers are unaffected in practice for remote tables (the value was always `None`/absent), but the attribute is gone for local tables too. `IndexStatistics::loss` was local-only — LanceDB Cloud never returned it, so `RemoteTable::index_stats` always set `loss: None`. It's vestigial; this removes it. - Remove `loss` from `IndexStatistics` and the internal `IndexMetadata` in `rust/lancedb/src/index.rs`, plus the summing logic in `NativeTable::index_stats`. - Drop `loss` from the Python and Node.js bindings (and their tests/docs). Fixes #3493 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 07:52:40 -07:00
Armaan Sandhu	3868965413	fix(python): run AsyncTable.search embeddings on a dedicated executor (#3459 ) ## Summary `AsyncTable.search()` computes the query embedding with `loop.run_in_executor(None, ...)`, which uses asyncio's default `ThreadPoolExecutor`. That pool is shared with all other `run_in_executor(None, ...)` work, so a slow embedding call — a heavy local model or an HTTP request to an embeddings API — ties up those threads and starves unrelated async I/O under concurrent load. This moves the (potentially blocking) embedding call onto a dedicated executor, isolating it from the default pool. Closes #3310. ## Problem `python/lancedb/table.py`, `AsyncTable.search()`: ```python return ( await loop.run_in_executor( None, # asyncio's default executor, shared with other blocking I/O embedding.function.compute_query_embeddings_with_retry, query, ) )[0] ``` Under load, concurrent searches whose embeddings block (or any other code using the default executor) contend for the same small thread pool. ## Change - Add a dedicated `ThreadPoolExecutor(thread_name_prefix="lancedb-embedding")` in `background_loop.py`, exposed via `embedding_executor()`. - Use it in `AsyncTable.search()`'s `make_embedding` instead of the default executor. - Reset the executor in the existing `_reset_after_fork` hook — its worker threads don't survive `fork()`, same as the background event loop. It's recreated lazily, so this is cheap. ## Design notes The issue asked whether maintainers preferred a configurable executor, a dedicated internal one, or another approach (no response in the thread). I went with a dedicated internal executor: it fixes the starvation with no public API change and stays consistent with the existing `LOOP` singleton. Making the pool size configurable would be an easy follow-up if preferred. Scope is limited to `search()`. The broader "embedding functions need real async support" (including `add()`) is tracked separately in #3268. ## Testing - Added `test_async_search_runs_embedding_on_dedicated_executor`: patches the embedding function to record the executing thread during an async search and asserts it runs on a `lancedb-embedding` thread. Verified it fails against the previous `run_in_executor(None, ...)` and passes with the fix. - `ruff format`, `ruff check`, and `pyright` pass on the changed files.	2026-06-04 21:57:16 -07:00
hashwnath	64194ea8ad	fix(python): make LanceDBClientError pickleable (#3470 ) ## Summary - Add `__reduce__` methods to `LanceDBClientError` and `RetryError` so that instances can be pickled and unpickled correctly - `HttpError` inherits the fix from `LanceDBClientError` since it has no additional `__init__` parameters - Add tests verifying pickle roundtrip for all three exception classes Fixes #3447 ## Test plan - [x] Verified pickle roundtrip for `LanceDBClientError` with and without `status_code` - [x] Verified pickle roundtrip for `HttpError` (subclass, no extra init params) - [x] Verified pickle roundtrip for `RetryError` (subclass with many extra params) - [ ] CI tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Will Jones <willjones127@gmail.com>	2026-06-04 09:29:15 -07:00
Lance Release	952055d428	Bump version: 0.33.1-beta.1 → 0.33.1-beta.2	2026-06-04 06:04:37 +00:00
Yang Cen	927ba2c948	fix(python): route blob query pandas through scanner (#3491 ) ## Bug Fix ### What is the bug? `QueryBuilder.to_pandas(blob_mode="descriptions")` could still fall back to `self.to_arrow()` for query outputs with blob columns. Custom query subclasses or wrappers can have `to_arrow()` behavior that is not compatible with pandas blob-description conversion, which can surface as low-level Arrow/list-batch conversion failures. ### What issues or incorrect behavior does the bug cause? Callers need to carry local `to_pandas` or plain-scan adapter special casing for blob descriptions, and scanner-only kwargs such as row addresses and fragment selection are not represented in LanceDB query state. ### How does this PR fix the problem? This PR routes blob-output query `to_pandas()` through the Lance scanner path for `lazy`, `bytes`, and `descriptions` modes when the query is a scanner-backed plain scan. For `blob_mode="descriptions"` with `flatten`, it collects scanner Arrow/table output, applies LanceDB `flatten_columns`, and converts to pandas from there. Non-plain blob query shapes now fail with a clear unsupported error instead of falling into subclass `to_arrow()` behavior. It also adds Python query state and builder methods for scanner-only plain-scan parameters: - `with_row_address()` for `_rowaddr` - `with_fragments(...)` for Lance fragment objects - `fragment_ids([...])` as a convenience wrapper that resolves IDs to Lance fragments ## Validation - `cd python && uv run --no-sync ruff format --check python/lancedb/query.py python/tests/test_query.py` - `cd python && uv run --no-sync ruff check python/lancedb/query.py python/tests/test_query.py` Targeted pytest was intentionally not run locally per maintainer request.	2026-06-04 14:03:33 +08:00
Will Jones	a16676e05f	ci: update python lockfile weekly (#3498 ) Make sure we are getting security fixes in there regularly, and other useful bumps.	2026-06-03 15:24:32 -07:00
Harikrishna KP	4e44262499	test(python): add regression test for nullable struct with None (#2654 ) (#3483 ) ## Summary Regression test for [issue #2654](https://github.com/lancedb/lancedb/issues/2654) — a nullable struct column whose first batch contains only `None` values crashed in `_align_field_types` with `AttributeError: 'pyarrow.lib.DataType' object has no attribute 'fields'`. The actual fix landed in #3394, but no test was added. This PR adds the reproducer from the issue as a test. ## Test plan - `test_add_nullable_struct_with_none`: creates a table with a nullable struct column, adds a row with a non-null struct value, then a row with `None` for the struct field. Verifies both rows land correctly. - Uses Lance file format v2.1 (`new_table_data_storage_version="2.1"`) because nullable structs aren't supported on v2.0. ## Related - #3028 (the original fix attempt, now superseded)	2026-06-03 14:13:09 -07:00
devteamaegis	9969191d0d	fix(rerankers): guard against empty vector_results in RRFReranker.rerank_multivector (#3467 ) ## What's broken Calling `RRFReranker().rerank_multivector([])` crashes with `IndexError: list index out of range` because the method accesses `vector_results[0]` for the type-homogeneity check before verifying the list is non-empty. The `all()` call passes vacuously on an empty iterable so the crash hits the next lines. ```python from lancedb.rerankers import RRFReranker RRFReranker().rerank_multivector([]) # IndexError: list index out of range ``` ## Why it happens The type check uses `vector_results[0]` as the reference type but never guards against an empty list. `all(...)` short-circuits to `True` when the iterable is empty, so the bad index access on the lines that follow is never reached by the existing guard logic. ## Fix Add an explicit empty-list check before any indexing.	2026-06-03 14:06:33 -07:00
devteamaegis	1e7326cd8c	fix(rerankers/mrr): raise ValueError on empty vector_results list (#3469 ) ## What's broken `MRRReranker.rerank_multivector([])` raises `IndexError: list index out of range`. The crash happens on line 128 (the `all()` type-homogeneity check passes vacuously on an empty iterable) and on line 134 which accesses `vector_results[0]` unconditionally, with no prior guard for an empty list. ## Why it happens `all()` over an empty iterable returns `True`, so the type check silently passes and execution falls through to `vector_results[0]` which crashes. ## Fix Added a two-line guard at the top of `rerank_multivector` that raises a clear `ValueError("vector_results must not be empty")` before any indexing occurs. ## Test Added `test_mrr_reranker_empty_input` in `test_rerankers.py` which calls `rerank_multivector([])` and asserts that a `ValueError` with the message "must not be empty" is raised. Fixes #3468 Co-authored-by: Aegis Dev <aegis@devteamaegis.com>	2026-06-03 14:05:43 -07:00
Lance Release	ac3411e81e	Bump version: 0.33.1-beta.0 → 0.33.1-beta.1	2026-06-03 11:16:51 +00:00
Yang Cen	6f18eb4cce	feat(python): support blob modes in query to_pandas (#3487 ) ## Feature - What is the new feature? - Adds `blob_mode` support to sync and async Python query `to_pandas()` APIs. - Enables plain scan queries to return blob columns as lazy `BlobFile` objects, raw bytes, or blob descriptions. - Lets namespace-backed local tables use Lance native blob-aware pandas conversion for lazy blobs. - Why do we need this feature? - Table and Lance dataset/scanner APIs already support blob-aware pandas conversion, but LanceDB query builders did not expose that capability. - Geneva and other callers should be able to use query-level `to_pandas(blob_mode=...)` without manually constructing Lance scanners. - How does it work? - Plain scan queries route through Lance scanner native `to_pandas(blob_mode=...)`, preserving filter, projection, limit, offset, row id, and alias/expression projection behavior. - Non-native query shapes keep existing Arrow fallback semantics and raise a clear error when they return blob columns with `blob_mode="lazy"` or `blob_mode="bytes"`. - Focused tests cover table/query blob modes, filter/select/limit/offset/alias query cases, async query behavior, vector-query error boundaries, and namespace-backed lazy blobs. ## Validation - `cd python && .venv/bin/maturin develop --uv --extras tests,dev --profile dev` - `cd python && uv run --frozen --no-sync pytest python/tests/test_table.py::test_table_to_pandas_blob_modes python/tests/test_table.py::test_async_table_to_pandas_blob_bytes python/tests/test_query.py::test_plain_scan_query_to_pandas_blob_modes python/tests/test_query.py::test_plain_scan_query_to_pandas_blob_projection python/tests/test_query.py::test_async_plain_scan_query_to_pandas_blob_projection python/tests/test_query.py::test_vector_query_to_pandas_blob_mode_requires_native_path python/tests/test_namespace.py::TestNamespaceConnection::test_table_to_pandas_blob_lazy_through_namespace -q` - `cd python && uv run --frozen --no-sync ruff format --check .` - `cd python && uv run --frozen --no-sync ruff check .` - `git diff --check`	2026-06-03 19:15:44 +08:00
Brendan Clement	379684391e	feat: deprecate replace_field_metadata for update_field_metadata (#3484 ) ### Summary Deprecates the Python replace_field_metadata (on Table and AsyncTable) in favor of update_field_metadata. Mirrors Lance, which already deprecated Dataset.replace_field_metadata for update_field_metadata. Stacked on top of #3482 as this was a follow-up task after adding update_field_metadata	2026-06-02 14:02:22 -07:00
Brendan Clement	d065be0474	feat: add update_field_metadata to edit per-field metadata (#3482 ) ### Summary Adds update_field_metadata to the client SDK (Rust core, Python, and TypeScript) so clients can edit per-field (column) Arrow metadata (schema.fields[].metadata) ### Testing - added unit tests - ran E2E against a local server on both local and remote tables (set → merge → delete), across Python sync/async and TypeScript ### Next steps - deprecate replace_field_metadata in the python lancedb favor of this (typescript didn't have replace_field_metadata method). This matches Lance's API direction (Lance already deprecated replace_field_metadata for update_field_metadata)	2026-06-02 07:00:00 -07:00
Xuanwo	a327044e2f	feat(python): support remote tables in PyTorch dataloaders (#3432 ) This PR makes remote LanceDB tables usable from PyTorch multiprocessing workers. Remote tables now carry enough safe JSON connection state to reopen themselves after pickle/spawn or fork, and permutations lazily rebuild their reader from restored tables instead of trying to reuse process-local handles. This addresses the remote-table gap in the PyTorch dataset path while preserving the explicit connection factory escape hatch for custom worker-side credential loading or non-serializable header providers. Validated with targeted remote table, permutation, and PyTorch DataLoader tests.	2026-06-02 15:38:28 +08:00
Lance Release	60f961584c	Bump version: 0.33.0-beta.1 → 0.33.1-beta.0	2026-06-01 12:41:02 +00:00
Heng Ge	048f52c2aa	feat(table): route merge_insert through the MemWAL LSM write path (#3354 ) ## Summary When an `LsmWriteSpec` is installed on a table (#3396), `merge_insert` upsert calls are dispatched through Lance's MemWAL `ShardWriter` (LSM-style append) instead of the standard merge path. - `use_lsm_write` — a `merge_insert` builder option, default `true`; set it `false` to use the standard path for a call even when a spec is set. - `assume_pre_sharded` — a `merge_insert` builder option, default `false`; skips the per-row shard check and routes by the first row only. - `close_lsm_writers` — drains and closes the table's cached MemWAL shard writers. - The `merge_insert` `on` columns default to, and are validated against, the table's unenforced primary key. - Shard writers are cached alongside the dataset (in `DatasetConsistencyWrapper`) and reused for the session. - `MergeResult` gains `num_rows` — on the LSM path the insert/update breakdown is unknown until compaction, so only the total is reported. Routing covers all three sharding strategies — bucket (murmur3, Iceberg-compatible), identity, and unsharded. Each `merge_insert` call targets a single shard; the whole input is collected and validated before a single atomic `ShardWriter::put`, so a validation failure leaves the MemWAL untouched. Bindings: Python (`merge_insert(...).use_lsm_write(...)` / `.assume_pre_sharded(...)`, `Table.close_lsm_writers`) and TypeScript (`mergeInsert(...).useLsmWrite(...)` / `.assumePreSharded(...)`, `Table.closeLsmWriters`). ## Context Reconstructed from the original #3354 branch onto current `main`: the branch predated the #3394 (unenforced primary key) / #3396 (`LsmWriteSpec`) split and has been rebuilt on that merged foundation. Depends on Lance `v7.0.0-beta.13`. The MemWAL read path (reading un-flushed shard data back into queries) and remote (LanceDB Cloud) LSM support are follow-ups. --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>	2026-05-29 08:48:11 -07:00
Xuanwo	60ac5c9a7c	test(python): fix remote create_index schema fixture (#3462 ) The latest main Python workflow fails across multiple matrix jobs because `test_remote_create_index_new_api` opens a remote table whose mocked schema only exposes `id`, while the new `create_index(..., config=...)` path validates the requested indexed columns. This updates the remote-table fixture to include the indexed columns used by the smoke test and checks the emitted column payloads, keeping the test aligned with the schema-aware API path.	2026-05-29 23:04:42 +08:00
Will Jones	d05fe8ec44	feat(python): unify sync create_index API to match async API (#2882 ) ## Summary - Transitions `LanceTable` and `RemoteTable` to use the unified `create_index()` API matching `AsyncTable` - Deprecates `create_scalar_index()` and `create_fts_index()` with deprecation warnings - Adds detection logic to distinguish legacy vs new API calls - Adds `@overload` decorators for type checker compatibility - Adds `accelerator` parameter to IVF config classes for GPU support New API: ```python table.create_index("vec", config=IvfPq(distance_type="l2")) table.create_index("col", config=BTree()) table.create_index("text_col", config=FTS(with_position=True)) ``` Legacy API (deprecated): ```python table.create_index("l2", vector_column_name="vec") # emits DeprecationWarning table.create_scalar_index("col", index_type="BTREE") # deprecated table.create_fts_index("text_col") # deprecated ``` Fixes #2879 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-05-28 16:41:47 -07:00
Will Jones	ab982d7f65	perf: migrate list_indices to use Lance's describe_indices (#3108 ) This needs https://github.com/lance-format/lance/pull/6099 to work. Closes #3140 --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-28 16:41:05 -07:00
Jack Ye	a7d9f2e99d	fix: remove primary key constraint from MemWAL bucket sharding (#3435 ) ## Summary - Bump lance dependency from `v7.0.0-beta.13` to `v7.0.0-rc.1` - Remove PK constraint from `LsmWriteSpec::Bucket` docs and `Table::set_lsm_write_spec` docs - Remove test assertions that expected rejection when no PK is set or when bucket column != PK Closes https://github.com/lance-format/lance/issues/6917	2026-05-26 17:35:28 -07:00
devteamaegis	7dba793629	fix(rerankers): inverted scores and incorrect missing-FTS penalty in LinearCombinationReranker (#3437 ) ## Problem `LinearCombinationReranker.merge_results` has two related bugs that make it return inverted relevance rankings — the least relevant document ranks first (closes #3154). ### Bug 1 — `_combine_score` subtracts from 1, inverting the final ranking ```python def _combine_score(self, vector_score, fts_score): return 1 - (self.weight * vector_score + (1 - self.weight) * fts_score) ``` Both `vector_score` (already converted via `_invert_score`) and `fts_score` (BM25 relevance) are in higher-is-better space. Wrapping the weighted average in `1 - (...)` flips the direction: a perfectly matching document (`vector_score=1, fts_score=1`) gets `_relevance_score = 0.0`, while a non-matching document gets a high score. ### Bug 2 — Documents missing an FTS score are rewarded, not penalised ```python fts_score = result.get("_score", fill) # fill=1.0 by default ``` When a document has no FTS match, `fts_score = fill = 1.0`. In `_combine_score` (with the bug-1 formula), this large value becomes a negative penalty via `1 - (... + 0.3 * 1.0)`, counterintuitively boosting the document's score. By contrast, missing vector results correctly receive `_invert_score(fill) = 0.0` (penalised). ## Fix Bug 1 — remove the `1 -` inversion from `_combine_score`: ```python def _combine_score(self, vector_score, fts_score): return self.weight * vector_score + (1 - self.weight) * fts_score ``` Bug 2 — use `1 - fill` for missing FTS scores so both penalties are symmetric (mirror of what `_invert_score(fill)` already does for missing vector scores): ```python fts_score = result.get("_score", 1 - fill) # was: fill ``` With `fill=1.0` (default): `1 - 1.0 = 0.0` — missing-FTS entries contribute `0` to the FTS term, identical to how missing-vector entries contribute `0` to the vector term. ## Verification Concrete example from the issue. With `weight=0.7`, `fill=1.0`: \| Document \| `_distance` \| `_score` \| Old `_relevance_score` \| New `_relevance_score` \| \|----------\|-------------\|----------\|------------------------\|------------------------\| \| `apple orange` \| 0.0 (best) \| 2.41 (only FTS) \| 0.30 (wrong: ranked 2nd) \| 1.42 (correct: ranked 1st) \| \| `banana grape` \| 0.9999 (worst) \| — \| 0.70 (wrong: ranked 1st) \| 0.00 (correct: ranked last) \| ## Tests Two regression tests added to `python/python/tests/test_rerankers.py`: - `test_linear_combination_best_match_ranks_first` — the document with the smallest distance and an FTS match must have the highest `_relevance_score`. - `test_linear_combination_missing_fts_is_penalised` — a document with any FTS score must beat an otherwise-equal document with no FTS match. --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-05-26 15:26:34 -07:00
Brendan Clement	15e75804c4	feat(remote): send read freshness headers for remote table consistency (#3439 ) Closes client side work of #3370 ### Summary - Plumbs `read_consistency_interval` from `ConnectBuilder` through `RestfulLanceDbClient` so remote reads attach an `x-lancedb-min-timestamp` freshness header. None = no header (default), zero = "now", positive = `now - interval`. - Adds per-table `FreshnessState` on `RemoteTable`: write responses (`update`, `delete`, `merge_insert`, `add_columns`, `alter_columns`, `drop_columns`) track the committed version, and the next read sends `x-lancedb-min-version` so the server's cache honors read-your-write. - `checkout(v)` / `checkout_tag(t)` / `checkout_latest()` / `restore()` reset the freshness state appropriately; the validating `/describe/` and tag-resolve requests are sent without freshness headers so they don't carry stale state. - Updates Rust, Python, and Node docstrings and calls out that stronger consistency raises per-read latency and cost. ### Testing - Unit tests cover default behavior, interval=0, positive interval, checkout_latest baseline, min_version-after-write, checkout clears state, and the two no-stale-header invariants on `checkout(v)` and `checkout_tag(t)`. - Ran smoke tests against local remote table to verify functionality	2026-05-26 13:38:07 -07:00
Zhaocun Sun	ec82e36317	docs(python): document in-memory connections (#3434 ) ## Problem Issue #2247 notes that the Python docs do not show how to use LanceDB's in-memory backend via `connect("memory://")`. ## Solution Add `memory://` examples to the sync and async `connect` docstrings, and call out that in-memory databases are intended for tests/temporary data and are not persisted. ## Validation - `python3 -m py_compile python/python/lancedb/__init__.py` - `git diff --check` ## Confidence 82/100 — docs-only update, directly tied to the documented missing `memory://` usage. It changes API documentation only and was syntax/diff validated. Closes #2247.	2026-05-22 10:51:09 -07:00
Lance Release	403c33dff0	Bump version: 0.33.0-beta.0 → 0.33.0-beta.1	2026-05-22 10:08:07 +00:00
Xuanwo	a0001043b6	fix: canonicalize remote nested field paths (#3430 ) Fixes #3407. Remote tables now resolve create-index field paths against the table schema before sending requests, so nested, escaped, and case-insensitive inputs use the same canonical path contract as local tables. Remote `list_indices()` also canonicalizes returned columns against the current schema, and the remote query tests lock explicit nested vector and FTS request payloads.	2026-05-22 15:23:00 +08:00

1 2 3 4 5 ...

1051 Commits