lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-07-03 11:00:40 +00:00

Author	SHA1	Message	Date
prrao87	3446d02f4e	fix(python): fill bad vector values element-wise	2026-07-02 15:05:33 -04:00
Lance Release	bfce8a510d	Bump version: 0.34.0-beta.5 → 0.34.0-beta.6	2026-07-02 11:32:45 +00:00
Armaan Sandhu	a1261e6299	fix(python): average MRR reciprocal ranks over all rankings (#3599 ) ## What `MRRReranker.rerank_multivector` averages each document's reciprocal ranks over the wrong denominator. It divides by the number of rankings the document happens to appear in, instead of the total number of rankings being fused. ```python # python/python/lancedb/rerankers/mrr.py for result_id, reciprocal_ranks in mrr_score_map.items(): mean_rr = np.mean(reciprocal_ranks) # divides by len(present systems) ``` `mrr_score_map[doc]` only accumulates a reciprocal rank for the systems in which the document was returned, so `np.mean` never accounts for the systems that missed it. ## Why it's wrong Mean Reciprocal Rank fusion treats a system that didn't return a document as a reciprocal rank of `0` and averages across all systems. That's the exact mechanism by which it rewards cross-system consensus. Dividing by the appearance count removes that, so a document liked by a single ranking can beat one ranked highly by every ranking. Concretely, fusing 3 vector rankings: \| Doc \| Ranks \| Current score \| Correct score \| \|-----\|-------\|---------------\|---------------\| \| A \| #1 in 1 system only \| `mean([1.0]) = 1.000` \| `1.0 / 3 = 0.333` \| \| B \| #1, #1, #2 across all 3 \| `mean([1, 1, .5]) = 0.833` \| `2.5 / 3 = 0.833` \| The current code ranks A above B - a document two of three rankings ignored outranks one all three ranked at or near the top. This also makes `rerank_multivector` inconsistent with `rerank_hybrid` in the same file, which already treats a missing system as `0` (`vector_rr = 0.0` / `fts_rr = 0.0`), and with the class docstring ("average of reciprocal ranks across different search results"). ## Fix Divide the summed reciprocal ranks by the total number of rankings: ```python num_systems = len(vector_results) ... mean_rr = float(np.sum(reciprocal_ranks)) / num_systems ``` ## Tests Adds `test_mrr_multivector_rewards_consensus`, which asserts the exact MRR scores and that the consensus document ranks first. It fails on `main` and passes with this change. Existing reranker tests are unaffected.	2026-07-01 15:36:56 -07:00
Neo-X7	17c499177f	docs(python): add missing parameter documentation for when_matched_update_all (#3536 ) Fixes #2493 Added target. prefix requirement to where parameter docstring.	2026-07-01 10:28:58 -07:00
Will Jones	d889321b5e	fix!: combine repeated where filters with AND instead of replacing (#3585 ) BREAKING CHANGE: When passing multiple where clauses to a query, they now stack instead of replacing the previous filter. Previously, calling `where`/`only_if` more than once on a query silently replaced the previous filter, so only the last filter was applied. This was surprising and could return rows that an earlier filter should have excluded. This implements the alternative suggested in https://github.com/lancedb/lancedb/pull/3514#issuecomment-4664901580: instead of rejecting a second filter, repeated filters are combined with a logical AND (`(previous) AND (new)`). The combination happens in the Rust core (`QueryBase::only_if` and `only_if_expr`), so it applies to all SDKs at once (Rust, Python async, and TypeScript). The Python sync query builder keeps its own filter state, so it combines filters in the binding layer as well. SQL string and expression filters are combined within their own representation. When the two representations are mixed, the expression is lowered to SQL (via `expr_to_sql_string`) and the filters are combined as SQL strings, so chaining `where` works regardless of which form each filter takes. Fixes #2649 ## Tests - Rust: `cargo test --features remote -p lancedb --lib query` - Python: `uv run --extra tests pytest python/tests/test_query.py` - TypeScript: `pnpm test __test__/query.test.ts` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-07-01 10:11:58 -07:00
Jack Ye	3b70fc4c9d	fix(python): route async namespace connections through rust (#3603 ) Summary: - Route built-in async namespace-backed connections through the Rust namespace connector. - Delegate async namespace/table management methods to the inner AsyncConnection while keeping the custom implementation Python-client fallback. - Add regressions for the native async dir path and lazy namespace_client() construction. Validated locally with targeted namespace/db/table pytest, full test_namespace.py, ruff, cargo fmt/check/clippy, and cargo test -p lancedb-python.	2026-06-30 17:03:23 -07:00
Lance Release	bcbc0da090	Bump version: 0.34.0-beta.4 → 0.34.0-beta.5	2026-06-30 22:23:43 +00:00
Jack Ye	9bead9f53d	fix(python): route sync namespace connections through rust (#3598 ) Summary: - Route built-in sync namespace connections through the Rust namespace connector. - Keep custom namespace clients on the existing Python fallback. - Preserve namespace-backed to_lance compatibility with lazy Python client construction and add regressions.	2026-06-30 14:46:23 -07:00
Raphael Malikian	05756f0bbf	fix(python): raise clear error when permutation API is used on remote tables (Fixes #2934 ) (#3591 ) Fixes #2934 ## Problem Passing a `RemoteTable` to `permutation_builder()` raises a cryptic `AttributeError`: ``` AttributeError: 'RemoteTable' object has no attribute '_inner' ``` This leaves users confused about what went wrong and why. ## Root Cause `PermutationBuilder.__init__()` calls `async_permutation_builder(table)` which accesses `table._inner` — the underlying Rust Lance table object. `RemoteTable` connects to LanceDB Cloud/Enterprise and does not have a local `_inner` attribute, making permutations fundamentally unsupported on remote tables. ## Solution Added an early check in `PermutationBuilder.__init__()` that verifies the table has `_inner` before calling the Rust function, raising a clear `TypeError` with an explanation of why permutations don't work on remote tables. ## Verification - Syntax validated with `ast.parse()` - Structural verification: single call site (`permutation_builder()`), guard placed before Rust FFI call - Error message tested with mock: `MockRemoteTable()` correctly triggers `TypeError` ## Changelog \| Date \| Change \| Author \| \|------\|--------\|--------\| \| 2026-06-28 \| Added remote table guard in PermutationBuilder.__init__ \| rtmalikian \| ### Files Changed - python/python/lancedb/permutation.py — Added `hasattr(table, "_inner")` check with clear error --- About the Author: Raphael Malikian — Clinical AI Solutions Architect. I specialise in building and fixing AI/ML systems for healthcare, including vector databases, RAG pipelines, and clinical NLP. If you need help with your project or think I can add value to your organisation, feel free to reach out — I'd love to connect. 📧 rtmalikian@gmail.com 🔗 GitHub: https://github.com/rtmalikian 🔗 LinkedIn: http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a --- Disclosure: This code was developed with assistance from deepseek-v4-pro (DeepSeek) via Hermes Agent (Nous Research). All changes were reviewed, tested against the actual codebase, and verified for correctness. Signed-off-by: rtmalikian <rtmalikian@gmail.com>	2026-06-29 16:36:01 -07:00
Jack Ye	39e819b6a7	feat(python): expose OAuth connection config (#3586 ) Expose the merged Rust OAuth header provider through the Python async connection path. Includes: - Python OAuthConfig and OAuthFlowType public config objects - PyO3 conversion into the Rust OAuthConfig - connect_async(oauth_config=...) plumbing - repr redaction coverage for client_secret Local validation: cargo fmt --all; ruff format/check on touched Python files.	2026-06-29 12:36:35 -07:00
Lance Release	3878adc6dc	Bump version: 0.34.0-beta.3 → 0.34.0-beta.4	2026-06-29 11:11:05 +00:00
Ryan Green	8a5cd74e48	fix: ensure read freshness provider is built into namespace client (#3571 ) By default the read freshness provider was not included in the namespace client, preventing the read freshness headers from being included in the request. This prevents checkout_latest() from working as expected when using the namespace client. This fix ensures the provided is built into the client when the namespace impl and properties are provided.	2026-06-25 21:47:55 -07:00
Lance Release	8718345229	Bump version: 0.34.0-beta.2 → 0.34.0-beta.3	2026-06-25 01:53:51 +00:00
Raphael Malikian	0ba70d96c3	fix: add missing stacklevel=2 to warnings.warn() and fix broken message concatenation (Fixes #3563 ) (#3564 ) Fixes #3563 ## Summary - Add `stacklevel=2` to 10 `warnings.warn()` calls across 4 files - Fix broken message concatenation in `table.py` where the second string was incorrectly passed as the `category` parameter ## Problem Multiple `warnings.warn()` calls in the `python/lancedb/` codebase were missing the `stacklevel` parameter. Without `stacklevel=2`, warnings point to library internals instead of the caller's code, making it impossible for users to identify which of their function calls triggered the warning. Additionally, two calls in `table.py` (lines 3411 and 3420) had a more serious bug: the deprecation message was split across two separate string arguments, causing the second string to be passed as the `category` parameter instead of being concatenated with the first string. This would cause `TypeError` when the warning was triggered. ## Changes \| File \| Fixes \| Description \| \|------\|-------\|-------------\| \| `embeddings/colpali.py` \| 1 \| Add `stacklevel=2` to `use_token_pooling` deprecation warning \| \| `remote/db.py` \| 3 \| Add `stacklevel=2` to `request_thread_pool`, `connection_timeout`, `read_timeout` deprecation warnings \| \| `remote/table.py` \| 3 \| Add `stacklevel=2` to `cleanup_old_versions`, `compact_files`, `optimize` no-op warnings \| \| `table.py` \| 3 \| Fix broken message concatenation for `data_storage_version` and `enable_v2_manifest_paths` deprecation warnings + add `stacklevel=2` to `retrain` deprecation warning \| ## Verification ```python # All warnings.warn() calls now have stacklevel python3 -c "import ast, os; ..." # Result: All warnings.warn() calls now have stacklevel! ``` ## Changelog \| Date \| Change \| Author \| \|------\|--------\|--------\| \| 2026-06-20 \| Fix missing stacklevel=2 in 10 warnings.warn() calls + fix broken message concatenation \| rtmalikian \| ### Files Changed - `python/python/lancedb/embeddings/colpali.py` — Add stacklevel=2 - `python/python/lancedb/remote/db.py` — Add stacklevel=2 to 3 deprecation warnings - `python/python/lancedb/remote/table.py` — Add stacklevel=2 to 3 no-op warnings - `python/python/lancedb/table.py` — Fix broken message concatenation + add stacklevel=2 ### Verification - AST-based audit confirms all `warnings.warn()` calls now include `stacklevel=2` - Syntax check passes for all 4 modified files --- About the Author: Raphael Malikian — Clinical AI Solutions Architect. I specialise in building and fixing AI/ML systems for healthcare, including vector databases, RAG pipelines, and clinical NLP. If you need help with your project or think I can add value to your organisation, feel free to reach out — I'd love to connect. 📧 rtmalikian@gmail.com 🔗 GitHub: https://github.com/rtmalikian 🔗 LinkedIn: http://www.linkedin.com/in/raphael-t-malikian-mbbs-bsc-hons-71075436a --- Disclosure: This code was developed with assistance from Hermes Agent (Nous Research). All changes were reviewed, tested against the actual codebase, and verified for correctness. Signed-off-by: rtmalikian <rtmalikian@gmail.com>	2026-06-23 13:42:59 -07:00
Lance Release	26481a4b74	Bump version: 0.34.0-beta.1 → 0.34.0-beta.2	2026-06-23 16:21:52 +00:00
Will Jones	85d870b397	fix: parse RFC 3339 created_at and improve IndexConfig repr (#3558 ) The server now serializes an index's `created_at` as an RFC 3339 string (e.g. `"2026-06-18T21:37:36.637Z"`), but the client deserializer only accepted a unix timestamp in milliseconds. This caused `list_indices` to fail with: ``` Failed to parse list_indices response: invalid type: string "2026-06-18T21:37:36.637Z", expected a unix timestamp in milliseconds ``` This PR replaces the fixed millisecond deserializer with a custom one that accepts both an RFC 3339 string (current server) and a unix-millisecond integer (legacy deployments), so the client works against any server version. It also improves the `IndexConfig` repr in the Python bindings. Previously it printed only three fields (`Index(FTS, columns=["text"], name="text_idx")`), hiding the metadata that `list_indices` returns. It now renders every populated field, omitting any that are `None`. Each value is valid Python — integer counts use `_` thousands separators and `created_at` uses the `datetime` repr — so values round-trip. The real repr is a single line; it's wrapped here for readability: ```python >>> table.list_indices() [IndexConfig( name="text_idx", index_type="FTS", columns=["text"], index_uuid="aefd3e00-2f95-4bdc-92ac-06de84442bf1", type_url="/lance.table.InvertedIndexDetails", created_at=datetime.datetime(2026, 6, 18, 21, 37, 36, 637000, tzinfo=datetime.timezone.utc), num_indexed_rows=2, size_bytes=3_669, num_segments=1, index_version=1, index_details={ 'lance_tokenizer': None, 'base_tokenizer': 'simple', 'language': 'English', 'with_position': False, 'max_token_length': 40, 'lower_case': True, 'stem': True, 'remove_stop_words': True, 'custom_stop_words': None, 'ascii_folding': True, 'min_ngram_length': 3, 'max_ngram_length': 3, 'prefix_only': False, }, )] ``` Fixes #3556 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 10:40:56 -07:00
Lance Release	3b279f5705	Bump version: 0.34.0-beta.0 → 0.34.0-beta.1	2026-06-19 15:59:43 +00:00
Ryan Green	e1334954d7	fix: overflow using sys.maxsize for k in query with namespace connection (#3561 )	2026-06-19 12:57:10 -02:30
Lance Release	4f4cce3f64	Bump version: 0.33.1-beta.2 → 0.34.0-beta.0	2026-06-18 18:42:07 +00:00
Will Jones	ce5dadd386	fix(ci): allow shell pre-commit hooks in bumpversion configs (#3554 ) The "Create release commit" workflow (`make-release-commit.yml`) has failed on its last two runs; no release tags have been created since June 4. Since this workflow creates the tag that the cargo/npm/pypi/java publish workflows trigger off of, all recent releases are effectively blocked. The workflow installs `bump-my-version` unpinned. Version `1.4.0` added a check that refuses to run `pre_commit_hooks` containing shell syntax (pipes, `&&`, `if`, variable expansion) unless `allow_shell_hooks = true` is set. Both bumpversion configs use such hooks: - `python/.bumpversion.toml` — updates `Cargo.lock` after the bump (fails first) - `.bumpversion.toml` — runs `mvn versions:set` for the Java packages The job dies at the version-bump step with: > Hook '…' contains shell syntax (pipes, redirects, or variable expansion). Set `allow_shell_hooks = true` in your configuration to enable shell execution… This sets `allow_shell_hooks = true` in both configs to restore the previous behavior. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-17 15:22:05 -07:00
whitewooood	217fd8491d	fix(python): clarify single dictionary input error (#3537 ) ## Summary - clarify the Python error for passing a single dictionary to table creation/add paths - add a regression test for `create_table(..., data=dict)` so it points users to a list of dictionaries Fixes #409 ## Testing - `python -m pytest python/tests/test_table.py -q` - `python -m ruff format python/lancedb/table.py python/lancedb/scannable.py python/tests/test_table.py` - `python -m ruff check python/lancedb/table.py python/lancedb/scannable.py python/tests/test_table.py`	2026-06-17 12:55:55 -07:00
JSap0914	9128dbcd7a	fix(util): escape single quotes in struct field names in value_to_sql (#3548 ) ### Bug `value_to_sql({...})` builds a DataFusion `named_struct(...)` literal but interpolates the struct field names directly as `f"'{k}'"`. A field name that contains a single quote therefore produces invalid SQL: ```python >>> from lancedb.util import value_to_sql >>> value_to_sql({"it's": 1}) "named_struct('it's', 1)" # invalid SQL — the quote terminates the literal ``` String values are already escaped (single quotes doubled) by the `str` branch of `value_to_sql`, so keys and values were handled inconsistently. This affects `Table.update(values={...})` / `merge_insert` when a struct column has a field name containing `'`. ### Fix Render the key through `value_to_sql(str(k))` so field names are escaped exactly like string values: ```python >>> value_to_sql({"it's": 1}) "named_struct('it''s', 1)" ``` Keys without special characters are unchanged (`'a'` stays `'a'`), so existing behavior is preserved. ### Verification ``` $ pytest python/tests/test_util.py -k value_to_sql_dict ``` The new `test_value_to_sql_dict_key_escaping` covers quoted keys (incl. nested structs) and fails on `main` (`named_struct('it's', 1)`), passes with this change; the existing `test_value_to_sql_dict` still passes. Co-authored-by: JSap0914 <JSap0914@users.noreply.github.com>	2026-06-17 12:55:43 -07:00
Armaan Sandhu	b2ae763254	fix(python): raise clear TypeError for bare List/Tuple in pydantic schema conversion (#3511 ) Closes #3502 ## Problem A bare, unparameterised `typing.List` / `typing.Tuple` field crashes `to_arrow_schema` with an opaque `AttributeError: __args__`: ```python from typing import Tuple from lancedb.pydantic import LanceModel class Doc(LanceModel): items: Tuple Doc.to_arrow_schema() # AttributeError: __args__ ``` In `_py_type_to_arrow_type`, the branch `elif getattr(py_type, "__origin__", None) in (list, tuple)` is taken for a bare generic (its `__origin__` is `list / tuple`), but the next line reads `py_type.__args__[0]`, and a bare generic has no `__args__`. Other unsupported types (e.g. `Dict[str, int]`) correctly raise a clear `TypeError`, so this case is inconsistent. Fix Guard the element-type lookup with `getattr(py_type, "__args__", None)` and raise a clear `TypeError` when it is missing, matching the existing behavior for other unsupported types. Bare builtin list / tuple are unaffected (their `__origin__` is `None`, so they already fall through to the existing `TypeError`). Testing - Added `test_bare_generic_raises_type_error` covering both `List` and `Tuple`. - ruff format and ruff check clean.	2026-06-17 11:58:48 -07:00
Brendan Clement	f76b075d13	feat: add table branch support to remote tables and Python/TS bindings (#3540 ) ### Description Adding branch support for RemoteTable by threading a branch selector onto every operation the data plane accepts it on. Exposes the currentBranch to nodejs and python through the bindings. Matching the server handlers, the branch rides as: - a `?branch=` query parameter for Arrow-body and query-only ops (insert, merge_insert, multipart_*, version/list, drop_index) - a `branch` field in the JSON body for everything else (count_rows, query, update, delete, create_index, column ops, index list/stats, stats, restore, describe, tags create/update) A main-branch handle (`branch == None`) produces byte-identical requests to before: no `branch` field and no `?branch=` - Handle-per-branch: `create_branch` / `checkout_branch` return a new handle with fresh caches and reset version/freshness state, mirroring `NativeTable`. - `create_branch` maps 409 to already-exists, 400 to invalid, and 404 to not-found with source context, and sends without retry so the 409 stays observable. - `Ref` translation covers version, version-number (relative to the handle's branch), and tag (resolved via the tags endpoint); `"main"` and empty normalize to the main branch. - Python branch handles persist their branch (and pinned version) across pickle/fork, so a forked or pickled handle reopens on its branch rather than silently reverting to main. ### Tests - Rust mock tests per op category (query-param and body mechanisms, branch CRUD, error paths, backward-compat). - Python sync branch CRUD, `open_table(branch=)`, and a pickle round-trip regression test.	2026-06-15 18:07:40 -04:00
Will Jones	f8caef3aca	feat(bindings): expose new IndexConfig fields in Python and Node.js (#3534 ) ## Summary Surfaces the rich per-index metadata added in #3497 to the Python and Node.js language bindings. Closes #3495. New optional fields exposed on `IndexConfig` in both bindings: - `index_uuid` / `indexUuid` — UUID of the first index segment - `type_url` / `typeUrl` — protobuf type URL for the index - `created_at` / `createdAt` — creation timestamp (milliseconds since Unix epoch) - `num_indexed_rows` / `numIndexedRows` — rows covered by the index - `num_unindexed_rows` / `numUnindexedRows` — rows not yet indexed - `size_bytes` / `sizeBytes` — total index file size in bytes - `num_segments` / `numSegments` — number of index segments - `index_version` / `indexVersion` — on-disk format version - `index_details` / `indexDetails` — type-specific JSON details string All fields are `None`/`undefined` for remote tables (which don't yet surface this metadata through the server response). ## Changes - `python/src/index.rs`: extend `IndexConfig` pyclass; update `From` impl; update `__getitem__` - `python/python/lancedb/_lancedb.pyi`: add type hints for new fields - `python/python/tests/test_table.py`: new `test_index_config_fields` test - `nodejs/src/table.rs`: extend `IndexConfig` napi struct; update `From` impl - `nodejs/__test__/table.test.ts`: new test; update existing `toEqual` assertions to `expect.objectContaining` to accommodate new fields ## Test plan - [x] Python: `uv run --extra tests pytest python/tests/test_table.py::test_index_config_fields` - [x] Node.js: `pnpm test __test__/table.test.ts` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-11 13:37:39 -07:00
nuthalapativarun	40f3e22600	feat: support rename_table on LanceNamespaceDatabase (#3520 ) ## Summary Closes #3412 Implements `rename_table` for `LanceNamespaceDatabase` (sync and async Python) and the Rust `NamespaceDatabase` backend. Previously these raised `NotImplementedError`; this PR delegates to the `LanceNamespace.rename_table` method which is part of the lance-namespace spec. ### Changes - `rust/lancedb/src/database/namespace.rs`: Remove the `NotImplementedError` stub for `rename_table`. Build a `RenameTableRequest` (with `id`, `new_table_name`, and optionally `new_namespace_id`) and call `self.namespace.rename_table(...)`, mirroring the existing `drop_table` pattern. - `python/python/lancedb/namespace.py`: Import `RenameTableRequest` from `lance_namespace`. Replace the `raise NotImplementedError` in both `LanceNamespaceDatabase.rename_table` (sync) and `AsyncLanceNamespaceDatabase.rename_table` (async) with a call to `self._namespace_client.rename_table(request)`. - `python/python/tests/test_namespace.py`: Replace the `test_rename_table_not_supported` test (which checked for `NotImplementedError`) with `test_rename_table`, which: 1. Creates a table in a namespace 2. Calls `rename_table` with `cur_namespace_path` and `new_namespace_path` 3. Asserts the old name is gone from `table_names()` 4. Asserts the new name appears in `table_names()` 5. Verifies the renamed table can be opened ## Test plan - [ ] Existing namespace tests pass in CI (all rely on `lance.namespace.DirectoryNamespace` which requires the full lance package) - [ ] `test_rename_table` exercises the full rename path: create → rename → verify old gone → verify new present → open - [ ] Rust build passes with the updated `namespace.rs` (requires Rust toolchain in CI)	2026-06-11 11:41:07 -07:00
nuthalapativarun	04480c274a	test(python): add nested field regression matrix tests (#3518 ) ## Summary Closes #3406 Add a regression matrix in `python/python/tests/test_nested_fields.py` that exercises the full nested field index lifecycle for both the sync and async Python table APIs. The tests will fail if any implementation regresses to leaf-only field names in `list_indices`, `index_stats`, search, or filter results. ## Test scenarios covered Index types: BTree scalar, IvfPq vector, FTS Field-name edge cases (per acceptance criteria): - `rowId` — camelCase top-level field - `` `row-id` `` — hyphenated top-level field (escaped) - `parent.`\``leaf.name`\`` ` — struct leaf whose name contains a literal dot - `MetaData.userId` — mixed-case nested path - `` `meta-data`.`user-id` `` — hyphenated struct with hyphenated leaf Lifecycle operations per index type: - `create_index` / `create_scalar_index` / `create_fts_index` - `list_indices` → verify canonical full dotted path (not leaf name) - `index_stats` → verify row count and index type - Filtered scan (`WHERE nested.field = value`) - Vector search via nested embedding column - FTS search via nested text column - `add` (append) then re-check index listing - `optimize` then re-check index listing Both sync and async APIs are covered in parallel test classes. ## Notes Lance forbids top-level field names that contain a literal `.`, so the `` `a.b` `` acceptance-criterion variant is exercised as a struct leaf field (`parent.`\``leaf.name`\``) rather than a top-level column.	2026-06-11 08:06:04 -07:00
Trenton H	ae7f2cbfe8	feat(python): accept Expr in Table.delete and merge when_not_matched_by_source_delete (#3524 ) Another little pain point as I was working to integrate with paperless-ngx. The read path of table.search() or table.query() already accepted an Expr, but write paths Table.delete and merge_insert(...).when_not_matched_by_source_delete did not. This PR attempts to close that gap, so writes and reads can both use Expr, instead of one side needing to build a string.	2026-06-11 07:59:49 -07:00
Trenton H	85d9c1ce63	feat: adds isin support to the 'Expr' builder (#3523 ) The `Expr` build already includes a lot of useful filtering options, `eq, ne, gt/gte, lt/lte, and_, or_, contains, cast`, but is was missing a membership like `isin`. This PR adds that support, as minimally as possible, allowing easy filtering for membership in a list, without needing to be a series of `where` expressions. I didn't see anything in CONTRIBUTING.md about needing a feature request or issue first, so I just made the change. My apologies if I missed that somewhere. Thanks for the vector store, we're using it now in paperless-ngx.	2026-06-10 15:28:19 -07:00
Jack Ye	8373318e89	feat: support FM-Index scalar index for substring search (#3532 ) Adds an FM-Index — a scalar index over string and binary columns that accelerates substring search (`contains(col, 'needle')`), distinct from the tokenized `FTS` index — across the Rust core and the Python and TypeScript bindings. ## Rust - `Index::Fm(FmIndexBuilder)` and `IndexType::Fm`. - `make_index_params` maps `Index::Fm` to Lance's `ScalarIndexParams::for_builtin(BuiltinIndexType::Fm)`. - `supported_fm_data_type` validates `Utf8`/`LargeUtf8`/`Binary`/`LargeBinary` columns. - `list_indices` round-trips the type (`"Fm"` → `IndexType::Fm`); the remote wire type is `"FM"`. ## Python Adds `lancedb.index.Fm`, accepted by `create_index`: ```python from lancedb.index import Fm await tbl.create_index("text", config=Fm()) ``` ## TypeScript Adds the `Index.fm()` factory: ```ts await tbl.createIndex("text", { config: Index.fm() }); ```	2026-06-10 12:28:20 -07:00
Xuanwo	566b67a634	fix: support LargeList label list indexes (#3529 ) ## Summary This PR extends nested-field regression coverage across Rust local/remote, Python sync/async, and Node so canonical escaped paths stay consistent across scalar, vector, and FTS index lifecycle behavior. It also aligns LanceDB's LabelList type gate with Lance by accepting `LargeList<primitive>` columns while keeping `List<Struct<...>>` unsupported until Lance defines stable membership semantics for struct labels. Part of #3406.	2026-06-10 23:53:56 +08:00
devteamaegis	f260d3bf12	fix(util): convert numpy scalars in value_to_sql (#3522 ) ## What's broken `Table.update(values={...})` raises `NotImplementedError: SQL conversion is not implemented for this type` when a value is a numpy scalar such as `np.int64`, `np.int32`, `np.float32`, or `np.bool_`. These arise naturally from indexing an ndarray or a pandas int/bool column. `np.float64` happens to work (it subclasses `float`), which makes the failure inconsistent and surprising. ```python df = pd.DataFrame({"id": np.array([10, 20], dtype="int32")}) t.update(where="id = 1", values={"id": df["id"].iloc[0]}) # np.int32 # -> NotImplementedError: SQL conversion is not implemented for this type ``` ## Why it happens `value_to_sql` is a `singledispatch` with handlers only for native Python types and `np.ndarray`; numpy `integer`/`floating`/`bool_` scalars aren't Python subclasses, so they fall through to the `NotImplementedError` base. ## Fix Register handlers for `np.bool_`, `np.integer`, and `np.floating` that delegate to the existing native handlers. ## Test `value_to_sql` on `np.int32/int64/float32/float64/bool_` all convert; `np.int32` raised before. Co-authored-by: Ishaan Samantray <ishaansamantray@Ishaans-MacBook-Pro.local>	2026-06-09 15:57:02 -07:00
Brendan Clement	d9018067b3	feat: support checking out a version on a branch (#3504 ) ### Description Stacked on #3490. Adds an optional version to branch checkout across the Rust core and the Python and TypeScript SDKs, so you can open a specific version on a branch ("version V of branch B"), not just the branch's latest version Rust ```rust // Open version 3 of branch "exp" (a read-only view): check out from an // existing table, or open it directly from the connection. let exp_v3 = table.checkout_branch("exp", Some(3)).await?; let exp_v3 = db.open_table("items").branch("exp").version(3).execute().await?; // checkout_latest re-attaches to the branch's writable HEAD. exp_v3.checkout_latest().await?; // With no branch, a version opens main at that version. let main_v3 = db.open_table("items").version(3).execute().await?; ``` Python ```python # Open version 3 of branch "exp" (a read-only view): check out from an # existing table, or open it directly from the connection. branch_v3 = await table.branches.checkout("exp", version=3) branch_v3 = await db.open_table("items", branch="exp", version=3) # checkout_latest re-attaches to the branch's writable HEAD. await branch_v3.checkout_latest() # With no branch, a version opens main at that version. main_v3 = await db.open_table("items", version=3) ``` TypeScript ```typescript // Open version 3 of branch "exp" (a read-only view): check out from an // existing table, or open it directly from the connection. const branchV3 = await (await table.branches()).checkout("exp", 3); const opened = await db.openTable("items", undefined, { branch: "exp", version: 3 }); // checkoutLatest re-attaches to the branch's writable HEAD. await branchV3.checkoutLatest(); // With no branch, a version opens main at that version. const mainV3 = await db.openTable("items", undefined, { version: 3 }); ``` ### Testing - Added unit tests (Rust, Python sync + async, TypeScript): branch-scoped resolution at a version number shared with `main` and with another branch, read-only enforcement on a pinned handle, `checkout_latest` recovery to the branch's HEAD, fork-point reads, and the nonexistent-version/branch error paths. - Ran smoke tests against the Python and TypeScript SDKs on local machine.	2026-06-08 17:36:38 -07:00
Brendan Clement	53517b3aaa	feat: add table branch support (#3490 ) ### Description Adds first-class support for table branches across the Rust core and the Python and TypeScript SDKs. Rust ```rust use lance::dataset::refs::Ref; // Create a branch from main and write to it — main is untouched. let exp = table.create_branch("exp", Ref::Version(None, None)).await?; exp.add(batches).await?; // Reopen the branch later: check out from a table, or open it directly. let exp = table.checkout_branch("exp").await?; let exp = db.open_table("items").branch("exp").execute().await?; let branches = table.list_branches().await?; table.delete_branch("exp").await?; ``` Python ```python # Create a branch from main and write to it branch = await table.branches.create("exp", from_ref="main") await branch.add(data) # Reopen the branch later: check out from a table, or open it directly. branch = await table.branches.checkout("exp") branch = await db.open_table("items", branch="exp") await table.branches.list() await table.branches.delete("exp") ``` TypeScript ```typescript const branches = await table.branches(); // Create a branch from main and write to it const branch = await branches.create("exp"); await branch.add(data); // Reopen the branch later: check out from a table, or open it directly. const checkedOut = await branches.checkout("exp"); const opened = await db.openTable("items", undefined, { branch: "exp" }); await branches.list(); await branches.delete("exp"); ``` ### Testing - Added unit tests - ran smoke tests against python and typescript sdks on local machine ### Next steps - Add RemoteTable support - Add Branch Comparison support - Merge Branching support	2026-06-08 16:26:46 -07:00
Yang Cen	3e25f584eb	fix(python): push down namespace full reads (#3516 ) ## Bug Fix ### What is the bug? Namespace-backed `LanceTable.to_arrow()` full-table reads bypassed the existing `QueryTable` server-side query path and called the lower-level table `to_arrow()` implementation directly. In Geneva/Sophon this could fail while parsing the Arrow IPC response for `hist.get_table().to_arrow()` / `to_pandas()`, even though `hist.get_table().search().to_arrow()` worked. ### What issues or incorrect behavior does the bug cause? Full-table reads on namespace-backed tables with `QueryTable` pushdown could fail with Arrow IPC parse errors, while query/search reads on the same table succeeded. Since `to_pandas()` delegates through `to_arrow()` for non-blob/native cases, pandas export was affected too. ### How does this PR fix the problem? When `QueryTable` pushdown is enabled, sync and async table `to_arrow()` now construct a plain no-filter, no-limit, all-columns query and execute it through the table-level `_execute_query()` path. `AsyncTable` now preserves namespace context from async namespace connections so async full reads can make the same pushdown decision. Non-namespace tables and namespace tables without `QueryTable` pushdown keep their existing behavior. ### Tests - `uv run --extra tests --extra dev --no-sync ruff check python/lancedb/table.py python/lancedb/namespace.py python/tests/test_namespace.py` - `uv run --extra tests --extra dev --no-sync ruff format python/lancedb/table.py python/lancedb/namespace.py python/tests/test_namespace.py` - `uv run --extra tests --extra dev --no-sync pytest python/tests/test_namespace.py::TestPushdownOperations::test_lance_table_to_arrow_uses_query_pushdown python/tests/test_namespace.py::TestAsyncPushdownOperations::test_async_table_to_arrow_uses_query_pushdown python/tests/test_namespace.py::test_local_table_to_arrow_and_to_pandas_are_unchanged -q` - `uv run --extra tests --extra dev --no-sync pytest python/tests/test_namespace.py -q`	2026-06-08 19:48:40 +08:00
Will Jones	09b1bbc12a	refactor!: drop unused loss field from IndexStatistics (#3496 ) BREAKING CHANGE: direct Rust users lose the `IndexStatistics::loss` field. Python and Node.js consumers are unaffected in practice for remote tables (the value was always `None`/absent), but the attribute is gone for local tables too. `IndexStatistics::loss` was local-only — LanceDB Cloud never returned it, so `RemoteTable::index_stats` always set `loss: None`. It's vestigial; this removes it. - Remove `loss` from `IndexStatistics` and the internal `IndexMetadata` in `rust/lancedb/src/index.rs`, plus the summing logic in `NativeTable::index_stats`. - Drop `loss` from the Python and Node.js bindings (and their tests/docs). Fixes #3493 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 07:52:40 -07:00
Armaan Sandhu	3868965413	fix(python): run AsyncTable.search embeddings on a dedicated executor (#3459 ) ## Summary `AsyncTable.search()` computes the query embedding with `loop.run_in_executor(None, ...)`, which uses asyncio's default `ThreadPoolExecutor`. That pool is shared with all other `run_in_executor(None, ...)` work, so a slow embedding call — a heavy local model or an HTTP request to an embeddings API — ties up those threads and starves unrelated async I/O under concurrent load. This moves the (potentially blocking) embedding call onto a dedicated executor, isolating it from the default pool. Closes #3310. ## Problem `python/lancedb/table.py`, `AsyncTable.search()`: ```python return ( await loop.run_in_executor( None, # asyncio's default executor, shared with other blocking I/O embedding.function.compute_query_embeddings_with_retry, query, ) )[0] ``` Under load, concurrent searches whose embeddings block (or any other code using the default executor) contend for the same small thread pool. ## Change - Add a dedicated `ThreadPoolExecutor(thread_name_prefix="lancedb-embedding")` in `background_loop.py`, exposed via `embedding_executor()`. - Use it in `AsyncTable.search()`'s `make_embedding` instead of the default executor. - Reset the executor in the existing `_reset_after_fork` hook — its worker threads don't survive `fork()`, same as the background event loop. It's recreated lazily, so this is cheap. ## Design notes The issue asked whether maintainers preferred a configurable executor, a dedicated internal one, or another approach (no response in the thread). I went with a dedicated internal executor: it fixes the starvation with no public API change and stays consistent with the existing `LOOP` singleton. Making the pool size configurable would be an easy follow-up if preferred. Scope is limited to `search()`. The broader "embedding functions need real async support" (including `add()`) is tracked separately in #3268. ## Testing - Added `test_async_search_runs_embedding_on_dedicated_executor`: patches the embedding function to record the executing thread during an async search and asserts it runs on a `lancedb-embedding` thread. Verified it fails against the previous `run_in_executor(None, ...)` and passes with the fix. - `ruff format`, `ruff check`, and `pyright` pass on the changed files.	2026-06-04 21:57:16 -07:00
hashwnath	64194ea8ad	fix(python): make LanceDBClientError pickleable (#3470 ) ## Summary - Add `__reduce__` methods to `LanceDBClientError` and `RetryError` so that instances can be pickled and unpickled correctly - `HttpError` inherits the fix from `LanceDBClientError` since it has no additional `__init__` parameters - Add tests verifying pickle roundtrip for all three exception classes Fixes #3447 ## Test plan - [x] Verified pickle roundtrip for `LanceDBClientError` with and without `status_code` - [x] Verified pickle roundtrip for `HttpError` (subclass, no extra init params) - [x] Verified pickle roundtrip for `RetryError` (subclass with many extra params) - [ ] CI tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Will Jones <willjones127@gmail.com>	2026-06-04 09:29:15 -07:00
Lance Release	952055d428	Bump version: 0.33.1-beta.1 → 0.33.1-beta.2	2026-06-04 06:04:37 +00:00
Yang Cen	927ba2c948	fix(python): route blob query pandas through scanner (#3491 ) ## Bug Fix ### What is the bug? `QueryBuilder.to_pandas(blob_mode="descriptions")` could still fall back to `self.to_arrow()` for query outputs with blob columns. Custom query subclasses or wrappers can have `to_arrow()` behavior that is not compatible with pandas blob-description conversion, which can surface as low-level Arrow/list-batch conversion failures. ### What issues or incorrect behavior does the bug cause? Callers need to carry local `to_pandas` or plain-scan adapter special casing for blob descriptions, and scanner-only kwargs such as row addresses and fragment selection are not represented in LanceDB query state. ### How does this PR fix the problem? This PR routes blob-output query `to_pandas()` through the Lance scanner path for `lazy`, `bytes`, and `descriptions` modes when the query is a scanner-backed plain scan. For `blob_mode="descriptions"` with `flatten`, it collects scanner Arrow/table output, applies LanceDB `flatten_columns`, and converts to pandas from there. Non-plain blob query shapes now fail with a clear unsupported error instead of falling into subclass `to_arrow()` behavior. It also adds Python query state and builder methods for scanner-only plain-scan parameters: - `with_row_address()` for `_rowaddr` - `with_fragments(...)` for Lance fragment objects - `fragment_ids([...])` as a convenience wrapper that resolves IDs to Lance fragments ## Validation - `cd python && uv run --no-sync ruff format --check python/lancedb/query.py python/tests/test_query.py` - `cd python && uv run --no-sync ruff check python/lancedb/query.py python/tests/test_query.py` Targeted pytest was intentionally not run locally per maintainer request.	2026-06-04 14:03:33 +08:00
Will Jones	a16676e05f	ci: update python lockfile weekly (#3498 ) Make sure we are getting security fixes in there regularly, and other useful bumps.	2026-06-03 15:24:32 -07:00
Harikrishna KP	4e44262499	test(python): add regression test for nullable struct with None (#2654 ) (#3483 ) ## Summary Regression test for [issue #2654](https://github.com/lancedb/lancedb/issues/2654) — a nullable struct column whose first batch contains only `None` values crashed in `_align_field_types` with `AttributeError: 'pyarrow.lib.DataType' object has no attribute 'fields'`. The actual fix landed in #3394, but no test was added. This PR adds the reproducer from the issue as a test. ## Test plan - `test_add_nullable_struct_with_none`: creates a table with a nullable struct column, adds a row with a non-null struct value, then a row with `None` for the struct field. Verifies both rows land correctly. - Uses Lance file format v2.1 (`new_table_data_storage_version="2.1"`) because nullable structs aren't supported on v2.0. ## Related - #3028 (the original fix attempt, now superseded)	2026-06-03 14:13:09 -07:00
devteamaegis	9969191d0d	fix(rerankers): guard against empty vector_results in RRFReranker.rerank_multivector (#3467 ) ## What's broken Calling `RRFReranker().rerank_multivector([])` crashes with `IndexError: list index out of range` because the method accesses `vector_results[0]` for the type-homogeneity check before verifying the list is non-empty. The `all()` call passes vacuously on an empty iterable so the crash hits the next lines. ```python from lancedb.rerankers import RRFReranker RRFReranker().rerank_multivector([]) # IndexError: list index out of range ``` ## Why it happens The type check uses `vector_results[0]` as the reference type but never guards against an empty list. `all(...)` short-circuits to `True` when the iterable is empty, so the bad index access on the lines that follow is never reached by the existing guard logic. ## Fix Add an explicit empty-list check before any indexing.	2026-06-03 14:06:33 -07:00
devteamaegis	1e7326cd8c	fix(rerankers/mrr): raise ValueError on empty vector_results list (#3469 ) ## What's broken `MRRReranker.rerank_multivector([])` raises `IndexError: list index out of range`. The crash happens on line 128 (the `all()` type-homogeneity check passes vacuously on an empty iterable) and on line 134 which accesses `vector_results[0]` unconditionally, with no prior guard for an empty list. ## Why it happens `all()` over an empty iterable returns `True`, so the type check silently passes and execution falls through to `vector_results[0]` which crashes. ## Fix Added a two-line guard at the top of `rerank_multivector` that raises a clear `ValueError("vector_results must not be empty")` before any indexing occurs. ## Test Added `test_mrr_reranker_empty_input` in `test_rerankers.py` which calls `rerank_multivector([])` and asserts that a `ValueError` with the message "must not be empty" is raised. Fixes #3468 Co-authored-by: Aegis Dev <aegis@devteamaegis.com>	2026-06-03 14:05:43 -07:00
Lance Release	ac3411e81e	Bump version: 0.33.1-beta.0 → 0.33.1-beta.1	2026-06-03 11:16:51 +00:00
Yang Cen	6f18eb4cce	feat(python): support blob modes in query to_pandas (#3487 ) ## Feature - What is the new feature? - Adds `blob_mode` support to sync and async Python query `to_pandas()` APIs. - Enables plain scan queries to return blob columns as lazy `BlobFile` objects, raw bytes, or blob descriptions. - Lets namespace-backed local tables use Lance native blob-aware pandas conversion for lazy blobs. - Why do we need this feature? - Table and Lance dataset/scanner APIs already support blob-aware pandas conversion, but LanceDB query builders did not expose that capability. - Geneva and other callers should be able to use query-level `to_pandas(blob_mode=...)` without manually constructing Lance scanners. - How does it work? - Plain scan queries route through Lance scanner native `to_pandas(blob_mode=...)`, preserving filter, projection, limit, offset, row id, and alias/expression projection behavior. - Non-native query shapes keep existing Arrow fallback semantics and raise a clear error when they return blob columns with `blob_mode="lazy"` or `blob_mode="bytes"`. - Focused tests cover table/query blob modes, filter/select/limit/offset/alias query cases, async query behavior, vector-query error boundaries, and namespace-backed lazy blobs. ## Validation - `cd python && .venv/bin/maturin develop --uv --extras tests,dev --profile dev` - `cd python && uv run --frozen --no-sync pytest python/tests/test_table.py::test_table_to_pandas_blob_modes python/tests/test_table.py::test_async_table_to_pandas_blob_bytes python/tests/test_query.py::test_plain_scan_query_to_pandas_blob_modes python/tests/test_query.py::test_plain_scan_query_to_pandas_blob_projection python/tests/test_query.py::test_async_plain_scan_query_to_pandas_blob_projection python/tests/test_query.py::test_vector_query_to_pandas_blob_mode_requires_native_path python/tests/test_namespace.py::TestNamespaceConnection::test_table_to_pandas_blob_lazy_through_namespace -q` - `cd python && uv run --frozen --no-sync ruff format --check .` - `cd python && uv run --frozen --no-sync ruff check .` - `git diff --check`	2026-06-03 19:15:44 +08:00
Brendan Clement	379684391e	feat: deprecate replace_field_metadata for update_field_metadata (#3484 ) ### Summary Deprecates the Python replace_field_metadata (on Table and AsyncTable) in favor of update_field_metadata. Mirrors Lance, which already deprecated Dataset.replace_field_metadata for update_field_metadata. Stacked on top of #3482 as this was a follow-up task after adding update_field_metadata	2026-06-02 14:02:22 -07:00
Brendan Clement	d065be0474	feat: add update_field_metadata to edit per-field metadata (#3482 ) ### Summary Adds update_field_metadata to the client SDK (Rust core, Python, and TypeScript) so clients can edit per-field (column) Arrow metadata (schema.fields[].metadata) ### Testing - added unit tests - ran E2E against a local server on both local and remote tables (set → merge → delete), across Python sync/async and TypeScript ### Next steps - deprecate replace_field_metadata in the python lancedb favor of this (typescript didn't have replace_field_metadata method). This matches Lance's API direction (Lance already deprecated replace_field_metadata for update_field_metadata)	2026-06-02 07:00:00 -07:00
Xuanwo	a327044e2f	feat(python): support remote tables in PyTorch dataloaders (#3432 ) This PR makes remote LanceDB tables usable from PyTorch multiprocessing workers. Remote tables now carry enough safe JSON connection state to reopen themselves after pickle/spawn or fork, and permutations lazily rebuild their reader from restored tables instead of trying to reuse process-local handles. This addresses the remote-table gap in the PyTorch dataset path while preserving the explicit connection factory escape hatch for custom worker-side credential loading or non-serializable header providers. Validated with targeted remote table, permutation, and PyTorch DataLoader tests.	2026-06-02 15:38:28 +08:00
Lance Release	60f961584c	Bump version: 0.33.0-beta.1 → 0.33.1-beta.0	2026-06-01 12:41:02 +00:00

1 2 3 4 5 ...

1061 Commits