lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-05-24 23:40:39 +00:00

Author	SHA1	Message	Date
Will Jones	a69dd0f62b	fix: remove primary key constraint from MemWAL bucket sharding Lance v7.0.0-rc.1 intentionally removed the requirement for bucket_sharding to match (or even require) the unenforced primary key column. Update LanceDB to match: drop the PK-related doc comments and the test assertions that expected rejection when no PK is set or when the bucket column differs from the PK. The Rust changes are taken from #3435; this commit additionally applies the equivalent updates to the Python and TypeScript bindings. See https://github.com/lance-format/lance/issues/6917 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 14:18:58 -07:00
Will Jones	02b112e931	test: ignore latest_version_hint.json in manifest dir assertions Lance v7.0.0-rc.1 writes _versions/latest_version_hint.json on non-lexically-ordered stores (e.g. the local filesystem) to speed up latest-version lookup. The V2-manifest-path tests iterate the whole _versions directory and asserted every entry matches the manifest filename pattern; filter to *.manifest files so the hint file is ignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 12:54:31 -07:00
Justin Miller	4bccb43e56	fix(python): route sync BaseQueryBuilder.to_batches through async path (#3425 ) ## Summary Fixes #3424. `LanceTakeQueryBuilder.to_batches()` raised `AttributeError: 'AsyncTakeQuery' object has no attribute 'execute'`. The inherited `BaseQueryBuilder.to_batches` called `self._inner.execute(...)`, but `self._inner` is an `AsyncQueryBase` (Python wrapper) — only its native inner exposes `execute`. Every other sync builder overrides `to_batches`, so the bug only surfaced on take-query builders, which inherit the base unchanged. `take_offsets(...).to_batches()` is broken for the same reason. Route the sync wrapper through the async `to_batches` on the background event loop, so the native `execute` is invoked from inside an awaiting context (matching how the async path works correctly). ## Repro ```python import lancedb, pyarrow as pa, tempfile db = lancedb.connect(tempfile.mkdtemp()) tbl = db.create_table("t", data=pa.table({"a": list(range(100))})) tbl.take_row_ids([0, 1, 2]).to_arrow() # works tbl.search().to_batches() # works list(tbl.take_row_ids([0, 1, 2]).to_batches()) # AttributeError (before) ``` ## Test plan - [x] New regression test `test_take_queries_to_batches` covers `take_offsets(...).to_batches()`, `take_row_ids(...).to_batches()`, and the `select(...)` projection — all fail on `main` with the patch reverted, all pass with the fix. - [x] `test_take_queries`, `test_query_builder_batches`, and `test_query_schema` still pass. - [x] `ruff format --check` and `ruff check` clean on changed files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 12:11:13 -07:00
Xuanwo	d5dc4c0f06	fix: discover nested vector columns by default (#3423 ) LanceDB default vector column discovery only considered top-level fields, so tables with a single nested vector leaf still required users to pass an explicit field path. This updates Rust and Python discovery to recurse into struct fields, return canonical field paths, and preserve actionable errors when no default or multiple defaults exist. The explicit nested path flow for index creation and search remains supported across Rust, Python, and Node, with regression coverage for single nested vector leaves, multiple candidate leaves, and schemas without vector leaves. Closes #3405.	2026-05-21 19:02:41 +08:00
Sean Mackrory	55ae6197c1	fix(python): drop version from Table __repr__ (#3411 ) There have been a couple of reports of this function freezing debuggers because it triggers a network round-trip but is assumed to be extremely light-weight: https://github.com/lancedb/lancedb/discussions/2853. We'll just cache the last version we see. I considered digging into see if we could assume or get the version at create time or after other operations, but that could be a bit of a rabbit hole as I'm a bit unfamiliar with this. Claude was having a hard time of it too 😅 I propose we see how the currently implementation goes and improve it if people find "unknown" or stale values coming up disruptively often before improving this further. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 12:20:46 -07:00
Pragnyan Ramtha	15bd821825	fix(python): check all table pages for db membership (#3395 ) ## Summary - Fix `name in db` and `len(db)` for local Python connections with more than one page of tables. - Use `list_tables()` pagination instead of deprecated `table_names()` with its default 10-item page. - Add regression coverage with 20 tables so later pages are included. Fixes #2727. ## Validation - `python3 -m py_compile python/python/lancedb/db.py python/python/tests/test_db.py` - No-build Python harness that extracts and executes the edited `LanceDBConnection` pagination methods: passed - `uvx ruff check python/python/lancedb/db.py python/python/tests/test_db.py` - `uvx ruff format --check python/python/lancedb/db.py python/python/tests/test_db.py` Note: `uv run pytest python/tests/test_db.py::test_db_contains_and_len_include_all_table_name_pages -q` was attempted first, but it stayed in the broad Rust/PyO3 native extension build and was stopped before pytest started.	2026-05-20 10:31:10 -07:00
Xuanwo	cf162c8a10	test(python): cover nested FTS field paths (#3418 ) Adds regression coverage for Python FTS APIs targeting nested text leaves, including sync and async match, phrase, and hybrid query paths. This also locks in the intended error boundary: nested text leaf paths are valid, while struct containers, non-text leaves, and missing paths remain rejected. Fixes #3404.	2026-05-21 00:49:00 +08:00
Xuanwo	2eba7ebd02	fix: return canonical nested index paths (#3413 ) Index metadata APIs now resolve stored field ids back to Lance canonical field paths instead of leaf names, so nested indexes such as `metadata.user_id` and escaped literal-dot fields round-trip through `list_indices()`. Native index creation also canonicalizes the input path before handing it to Lance, keeping local metadata consistent with the field-path contract while remote responses continue to expose server-provided canonical columns. Fixes #3403.	2026-05-21 00:20:47 +08:00
Xuanwo	5bfde47a8e	fix: support nested field paths in native index creation (#3408 ) Native index creation was resolving requested columns through top-level Arrow schema lookup before handing the request to Lance, which rejected nested paths and could collapse a nested field to its leaf name. This PR resolves index targets with Lance field-path semantics, passes the canonical path through to Lance, and reports indexed columns from field ids as canonical full paths. This also removes the Python native FTS guard that rejected dotted paths so scalar, vector, and FTS index creation share the same nested-field contract. Related to #3402.	2026-05-20 11:15:15 +08:00
Yang Cen	5d1c28922a	feat(python): align to_pandas pandas kwargs (#3397 ) ## Feature This PR aligns LanceDB Python `to_pandas()` APIs with Lance pandas conversion capabilities while keeping LanceDB query-specific semantics intact. - Adds `blob_mode` and pandas `kwargs` support to local table `to_pandas()`. - Delegates local `LanceTable.to_pandas()` to Lance dataset `to_pandas(blob_mode=..., kwargs)`. - Keeps remote table `to_pandas()` unsupported with `NotImplementedError`. - Allows sync and async query `to_pandas()` to forward pandas kwargs after LanceDB `flatten` and `timeout` handling. Why we need this feature: Users can access Lance blob-aware pandas conversion from LanceDB local tables and can pass PyArrow pandas conversion options through table/query APIs without losing existing `flatten` or `timeout` behavior. How it works: The table API exposes a `BlobMode` literal type for `lazy`, `bytes`, and `descriptions`. Local tables call through to the backing Lance dataset. Query APIs do not add `blob_mode`; they materialize Arrow results, apply LanceDB flattening when requested, and then call `to_pandas(**kwargs)`. ## Validation - `uv run --frozen --extra tests pytest python/tests/test_table.py::test_table_to_pandas_default_matches_arrow python/tests/test_table.py::test_table_to_pandas_blob_bytes python/tests/test_table.py::test_table_to_pandas_kwargs python/tests/test_query.py::test_query_to_pandas_kwargs python/tests/test_query.py::test_query_timeout python/tests/test_remote_db.py::test_table_to_pandas_not_supported` - `uv run --frozen --extra dev ruff check python/lancedb/table.py python/lancedb/query.py python/lancedb/remote/table.py python/tests/test_table.py python/tests/test_query.py python/tests/test_remote_db.py` - `uv run --frozen --extra tests pytest python/tests/test_table.py python/tests/test_query.py python/tests/test_remote_db.py` Note: `python/uv.lock` was intentionally not committed in this branch.	2026-05-19 20:05:51 +08:00
Drew Gallardo	aac6c62459	feat(python): add public take_offsets method on Permutation (#3375 ) Closes #3243. This PR exposes a new public api `Permutation.take_offsets(offsets: list[int])`, since users initially had to call __getitems__ directly to batch-fetch rows by position. Currently, the name matches the existing `Table.take_offsets` pattern, and now the dunder `__getitem__` and `__getitems__` now delegate to it. Also, fixes a parse error when `PermutationReader::take_offsets` gets an empty list. Now returns an empty `RecordBatch` with the correct schema instead. Bundled this because without the fix the new public API blows up on a perfectly reasonable input. `__getitems__` is preserved since PyTorch's batched DataLoader requires it. ### Testing - Added 3 new Rust tests for empty offsets including permutation table with Select::All, Select::Columns, and identity path - Added 3 new Python tests for the public API including a happy case, and empty input on both identity and permutation clippy, format, check all clean! cc: @westonpace	2026-05-18 09:35:56 -07:00
Heng Ge	0d30b31998	feat: support setting LSM write spec for a table (#3396 ) ## Summary Split out from #3354 Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` / `unset_lsm_write_spec` to install and clear the spec that selects Lance's MemWAL LSM-style write path for `merge_insert`. `LsmWriteSpec` offers three sharding strategies, all built on Lance's `InitializeMemWalBuilder`: - `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by the single-column unenforced primary key. - `LsmWriteSpec::identity(column)` — identity sharding by the raw value of a scalar column. - `LsmWriteSpec::unsharded()` — a single MemWAL shard. Each can be refined with `with_maintained_indexes(...)` (indexes the MemWAL keeps up to date as rows are appended) and `with_writer_config_defaults(...)` (default `ShardWriter` configuration recorded in the MemWAL index, so every writer starts from the same defaults). All variants require the table to have an unenforced primary key. - `set_lsm_write_spec` installs the spec by initializing the MemWAL index; `unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting to the standard `merge_insert` path. `unset` is idempotent. - Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`, `set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript (`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` / `"unsharded"`). `RemoteTable` returns `NotSupported`. The actual `merge_insert` LSM dispatch and `ShardWriter` write path are a follow-up — this PR only installs and clears the spec.	2026-05-18 00:11:33 -07:00
Heng Ge	6a431ff0a0	feat: support setting unenforced primary key (#3394 ) ## Summary Adds `Table::set_unenforced_primary_key` — records a single column as the table's unenforced primary key in Lance schema field metadata. "Unenforced" means LanceDB does not check uniqueness on write; the key is metadata that `merge_insert` consumes. - Single-column only; the column must exist and have a supported dtype (Int32, Int64, Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary). The API accepts an iterable for binding ergonomics but requires exactly one column — compound keys are rejected. - The primary key is immutable: calling this on a table that already has an unenforced primary key is rejected. Concurrent writers racing to set the key fail at commit time rather than silently overriding it. - `RemoteTable` returns `NotSupported`. - Bindings: Python (`AsyncTable`, `LanceTable`, `RemoteTable`) and TypeScript (`Table.setUnenforcedPrimaryKey`). ## Context Split out from #3354 per review feedback, so the unenforced primary key and the `merge_insert` sharding spec land as separate reviewable PRs. No Lance dependency bump — `main` is already on v7.0.0-beta.10, which includes the field-metadata round-trip fix the API relies on. Enforcing primary-key immutability at the Lance commit layer (so the cross-column concurrent race is also rejected) is a companion Lance change: lance-format/lance#6810.	2026-05-16 23:12:55 -07:00
Xin Sun	ab2c5adf5e	feat(nodejs): add order_by method to Query (#3123 )	2026-05-16 22:49:08 -07:00
LanceDB Robot	7b74c3dd91	chore: update lance dependency to v7.0.0-beta.9 (#3391 ) ## Summary - Update Lance Rust workspace dependencies from v7.0.0-beta.7 to v7.0.0-beta.9 using `ci/set_lance_version.py`. - Update the Java `lance-core.version` property to `7.0.0-beta.9`. - Refresh `Cargo.lock` for the Lance dependency bump. ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` Triggering Lance tag: https://github.com/lance-format/lance/releases/tag/v7.0.0-beta.9 --------- Co-authored-by: Daniel Rammer <hamersaw@protonmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 12:56:29 -05:00
Shengan Zhang	64aeee84a8	feat(python): support `bytes` in `lit()` expressions (#3387 ) Closes #3261. ## Summary Adds `bytes` to the accepted types of `lancedb.expr.lit()` so that binary scalars can be used in filter / projection expressions. The previous attempt in #3235 had to be reverted because DataFusion's SQL unparser does not support `Binary` / `LargeBinary` scalars, so any expression containing such a literal would fail in both `to_sql()` and `__repr__`. ## How `expr_to_sql_string` now has two paths: - Fast path (no binary literals): delegate to DataFusion's unparser unchanged. - Slow path: rewrite each `Binary(Some(bytes))` literal in the tree to a unique string-literal placeholder, run the unparser, then substitute `'<placeholder>'` with `X'<HEX>'` in the resulting SQL. `Binary(None)` / `LargeBinary(None)` are rewritten to `ScalarValue::Null` so the unparser emits plain `NULL`. This keeps DataFusion as the single source of truth for operator and function serialization, so binary literals work in every expression node type the unparser already supports — including nested cases like `contains(col("data"), lit(b"\xff"))`, `NOT (col == lit(b"..."))`, and `col.cast(...) == lit(b"...")`. ## Changes - `rust/lancedb/src/expr/sql.rs`: placeholder-substitution implementation. - `rust/lancedb/src/expr.rs`: 4 new unit tests covering binary literals in equality, compound predicates, scalar function calls, negation, and `NULL` binary literals. - `python/src/expr.rs`: `expr_lit` accepts `PyBytes` and produces `ScalarValue::Binary`. - `python/Cargo.toml` + `Cargo.lock`: pull in `datafusion-common` for `ScalarValue`. - `python/python/lancedb/expr.py`: extend `ExprLike` and `lit()` type annotations / docstrings with `bytes`. - `python/python/lancedb/_lancedb.pyi`: update `expr_lit` stub. - `python/tests/test_expr.py`: unit tests for `to_sql` / `repr` of binary literals and an integration test against a real `pa.binary()` column for equality / inequality / compound filters. ## Example ```python from lancedb.expr import col, lit, func # Equality against a binary column col("payload") == lit(b"\xca\xfe") # Expr((payload = X'CAFE')) # Nested inside a function call (previously failed) func("contains", col("data"), lit(b"\xff")) # Expr(contains(data, X'FF')) # repr() no longer crashes repr(lit(b"\xde\xad\xbe\xef")) # "Expr(X'DEADBEEF')" ``` ## Verification - [x] `cargo test -p lancedb --lib expr::` — 12/12 pass (was 9; +3 new tests) - [x] `cargo check --features remote --tests --examples` — clean - [x] `cargo clippy --features remote --tests --examples` — no warnings - [x] `cargo fmt --all -- --check` — clean - [x] `pytest python/tests/test_expr.py` — 76/76 pass (was 74; +2 new tests) - [x] `ruff check python` / `ruff format --check python` — clean ## Follow-ups (not in this PR) Issue #3261 also raises the possibility of a truncated `__repr__` for very large binary literals. This PR keeps `__repr__` exact (it forwards to `to_sql()`), since truncating display output would diverge from the SQL that actually gets executed. A display-only truncation could be added in a follow-up by giving `__repr__` its own renderer. Made with [Cursor](https://cursor.com) Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-14 15:24:52 -07:00
Brendan Clement	f893589356	fix(python): invalid namespace mode/behavior was silently ignored, now raises ValueError (#3388 ) Follow-up to #3371 , which added runtime validation for namespace `mode` and `behavior` parameters in the NodeJS SDK. Bringing the same fix to Python for cross-SDK consistency. Before: unrecognized values were silently dropped to `None`, so `db.create_namespace(["x"], mode="foobar")` would quietly fall through to the server's default mode and hide caller typos. After: raises `ValueError` listing the valid values.	2026-05-14 15:17:44 -07:00
Shengan Zhang	650f173236	feat(python): add IVF_HNSW_FLAT vector index support (#3366 ) ## Summary Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was documented at https://docs.lancedb.com/indexing/vector-index but `lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised `ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying `pylance` already accepted it, only the LanceDB wrapper was missing the wiring. Rust core (`rust/lancedb`): - Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the `IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`). - Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)` helper, keeping symmetry with the other `IVF_HNSW_` variants. - Forward the variant in `RemoteTable::create_index` and add two parametrised tests (default + customised config) for the JSON serialisation. - New `NativeTable` integration test (`test_create_index_ivf_hnsw_flat`). Python binding (`python/`):* - New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias. - PyO3 `extract_index_params` recognises the `HnswFlat` config. - `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync `RemoteTable.create_index` both dispatch to the new config. - `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover the new type so `pyright`/`make check` stays clean. - Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage. - Existing `test_index_statistics_index_type_lists_all_supported_values` updated to include `IVF_HNSW_FLAT`. A matching Node.js / TypeScript binding is in a follow-up PR. Closes #3331 ## Test plan - [ ] \`cargo check --quiet --features remote --tests --examples\` - [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised \`RemoteTable::create_index\` cases) - [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote --tests --examples\` - [ ] \`cd python && make develop && make check && make test\` (covers the two new async tests, the alias test, the dispatcher test, and the updated \`test_index_statistics_index_type_lists_all_supported_values\` assertion)	2026-05-11 15:08:32 -07:00
Xuanwo	9b21c136c6	feat(python): support model-backed native FTS tokenizers (#3289 ) This wires Lance's existing `jieba/` and `lindera/` native FTS tokenizers through the Python SDK instead of leaving them behind disabled features and narrow public typing. It also documents the `LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for successful CJK indexing plus missing-model error guidance. Closes #2168.	2026-05-08 23:53:14 +08:00
Weston Pace	a17c241e86	feat(python): make Permutation fork-safe for PyTorch DataLoader workers (#3339 ) ## Summary PyTorch's `DataLoader` uses fork-based multiprocessing by default on Linux, but threads do not survive `fork()`. LanceDB's Python bindings drive async work through two threaded layers, both of which become inert in a forked child: - `BackgroundEventLoop` runs an asyncio loop on a Python `threading.Thread`. - `pyo3-async-runtimes::tokio` holds a global multi-threaded tokio runtime whose worker threads also die on fork — and its runtime lives in a `OnceLock` that cannot be replaced after first use. As a result, any `Permutation` (or other async API) used inside a fork-based `DataLoader` worker hangs indefinitely. This PR makes both layers fork-safe so `Permutation` works as a `torch.utils.data.Dataset` with `num_workers > 0`. ## Approach ### Rust — new `python/src/runtime.rs` Mirrors the pattern used in [Lance's Python bindings](`456198cd6f/python/src/lib.rs (L139)`), adapted for the async-bridge use case. - `LanceRuntime` implements `pyo3_async_runtimes::generic::Runtime + ContextExt`, backed by an `AtomicPtr<tokio::runtime::Runtime>` we own (sidestepping `pyo3-async-runtimes`'s frozen `OnceLock` global). - A `pthread_atfork(after_in_child)` handler nulls the pointer; the next `spawn` rebuilds the runtime in the child. The previous runtime is intentionally leaked — calling `Drop` would try to join now-dead worker threads and hang. - `runtime::future_into_py` is a drop-in for `pyo3_async_runtimes::tokio::future_into_py`. All ~80 call sites in `arrow.rs` / `connection.rs` / `permutation.rs` / `query.rs` / `table.rs` are updated to route through it. - `python/Cargo.toml` adds `libc = "0.2"` and the tokio `rt-multi-thread` feature. ### Python — `lancedb/background_loop.py` - Refactors `BackgroundEventLoop.__init__` to a reusable `_start()` method. - An `os.register_at_fork(after_in_child=…)` hook calls `LOOP._start()` to give the singleton a fresh asyncio loop and thread in place. This matters because the rest of the codebase imports `LOOP` via `from .background_loop import LOOP` — rebinding the module attribute would leave those references holding the dead loop. ### Python — `lancedb/__init__.py` Removes the `__warn_on_fork` pre-fork warning (and the now-unused `import warnings`). Fork is supported. ## Test plan - [x] New `test_permutation_dataloader_fork_workers` in `python/tests/test_torch.py`: runs a `Permutation` through `torch.utils.data.DataLoader(num_workers=2, multiprocessing_context="fork")` inside a spawn-isolated child with a 30s hang detector. Pre-fix: timed out at 36s. Post-fix: passes in ~3.6s. - [x] New `test_remote_connection_after_fork` in `python/tests/test_remote_db.py`: forks a child that creates a fresh `lancedb.connect(...)` against a mock HTTP server and calls `table_names()`; passes in <1s, validates the runtime reset is sufficient for fresh remote clients. - [x] All 62 tests in `test_torch.py` + `test_permutation.py` pass. - [x] All 35 tests in `test_remote_db.py` pass. - [x] `test_table.py` (87) + `test_db.py` + `test_query.py` (157, minus one unrelated `sentence_transformers` import skip) — 244 passing. - [x] `cargo clippy -p lancedb-python --tests` clean. - [x] `cargo fmt`, `ruff check`, `ruff format` all clean. ## Known limitation (follow-up) This PR makes a freshly-built `lancedb.connect(...)` work in a forked child. An inherited `Connection` from the parent still carries an inherited `reqwest::Client` whose hyper connection pool references socket FDs and TCP/TLS state shared with the parent — using it from the child after fork is unsafe (especially with HTTP/1.1 keep-alive). The recommended pattern for fork-based `DataLoader` workers that hit a remote DB is to construct a new connection inside the worker. Auto-clearing inherited HTTP client pools on fork would require tracking live `Connection` instances in `lancedb` core and is left for a follow-up PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:44:10 -07:00
Weston Pace	1fc23e5473	fix(python): make Permutation picklable for PyTorch multiprocessing (#3335 ) ## Summary When pytorch is used with multiprocessing and the mp mode is spawn then the Permutation needs to be pickled. It could not be pickled because `Table` and `Connection` are not serializable. This PR adds pickle support to Permutation without adding general pickle support to `Table` or `Connection`. To add general support we probably need to start by adding serialization in the namespace client. In the meantime this PR enable pickling by adding special cases for: * In-memory tables (just serialize as Arrow IPC) * Native tables (serialize the URI) If a user is not using one of the above cases (e.g. using a remote connection) then they will need to provide a connection factory that can be pickled. ## Breaking change `PermutationBuilder.persist(...)` is removed from the Python bindings; the permutation table is now always in-memory. The underlying Rust `PermutationBuilder::persist` API is untouched and can be re-exposed later if needed. It probably won't make sense to do that until we have a way to serialize `Table` and `Connection`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:37:58 -07:00
Nitesh Yadav	59db036118	fix(python): add missing space in hybrid query error message (#3340 ) Hi, the hybrid query error message looks like it can use a space, just added it. ```python def _validate_query(self, query, vector=None, text=None): if query is not None and (vector is not None or text is not None): raise ValueError( "You can either provide a string query in search() method" "or set `vector()` and `text()` explicitly for hybrid search." "But not both." ) ```	2026-05-02 15:51:00 -07:00
Jack Ye	25dfe2cfd4	feat: add manifest-enabled directory namespace mode (#3332 ) Adds manifest_enabled for local/native connections so directory namespace manifests can be the source of truth, including migration from directory listing and Azure credential vending feature wiring. Also exposes the option through Rust, Python, and Node bindings with focused validation.	2026-04-29 09:22:06 -07:00
Xuanwo	c54888a83a	refactor(python): remove legacy tantivy FTS support (#3282 ) This follows the Rust-side Tantivy removal by deleting the remaining Python Tantivy runtime, tests, and packaging references. It also turns the legacy Python-only Tantivy parameters into explicit errors and stops reading legacy `_indices/fts` directories so Python FTS is fully native-only.	2026-04-20 09:28:45 +08:00
Jack Ye	f909df3e87	fix(python): use namespace-backed rust connection for namespace tables (#3286 ) So far, I have been using a hacky approach that creates and opens namespace-backed table, by getting its location and use a temporary lancedb connection to create or open it. This was working for features like credentials vending but is no longer fully working for the managed versioning feature, recently geneva tests have been failing here and there and various patches are not addressing the root cause. This PR fully fixes this and implements proper rust binding for it. Specifically: - build a real Rust namespace-backed connection from the Python namespace client - route namespace table create/open through that connection instead of resolved-location temp connections - keep namespace client naming consistent in the Rust bridge and preserve federated namespace + DuckDB behavior	2026-04-18 21:17:52 -07:00
Jack Ye	5eaac178b1	fix(python): pass namespace client on schema-only table create (#3283 ) ## Summary - pass `namespace_client` through the Python create-table path - ensure schema-only namespace table creation uses the namespace-aware empty-table flow - fix reopening namespace tables created without initial data	2026-04-17 01:11:18 -07:00
Jack Ye	97a4b38f19	feat(rust): support nested namespace ops in listing db (#3279 ) ## Summary - delegate child-namespace `ListingDatabase` operations through an eagerly initialized `LanceNamespaceDatabase` - support nested namespace create/open/list/drop flows without requiring callers to inject explicit locations - add `namespace_client_properties` plumbing for local and namespace connections so directory namespace settings like `table_version_tracking_enabled` can be configured - add regression tests for nested namespace ops and namespace client property propagation	2026-04-16 10:12:28 -07:00
Gezi-lzq	10879d99b8	docs: fix broken documentation links (#3278 )	2026-04-15 20:56:59 +08:00
Jack Ye	7f52ec8c36	feat(python): support child namepsace operations and json serialization for LanceDBConnection (#3265 ) ## Summary Add connection serialization and child namespace support to `LanceDBConnection`. - `DBConnection.serialize()` / `lancedb.deserialize()` for connection reconstruction in remote workers - Cache `namespace_client()` in `LanceDBConnection` to avoid repeated DirectoryNamespace builds - `LanceDBConnection` transparently delegates child namespace operations (open_table, create_table, list_tables, drop_table, create_namespace, etc.) to `LanceNamespaceDBConnection` via `_namespace_conn()` - Root namespace operations still go through the original Rust path - Generic worker property override mechanism: any `namespace_client_properties` key prefixed with `_lancedb_worker_` has the prefix stripped and overrides the corresponding property when `deserialize(data, for_worker=True)` - `LanceNamespaceDBConnection` stores `namespace_client_impl`/`namespace_client_properties` for serialization roundtrip --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 16:49:45 -07:00
Dhruv Garg	4761fa9bcb	fix(python): migrate gemini-text provider to google-genai sdk (#3250 ) ## Summary - migrate gemini-text embedding provider from deprecated google.generativeai to google.genai - update Python embedding extra dependency to google-genai - update default model name to gemini-embedding-001 - adapt embed calls to Client().models.embed_content(...) - apply lint fixes from CI ## Related - Closes #3191	2026-04-09 15:28:34 -07:00
lennylxx	4c2939d66e	fix(python): guard against None before .decode() on split_names metadata key (#3229 ) `.get(b"split_names", None).decode()` was called unconditionally in both Permutations.__init__ and Permutation.from_tables(), crashing with AttributeError when schema metadata existed but lacked the split_names key. Guard the decode behind a None check and add regression tests.	2026-04-08 16:04:13 -07:00
yaommen	a813ce2f71	fix(python): sanitize bad vectors before Arrow cast (#3158 ) ## Problem `on_bad_vectors="drop"` is supposed to remove invalid vector rows before write, but for some schema-defined vector columns it can still fail later during Arrow cast instead of dropping the bad row. Repro: ```python class MySchema(LanceModel): text: str embedding: Vector(16) table = db.create_table("test", schema=MySchema) table.add( [ {"text": "hello", "embedding": []}, {"text": "bar", "embedding": [0.1] * 16}, ], on_bad_vectors="drop", ) ``` Before: ``` RuntimeError Arrow error: C Data interface error: Invalid: ListType can only be casted to FixedSizeListType if the lists are all the expected size. ``` After: ``` rows 1 texts ['bar'] ``` ## Solution Make bad-vector sanitization use schema dimensions before cast, while keeping the handling scoped to vector columns identified by schema metadata or existing vector-name heuristics. This also preserves existing integer vector inputs and avoids applying on_bad_vectors to unrelated fixed-size float columns. Fixes #1670 Signed-off-by: yaommen <myanstu@163.com>	2026-04-08 09:09:41 -07:00
Jack Ye	a898dc81c2	feat: add user_id field to ClientConfig for user identification (#3240 ) ## Summary - Add a `user_id` field to `ClientConfig` that allows users to identify themselves to LanceDB Cloud/Enterprise - The user_id is sent as the `x-lancedb-user-id` HTTP header in all requests - Supports three configuration methods: - Direct assignment via `ClientConfig.user_id` - Environment variable `LANCEDB_USER_ID` - Indirect env var lookup via `LANCEDB_USER_ID_ENV_KEY` Closes #3230 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-04-06 11:20:10 -07:00
LanceDB Robot	d082c2d2ac	chore: update lance dependency to v5.0.0-beta.5 (#3237 ) ## Summary - update Rust Lance workspace dependencies to `v5.0.0-beta.5` using `ci/set_lance_version.py` - update Java `lance-core` dependency property to `5.0.0-beta.5` - refresh Cargo lockfile to the new Lance tag ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` ## Upstream Tag - https://github.com/lance-format/lance/releases/tag/v5.0.0-beta.5 --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>	2026-04-04 19:49:51 -07:00
Zelys	9d8699f99e	feat(python): support Enum types in Pydantic to Arrow schema conversion (#3232 ) ## Summary Fixes #1846. Python `Enum` fields raised `TypeError: Converting Pydantic type to Arrow Type: unsupported type <enum 'SomethingTypes'>` when converting a Pydantic model to an Arrow schema. The fix adds Enum detection in `_pydantic_type_to_arrow_type`. When an Enum subclass is encountered, the value type of its members is inspected and mapped to the appropriate Arrow type: - `str`-valued enums (e.g. `class Status(str, Enum)`) → `pa.utf8()` - `int`-valued enums (e.g. `class Priority(int, Enum)`) → `pa.int64()` - Other homogeneous value types → the Arrow type for that Python type - Mixed-value or empty enums → `pa.utf8()` (safe fallback) This covers the common `(str, Enum)` and `(int, Enum)` mixin patterns used in practice. ## Changes - `python/python/lancedb/pydantic.py`: add Enum branch in `_pydantic_type_to_arrow_type` - `python/python/tests/test_pydantic.py`: add `test_enum_types` covering `str`, `int`, and `Optional` Enum fields ## Note on #2395 PR #2395 handles `StrEnum` (Python 3.11+) specifically, using a dictionary-encoded type. This PR handles the broader `(str, Enum)` / `(int, Enum)` mixin pattern that works across all Python versions and stores values as their natural Arrow type. AI assistance was used in developing this fix.	2026-04-03 10:40:49 -07:00
Jack Ye	e26b22bcca	refactor!: consolidate namespace related naming and enterprise integration (#3205 ) 1. Refactored every client (Rust core, Python, Node/TypeScript) so “namespace” usage is explicit: code now keeps namespace paths (namespace_path) separate from namespace clients (namespace_client). Connections propagate the client, table creation routes through it, and managed versioning defaults are resolved from namespace metadata. Python gained LanceNamespaceDBConnection/async counterparts, and the namespace-focused tests were rewritten to match the clarified API surface. 2. Synchronized the workspace with Lance 5.0.0-beta.3 (see https://github.com/lance-format/lance/pull/6186 for the upstream namespace refactor), updating Cargo/uv lockfiles and ensuring all bindings align with the new namespace semantics. 3. Added a namespace-backed code path to lancedb.connect() via new keyword arguments (namespace_client_impl, namespace_client_properties, plus the existing pushdown-ops flag). When those kwargs are supplied, connect() delegates to connect_namespace, so users can opt into namespace clients without changing APIs. (The async helper will gain parity in a later change)	2026-04-03 00:09:03 -07:00
Dan Tasse	97754f5123	fix: change _client reference to _conn (#3188 ) This code previously referenced `self._client`, which does not exist. This change makes it correctly call `self._conn.close()`	2026-03-31 13:29:17 -07:00
Pratik Dey	7b1c063848	feat(python): add type-safe expression builder API (#3150 ) Introduces col(), lit(), func(), and Expr class as alternatives to raw SQL strings in .where() and .select(). Expressions are backed by DataFusion's Expr AST and serialized to SQL for remote table compat. Resolves: - https://github.com/lancedb/lancedb/issues/3044 (python api's) - https://github.com/lancedb/lancedb/issues/3043 (support for filter) - https://github.com/lancedb/lancedb/issues/3045 (support for projection) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-31 11:32:49 -07:00
Will Jones	e3d53dd185	fix(python): skip test_url_retrieve_downloads_image when PIL not installed (#3208 ) The test added in #3190 unconditionally imports `PIL`, which is an optional dependency. This causes CI failures in environments where Pillow isn't installed (`ModuleNotFoundError: No module named 'PIL'`). Use `pytest.importorskip` to skip gracefully when Pillow is unavailable. Fixes CI failure on main. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-30 14:48:49 -07:00
Will Jones	66804e99fc	fix(python): use correct exception types in namespace tests (#3206 ) ## Summary - Namespace tests expected `RuntimeError` for table-not-found and namespace-not-empty cases, but `lance_namespace` raises `TableNotFoundError` and `NamespaceNotEmptyError` which inherit from `Exception`, not `RuntimeError`. - Updated `pytest.raises` to use the correct exception types. ## Test plan - [x] CI passes on `test_namespace.py` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 12:55:54 -07:00
lennylxx	9f85d4c639	fix(embeddings): add missing urllib.request import in url_retrieve (#3190 ) url_retrieve() calls urllib.request.urlopen() but only urllib.error was imported, causing AttributeError for any HTTP URL input. This affects open-clip, siglip, and jinaai embedding functions when processing image URLs. The bug has existed since the embeddings API refactor (#580) but was masked because most users pass local file paths or bytes rather than HTTP URLs.	2026-03-30 12:03:44 -07:00
lif	4c44587af0	fix: table.add(mode='overwrite') infers vector column types (#3184 ) Fixes #3183 ## Summary When `table.add(mode='overwrite')` is called, PyArrow infers input data types (e.g. `list<double>`) which differ from the original table schema (e.g. `fixed_size_list<float32>`). Previously, overwrite mode bypassed `cast_to_table_schema()` entirely, so the inferred types replaced the original schema, breaking vector search. This fix builds a merged target schema for overwrite: columns present in the existing table schema keep their original types, while columns unique to the input pass through as-is. This way `cast_to_table_schema()` is applied unconditionally, preserving vector column types without blocking schema evolution. ## Changes - `rust/lancedb/src/table/add_data.rs`: For overwrite mode, construct a target schema by matching input columns against the existing table schema, then cast. Non-overwrite (append) path is unchanged. - Added `test_add_overwrite_preserves_vector_type` test that creates a table with `fixed_size_list<float32>`, overwrites with `list<double>` input, and asserts the original type is preserved. ## Test Plan - `cargo test --features remote -p lancedb -- test_add_overwrite` — all 4 overwrite tests pass - Full suite: 454 passed, 2 failed (pre-existing `remote::retry` flakes unrelated to this change) --------- Signed-off-by: majiayu000 <1835304752@qq.com>	2026-03-30 10:57:33 -07:00
lennylxx	1d1cafb59c	fix(python): don't assign dict.update() return value in _sanitize_data (#3198 ) dict.update() mutates in place and returns None. Assigning its result caused with_metadata(None) to strip all schema metadata when embedding metadata was merged during create_table with embedding_functions.	2026-03-30 10:15:45 -07:00
Dan Tasse	cca6a7c989	fix: raise instead of return ValueError (#3189 ) These couple of cases used to return ValueError; should raise it instead.	2026-03-25 18:49:29 -07:00
Will Jones	1d6e00b902	feat: progress bar for `add()` (#3067 ) ## Summary Adds progress reporting for `table.add()` so users can track large write operations. The progress callback is available in Rust, Python (sync and async), and through the PyO3 bindings. ### Usage Pass `progress=True` to get an automatic tqdm bar: ```python table.add(data, progress=True) # 100%\|██████████\| 1000000/1000000 [00:12<00:00, 82345 rows/s, 45.2 MB/s \| 4/4 workers] ``` Or pass a tqdm bar for more control: ```python from tqdm import tqdm with tqdm(unit=" rows") as pbar: table.add(data, progress=pbar) ``` Or use a callback for custom progress handling: ```python def on_progress(p): print(f"{p['output_rows']}/{p['total_rows']} rows, " f"{p['active_tasks']}/{p['total_tasks']} workers, " f"done={p['done']}") table.add(data, progress=on_progress) ``` In Rust: ```rust table.add(data) .progress(\|p\| println!("{}/{:?} rows", p.output_rows(), p.total_rows())) .execute() .await?; ``` ### Details - `WriteProgress` struct in Rust with getters for `elapsed`, `output_rows`, `output_bytes`, `total_rows`, `active_tasks`, `total_tasks`, and `done`. Fields are private behind getters so new fields can be added without breaking changes. - `WriteProgressTracker` tracks progress across parallel write tasks using a mutex for row/byte counts and atomics for active task counts. - Active task tracking uses an RAII guard pattern (`ActiveTaskGuard`) that increments on creation and decrements on drop. - For remote writes, `output_bytes` reflects IPC wire bytes rather than in-memory Arrow size. For local writes it uses in-memory Arrow size as a proxy (see TODO below). - tqdm postfix displays throughput (MB/s) and worker utilization (active/total). - The `done` callback always fires, even on error (via `FinishOnDrop`), so progress bars are always finalized. ### TODO - Track actual bytes written to disk for local tables. This requires Lance to expose a progress callback from its write path. See lance-format/lance#6247. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 16:14:13 -07:00
Prashanth Rao	ed7e01a58b	docs: fix rendering issues with missing index types in API docs (#3143 ) ## Problem The generated Python API docs for `lancedb.table.IndexStatistics.index_type` were misleading because mkdocstrings renders that field’s type annotation directly, and the existing `Literal[...]` listed only a subset of the actual canonical SDK index type strings. Current (missing index types): <img width="823" height="83" alt="image" src="https://github.com/user-attachments/assets/f6f29fe3-4c16-4d00-a4e9-28a7cd6e19ec" /> ## Fix - Update the `IndexStatistics.index_type` annotation in `python/python/lancedb/table.py` to include the full supported set of canonical values, so the generated docs show all valid index_type strings inline. - Add a small regression test in `python/python/tests/test_index.py` to ensure the docs-facing annotation does not drift silently again in case we add a new index/quantization type in the future. - Bumps mkdocs and material theme versions to mkdocs 1.6 to allow access to more features like hooks After fix (all index types are included and tested for in the annotations): <img width="1017" height="93" alt="image" src="https://github.com/user-attachments/assets/66c74d5c-34b3-4b44-8173-3ee23e3648ac" />	2026-03-20 09:34:42 -07:00
marca116	3a200d77ef	fix: pre-filtering on hybrid search (#3096 ) When using hybrid search with a where filter, the prefilter argument is silently inverted. Passing prefilter=True actually performs post-filtering, and prefilter=False actually performs pre-filtering.	2026-03-16 21:48:42 -07:00
Weston Pace	25eb1fbfa4	fix: restore storage options on copy in localstack tests (#3148 )	2026-03-16 14:02:19 -07:00
Weston Pace	216c1b5f77	docs: remove experimental label from optimize and warn about delete_unverified (#3128 ) ## Summary - Removes the "Experimental API" section from `optimize` method documentation across Rust, Python, and TypeScript - Adds a warning to `delete_unverified` documentation in all bindings: this should only be set to true if you can guarantee no other process is working on the dataset, otherwise it could be corrupted - Fixes a typo ("shoudl" → "should") Closes #3125 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:37:42 +08:00
Esteban Gutierrez	f951da2b00	feat: support prewarm_index and prewarm_data on remote tables (#3110 ) ## Summary - Implement `RemoteTable.prewarm_data(columns)` calling `POST /v1/table/{id}/page_cache/prewarm/` - Implement `RemoteTable.prewarm_index(name)` calling `POST /v1/table/{id}/index/{name}/prewarm/` (previously returned `NotSupported`) - Add `BaseTable::prewarm_data(columns)` trait method and `Table` public API in Rust core - Add PyO3 bindings and Python API (`AsyncTable`, `LanceTable`, `RemoteTable`) for `prewarm_data` - Add type stubs for `prewarm_index` and `prewarm_data` in `_lancedb.pyi` - Upgrade Lance to 3.0.0-rc.3 with breaking change fixes Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 15:39:39 -05:00

1 2 3 4 5 ...

422 Commits