lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-07-05 12:00:39 +00:00

Author	SHA1	Message	Date
Heng Ge	0d30b31998	feat: support setting LSM write spec for a table (#3396 ) ## Summary Split out from #3354 Adds `LsmWriteSpec` and `Table::set_lsm_write_spec` / `unset_lsm_write_spec` to install and clear the spec that selects Lance's MemWAL LSM-style write path for `merge_insert`. `LsmWriteSpec` offers three sharding strategies, all built on Lance's `InitializeMemWalBuilder`: - `LsmWriteSpec::bucket(column, num_buckets)` — hash-bucket sharding by the single-column unenforced primary key. - `LsmWriteSpec::identity(column)` — identity sharding by the raw value of a scalar column. - `LsmWriteSpec::unsharded()` — a single MemWAL shard. Each can be refined with `with_maintained_indexes(...)` (indexes the MemWAL keeps up to date as rows are appended) and `with_writer_config_defaults(...)` (default `ShardWriter` configuration recorded in the MemWAL index, so every writer starts from the same defaults). All variants require the table to have an unenforced primary key. - `set_lsm_write_spec` installs the spec by initializing the MemWAL index; `unset_lsm_write_spec` removes it (dropping the MemWAL index), reverting to the standard `merge_insert` path. `unset` is idempotent. - Bindings: Python (`LsmWriteSpec.bucket` / `.identity` / `.unsharded`, `set_lsm_write_spec` / `unset_lsm_write_spec`) and TypeScript (`setLsmWriteSpec` with `specType` `"bucket"` / `"identity"` / `"unsharded"`). `RemoteTable` returns `NotSupported`. The actual `merge_insert` LSM dispatch and `ShardWriter` write path are a follow-up — this PR only installs and clears the spec.	2026-05-18 00:11:33 -07:00
Heng Ge	6a431ff0a0	feat: support setting unenforced primary key (#3394 ) ## Summary Adds `Table::set_unenforced_primary_key` — records a single column as the table's unenforced primary key in Lance schema field metadata. "Unenforced" means LanceDB does not check uniqueness on write; the key is metadata that `merge_insert` consumes. - Single-column only; the column must exist and have a supported dtype (Int32, Int64, Utf8, LargeUtf8, Binary, LargeBinary, FixedSizeBinary). The API accepts an iterable for binding ergonomics but requires exactly one column — compound keys are rejected. - The primary key is immutable: calling this on a table that already has an unenforced primary key is rejected. Concurrent writers racing to set the key fail at commit time rather than silently overriding it. - `RemoteTable` returns `NotSupported`. - Bindings: Python (`AsyncTable`, `LanceTable`, `RemoteTable`) and TypeScript (`Table.setUnenforcedPrimaryKey`). ## Context Split out from #3354 per review feedback, so the unenforced primary key and the `merge_insert` sharding spec land as separate reviewable PRs. No Lance dependency bump — `main` is already on v7.0.0-beta.10, which includes the field-metadata round-trip fix the API relies on. Enforcing primary-key immutability at the Lance commit layer (so the cross-column concurrent race is also rejected) is a companion Lance change: lance-format/lance#6810.	2026-05-16 23:12:55 -07:00
Xin Sun	ab2c5adf5e	feat(nodejs): add order_by method to Query (#3123 )	2026-05-16 22:49:08 -07:00
Shengan Zhang	64aeee84a8	feat(python): support `bytes` in `lit()` expressions (#3387 ) Closes #3261. ## Summary Adds `bytes` to the accepted types of `lancedb.expr.lit()` so that binary scalars can be used in filter / projection expressions. The previous attempt in #3235 had to be reverted because DataFusion's SQL unparser does not support `Binary` / `LargeBinary` scalars, so any expression containing such a literal would fail in both `to_sql()` and `__repr__`. ## How `expr_to_sql_string` now has two paths: - Fast path (no binary literals): delegate to DataFusion's unparser unchanged. - Slow path: rewrite each `Binary(Some(bytes))` literal in the tree to a unique string-literal placeholder, run the unparser, then substitute `'<placeholder>'` with `X'<HEX>'` in the resulting SQL. `Binary(None)` / `LargeBinary(None)` are rewritten to `ScalarValue::Null` so the unparser emits plain `NULL`. This keeps DataFusion as the single source of truth for operator and function serialization, so binary literals work in every expression node type the unparser already supports — including nested cases like `contains(col("data"), lit(b"\xff"))`, `NOT (col == lit(b"..."))`, and `col.cast(...) == lit(b"...")`. ## Changes - `rust/lancedb/src/expr/sql.rs`: placeholder-substitution implementation. - `rust/lancedb/src/expr.rs`: 4 new unit tests covering binary literals in equality, compound predicates, scalar function calls, negation, and `NULL` binary literals. - `python/src/expr.rs`: `expr_lit` accepts `PyBytes` and produces `ScalarValue::Binary`. - `python/Cargo.toml` + `Cargo.lock`: pull in `datafusion-common` for `ScalarValue`. - `python/python/lancedb/expr.py`: extend `ExprLike` and `lit()` type annotations / docstrings with `bytes`. - `python/python/lancedb/_lancedb.pyi`: update `expr_lit` stub. - `python/tests/test_expr.py`: unit tests for `to_sql` / `repr` of binary literals and an integration test against a real `pa.binary()` column for equality / inequality / compound filters. ## Example ```python from lancedb.expr import col, lit, func # Equality against a binary column col("payload") == lit(b"\xca\xfe") # Expr((payload = X'CAFE')) # Nested inside a function call (previously failed) func("contains", col("data"), lit(b"\xff")) # Expr(contains(data, X'FF')) # repr() no longer crashes repr(lit(b"\xde\xad\xbe\xef")) # "Expr(X'DEADBEEF')" ``` ## Verification - [x] `cargo test -p lancedb --lib expr::` — 12/12 pass (was 9; +3 new tests) - [x] `cargo check --features remote --tests --examples` — clean - [x] `cargo clippy --features remote --tests --examples` — no warnings - [x] `cargo fmt --all -- --check` — clean - [x] `pytest python/tests/test_expr.py` — 76/76 pass (was 74; +2 new tests) - [x] `ruff check python` / `ruff format --check python` — clean ## Follow-ups (not in this PR) Issue #3261 also raises the possibility of a truncated `__repr__` for very large binary literals. This PR keeps `__repr__` exact (it forwards to `to_sql()`), since truncating display output would diverge from the SQL that actually gets executed. A display-only truncation could be added in a follow-up by giving `__repr__` its own renderer. Made with [Cursor](https://cursor.com) Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-14 15:24:52 -07:00
Justin Miller	5b45e44ce3	fix(rust): map lance-namespace errors to TableNotFound / TableAlreadyExists (#3385 ) ## Summary `LanceNamespaceDatabase::open_table` and `create_table` were squashing `NamespaceError::TableNotFound` and `TableAlreadyExists` into generic `Error::Runtime`, so callers couldn't distinguish a missing-table or duplicate-table error from any other internal failure. Downstream this surfaced to geneva-style code as HTTP 500 / "internal server error" on operations that should have been 400/404 — see [ENT-1235](https://linear.app/lancedb/issue/ENT-1235/fix-ns-errors-for-create-tableopen-table). This PR walks the boxed-error chain from `lance::Error::Namespace` down to the inner `NamespaceError` and maps its `ErrorCode` onto the proper `lancedb::Error` variant: - `NamespaceError::TableNotFound` → `Error::TableNotFound { name, source }` - `NamespaceError::TableAlreadyExists` → `Error::TableAlreadyExists { name }` - everything else → `Error::Runtime` (unchanged behavior for the long tail) It also replaces the existing `e.to_string().contains("already exists")` string match in `LanceNamespaceDatabase::create_table` with a downcast on the `NamespaceError` code. That string-match happened to work for the `dir` backend but isn't guaranteed to match the REST namespace backend's error format; the downcast works for both. The chain-walk is needed because `DatasetBuilder::from_namespace` re-wraps the inner namespace error in a fresh `lance::Error::Namespace`, so a single top-level downcast misses it. ## How this helps geneva Geneva's workaround (linked in the parent issue) currently has to use `except Exception:` with a `# todo: this is too broad` comment, plus `str(e).lower().contains("already exists")` string matching, because the namespace-impl path raised a generic `RuntimeError`. After this PR: - `db.open_table("missing")` raises `ValueError("Table 'missing' was not found")` (via the existing Python binding mapping of `TableNotFound` → `PyValueError`) — geneva can catch `ValueError` cleanly. - `db.create_table("dup")` raises `ValueError("Table 'dup' already exists")` reliably across both `dir` and REST backends, so the existing string match becomes deterministic. In phalanx (the sophon REST server), `LanceDBError::TableNotFound` and `LanceDBError::TableAlreadyExists` already map directly to HTTP 404 and HTTP 400 respectively — see [phalanx/src/error.rs:77-94](https://github.com/lancedb/sophon/blob/main/src/rust/phalanx/src/error.rs#L77). No phalanx code change is needed for the bug fix; the previous 500 came from phalanx's string-match fallback not finding `"namespace"` AND `"not found"` in the `Runtime` error's debug-formatted message. ## Follow-up [ENT-1246](https://linear.app/lancedb/issue/ENT-1246/remove-dead-namespace-error-string-matching-in-phalanx) — after this lands and phalanx picks up the new lancedb, the string-matching fallback for table errors in `src/rust/phalanx/src/error.rs` (lines 99-168, 236-256, 502-514) and `src/rust/phalanx/src/rest/table/create_table.rs` (lines 224-241) becomes dead code and can be removed. The `// TODO: Refactor for better namespace error handling` comment at phalanx/src/error.rs:96-98 is exactly what this PR addresses on the lancedb side; ENT-1246 finishes the cleanup on the sophon side. ## Test plan - [x] `cargo test --quiet --features remote -p lancedb --lib` — all 495 lib tests pass, including 4 new tests in `database::namespace::tests`: - `test_namespace_table_not_found` — extended to assert `Error::TableNotFound` (was just `is_err()`) - `test_namespace_open_table_not_found_at_root` — covers the root-namespace path - `test_namespace_create_table_already_exists` — covers child namespace - `test_namespace_create_table_already_exists_at_root` — covers root namespace - [x] `cargo clippy --quiet --features remote --tests` — clean - [x] `cargo fmt --all` — clean - [x] Manually confirmed (via test failures before the fix) that the two `open_table` tests were returning `Error::Runtime { message: "Failed to get table info from namespace: Namespace { source: TableNotFound { ... } }" }` prior to this change. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-14 15:19:23 -07:00
Shengan Zhang	650f173236	feat(python): add IVF_HNSW_FLAT vector index support (#3366 ) ## Summary Wire up `IVF_HNSW_FLAT` in the Rust core and Python SDK. The index was documented at https://docs.lancedb.com/indexing/vector-index but `lancedb.Table.create_index(index_type="IVF_HNSW_FLAT")` raised `ValueError: Unknown index type IVF_HNSW_FLAT` — the underlying `pylance` already accepted it, only the LanceDB wrapper was missing the wiring. Rust core (`rust/lancedb`): - Add `Index::IvfHnswFlat` / `IndexType::IvfHnswFlat` variants and the `IvfHnswFlatIndexBuilder` (modelled on `IvfHnswSqIndexBuilder`). - Build Lance params via the existing `VectorIndexParams::ivf_hnsw(...)` helper, keeping symmetry with the other `IVF_HNSW_` variants. - Forward the variant in `RemoteTable::create_index` and add two parametrised tests (default + customised config) for the JSON serialisation. - New `NativeTable` integration test (`test_create_index_ivf_hnsw_flat`). Python binding (`python/`):* - New `HnswFlat` dataclass + backwards-compat `IvfHnswFlat` alias. - PyO3 `extract_index_params` recognises the `HnswFlat` config. - `LanceTable.create_index(index_type="IVF_HNSW_FLAT", …)` and the sync `RemoteTable.create_index` both dispatch to the new config. - `IndexStatistics.index_type` `Literal` and `_lancedb.pyi` stubs cover the new type so `pyright`/`make check` stays clean. - Async integration tests (`HnswFlat` + `IvfHnswFlat` alias) and a sync dispatcher test, mirroring the existing `IVF_HNSW_SQ` coverage. - Existing `test_index_statistics_index_type_lists_all_supported_values` updated to include `IVF_HNSW_FLAT`. A matching Node.js / TypeScript binding is in a follow-up PR. Closes #3331 ## Test plan - [ ] \`cargo check --quiet --features remote --tests --examples\` - [ ] \`cargo test --quiet --features remote -p lancedb\` (covers the new \`test_create_index_ivf_hnsw_flat\` and the two new parametrised \`RemoteTable::create_index\` cases) - [ ] \`cargo fmt --all\` / \`cargo clippy --quiet --features remote --tests --examples\` - [ ] \`cd python && make develop && make check && make test\` (covers the two new async tests, the alias test, the dispatcher test, and the updated \`test_index_statistics_index_type_lists_all_supported_values\` assertion)	2026-05-11 15:08:32 -07:00
Xuanwo	9b21c136c6	feat(python): support model-backed native FTS tokenizers (#3289 ) This wires Lance's existing `jieba/` and `lindera/` native FTS tokenizers through the Python SDK instead of leaving them behind disabled features and narrow public typing. It also documents the `LANCE_LANGUAGE_MODEL_HOME` model layout and adds Python coverage for successful CJK indexing plus missing-model error guidance. Closes #2168.	2026-05-08 23:53:14 +08:00
Heng Ge	694aa48e19	fix(database): drop spurious trailing `?` from listing-database URIs (#3357 ) ## Summary `url::Url::query_pairs_mut()` leaves the URL with `query=Some("")` after `.clear()` even when the input had no query string. The listing-database connect path then captured that empty query into `ListingDatabase::query_string`, and `table_uri()` blindly appended `?<query>` to every per-table URI — producing URIs like `s3://bucket/prefix/foo.lance?`. The trailing `?` is benign for normal table operations, but it breaks any caller that constructs a sub-path from the table URI. In particular, MemWAL flushes write to `<table_uri>/_mem_wal/<shard>/<rand>_gen_<n>`, which `url::Url::parse` then re-parses as `path=<base table>` + `query=/_mem_wal/...`. `Dataset::write` resolves the base table dataset, finds it already exists, and fails with `Dataset already exists: …_gen_1` on the very first MemTable flush (observed deterministically against S3 across all merge_insert LSM modes; tracked in [lance-format/lance#6713](https://github.com/lance-format/lance/pull/6715)). ## Fix Treat `Some("")` query the same as no query when capturing `query_string`. A real `?foo=bar` query is still propagated unchanged. Adds a regression test covering both the empty-query and non-empty-query paths. ## Verification - `url::Url::parse("s3://bucket/prefix/").query()` → `None`, but after `query_pairs_mut().clear()` → `Some("")`. Confirmed in a standalone repro. - Without this fix, every `table_uri()` for an `s3://`-style connection ends with `?`, breaking MemWAL and any future sub-path consumer in the same way. - New unit test `test_table_uri_url_path_has_no_trailing_question_mark` exercises both code paths.	2026-05-07 23:29:29 -07:00
LanceDB Robot	47a34f5cca	chore: update lance dependency to v7.0.0-beta.4 (#3348 ) ## Summary - Update Lance Rust dependencies to `v7.0.0-beta.4` using `ci/set_lance_version.py`. - Update the Java `lance-core` dependency property to `7.0.0-beta.4`. - Align LanceDB with dependency updates required by Lance 7, including `object_store` 0.13 API compatibility. Triggering tag: https://github.com/lance-format/lance/releases/tag/v7.0.0-beta.4 ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all`	2026-05-05 18:36:39 -07:00
Lance Release	c091243d5b	Bump version: 0.28.0-beta.10 → 0.28.0-beta.11	2026-04-29 17:53:49 +00:00
LanceDB Robot	4a5341edb1	chore: update lance dependency to v6.0.0-beta.7 (#3334 ) ## Summary - Update Lance Rust dependencies to `6.0.0-beta.7` using `ci/set_lance_version.py`. - Update Java `lance-core.version` to `6.0.0-beta.7`. - Align Arrow/DataFusion/PyO3 dependency versions and apply required compatibility fixes for the Lance upgrade. Triggering tag: [v6.0.0-beta.7](https://github.com/lance-format/lance/releases/tag/v6.0.0-beta.7) ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all`	2026-04-29 10:52:25 -07:00
Jack Ye	25dfe2cfd4	feat: add manifest-enabled directory namespace mode (#3332 ) Adds manifest_enabled for local/native connections so directory namespace manifests can be the source of truth, including migration from directory listing and Azure credential vending feature wiring. Also exposes the option through Rust, Python, and Node bindings with focused validation.	2026-04-29 09:22:06 -07:00
Lance Release	4dcd7f4314	Bump version: 0.28.0-beta.9 → 0.28.0-beta.10	2026-04-28 13:29:26 +00:00
Jack Ye	a92ae0ded5	fix: enable hostname verification by default (#3304 ) ## Summary - make `TlsConfig::default()` enable hostname verification by default - align the Rust default with the documented Python and Node behavior - update the Rust unit test to lock in the safe default	2026-04-21 08:39:03 -07:00
Lance Release	75b0a8e0a3	Bump version: 0.28.0-beta.8 → 0.28.0-beta.9	2026-04-19 20:39:29 +00:00
Jack Ye	2a1df8edcf	fix(rust): materialize declared namespace tables on create (#3288 ) ## Summary - handle `declare_table` already-exists conflicts in the Rust namespace database create path - reuse declared-but-not-materialized table metadata instead of failing create mode - preserve overwrite behavior while allowing declared Geneva system tables to be materialized	2026-04-19 13:25:53 -07:00
Lance Release	be48ada352	Bump version: 0.28.0-beta.7 → 0.28.0-beta.8	2026-04-19 04:19:10 +00:00
Jack Ye	f909df3e87	fix(python): use namespace-backed rust connection for namespace tables (#3286 ) So far, I have been using a hacky approach that creates and opens namespace-backed table, by getting its location and use a temporary lancedb connection to create or open it. This was working for features like credentials vending but is no longer fully working for the managed versioning feature, recently geneva tests have been failing here and there and various patches are not addressing the root cause. This PR fully fixes this and implements proper rust binding for it. Specifically: - build a real Rust namespace-backed connection from the Python namespace client - route namespace table create/open through that connection instead of resolved-location temp connections - keep namespace client naming consistent in the Rust bridge and preserve federated namespace + DuckDB behavior	2026-04-18 21:17:52 -07:00
Lance Release	d715bbb588	Bump version: 0.28.0-beta.6 → 0.28.0-beta.7	2026-04-17 08:12:27 +00:00
Lance Release	11af763fcd	Bump version: 0.28.0-beta.5 → 0.28.0-beta.6	2026-04-16 18:57:28 +00:00
Xuanwo	b7c0b5987c	chore: upgrade lance to 6.0.0-beta.1 (#3281 )	2026-04-17 02:51:58 +08:00
Jack Ye	97a4b38f19	feat(rust): support nested namespace ops in listing db (#3279 ) ## Summary - delegate child-namespace `ListingDatabase` operations through an eagerly initialized `LanceNamespaceDatabase` - support nested namespace create/open/list/drop flows without requiring callers to inject explicit locations - add `namespace_client_properties` plumbing for local and namespace connections so directory namespace settings like `table_version_tracking_enabled` can be configured - add regression tests for nested namespace ops and namespace client property propagation	2026-04-16 10:12:28 -07:00
Gezi-lzq	10879d99b8	docs: fix broken documentation links (#3278 )	2026-04-15 20:56:59 +08:00
Lance Release	4e6a1d5dce	Bump version: 0.28.0-beta.4 → 0.28.0-beta.5	2026-04-12 23:51:14 +00:00
Lance Release	c6ae0de3ee	Bump version: 0.28.0-beta.3 → 0.28.0-beta.4	2026-04-12 03:57:58 +00:00
Lance Release	359710a0bf	Bump version: 0.28.0-beta.2 → 0.28.0-beta.3	2026-04-11 22:44:52 +00:00
Lance Release	df354abae4	Bump version: 0.28.0-beta.1 → 0.28.0-beta.2	2026-04-11 07:06:00 +00:00
Will Jones	2807ad6854	chore: bump Rust toolchain from 1.91.0 to 1.94.0 (#3257 ) Bumps the Rust toolchain to 1.94.0 (latest installed) to unblock CI failures caused by the AWS SDK's MSRV requirement. No lint fixes were needed. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-10 07:57:47 -07:00
Jack Ye	a898dc81c2	feat: add user_id field to ClientConfig for user identification (#3240 ) ## Summary - Add a `user_id` field to `ClientConfig` that allows users to identify themselves to LanceDB Cloud/Enterprise - The user_id is sent as the `x-lancedb-user-id` HTTP header in all requests - Supports three configuration methods: - Direct assignment via `ClientConfig.user_id` - Environment variable `LANCEDB_USER_ID` - Indirect env var lookup via `LANCEDB_USER_ID_ENV_KEY` Closes #3230 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-04-06 11:20:10 -07:00
Lance Release	de3f8097e7	Bump version: 0.28.0-beta.0 → 0.28.0-beta.1	2026-04-05 02:51:18 +00:00
LanceDB Robot	d082c2d2ac	chore: update lance dependency to v5.0.0-beta.5 (#3237 ) ## Summary - update Rust Lance workspace dependencies to `v5.0.0-beta.5` using `ci/set_lance_version.py` - update Java `lance-core` dependency property to `5.0.0-beta.5` - refresh Cargo lockfile to the new Lance tag ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` ## Upstream Tag - https://github.com/lance-format/lance/releases/tag/v5.0.0-beta.5 --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>	2026-04-04 19:49:51 -07:00
Lance Release	aa2c7b3591	Bump version: 0.27.2 → 0.28.0-beta.0	2026-04-03 08:45:56 +00:00
Jack Ye	e26b22bcca	refactor!: consolidate namespace related naming and enterprise integration (#3205 ) 1. Refactored every client (Rust core, Python, Node/TypeScript) so “namespace” usage is explicit: code now keeps namespace paths (namespace_path) separate from namespace clients (namespace_client). Connections propagate the client, table creation routes through it, and managed versioning defaults are resolved from namespace metadata. Python gained LanceNamespaceDBConnection/async counterparts, and the namespace-focused tests were rewritten to match the clarified API surface. 2. Synchronized the workspace with Lance 5.0.0-beta.3 (see https://github.com/lance-format/lance/pull/6186 for the upstream namespace refactor), updating Cargo/uv lockfiles and ensuring all bindings align with the new namespace semantics. 3. Added a namespace-backed code path to lancedb.connect() via new keyword arguments (namespace_client_impl, namespace_client_properties, plus the existing pushdown-ops flag). When those kwargs are supplied, connect() delegates to connect_namespace, so users can opt into namespace clients without changing APIs. (The async helper will gain parity in a later change)	2026-04-03 00:09:03 -07:00
Lance Release	3ba46135a5	Bump version: 0.27.2-beta.2 → 0.27.2	2026-03-31 21:26:04 +00:00
Lance Release	f903d07887	Bump version: 0.27.2-beta.1 → 0.27.2-beta.2	2026-03-31 21:25:36 +00:00
Pratik Dey	7b1c063848	feat(python): add type-safe expression builder API (#3150 ) Introduces col(), lit(), func(), and Expr class as alternatives to raw SQL strings in .where() and .select(). Expressions are backed by DataFusion's Expr AST and serialized to SQL for remote table compat. Resolves: - https://github.com/lancedb/lancedb/issues/3044 (python api's) - https://github.com/lancedb/lancedb/issues/3043 (support for filter) - https://github.com/lancedb/lancedb/issues/3045 (support for projection) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-31 11:32:49 -07:00
yaommen	a0a2942ad5	fix: respect max_batch_length for Rust vector and hybrid queries (#3172 ) Fixes #1540 I could not reproduce this on current `main` from Python, but I could still reproduce it from the Rust SDK. Python no longer reproduces because the current Python vector/hybrid query paths re-chunk results into a `pyarrow.Table` before returning batches. Rust still reproduced because `max_batch_length` was passed into planning/scanning, but vector search could still emit larger `RecordBatch`es later in execution (for example after KNN / TopK), so it was not enforced on the final Rust output stream. This PR enforces `max_batch_length` on the final Rust query output stream and adds Rust regression coverage. Before the fix, the Rust repro produced: `num_batches=2, max_batch=8192, min_batch=1808, all_le_100=false` After the fix, the same repro produces batches `<= 100`. ## Runnable Rust repro Before this fix, current `main` could still return batches like `[8192, 1808]` here even with `max_batch_length = 100`: ```rust use std::sync::Arc; use arrow_array::{ types::Float32Type, FixedSizeListArray, RecordBatch, RecordBatchReader, StringArray, }; use arrow_schema::{DataType, Field, Schema}; use futures::TryStreamExt; use lancedb::query::{ExecutableQuery, QueryBase, QueryExecutionOptions}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let tmp = tempfile::tempdir()?; let uri = tmp.path().to_str().unwrap(); let rows = 10_000; let schema = Arc::new(Schema::new(vec![ Field::new("id", DataType::Utf8, false), Field::new( "vector", DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Float32, true)), 4), false, ), ])); let ids = StringArray::from_iter_values((0..rows).map(\|i\| format!("row-{i}"))); let vectors = FixedSizeListArray::from_iter_primitive::<Float32Type, _, _>( (0..rows).map(\|i\| Some(vec![Some(i as f32), Some(1.0), Some(2.0), Some(3.0)])), 4, ); let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(ids), Arc::new(vectors)])?; let reader: Box<dyn RecordBatchReader + Send> = Box::new( arrow_array::RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema), ); let db = lancedb::connect(uri).execute().await?; let table = db.create_table("test", reader).execute().await?; let mut opts = QueryExecutionOptions::default(); opts.max_batch_length = 100; let mut stream = table .query() .nearest_to(vec![0.0, 1.0, 2.0, 3.0])? .limit(rows) .execute_with_options(opts) .await?; let mut sizes = Vec::new(); while let Some(batch) = stream.try_next().await? { sizes.push(batch.num_rows()); } println!("{sizes:?}"); Ok(()) } ``` Signed-off-by: yaommen <myanstu@163.com>	2026-03-30 15:43:58 -07:00
lennylxx	74f457a0f2	fix(rust): handle Mutex lock poisoning gracefully across codebase (#3196 ) Replace ~30 production `lock().unwrap()` calls that would cascade-panic on a poisoned Mutex. Functions returning `Result` now propagate the poison as an error via `?` (leveraging the existing `From<PoisonError>` impl). Functions without a `Result` return recover via `unwrap_or_else(\|e\| e.into_inner())`, which is safe because the guarded data (counters, caches, RNG state) remains logically valid after a panic.	2026-03-30 09:25:18 -07:00
Lance Release	ad96489114	Bump version: 0.27.2-beta.0 → 0.27.2-beta.1	2026-03-25 16:22:09 +00:00
Lance Release	61de47f3a5	Bump version: 0.27.1 → 0.27.2-beta.0	2026-03-25 03:23:28 +00:00
Wyatt Alt	410ab9b6fe	Revert "feat: allow passing azure client/tenant ID through remote SDK" (#3185 ) Reverts lancedb/lancedb#3102	2026-03-24 20:17:40 -07:00
Will Jones	1d6e00b902	feat: progress bar for `add()` (#3067 ) ## Summary Adds progress reporting for `table.add()` so users can track large write operations. The progress callback is available in Rust, Python (sync and async), and through the PyO3 bindings. ### Usage Pass `progress=True` to get an automatic tqdm bar: ```python table.add(data, progress=True) # 100%\|██████████\| 1000000/1000000 [00:12<00:00, 82345 rows/s, 45.2 MB/s \| 4/4 workers] ``` Or pass a tqdm bar for more control: ```python from tqdm import tqdm with tqdm(unit=" rows") as pbar: table.add(data, progress=pbar) ``` Or use a callback for custom progress handling: ```python def on_progress(p): print(f"{p['output_rows']}/{p['total_rows']} rows, " f"{p['active_tasks']}/{p['total_tasks']} workers, " f"done={p['done']}") table.add(data, progress=on_progress) ``` In Rust: ```rust table.add(data) .progress(\|p\| println!("{}/{:?} rows", p.output_rows(), p.total_rows())) .execute() .await?; ``` ### Details - `WriteProgress` struct in Rust with getters for `elapsed`, `output_rows`, `output_bytes`, `total_rows`, `active_tasks`, `total_tasks`, and `done`. Fields are private behind getters so new fields can be added without breaking changes. - `WriteProgressTracker` tracks progress across parallel write tasks using a mutex for row/byte counts and atomics for active task counts. - Active task tracking uses an RAII guard pattern (`ActiveTaskGuard`) that increments on creation and decrements on drop. - For remote writes, `output_bytes` reflects IPC wire bytes rather than in-memory Arrow size. For local writes it uses in-memory Arrow size as a proxy (see TODO below). - tqdm postfix displays throughput (MB/s) and worker utilization (active/total). - The `done` callback always fires, even on error (via `FinishOnDrop`), so progress bars are always finalized. ### TODO - Track actual bytes written to disk for local tables. This requires Lance to expose a progress callback from its write path. See lance-format/lance#6247. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 16:14:13 -07:00
Esteban Gutierrez	a0228036ae	ci: fix unused PreprocessingOutput (#3180 ) Simple fix to for CI due unused import of PreprocessingOutput in table.rs Co-authored-by: Esteban Gutierrez <esteban@lancedb.com>	2026-03-23 13:45:44 -07:00
Will Jones	e6fd8d071e	feat(rust): parallel inserts for remote tables via multipart write (#3071 ) Similar to https://github.com/lancedb/lancedb/pull/3062, we can write in parallel to remote tables if the input data source is large enough. We take advantage of new endpoints coming in server version 0.4.0, which allow writing data in multiple requests, and the committing at the end in a single request. To make testing easier, I also introduce a `write_parallelism` parameter. In the future, we can expose that in Python and NodeJS so users can manually specify the parallelism they get. Closes #2861 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 13:19:07 -07:00
Lance Release	3450ccaf7f	Bump version: 0.27.1-beta.0 → 0.27.1	2026-03-20 00:35:36 +00:00
Lance Release	9b229f1e7c	Bump version: 0.27.0 → 0.27.1-beta.0	2026-03-20 00:35:19 +00:00
Lance Release	bd09c53938	Bump version: 0.27.0-beta.6 → 0.27.0	2026-03-16 22:47:06 +00:00
Lance Release	0b18e33180	Bump version: 0.27.0-beta.5 → 0.27.0-beta.6	2026-03-16 22:46:48 +00:00
Mesut-Doner	c2e543f1b7	feat(rust): support Expr in projection query (#3069 ) Referred and followed [`Select::Dynamic`] implementation. Closes #3039	2026-03-13 12:54:26 -07:00
Weston Pace	216c1b5f77	docs: remove experimental label from optimize and warn about delete_unverified (#3128 ) ## Summary - Removes the "Experimental API" section from `optimize` method documentation across Rust, Python, and TypeScript - Adds a warning to `delete_unverified` documentation in all bindings: this should only be set to true if you can guarantee no other process is working on the dataset, otherwise it could be corrupted - Fixes a typo ("shoudl" → "should") Closes #3125 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 14:37:42 +08:00

1 2 3 4 5 ...

713 Commits