lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-05-14 10:30:40 +00:00

Author	SHA1	Message	Date
lancedb automation	ca4f03ca1e	chore: update lance dependency to v4.0.0-beta.3	2026-02-27 19:52:36 +00:00
Jack Ye	154dbeee2a	chore: fix clippy for PreprocessingOutput without remote feature (#3070 ) Fix clippy: ``` error: fields `overwrite` and `rescannable` are never read Error: --> /home/runner/work/xxxx/xxxx/src/lancedb/rust/lancedb/src/table/add_data.rs:158:9 \| 156 \| pub struct PreprocessingOutput { \| ------------------- fields in this struct 157 \| pub plan: Arc<dyn datafusion_physical_plan::ExecutionPlan>, 158 \| pub overwrite: bool, \| ^^^^^^^^^ 159 \| pub rescannable: bool, \| ^^^^^^^^^^^ \| = note: `-D dead-code` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(dead_code)]` ```	2026-02-25 14:59:32 -08:00
Lance Release	c9c08ac8b9	Bump version: 0.27.0-beta.1 → 0.27.0-beta.2	2026-02-25 07:47:54 +00:00
Mesut-Doner	613b9c1099	feat(rust): add expression builder API for type-safe query filters (#3032 ) ## Summary Adds a Rust expression builder API as a type-safe alternative to SQL strings for query filters. ## Motivation Filtering with raw SQL strings can be awkward when using variables and special types: Closes #3038 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-24 18:44:03 -08:00
Will Jones	d5948576b9	feat: parallel inserts for local tables (#3062 ) When input data is sufficiently large, we automatically split up into parallel writes using a round-robin exchange operator. We sample the first batch to determine data width, and target size of 1 million rows or 2GB, whichever is smaller.	2026-02-24 12:26:51 -08:00
Weston Pace	531cec075c	fix: don't expect all offsets to fit in one batch in permutation reader (#3065 ) This would cause takes against large permutations to fail	2026-02-24 06:32:54 -08:00
Will Jones	0e486511fa	feat: hook up new writer for insert (#3029 ) This hooks up a new writer implementation for the `add()` method. The main immediate benefit is it allows streaming requests to remote tables, and at the same time allowing retries for most inputs. In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's always retry-able. For Python, all are retry-able, except `Iterator` and `pa.RecordBatchReader`, which can only be consumed once. Some, like `pa.datasets.Dataset` are retry-able and streaming. A lot of the changes here are to make the new DataFusion write pipeline maintain the same behavior as the existing Python-based preprocessing, such as: * casting input data to target schema * rejecting NaN values if `on_bad_vectors="error"` * applying embedding functions. In future PRs, we'll enhance these by moving the embedding calls into DataFusion and making sure we parallelize them. See: https://github.com/lancedb/lancedb/issues/3048 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 14:43:31 -08:00
Lance Release	11efaf46ae	Bump version: 0.27.0-beta.0 → 0.27.0-beta.1	2026-02-23 18:34:48 +00:00
LanceDB Robot	8cef8806e9	chore: update lance dependency to v3.0.0-beta.5 (#3058 ) ## Summary - Bump Lance Rust dependencies and Java `lance-core` to v3.0.0-beta.5 (refs/tags/v3.0.0-beta.5). - Update workspace toolchain and dependency defaults needed for the new Lance release. - Resolve new clippy lint defaults introduced by the toolchain update. ## Validation - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>	2026-02-23 00:39:30 -08:00
Will Jones	a3cd7fce69	fix: update DatasetConsistencyWrapper to accept same-version updates (#3055 ) ## Summary `DatasetConsistencyWrapper::update()` only stored datasets with a strictly newer version. This caused `migrate_manifest_paths_v2` to silently drop its update since the migration renames files without bumping the dataset version. The subsequent `uses_v2_manifest_paths()` call would then return the stale cached dataset. Changed the version check from `>` to `>=` so same-version updates are accepted. ## Test plan - [x] Existing `test_create_table_v2_manifest_paths_async` Python test should pass - [x] Existing `should be able to migrate tables to the V2 manifest paths` NodeJS test should pass - [x] All dataset wrapper unit tests pass locally 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 16:01:15 -08:00
Will Jones	48ddc833dd	feat: check for dataset updates in the background (#3021 ) This updates `DatasetConsistencyWrapper` to block less: 1. `DatasetConsistencyWrapper::get()` just returns `Arc<Dataset>` now, instead of a guard that blocks writes. `DatasetConsistencyWrapper::get_mut()` is gone; now write methods just use `get()` and then later call `update()` with the new version. This means a given table handle can do concurrent reads and writes. 2. In weak consistency mode, will check for dataset updates in the background, instead of blocking calls to `get()`. --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-20 11:18:33 -08:00
Omair Afzal	7e1616376e	refactor: extract merge_insert into table/merge.rs submodule (#3031 ) Completes the merge_insert.rs checklist item from #2949. ## Changes - Moved `MergeResult` struct from `table.rs` to `table/merge.rs` - Moved the `NativeTable::merge_insert` implementation into `merge::execute_merge_insert()`, with the trait impl now delegating to it (same pattern as `delete.rs`) - Moved `test_merge_insert` and `test_merge_insert_use_index` tests into `table/merge.rs` - Improved moved tests to use `memory://` URIs instead of temporary directories - Cleaned up unused imports from `table.rs` (`FutureExt`, `TryFutureExt`, `Either`, `WhenMatched`, `WhenNotMatchedBySource`, `LanceMergeInsertBuilder`) - `MergeResult` is re-exported from `table.rs` so the public API is unchanged ## Testing `cargo build -p lancedb` compiles cleanly with no warnings.	2026-02-17 11:36:53 -08:00
ChinmayGowda71	d5ac5b949a	refactor(rust): extract query logic to src/table/query.rs (#3035 ) References #2949 Moved query logic and helpers from table.rs to query.rs. Refactored tests using guidelines and added coverage for multi vector plan structure.	2026-02-17 09:04:21 -08:00
Lance Release	7be6f45e0b	Bump version: 0.26.2 → 0.27.0-beta.0	2026-02-17 00:28:24 +00:00
LuQQiu	e081708cce	fix: non-stopping dataset version check after passing the first consistency check interval (#3034 ) When a table has a read consistency interval, queries within the interval skip the version check. Once the interval expires, a list call checks for new versions. If the version hasn't changed, the timer should reset so the next interval begins, but it didn't. The timer stayed expired, so every query after that triggered a list call, even though nothing changed. This affects all read operations (queries, schema lookups, searches) on tables with read_consistency_interval set. Each operation adds a list("_versions/") call to object storage, adding latency proportional to the store's list performance. For high-QPS workloads, this can saturate object store list throughput and significantly degrade query latency. Bug flow: 1. Every read operation (query, schema, search) calls ensure_up_to_date() 2. ensure_up_to_date() calls is_up_to_date(), which compares last_consistency_check.elapsed() against read_consistency_interval 3. If the interval has expired, it calls reload() 4. reload() calls need_reload(), which calls latest_version_id() — this is the list IOP (list("_versions/")) 5. If no new version, reload() returns early without resetting last_consistency_check 6. On the next query, step 2 sees the stale timer again → step 3 → step 4 → another list IOP 7. This repeats on every query forever	2026-02-16 15:49:14 -08:00
Will Jones	2d60ea6938	perf(remote): cache schema of remote tables (#3015 ) Caches the schema of remote tables and invalidates the cache when: 1. After 30 second TTL 2. When we do an operation that changes schema (e.g. add_columns) or checks out a different version (e.g. checkout_version) 3. When we get a 400, 404, or 500 reponse If the schema is retrieved close to the TTL, we optimistically fetch the schema in the background. This means a continuous stream of queries will never have the schema fetch on the critical path. Closes #3014 --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-13 15:21:04 -08:00
Will Jones	c0230f91d2	feat(rust)!: accept `RecordBatch`, `Vec<RecordBatch>` in `create_table()` and `Table.add()` (#2948 ) BREAKING CHANGE: Arbitrary `impl RecordBatchReader` is no longer accepted, it must be made into `Box<dyn RecordBatchReader>`. This PR replaces `IntoArrow` with a new trait `Scannable` to define input row data. This provides the following advantages: 1. We can implement `Scannable` for more types than `IntoArrow`, such as `RecordBatch` and `Vec<RecordBatch>`. The `IntoArrow` trait was implemented for arbitrary `T: RecordBatchReader`, and the Rust compiler would prevent us from implementing it for foreign types like `RecordBatch` because (theoretically) those types might implement `RecordBatchReader` in the future. That's why we implement `Scannable` for `Box<dyn RecordBatchReader>` instead; since it's a concrete type it doesn't block implementing for other foreign types. 2. We can potentially replay `Scannable` values. Previously, we had to choose between buffering all data in memory and supporting retries of writes. But because `Scannable` things can optionally support re-scanning, we now have a way of supporting retries while also streaming. 3. `Scannable` can provide hints like `num_rows`, which can be used to schedule parallel writers. Without knowing the total number of rows, it's difficult to know whether it's worth writing multiple files in parallel. We don't yet fully take advantage of (2) and (3) yet, but will in future PRs. For (2), in order to be ready to leverage this, we need to hook the `Scannable` implementation up to Python and NodeJS bindings. Right now they always pass down a stream, but we want to make sure they support retries when possible. And for (3), this will need to be hooked up to #2939 and to a pipeline for running pre-processing steps (like embedding generation). ## Other changes * Moved `create_table` and `add_data` into their own modules. I've created a follow up issue to split up `table.rs` further, as it's by far the largest file: https://github.com/lancedb/lancedb/issues/2949 * Eliminated the `HAS_DATA` generic for `CreateTableBuilder`. I didn't see any public-facing places where we differentiated methods, which is why I felt this simplification was okay. * Added an `Error::External` variant and integrated some conversions to allow certain errors to pass through transparently. This will fully work once we upgrade Lance and get to take advantage of changes in https://github.com/lance-format/lance/pull/5606 * Added LZ4 compression support for write requests to remote endpoints. I checked and this has been supported on the server for > 1 year. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 14:18:36 -08:00
Weston Pace	14973ac9d1	fix: support dynamic projection on remote table (#3023 ) The remote server expects an object (`{"alias": "col"}`) and the client was previously sending a list of tuples `[["alias", "col"]]`	2026-02-13 10:10:56 -08:00
Weston Pace	02783bf440	feat: add a getitems implementation for the permutation (#3013 )	2026-02-12 05:36:11 -08:00
Abhishek	3c1162612e	refactor: extract optimize logic from table.rs into submodule (#2979 ) ## Summary Continues the modularization effort of table operations as outlined in #2949. - Extracts optimization operations (`OptimizeAction`, `OptimizeStats`, `execute_optimize`, `compact_files_impl`, `cleanup_old_versions`, `optimize_indices`) from `table.rs` into `table/optimize.rs` - Public API remains unchanged via re-exports - Adds comprehensive tests including error cases with message assertions ## Test plan - [x] All new optimization tests pass - [x] All existing tests pass - [x] `cargo clippy` passes with no warnings - [x] `cargo fmt --check` passes --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-09 16:22:57 -08:00
Lance Release	de4f77800d	Bump version: 0.26.2-beta.0 → 0.26.2	2026-02-09 06:06:22 +00:00
Lance Release	b6ab721cf7	Bump version: 0.26.1 → 0.26.2-beta.0	2026-02-09 06:06:03 +00:00
Lance Release	9fac56252e	Bump version: 0.26.1-beta.0 → 0.26.1	2026-02-07 00:33:18 +00:00
Lance Release	c55ca20c1b	Bump version: 0.26.0 → 0.26.1-beta.0	2026-02-07 00:33:02 +00:00
Abhishek	6dde379d44	refactor: extract schema evolution logic from table.rs into submodule (#2973 ) Continues the modularization effort of schema evolution operations as outlined in #2949 ## Summary - Extracts schema evolution operations (add_columns, alter_columns, drop_columns) from `table.rs` into `table/schema_evolution.rs` - Public API remains unchanged via re-exports ## Test plan - [x] All new schema evolution tests pass - [x] All existing tests pass - [x] `cargo clippy` passes with no warnings - [x] `cargo fmt --check` passes	2026-02-06 11:33:18 -08:00
Lance Release	55f09ef1cd	Bump version: 0.26.0-beta.0 → 0.26.0	2026-02-06 18:08:30 +00:00
Lance Release	e9d8651d18	Bump version: 0.25.0-beta.0 → 0.26.0-beta.0	2026-02-06 18:08:08 +00:00
Jack Ye	0859312b83	feat: add initial and latest storage options apis (#2966 ) Expose `initial_storage_options()` and `latest_storage_options()` in lance Dataset, in lancedb rust, python and typescript SDKs. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 10:31:39 -08:00
Jack Ye	bd2c6d0763	chore: update lance dependency to v2.0.0-rc.4 (#2972 )	2026-02-03 14:38:39 -08:00
Will Jones	fbf4a53475	feat(rust): implement `TableProvider::insert_into()` for LanceDB tables (#2939 ) Implements `InsertExec` and `RemoteInsertExec` to support running inserts in DataFusion. ## Context In https://github.com/lancedb/lancedb/pull/2929, I've prototyped moving the insert pipeline into DataFusion. This will enable parallelism at two levels: 1. Running preprocessing, such as casting the input schema or computing embeddings 2. Writing out files This PR is just the first part of running the actual writes. In the end, the plans might look like: ``` InsertExec RepartitionExec num_partitions=<write_parallelism> ProjectionExec vector=compute_embedding() RepartitionExec num_partitions=<num_cpus> DataSourceExec ``` where `num_cpus` is used to take advantage of all cores, while `write_parallelism` might be less than `num_cpus` if there are too few rows to want to split writes across `num_cpus` files. Later PRs will move the preprocessing steps into DataFusion, and then hook this up to the `Table::add()` implementations. ## Relation to future SQL work We eventually plan on having the Remote SDK go through a FlightSQL endpoint. Then for most queries we will send just the SQL string to the server, and not run any sort of DataFusion plan on the client. However, I think writes will be a little special, especially bulk writes where we need to upload large streams of data and likely want parallelism. So we'll have different code paths for writes, and I think using DataFusion makes sense, especially as long as we are doing the pre-processing on the client side still.	2026-02-03 10:38:02 -08:00
ChinmayGowda71	9c017d8348	refactor: extract update logic to src/table/update.rs (#2964 ) References #2949 Part 2 of table.rs refactor. Moved UpdateResult, UpdateBuilder, and execution logic to src/table/update.rs. No functional changes API remains identical. --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-03 09:54:19 -08:00
Lance Release	571295b0d9	Bump version: 0.24.1 → 0.25.0-beta.0	2026-02-03 04:48:34 +00:00
Will Jones	131024839f	fix: include _rowid in hash and calculated split projections (#2965 ) ## Summary - PR #2957 changed the permutation builder to only select `_rowid` from the base table, but `Splitter::project()` for hash and calculated splits replaced the selection entirely, dropping `_rowid`. - Include `_rowid` in the column selections for hash and calculated split projections. - Fix a Python test that queried the permutation table for base table columns no longer materialized. Fixes the `test_split_hash`, `test_split_hash_with_discard`, `test_split_calculated`, `test_shuffle_combined_with_splits`, and `test_filter_with_splits` failures in `test_permutation.py`. ## Test plan - [x] `cargo test -p lancedb -- permutation` (22 passed) - [x] `pytest python/tests/test_permutation.py` (46 passed) - [x] `npm test __test__/permutation.test.ts` (20 passed) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 16:27:58 -08:00
ChinmayGowda71	3c7ddf4d0c	refactor: modularize table.rs and extract delete logic (#2952 ) References #2949 Moved DeleteResult and delete() implementation to src/table/delete.rs. No functional changes. Added a test delete which works. Will work on refactoring update next.	2026-02-02 11:54:49 -08:00
Mesut-Doner	3755064e93	fix(rust): support embeddings in create_empty_table (#2961 ) Fixes the Rust SDK's `create_empty_table` to properly support embedding column definitions, bringing it to parity with the Python SDK. ## Problem The Rust SDK's `Connection::create_empty_table` did not support setting embedding columns. When using `.add_embedding()` on the builder, the embedding column definitions were lost because `TableDefinition::new_from_schema(schema)` marks all columns as physical only, without embedding metadata. The Python SDK worked around this by creating an empty record batch with proper schema metadata rather than using `create_empty_table` directly. ## Solution Modified `CreateTableBuilder<false>` to handle embeddings Closes #2759	2026-01-30 15:44:18 -08:00
Weston Pace	9be28448f5	fix: don't store all columns in the permutation table (#2957 ) The permutation table was always intended to be a small table of row id pointers (and split id). However, it was accidentally doing a full materialization of the base table 🤦 This PR changes the permutation builder to only store row id and split id.	2026-01-29 16:06:36 -08:00
Weston Pace	e9e904783c	feat: allow the permutation builder memory limit to be configured by env var (#2946 ) Running into issues with DF sorting again. This will at least allow the memory limit to be set large to bypass problems.	2026-01-28 09:02:59 +05:30
Lance Release	8500b16eca	Bump version: 0.24.1-beta.0 → 0.24.1	2026-01-26 23:39:18 +00:00
Lance Release	57e7282342	Bump version: 0.24.0 → 0.24.1-beta.0	2026-01-26 23:38:50 +00:00
Jack Ye	7bf020b3d5	chore: fix clippy when remote flag is not set (#2943 ) Also add a step in CI to ensure this does not happen in the future	2026-01-26 13:59:31 -08:00
Jack Ye	e4552e577a	chore(revert): revert update lance dependency to v2.0.0-rc.1 (#2936 ) (#2941 ) This reverts commit `bd84bba14d`, so that we can bump version to 1.0.4-rc.1	2026-01-26 11:13:59 -08:00
Will Jones	f979a902ad	ci(rust): fix MSRV check (#2940 ) Realized our MSRV check was inert because `rust-toolchain.toml` was overriding the Rust version. We set the `RUSTUP_TOOLCHAIN` environment variable, which overrides that. Also needed to update to MSRV 1.88 (due to dependencies like Lance and DataFusion) and fix some clippy warnings.	2026-01-23 15:57:09 -08:00
Colin Patrick McCabe	5a7a8da567	feat: check AZURE_STORAGE_ACCOUNT_NAME in remote conns (#2918 ) Unlike in Amazon S3, in Azure bucket names are not globally unique. Instead, the combination of (storage_account_name, bucket_name) is unique. Therefore, when using Azure blob store, we always need a way to configure the storage account name. One way is to use the storage_options hash map and set azure_storage_account_name. Another way is to set an environment variable, AZURE_STORAGE_ACCOUNT_NAME. Prior to this PR, the second way (environment variable) did not work with remote connections. This is because the existing code that checks for these environment variables happens inside the Azure object store implementation itself, which does not run locally when using remote connections. This PR addresses that situation by adding a check of the environment variable. This functions as a default if the relevant storage option is not set in the storage_options hash map.	2026-01-22 13:36:05 -08:00
Jack Ye	0db8176445	test: fix failing remote doctest reference to aws feature (#2935 ) Closes https://github.com/lancedb/lancedb/issues/2933	2026-01-22 13:17:03 -08:00
LanceDB Robot	bd84bba14d	chore: update lance dependency to v2.0.0-rc.1 (#2936 ) ## Summary - bump Lance dependencies to v2.0.0-rc.1 (git tag) - align Arrow/DataFusion/PyO3 versions for the new Lance release - update Python bindings for PyO3 0.26 (attach API + Py<PyAny>) ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` ## Reference - https://github.com/lance-format/lance/releases/tag/v2.0.0-rc.1 --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: BubbleCal <bubble_cal@outlook.com>	2026-01-22 13:14:38 -08:00
Lance Release	ac07f8068c	Bump version: 0.24.0-beta.1 → 0.24.0	2026-01-22 01:10:15 +00:00
Lance Release	bba362d372	Bump version: 0.24.0-beta.0 → 0.24.0-beta.1	2026-01-22 01:09:53 +00:00
Jack Ye	4e65748abf	chore: update lance dependency to v1.0.3-rc.1 (#2927 ) Supercedes https://github.com/lancedb/lancedb/pull/2925 We accidentally upgraded lance to 2.0.0-beta.8. This PR reverts that first and then bump to 1.0.3-rc.1 --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 11:52:07 -08:00
Colin Patrick McCabe	e897f3edab	test: assert remote behavior of drop_table (#2926 ) Add support for testing remote connections in drop_table in `rust/lancedb/src/connection.rs`.	2026-01-21 08:42:40 -08:00
Lance Release	790ba7115b	Bump version: 0.23.1 → 0.24.0-beta.0	2026-01-21 12:21:53 +00:00

1 2 3 4 5 ...

651 Commits