lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-03-24 01:20:39 +00:00

Author	SHA1	Message	Date
Lance Release	d9e2d51f51	Bump version: 0.29.2 → 0.30.0-beta.0 python-v0.30.0-beta.0	2026-02-17 00:27:45 +00:00
LuQQiu	e081708cce	fix: non-stopping dataset version check after passing the first consistency check interval (#3034 ) When a table has a read consistency interval, queries within the interval skip the version check. Once the interval expires, a list call checks for new versions. If the version hasn't changed, the timer should reset so the next interval begins, but it didn't. The timer stayed expired, so every query after that triggered a list call, even though nothing changed. This affects all read operations (queries, schema lookups, searches) on tables with read_consistency_interval set. Each operation adds a list("_versions/") call to object storage, adding latency proportional to the store's list performance. For high-QPS workloads, this can saturate object store list throughput and significantly degrade query latency. Bug flow: 1. Every read operation (query, schema, search) calls ensure_up_to_date() 2. ensure_up_to_date() calls is_up_to_date(), which compares last_consistency_check.elapsed() against read_consistency_interval 3. If the interval has expired, it calls reload() 4. reload() calls need_reload(), which calls latest_version_id() — this is the list IOP (list("_versions/")) 5. If no new version, reload() returns early without resetting last_consistency_check 6. On the next query, step 2 sees the stale timer again → step 3 → step 4 → another list IOP 7. This repeats on every query forever	2026-02-16 15:49:14 -08:00
Will Jones	2d60ea6938	perf(remote): cache schema of remote tables (#3015 ) Caches the schema of remote tables and invalidates the cache when: 1. After 30 second TTL 2. When we do an operation that changes schema (e.g. add_columns) or checks out a different version (e.g. checkout_version) 3. When we get a 400, 404, or 500 reponse If the schema is retrieved close to the TTL, we optimistically fetch the schema in the background. This means a continuous stream of queries will never have the schema fetch on the critical path. Closes #3014 --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-13 15:21:04 -08:00
Jack Ye	dcb1443143	ci: add codex fix ci workflow (#3022 ) Similar to the lance one added recently: https://github.com/lance-format/lance/actions/workflows/codex-fix-ci.yml	2026-02-13 14:20:02 -08:00
Will Jones	c0230f91d2	feat(rust)!: accept `RecordBatch`, `Vec<RecordBatch>` in `create_table()` and `Table.add()` (#2948 ) BREAKING CHANGE: Arbitrary `impl RecordBatchReader` is no longer accepted, it must be made into `Box<dyn RecordBatchReader>`. This PR replaces `IntoArrow` with a new trait `Scannable` to define input row data. This provides the following advantages: 1. We can implement `Scannable` for more types than `IntoArrow`, such as `RecordBatch` and `Vec<RecordBatch>`. The `IntoArrow` trait was implemented for arbitrary `T: RecordBatchReader`, and the Rust compiler would prevent us from implementing it for foreign types like `RecordBatch` because (theoretically) those types might implement `RecordBatchReader` in the future. That's why we implement `Scannable` for `Box<dyn RecordBatchReader>` instead; since it's a concrete type it doesn't block implementing for other foreign types. 2. We can potentially replay `Scannable` values. Previously, we had to choose between buffering all data in memory and supporting retries of writes. But because `Scannable` things can optionally support re-scanning, we now have a way of supporting retries while also streaming. 3. `Scannable` can provide hints like `num_rows`, which can be used to schedule parallel writers. Without knowing the total number of rows, it's difficult to know whether it's worth writing multiple files in parallel. We don't yet fully take advantage of (2) and (3) yet, but will in future PRs. For (2), in order to be ready to leverage this, we need to hook the `Scannable` implementation up to Python and NodeJS bindings. Right now they always pass down a stream, but we want to make sure they support retries when possible. And for (3), this will need to be hooked up to #2939 and to a pipeline for running pre-processing steps (like embedding generation). ## Other changes * Moved `create_table` and `add_data` into their own modules. I've created a follow up issue to split up `table.rs` further, as it's by far the largest file: https://github.com/lancedb/lancedb/issues/2949 * Eliminated the `HAS_DATA` generic for `CreateTableBuilder`. I didn't see any public-facing places where we differentiated methods, which is why I felt this simplification was okay. * Added an `Error::External` variant and integrated some conversions to allow certain errors to pass through transparently. This will fully work once we upgrade Lance and get to take advantage of changes in https://github.com/lance-format/lance/pull/5606 * Added LZ4 compression support for write requests to remote endpoints. I checked and this has been supported on the server for > 1 year. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 14:18:36 -08:00
LanceDB Robot	5d629c9ecb	feat: update lance dependency to v2.0.1 (#3027 ) ## Summary - Bump Lance Rust workspace dependencies to v2.0.1 and update Java `lance-core` version. - Verified `cargo clippy --workspace --tests --all-features -- -D warnings` and `cargo fmt --all`. - Triggering tag: https://github.com/lancedb/lance/releases/tag/v2.0.1	2026-02-13 13:53:02 -08:00
Weston Pace	14973ac9d1	fix: support dynamic projection on remote table (#3023 ) The remote server expects an object (`{"alias": "col"}`) and the client was previously sending a list of tuples `[["alias", "col"]]`	2026-02-13 10:10:56 -08:00
Weston Pace	70cbee6293	feat: improve Permutation pytorch integration (#3016 ) This changes around the output format of `Permutation` in some breaking ways but I think the API is still new enough to be considered experimental. 1. In order to align with both huggingface's dataset and torch's expectations the default output format is now a list of dicts (row-major) instead of a dict of lists (column-major). I've added a python_col option which will return the dict of lists. 2. In order to align with pytorch's expectation the `torch` format is now a list of tensors (row-major) instead of a 2D tensor (column-major). I've added a torch_col option which will return the 2D tensor instead. Added tests for torch integration with Permutation ~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013 merges as this is built on top of that~~	2026-02-12 13:41:14 -08:00
Weston Pace	02783bf440	feat: add a getitems implementation for the permutation (#3013 )	2026-02-12 05:36:11 -08:00
Dhruv	4323ca0147	feat: show reranker info in hybrid search explain plan (#3006 ) Closes #3000 The hybrid search `explain_plan` now shows the reranker as the top-level node with the vector and FTS sub-plans indented underneath, instead of just listing them separately with no reranker context. Before: ``` Vector Search Plan: ProjectionExec: ... FTS Search Plan: ProjectionExec: ... ``` After: ``` RRFReranker(K=60) Vector Search Plan: ProjectionExec: ... FTS Search Plan: ProjectionExec: ... ``` Other rerankers display similarly ; e.g. `LinearCombinationReranker(weight=0.7, fill=1.0)`, `MRRReranker(weight_vector=0.5, weight_fts=0.5)`, `CohereReranker(model_name=name)`. --------- Signed-off-by: dask-58 <googldhruv@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-10 11:45:39 -08:00
Dhruv	bd3dd6a8e5	fix: improve error message for multi-field FTS index creation (#3005 ) Fixes #2999 The error message previously said `"field_names must be a string when use_tantivy=False"` implying they should use the to be deprecated tantivy backend #2998. Updated the error message and docstring to instead guide users to create a separate FTS index for each field Signed-off-by: dask-58 <googldhruv@gmail.com>	2026-02-09 16:28:50 -08:00
Abhishek	3c1162612e	refactor: extract optimize logic from table.rs into submodule (#2979 ) ## Summary Continues the modularization effort of table operations as outlined in #2949. - Extracts optimization operations (`OptimizeAction`, `OptimizeStats`, `execute_optimize`, `compact_files_impl`, `cleanup_old_versions`, `optimize_indices`) from `table.rs` into `table/optimize.rs` - Public API remains unchanged via re-exports - Adds comprehensive tests including error cases with message assertions ## Test plan - [x] All new optimization tests pass - [x] All existing tests pass - [x] `cargo clippy` passes with no warnings - [x] `cargo fmt --check` passes --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-09 16:22:57 -08:00
Jack Ye	53c7c560c9	feat: add third party licenses lists (#3010 ) The files are generated with `make licenses`, currently expected to run manually. In the future, some automations could be built.	2026-02-09 16:16:46 -08:00
Lance Release	de4f77800d	Bump version: 0.26.2-beta.0 → 0.26.2	2026-02-09 06:06:22 +00:00
Lance Release	b6ab721cf7	Bump version: 0.26.1 → 0.26.2-beta.0	2026-02-09 06:06:03 +00:00
Lance Release	027d53500b	Bump version: 0.29.2-beta.0 → 0.29.2 python-v0.29.2	2026-02-09 06:05:42 +00:00
Lance Release	9098f47e73	Bump version: 0.29.1 → 0.29.2-beta.0	2026-02-09 06:05:40 +00:00
Jack Ye	826a3e5ee9	ci(nodejs): add repository field to package.json for npm provenance (#3003 ) ## Summary - Added `repository` field to all nodejs package.json files (main package + 7 platform-specific packages) - This fixes the npm publish E422 error where sigstore provenance verification fails because the repository.url was empty ## Root Cause Failing CI: https://github.com/lancedb/lancedb/actions/runs/21770794768/job/62821570260 npm's sigstore provenance verification requires the `repository.url` field in package.json to match the GitHub repository URL from the provenance bundle. The platform-specific packages (`@lancedb/lancedb-darwin-arm64`, etc.) were missing this field entirely, causing the publish to fail with: ``` npm error 422 Unprocessable Entity - Error verifying sigstore provenance bundle: Failed to validate repository information: package.json: "repository.url" is "", expected to match "https://github.com/lancedb/lancedb" from provenance ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-08 22:04:32 -08:00
Lance Release	9fac56252e	Bump version: 0.26.1-beta.0 → 0.26.1	2026-02-07 00:33:18 +00:00
Lance Release	c55ca20c1b	Bump version: 0.26.0 → 0.26.1-beta.0	2026-02-07 00:33:02 +00:00
Lance Release	5cdb15feef	Bump version: 0.29.1-beta.0 → 0.29.1 python-v0.29.1	2026-02-07 00:32:44 +00:00
Lance Release	7a3eea927f	Bump version: 0.29.0 → 0.29.1-beta.0	2026-02-07 00:32:42 +00:00
Jack Ye	5dd9b072d8	ci: upgrade node version for publishing (#2993 ) Trusted publishing requires npm >=11.5.1, which means node>=24. Also need `npm config set provenance true` to fully enable it	2026-02-06 16:30:46 -08:00
Abhishek	6dde379d44	refactor: extract schema evolution logic from table.rs into submodule (#2973 ) Continues the modularization effort of schema evolution operations as outlined in #2949 ## Summary - Extracts schema evolution operations (add_columns, alter_columns, drop_columns) from `table.rs` into `table/schema_evolution.rs` - Public API remains unchanged via re-exports ## Test plan - [x] All new schema evolution tests pass - [x] All existing tests pass - [x] `cargo clippy` passes with no warnings - [x] `cargo fmt --check` passes	2026-02-06 11:33:18 -08:00
Lance Release	55f09ef1cd	Bump version: 0.26.0-beta.0 → 0.26.0	2026-02-06 18:08:30 +00:00
Lance Release	e9d8651d18	Bump version: 0.25.0-beta.0 → 0.26.0-beta.0	2026-02-06 18:08:08 +00:00
Lance Release	071f467571	Bump version: 0.29.0-beta.0 → 0.29.0 python-v0.29.0	2026-02-06 18:07:49 +00:00
Lance Release	f83aa25119	Bump version: 0.28.0-beta.0 → 0.29.0-beta.0	2026-02-06 18:07:48 +00:00
Jack Ye	0a8fe4d026	ci: fix python version for latest release (#2989 ) It was accidentally corrupted in https://github.com/lancedb/lancedb/pull/2972	2026-02-06 10:07:03 -08:00
Jack Ye	3ad7be9825	fix: remove x86_64-apple-darwin from list of npm triples (#2987 ) Missed during https://github.com/lancedb/lancedb/pull/2987	2026-02-06 09:43:44 -08:00
LanceDB Robot	589041d842	feat: update lance dependency to v2.0.0 (#2985 ) ## Summary - Bump Lance Rust crates to v2.0.0 (from v2.0.0-rc.4) and update Java `lance-core` to 2.0.0. - Verified `cargo clippy --workspace --tests --all-features -- -D warnings` and `cargo fmt --all`. - Triggering tag: v2.0.0.	2026-02-05 17:39:32 -08:00
Jack Ye	2e4cd56ab1	ci: auto-publish lancedb java sdk (#2986 ) Avoid the need to manually approve an artifact release in Maven Central	2026-02-05 16:30:32 -08:00
Jack Ye	6fd8586fa7	fix: avoid force push in codex workflows to work with v0.95.0 git safety (#2981 ) ## Summary - Codex CLI v0.95.0 ([PR #10258](https://github.com/openai/codex/pull/10258)) hardened git command safety so force push (`git push -f`, `--force`, `--force-with-lease`, `+refspec`) now requires approval, which blocks it in non-interactive `exec` mode. - This broke the [codex-update-lance-dependency](https://github.com/lancedb/lancedb/actions/runs/21727536000/job/62673436482) workflow — the job succeeded but failed to push the branch or create the PR. - Replace force push with `gh api` branch deletion followed by regular `git push`. - Also update the script to bump Java lance-core version which was missing previously ## Test plan - [x] Re-run the `Codex Update Lance Dependency` workflow with a test tag to verify the push and PR creation succeed: https://github.com/lancedb/lancedb/pull/2983 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 15:57:45 -08:00
Jack Ye	6329b57604	docs: update nodejs docs for storage options APIs (#2978 ) Regenerate TypeScript docs to include the new initialStorageOptions() and latestStorageOptions() methods added in #2966. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 16:07:58 -08:00
Will Jones	c51b13e70f	ci: fix publish failure notifications being skipped (#2976 ) ## Summary The `report-failure` jobs in npm, cargo, and pypi publish workflows checked for `release` or `workflow_dispatch` events, but these workflows are triggered by tag pushes where `github.event_name` is `push`. The condition was never true, so failure notifications were silently skipped. - Use `startsWith(github.ref, 'refs/tags/...')` to match actual tag triggers - Add `failure()` to only notify on actual failures This matches the pattern already used by `java-publish.yml`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 11:22:27 -08:00
Jack Ye	0859312b83	feat: add initial and latest storage options apis (#2966 ) Expose `initial_storage_options()` and `latest_storage_options()` in lance Dataset, in lancedb rust, python and typescript SDKs. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 10:31:39 -08:00
Weston Pace	a6e8ec8d48	ci: remove npm auth token to allow trusted publisher (#2975 )	2026-02-04 07:28:42 -08:00
Jack Ye	bd2c6d0763	chore: update lance dependency to v2.0.0-rc.4 (#2972 )	2026-02-03 14:38:39 -08:00
Will Jones	fbf4a53475	feat(rust): implement `TableProvider::insert_into()` for LanceDB tables (#2939 ) Implements `InsertExec` and `RemoteInsertExec` to support running inserts in DataFusion. ## Context In https://github.com/lancedb/lancedb/pull/2929, I've prototyped moving the insert pipeline into DataFusion. This will enable parallelism at two levels: 1. Running preprocessing, such as casting the input schema or computing embeddings 2. Writing out files This PR is just the first part of running the actual writes. In the end, the plans might look like: ``` InsertExec RepartitionExec num_partitions=<write_parallelism> ProjectionExec vector=compute_embedding() RepartitionExec num_partitions=<num_cpus> DataSourceExec ``` where `num_cpus` is used to take advantage of all cores, while `write_parallelism` might be less than `num_cpus` if there are too few rows to want to split writes across `num_cpus` files. Later PRs will move the preprocessing steps into DataFusion, and then hook this up to the `Table::add()` implementations. ## Relation to future SQL work We eventually plan on having the Remote SDK go through a FlightSQL endpoint. Then for most queries we will send just the SQL string to the server, and not run any sort of DataFusion plan on the client. However, I think writes will be a little special, especially bulk writes where we need to upload large streams of data and likely want parallelism. So we'll have different code paths for writes, and I think using DataFusion makes sense, especially as long as we are doing the pre-processing on the client side still.	2026-02-03 10:38:02 -08:00
Vedant Madane	d3e15f3e17	fix(node): allow bigint[] for takeRowIds (#2916 ) ## Summary This PR changes takeRowIds to accept bigint[] instead of number[], matching the type of _rowid returned by withRowId(). ## Problem When retrieving row IDs using \withRowId()\ and querying them back with takeRowIds(), users get an error because: 1. _rowid values are returned as JavaScript bigint 2. takeRowIds() expected number[] 3. NAPI failed to convert: Error: Failed to convert napi value BigInt into rust type i64 ## Reproduction \\\js import lancedb from '@lancedb/lancedb'; const db = await lancedb.connect('memory://'); const table = await db.createTable('test', [{ id: 1, vector: [1.0, 2.0] }]); const results = await table.query().withRowId().toArray(); const rowIds = results.map(row => row._rowid); console.log('types:', rowIds.map(id => typeof id)); // ['bigint'] await table.takeRowIds(rowIds).toArray(); // âŒ Error before fix \\\ ## Solution - Updated TypeScript signature from takeRowIds(rowIds: number[]) to takeRowIds(rowIds: bigint[]) - Updated Rust NAPI binding to accept Vec<BigInt> and convert using get_u64() Fixes #2722 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-03 10:09:51 -08:00
ChinmayGowda71	9c017d8348	refactor: extract update logic to src/table/update.rs (#2964 ) References #2949 Part 2 of table.rs refactor. Moved UpdateResult, UpdateBuilder, and execution logic to src/table/update.rs. No functional changes API remains identical. --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-03 09:54:19 -08:00
Rashid Ul Islam	c3cc2530b7	feat(python): expose fast_search in synchronous API (Fixes #2612 ) (#2962 ) Fixes #2612 This PR exposes the private _fast_search attribute via a public fast_search() method in the synchronous LanceVectorQueryBuilder. Previously, enabling fast search in the sync API required accessing a private member (query._fast_search = True). This change aligns the synchronous API with the Async and Remote APIs, allowing for cleaner, more Pythonic method chaining. Changes: Added fast_search() method to LanceVectorQueryBuilder in python/python/lancedb/query.py. Added a unit test verifying the flag works with high-dimensional data (2560 dims) and chaining. Example Usage: Before: ``` query = table.search(vector) query._fast_search = True # Private attribute usage results = query.limit(10).to_pandas() ``` After: ``` results = ( table.search(vector) .fast_search() .limit(10) .to_pandas() ) ``` Verification: I have added a test case (test_fast_search_high_dimension) that replicates the scenario described in the issue (2560 dimensions, cosine distance) to ensure the pipeline constructs the query correctly without errors. Checklist: - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. - [ ] Documentation has been updated (inline docstrings). Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>	2026-02-03 09:17:27 -08:00
Lance Release	571295b0d9	Bump version: 0.24.1 → 0.25.0-beta.0	2026-02-03 04:48:34 +00:00
Lance Release	972c682857	Bump version: 0.27.1 → 0.28.0-beta.0 python-v0.28.0-beta.0	2026-02-03 04:47:20 +00:00
LuQQiu	4f8ee82730	chore: update lance core java version to 1.0.4 (#2971 )	2026-02-02 20:43:36 -08:00
Will Jones	131024839f	fix: include _rowid in hash and calculated split projections (#2965 ) ## Summary - PR #2957 changed the permutation builder to only select `_rowid` from the base table, but `Splitter::project()` for hash and calculated splits replaced the selection entirely, dropping `_rowid`. - Include `_rowid` in the column selections for hash and calculated split projections. - Fix a Python test that queried the permutation table for base table columns no longer materialized. Fixes the `test_split_hash`, `test_split_hash_with_discard`, `test_split_calculated`, `test_shuffle_combined_with_splits`, and `test_filter_with_splits` failures in `test_permutation.py`. ## Test plan - [x] `cargo test -p lancedb -- permutation` (22 passed) - [x] `pytest python/tests/test_permutation.py` (46 passed) - [x] `npm test __test__/permutation.test.ts` (20 passed) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 16:27:58 -08:00
ChinmayGowda71	3c7ddf4d0c	refactor: modularize table.rs and extract delete logic (#2952 ) References #2949 Moved DeleteResult and delete() implementation to src/table/delete.rs. No functional changes. Added a test delete which works. Will work on refactoring update next.	2026-02-02 11:54:49 -08:00
Siyuan Huang	461176f9f2	docs: update REST API link in README.md (#2906 ) Fix broken REST API docs link in README.md by replacing https://docs.lancedb.com/api-reference/introduction (404) with https://docs.lancedb.com/api-reference/rest	2026-01-30 15:49:41 -08:00
Aman Harsh	3b8996bb69	fix(python): cancel remote queries on sync API interruption (#2913 ) Fixes #2898 Problem: Sync API cancellations didn’t stop remote query coroutines, so requests could continue after interrupt. Changes: - Cancel run_coroutine_threadsafe futures on any BaseException in the sync background loop - Update cancellation test to avoid starting a real background thread and cover GeneratorExit	2026-01-30 15:47:18 -08:00
Mesut-Doner	3755064e93	fix(rust): support embeddings in create_empty_table (#2961 ) Fixes the Rust SDK's `create_empty_table` to properly support embedding column definitions, bringing it to parity with the Python SDK. ## Problem The Rust SDK's `Connection::create_empty_table` did not support setting embedding columns. When using `.add_embedding()` on the builder, the embedding column definitions were lost because `TableDefinition::new_from_schema(schema)` marks all columns as physical only, without embedding metadata. The Python SDK worked around this by creating an empty record batch with proper schema metadata rather than using `create_empty_table` directly. ## Solution Modified `CreateTableBuilder<false>` to handle embeddings Closes #2759	2026-01-30 15:44:18 -08:00

1 2 3 4 5 ...

2314 Commits