lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-05-14 02:20:40 +00:00

Author	SHA1	Message	Date
Will Jones	0d3fc7860a	ci: fix python DataFusion test (#3060 )	2026-02-24 07:59:12 -08:00
Will Jones	0e486511fa	feat: hook up new writer for insert (#3029 ) This hooks up a new writer implementation for the `add()` method. The main immediate benefit is it allows streaming requests to remote tables, and at the same time allowing retries for most inputs. In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's always retry-able. For Python, all are retry-able, except `Iterator` and `pa.RecordBatchReader`, which can only be consumed once. Some, like `pa.datasets.Dataset` are retry-able and streaming. A lot of the changes here are to make the new DataFusion write pipeline maintain the same behavior as the existing Python-based preprocessing, such as: * casting input data to target schema * rejecting NaN values if `on_bad_vectors="error"` * applying embedding functions. In future PRs, we'll enhance these by moving the embedding calls into DataFusion and making sure we parallelize them. See: https://github.com/lancedb/lancedb/issues/3048 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 14:43:31 -08:00
Lance Release	1ea22ee5ef	Bump version: 0.30.0-beta.0 → 0.30.0-beta.1	2026-02-23 18:33:28 +00:00
LanceDB Robot	8cef8806e9	chore: update lance dependency to v3.0.0-beta.5 (#3058 ) ## Summary - Bump Lance Rust dependencies and Java `lance-core` to v3.0.0-beta.5 (refs/tags/v3.0.0-beta.5). - Update workspace toolchain and dependency defaults needed for the new Lance release. - Resolve new clippy lint defaults introduced by the toolchain update. ## Validation - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>	2026-02-23 00:39:30 -08:00
Varun Chawla	2802764092	fix(embeddings): stop retrying OpenAI 401 authentication errors (#2995 ) ## Summary Fixes #1679 This PR prevents the OpenAI embedding function from retrying when receiving a 401 Unauthorized error. Authentication errors are permanent failures that won't be fixed by retrying, yet the current implementation retries all exceptions up to 7 times by default. ## Changes - Modified `retry_with_exponential_backoff` in `utils.py` to check for non-retryable errors before retrying - Added `_is_non_retryable_error` helper function that detects: - Exceptions with name `AuthenticationError` (OpenAI's 401 error) - Exceptions with `status_code` attribute of 401 or 403 - Enhanced OpenAI embeddings to explicitly catch and re-raise `AuthenticationError` with better logging - Added unit test `test_openai_no_retry_on_401` to verify authentication errors don't trigger retries ## Test Plan - Added test that verifies: 1. A function raising `AuthenticationError` is only called once 2. No retry delays occur (sleep is never called) - Existing tests continue to pass - Formatting applied via `make format` ## Example Behavior Before: With an invalid API key, users would see 7 retry attempts over ~2 minutes: ``` WARNING:root:Error occurred: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}} Retrying in 3.97 seconds (retry 1 of 7) WARNING:root:Error occurred: Error code: 401... Retrying in 7.94 seconds (retry 2 of 7) ... ``` After: With an invalid API key, the error is raised immediately: ``` ERROR:root:Authentication failed: Invalid API key provided AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}} ``` This provides better UX and prevents unnecessary API calls that would fail anyway. --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-19 09:20:54 -08:00
Weston Pace	37bbb0dba1	fix: allow permutation reader to work with remote tables as well (#3047 ) Fixed one more spot that was relying on `_inner`.	2026-02-19 00:41:41 +05:30
Prashanth Rao	155ec16161	fix: deprecate outdated files for embedding registry (#3037 ) There are old and outdated files in our embedding registry that can confuse coding agents. This PR deprecates the following files that have newer, more modern methods to generate such embeddings. - Deprecate `embeddings/siglip.py` - Deprecate `embeddings/gte.py` ## Why this change? Per a discussion with @AyushExel, the [embedding registry directory ](`1840aa7edc/python/python/lancedb/embeddings`) in the LanceDB repo has a number of outdated files that need to be deprecated. See https://github.com/lancedb/docs/issues/85 for the docs gaps that identified this. - Add note in `openclip` docs that it can be used for SigLip embeddings, which it now supports - Add note in the `sentence-transformers` page that ALL text embedding models on Hugging Face can be used	2026-02-18 12:04:39 -05:00
Weston Pace	636b8b5bbd	fix: allow permutation reader to be used with remote tables (#3019 ) There were two issues: 1. The python code needs to get access to the underlying rust table to setup the permutation reader and the attributes involved in this differ between the python local table and remote table objects. ~~2. The remote table was sending projection dictionaries as arrays of tuples and (on LanceDB cloud at least) it does not appear this is how rest servers are setup to receive them.~~ (this is now fixed as #3023) ~~Leaving as draft as this is built on https://github.com/lancedb/lancedb/pull/3016~~	2026-02-18 05:44:08 -08:00
Omair Afzal	715b81c86b	fix(python): graceful handling of empty result sets in hybrid search (#3030 ) ## Problem When applying hard filters that result in zero matches, hybrid search crashes with `IndexError: list index out of range` during reranking. This happens because empty result tables are passed through the full reranker pipeline, which expects at least one result. Traceback from the issue: ``` lancedb/query.py: in _combine_hybrid_results results = reranker.rerank_hybrid(fts_query, vector_results, fts_results) lancedb/rerankers/answerdotai.py: in rerank_hybrid combined_results = self._rerank(combined_results, query) ... IndexError: list index out of range ``` ## Fix Added an early return in `_combine_hybrid_results` when both vector and FTS results are empty. Instead of passing empty tables through normalization, reranking, and score restoration (which can fail in various ways), we now build a properly-typed empty result table with the `_relevance_score` column and return it directly. ## Test Added `test_empty_hybrid_result_reranker` that exercises `_combine_hybrid_results` directly with empty vector and FTS tables, verifying: - Returns empty table with correct schema - Includes `_relevance_score` column - Respects `with_row_ids` flag Closes #2425	2026-02-17 11:37:10 -08:00
Lance Release	d9e2d51f51	Bump version: 0.29.2 → 0.30.0-beta.0	2026-02-17 00:27:45 +00:00
Will Jones	c0230f91d2	feat(rust)!: accept `RecordBatch`, `Vec<RecordBatch>` in `create_table()` and `Table.add()` (#2948 ) BREAKING CHANGE: Arbitrary `impl RecordBatchReader` is no longer accepted, it must be made into `Box<dyn RecordBatchReader>`. This PR replaces `IntoArrow` with a new trait `Scannable` to define input row data. This provides the following advantages: 1. We can implement `Scannable` for more types than `IntoArrow`, such as `RecordBatch` and `Vec<RecordBatch>`. The `IntoArrow` trait was implemented for arbitrary `T: RecordBatchReader`, and the Rust compiler would prevent us from implementing it for foreign types like `RecordBatch` because (theoretically) those types might implement `RecordBatchReader` in the future. That's why we implement `Scannable` for `Box<dyn RecordBatchReader>` instead; since it's a concrete type it doesn't block implementing for other foreign types. 2. We can potentially replay `Scannable` values. Previously, we had to choose between buffering all data in memory and supporting retries of writes. But because `Scannable` things can optionally support re-scanning, we now have a way of supporting retries while also streaming. 3. `Scannable` can provide hints like `num_rows`, which can be used to schedule parallel writers. Without knowing the total number of rows, it's difficult to know whether it's worth writing multiple files in parallel. We don't yet fully take advantage of (2) and (3) yet, but will in future PRs. For (2), in order to be ready to leverage this, we need to hook the `Scannable` implementation up to Python and NodeJS bindings. Right now they always pass down a stream, but we want to make sure they support retries when possible. And for (3), this will need to be hooked up to #2939 and to a pipeline for running pre-processing steps (like embedding generation). ## Other changes * Moved `create_table` and `add_data` into their own modules. I've created a follow up issue to split up `table.rs` further, as it's by far the largest file: https://github.com/lancedb/lancedb/issues/2949 * Eliminated the `HAS_DATA` generic for `CreateTableBuilder`. I didn't see any public-facing places where we differentiated methods, which is why I felt this simplification was okay. * Added an `Error::External` variant and integrated some conversions to allow certain errors to pass through transparently. This will fully work once we upgrade Lance and get to take advantage of changes in https://github.com/lance-format/lance/pull/5606 * Added LZ4 compression support for write requests to remote endpoints. I checked and this has been supported on the server for > 1 year. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 14:18:36 -08:00
Weston Pace	70cbee6293	feat: improve Permutation pytorch integration (#3016 ) This changes around the output format of `Permutation` in some breaking ways but I think the API is still new enough to be considered experimental. 1. In order to align with both huggingface's dataset and torch's expectations the default output format is now a list of dicts (row-major) instead of a dict of lists (column-major). I've added a python_col option which will return the dict of lists. 2. In order to align with pytorch's expectation the `torch` format is now a list of tensors (row-major) instead of a 2D tensor (column-major). I've added a torch_col option which will return the 2D tensor instead. Added tests for torch integration with Permutation ~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013 merges as this is built on top of that~~	2026-02-12 13:41:14 -08:00
Weston Pace	02783bf440	feat: add a getitems implementation for the permutation (#3013 )	2026-02-12 05:36:11 -08:00
Dhruv	4323ca0147	feat: show reranker info in hybrid search explain plan (#3006 ) Closes #3000 The hybrid search `explain_plan` now shows the reranker as the top-level node with the vector and FTS sub-plans indented underneath, instead of just listing them separately with no reranker context. Before: ``` Vector Search Plan: ProjectionExec: ... FTS Search Plan: ProjectionExec: ... ``` After: ``` RRFReranker(K=60) Vector Search Plan: ProjectionExec: ... FTS Search Plan: ProjectionExec: ... ``` Other rerankers display similarly ; e.g. `LinearCombinationReranker(weight=0.7, fill=1.0)`, `MRRReranker(weight_vector=0.5, weight_fts=0.5)`, `CohereReranker(model_name=name)`. --------- Signed-off-by: dask-58 <googldhruv@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com>	2026-02-10 11:45:39 -08:00
Dhruv	bd3dd6a8e5	fix: improve error message for multi-field FTS index creation (#3005 ) Fixes #2999 The error message previously said `"field_names must be a string when use_tantivy=False"` implying they should use the to be deprecated tantivy backend #2998. Updated the error message and docstring to instead guide users to create a separate FTS index for each field Signed-off-by: dask-58 <googldhruv@gmail.com>	2026-02-09 16:28:50 -08:00
Jack Ye	53c7c560c9	feat: add third party licenses lists (#3010 ) The files are generated with `make licenses`, currently expected to run manually. In the future, some automations could be built.	2026-02-09 16:16:46 -08:00
Lance Release	027d53500b	Bump version: 0.29.2-beta.0 → 0.29.2	2026-02-09 06:05:42 +00:00
Lance Release	9098f47e73	Bump version: 0.29.1 → 0.29.2-beta.0	2026-02-09 06:05:40 +00:00
Lance Release	5cdb15feef	Bump version: 0.29.1-beta.0 → 0.29.1	2026-02-07 00:32:44 +00:00
Lance Release	7a3eea927f	Bump version: 0.29.0 → 0.29.1-beta.0	2026-02-07 00:32:42 +00:00
Lance Release	071f467571	Bump version: 0.29.0-beta.0 → 0.29.0	2026-02-06 18:07:49 +00:00
Lance Release	f83aa25119	Bump version: 0.28.0-beta.0 → 0.29.0-beta.0	2026-02-06 18:07:48 +00:00
Jack Ye	0a8fe4d026	ci: fix python version for latest release (#2989 ) It was accidentally corrupted in https://github.com/lancedb/lancedb/pull/2972	2026-02-06 10:07:03 -08:00
Jack Ye	0859312b83	feat: add initial and latest storage options apis (#2966 ) Expose `initial_storage_options()` and `latest_storage_options()` in lance Dataset, in lancedb rust, python and typescript SDKs. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 10:31:39 -08:00
Jack Ye	bd2c6d0763	chore: update lance dependency to v2.0.0-rc.4 (#2972 )	2026-02-03 14:38:39 -08:00
Rashid Ul Islam	c3cc2530b7	feat(python): expose fast_search in synchronous API (Fixes #2612 ) (#2962 ) Fixes #2612 This PR exposes the private _fast_search attribute via a public fast_search() method in the synchronous LanceVectorQueryBuilder. Previously, enabling fast search in the sync API required accessing a private member (query._fast_search = True). This change aligns the synchronous API with the Async and Remote APIs, allowing for cleaner, more Pythonic method chaining. Changes: Added fast_search() method to LanceVectorQueryBuilder in python/python/lancedb/query.py. Added a unit test verifying the flag works with high-dimensional data (2560 dims) and chaining. Example Usage: Before: ``` query = table.search(vector) query._fast_search = True # Private attribute usage results = query.limit(10).to_pandas() ``` After: ``` results = ( table.search(vector) .fast_search() .limit(10) .to_pandas() ) ``` Verification: I have added a test case (test_fast_search_high_dimension) that replicates the scenario described in the issue (2560 dimensions, cosine distance) to ensure the pipeline constructs the query correctly without errors. Checklist: - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. - [ ] Documentation has been updated (inline docstrings). Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>	2026-02-03 09:17:27 -08:00
Lance Release	972c682857	Bump version: 0.27.1 → 0.28.0-beta.0	2026-02-03 04:47:20 +00:00
Will Jones	131024839f	fix: include _rowid in hash and calculated split projections (#2965 ) ## Summary - PR #2957 changed the permutation builder to only select `_rowid` from the base table, but `Splitter::project()` for hash and calculated splits replaced the selection entirely, dropping `_rowid`. - Include `_rowid` in the column selections for hash and calculated split projections. - Fix a Python test that queried the permutation table for base table columns no longer materialized. Fixes the `test_split_hash`, `test_split_hash_with_discard`, `test_split_calculated`, `test_shuffle_combined_with_splits`, and `test_filter_with_splits` failures in `test_permutation.py`. ## Test plan - [x] `cargo test -p lancedb -- permutation` (22 passed) - [x] `pytest python/tests/test_permutation.py` (46 passed) - [x] `npm test __test__/permutation.test.ts` (20 passed) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 16:27:58 -08:00
Aman Harsh	3b8996bb69	fix(python): cancel remote queries on sync API interruption (#2913 ) Fixes #2898 Problem: Sync API cancellations didn’t stop remote query coroutines, so requests could continue after interrupt. Changes: - Cancel run_coroutine_threadsafe futures on any BaseException in the sync background loop - Update cancellation test to avoid starting a real background thread and cover GeneratorExit	2026-01-30 15:47:18 -08:00
Xin Sun	8773b865a9	fix(python): uses PIL incorrectly and may raise AttributeError (#2954 ) Importing `PIL` alone does not guarantee that the `Image` submodule is loaded. In a clean environment where no other code has imported `PIL.Image` before, `PIL.Image` does not exist on the `PIL` package, which leads to the AttributeError.	2026-01-30 15:33:10 -08:00
fzowl	1ee29675b3	feat(python): adding VoyageAI v4 models (#2959 ) Adding VoyageAI v4 models - with these, i added unit tests - added example code (tested!)	2026-01-30 15:16:03 -08:00
Lei Xu	357197bacc	chore!: change support python version from 3.10 to 3.13 (#2955 ) Python 3.9 is EOL since Oct 2025. and last two pyarrow builts were against python3.10-3.13. * This PR is contributed by codex-gpt5.2	2026-01-30 01:47:50 +08:00
Lei Xu	ad51e2dd1f	fix: support pydantic list of structs or optional struct (#2953 ) Closes #2950 This code is generated by codex-gpt5.2	2026-01-28 21:08:18 -08:00
Lance Release	cc5f8070d7	Bump version: 0.27.1-beta.0 → 0.27.1	2026-01-26 23:38:24 +00:00
Lance Release	dc0fb01f6b	Bump version: 0.27.0 → 0.27.1-beta.0	2026-01-26 23:38:23 +00:00
Jack Ye	e4552e577a	chore(revert): revert update lance dependency to v2.0.0-rc.1 (#2936 ) (#2941 ) This reverts commit `bd84bba14d`, so that we can bump version to 1.0.4-rc.1	2026-01-26 11:13:59 -08:00
Will Jones	f979a902ad	ci(rust): fix MSRV check (#2940 ) Realized our MSRV check was inert because `rust-toolchain.toml` was overriding the Rust version. We set the `RUSTUP_TOOLCHAIN` environment variable, which overrides that. Also needed to update to MSRV 1.88 (due to dependencies like Lance and DataFusion) and fix some clippy warnings.	2026-01-23 15:57:09 -08:00
LanceDB Robot	bd84bba14d	chore: update lance dependency to v2.0.0-rc.1 (#2936 ) ## Summary - bump Lance dependencies to v2.0.0-rc.1 (git tag) - align Arrow/DataFusion/PyO3 versions for the new Lance release - update Python bindings for PyO3 0.26 (attach API + Py<PyAny>) ## Verification - `cargo clippy --workspace --tests --all-features -- -D warnings` - `cargo fmt --all` ## Reference - https://github.com/lance-format/lance/releases/tag/v2.0.0-rc.1 --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: BubbleCal <bubble_cal@outlook.com>	2026-01-22 13:14:38 -08:00
Lance Release	042bc22468	Bump version: 0.27.0-beta.1 → 0.27.0	2026-01-22 01:09:32 +00:00
Lance Release	68569906c6	Bump version: 0.27.0-beta.0 → 0.27.0-beta.1	2026-01-22 01:09:31 +00:00
Jack Ye	f124c9d8d2	test: string type conversion in pandas 3.0+ (#2928 ) Pandas 3.0+ string now converts to Arrow large_utf8. This PR mainly makes sure our test accounts for the difference across the pandas versions when constructing schema.	2026-01-21 13:40:48 -08:00
Jack Ye	4e65748abf	chore: update lance dependency to v1.0.3-rc.1 (#2927 ) Supercedes https://github.com/lancedb/lancedb/pull/2925 We accidentally upgraded lance to 2.0.0-beta.8. This PR reverts that first and then bump to 1.0.3-rc.1 --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 11:52:07 -08:00
Lance Release	446a69b51b	Bump version: 0.26.1 → 0.27.0-beta.0	2026-01-21 12:21:09 +00:00
Ryan Green	cd5f91bb7d	feat: expose table uri (#2922 ) * Expose `table.uri` property for all tables, including remote tables * Fix bug in path calculation on windows file systems	2026-01-20 19:56:46 -03:30
LanceDB Robot	4da01a0e65	chore: update lance dependency to v2.0.0-beta.8 (#2907 ) ## Summary - bump Lance crates to v2.0.0-beta.8 and align arrow/datafusion/regex/half and PyO3 dependencies - update Rust/Python bindings for upstream API changes (namespace/table requests, query select columns, storage option providers) - verified with cargo clippy --workspace --tests --all-features -D warnings and cargo fmt --all Triggered by refs/tags/v2.0.0-beta.8. --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com> Co-authored-by: BubbleCal <bubble-cal@outlook.com>	2026-01-16 01:46:52 +08:00
Will Jones	1840aa7edc	feat(rust)!: remove default features (#2912 ) BREAKING CHANGE: removes `aws`, `dynamodb`, `azure`, `gcs`, `oss`, `huggingface` from default Rust features. They can be enabled by users as needed. They are still enabled for Python and NodeJS, since those users don't control the compilation of artifacts. Closes #2911	2026-01-13 11:23:14 -08:00
Colin Patrick McCabe	2f6d525802	fix: support `exist_ok` in `RemoteDBConnection.create_table` (#2901 ) RemoteDBConnection should support passing exist_ok to create_table, just like LanceDBConnection (the non-remote form) does. It can support this by passing 'exist_ok' as the mode parameter.	2026-01-07 12:29:45 -08:00
LuQQiu	d67a8743ba	feat: support remote ivf rq (#2863 )	2026-01-02 15:35:33 -08:00
Chenghao Lyu	46fcbbc1e3	fix(python): require explicit region for S3 buckets with dots (#2892 ) When region is not specific in the s3 path, `resolve_s3_region` from "lance-format" project (see [here][1]) will resolve the region by calling `resolve_bucket_region`, which is a function from the "arrow-rs-object-store" project expecting [virtual-hosted-style URLs][1]. When there are dot (".") in the virtual-hosted-style URLs, it breaks automatic region detection. See more details in the issue description: https://github.com/lancedb/lancedb/issues/1898#issuecomment-3690142427 This PR add early validation in connect() and connect_async() to raise a clear error with instructions when the region is not specified for such buckets. [1]: https://github.com/lance-format/lance/blob/v2.0.0-beta.4/rust/lance-io/src/object_store/providers/aws.rs#L197 [2]: `eedbf3d7d8/src/aws/resolve.rs (L52C5-L52C65)` [3]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html#virtual-hosted-style-access Fixes #1898	2026-01-02 15:35:22 -08:00
fzowl	2adb10e6a8	feat: voyage-multimodal-3.5 (#2887 ) voyage-multimodal-3.5 support (text, image and video embeddings)	2026-01-02 15:14:52 -08:00

1 2 3 4 5 ...

910 Commits