Without this fix, if user directly use the native table to do operations
like `add_columns`, even if it is configured to use namespace db
connection, it is not really propagated through.
The fix is to bring lancedb's python binding up to date and do a similar
implementation as https://github.com/lance-format/lance/pull/5968, and
make sure the namespace is fully propagated through all the related
calls.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
When we write data with `add()`, we can input data to the table's
schema. However, we were using "safe" mode, which propagates errors as
nulls. For example, if you pass `u64::max` into a field that is a `u32`,
it will just write null instead of giving overflow error. Now it
propagates the overflow. This is the same behavior as other systems like
DuckDB.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- Add `@value_to_sql.register(dict)` handler that converts Python dicts
to DataFusion's `named_struct()` SQL syntax
- Enables updating struct-typed columns via `table.update(values={"col":
{"field_a": 1, "field_b": "hello"}})`
- Recursively handles nested structs, lists, nulls, and all existing
scalar types
Closes#1363
## Details
The `named_struct` function was introduced in DataFusion 38 and is now
available (LanceDB uses DataFusion 52.1). The implementation follows the
existing `singledispatch` pattern in `util.py`.
**Example conversion:**
```python
value_to_sql({"field_a": 1, "field_b": "hello"})
# => "named_struct('field_a', 1, 'field_b', 'hello')"
```
## Test plan
- [x] Unit tests for flat struct, nested struct, list inside struct,
mixed types, null values, and empty dict
- [ ] CI integration tests with actual table.update() on struct columns
🔗 [DataFusion named_struct
docs](https://datafusion.apache.org/user-guide/sql/scalar_functions.html#named-struct)
We don't necessarily need to do this, but one user was confused having
used `fast_search=True` as a keyword argument for vector searches, but
being unable to do so for FTS, even after the most recent changes. I
think this is the only discrepancy in where that is possible.
This hooks up a new writer implementation for the `add()` method. The
main immediate benefit is it allows streaming requests to remote tables,
and at the same time allowing retries for most inputs.
In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's
always retry-able.
For Python, all are retry-able, except `Iterator` and
`pa.RecordBatchReader`, which can only be consumed once. Some, like
`pa.datasets.Dataset` are retry-able *and* streaming.
A lot of the changes here are to make the new DataFusion write pipeline
maintain the same behavior as the existing Python-based preprocessing,
such as:
* casting input data to target schema
* rejecting NaN values if `on_bad_vectors="error"`
* applying embedding functions.
In future PRs, we'll enhance these by moving the embedding calls into
DataFusion and making sure we parallelize them. See:
https://github.com/lancedb/lancedb/issues/3048
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
Fixes#1679
This PR prevents the OpenAI embedding function from retrying when
receiving a 401 Unauthorized error. Authentication errors are permanent
failures that won't be fixed by retrying, yet the current implementation
retries all exceptions up to 7 times by default.
## Changes
- Modified `retry_with_exponential_backoff` in `utils.py` to check for
non-retryable errors before retrying
- Added `_is_non_retryable_error` helper function that detects:
- Exceptions with name `AuthenticationError` (OpenAI's 401 error)
- Exceptions with `status_code` attribute of 401 or 403
- Enhanced OpenAI embeddings to explicitly catch and re-raise
`AuthenticationError` with better logging
- Added unit test `test_openai_no_retry_on_401` to verify authentication
errors don't trigger retries
## Test Plan
- Added test that verifies:
1. A function raising `AuthenticationError` is only called once
2. No retry delays occur (sleep is never called)
- Existing tests continue to pass
- Formatting applied via `make format`
## Example Behavior
**Before**: With an invalid API key, users would see 7 retry attempts
over ~2 minutes:
```
WARNING:root:Error occurred: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}}
Retrying in 3.97 seconds (retry 1 of 7)
WARNING:root:Error occurred: Error code: 401...
Retrying in 7.94 seconds (retry 2 of 7)
...
```
**After**: With an invalid API key, the error is raised immediately:
```
ERROR:root:Authentication failed: Invalid API key provided
AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided...'}}
```
This provides better UX and prevents unnecessary API calls that would
fail anyway.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
There are old and outdated files in our embedding registry that can
confuse coding agents. This PR deprecates the following files that have
newer, more modern methods to generate such embeddings.
- Deprecate `embeddings/siglip.py`
- Deprecate `embeddings/gte.py`
## Why this change?
Per a discussion with @AyushExel, the [embedding registry directory
](1840aa7edc/python/python/lancedb/embeddings)
in the LanceDB repo has a number of outdated files that need to be
deprecated.
See https://github.com/lancedb/docs/issues/85 for the docs gaps that
identified this.
- Add note in `openclip` docs that it can be used for SigLip embeddings,
which it now supports
- Add note in the `sentence-transformers` page that ALL text embedding
models on Hugging Face can be used
## Problem
When applying hard filters that result in zero matches, hybrid search
crashes with `IndexError: list index out of range` during reranking.
This happens because empty result tables are passed through the full
reranker pipeline, which expects at least one result.
Traceback from the issue:
```
lancedb/query.py: in _combine_hybrid_results
results = reranker.rerank_hybrid(fts_query, vector_results, fts_results)
lancedb/rerankers/answerdotai.py: in rerank_hybrid
combined_results = self._rerank(combined_results, query)
...
IndexError: list index out of range
```
## Fix
Added an early return in `_combine_hybrid_results` when both vector and
FTS results are empty. Instead of passing empty tables through
normalization, reranking, and score restoration (which can fail in
various ways), we now build a properly-typed empty result table with the
`_relevance_score` column and return it directly.
## Test
Added `test_empty_hybrid_result_reranker` that exercises
`_combine_hybrid_results` directly with empty vector and FTS tables,
verifying:
- Returns empty table with correct schema
- Includes `_relevance_score` column
- Respects `with_row_ids` flag
Closes#2425
This changes around the output format of `Permutation` in some breaking
ways but I think the API is still new enough to be considered
experimental.
1. In order to align with both huggingface's dataset and torch's
expectations the default output format is now a list of dicts
(row-major) instead of a dict of lists (column-major). I've added a
python_col option which will return the dict of lists.
2. In order to align with pytorch's expectation the `torch` format is
now a list of tensors (row-major) instead of a 2D tensor (column-major).
I've added a torch_col option which will return the 2D tensor instead.
Added tests for torch integration with Permutation
~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013
merges as this is built on top of that~~
Closes#3000
The hybrid search `explain_plan` now shows the reranker as the top-level
node with
the vector and FTS sub-plans indented underneath, instead of just
listing them
separately with no reranker context.
**Before:**
```
Vector Search Plan:
ProjectionExec: ...
FTS Search Plan:
ProjectionExec: ...
```
**After:**
```
RRFReranker(K=60)
Vector Search Plan:
ProjectionExec: ...
FTS Search Plan:
ProjectionExec: ...
```
Other rerankers display similarly ; e.g.
`LinearCombinationReranker(weight=0.7, fill=1.0)`,
`MRRReranker(weight_vector=0.5, weight_fts=0.5)`,
`CohereReranker(model_name=name)`.
---------
Signed-off-by: dask-58 <googldhruv@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Fixes#2999
The error message previously said `"field_names must be a string when
use_tantivy=False"` implying they should use the to be deprecated
tantivy backend #2998.
Updated the error message and docstring to instead guide users to create
a separate FTS index for each field
Signed-off-by: dask-58 <googldhruv@gmail.com>
Expose `initial_storage_options()` and `latest_storage_options()` in
lance Dataset, in lancedb rust, python and typescript SDKs.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Fixes#2612
This PR exposes the private _fast_search attribute via a public
fast_search() method in the synchronous LanceVectorQueryBuilder.
Previously, enabling fast search in the sync API required accessing a
private member (query._fast_search = True). This change aligns the
synchronous API with the Async and Remote APIs, allowing for cleaner,
more Pythonic method chaining.
Changes:
Added fast_search() method to LanceVectorQueryBuilder in
python/python/lancedb/query.py.
Added a unit test verifying the flag works with high-dimensional data
(2560 dims) and chaining.
Example Usage:
Before:
```
query = table.search(vector)
query._fast_search = True # Private attribute usage
results = query.limit(10).to_pandas()
```
After:
```
results = (
table.search(vector)
.fast_search()
.limit(10)
.to_pandas()
)
```
Verification:
I have added a test case (test_fast_search_high_dimension) that
replicates the scenario described in the issue (2560 dimensions, cosine
distance) to ensure the pipeline constructs the query correctly without
errors.
Checklist:
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.
- [ ] Documentation has been updated (inline docstrings).
Signed-off-by: Rashidul Islam <rasidulislam71@gmail.com>
## Summary
- PR #2957 changed the permutation builder to only select `_rowid` from
the base table, but `Splitter::project()` for hash and calculated splits
replaced the selection entirely, dropping `_rowid`.
- Include `_rowid` in the column selections for hash and calculated
split projections.
- Fix a Python test that queried the permutation table for base table
columns no longer materialized.
Fixes the `test_split_hash`, `test_split_hash_with_discard`,
`test_split_calculated`, `test_shuffle_combined_with_splits`, and
`test_filter_with_splits` failures in `test_permutation.py`.
## Test plan
- [x] `cargo test -p lancedb -- permutation` (22 passed)
- [x] `pytest python/tests/test_permutation.py` (46 passed)
- [x] `npm test __test__/permutation.test.ts` (20 passed)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Fixes#2898
Problem:
Sync API cancellations didn’t stop remote query coroutines, so requests
could continue after interrupt.
Changes:
- Cancel run_coroutine_threadsafe futures on any BaseException in the
sync background loop
- Update cancellation test to avoid starting a real background thread
and cover GeneratorExit
Importing `PIL` alone does not guarantee that the `Image` submodule is
loaded. In a clean environment where no other code has imported
`PIL.Image` before, `PIL.Image` does not exist on the `PIL` package,
which leads to the AttributeError.
Pandas 3.0+ string now converts to Arrow large_utf8. This PR mainly
makes sure our test accounts for the difference across the pandas
versions when constructing schema.
RemoteDBConnection should support passing exist_ok to create_table, just
like LanceDBConnection (the non-remote form) does. It can support this
by passing 'exist_ok' as the mode parameter.
The page_token and limit parameters for table_names() are supported by
both local storage and LanceDB Cloud, not just Cloud as the docstring
incorrectly stated.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Adds `Table.to_lance()` and `Table.to_polars()` methods (non-abstract
methods, defaulting to `NotImplementedError`) so type checkers like
mypy, pyright and ty don’t flag them as unknown attributes on `Table`.
Not making these abstract methods should keep existing remote/other
`Table` implementations instantiable.
This is non-breaking change to existing functionality and is purely for
the purpose of pleasing static type-checkers like mypy, ty and pyright.
<img width="626" height="134" alt="image"
src="https://github.com/user-attachments/assets/f4619bca-a882-432b-bd23-ae8f189ff9e3"
/>
## Summary
- infer integer vector columns as float32 when any value exceeds uint8
range or is negative
- keep uint8 for integer vectors within range and nulls only
- add sync/async tests covering large integer vector inference
## Testing
- ./.venv/bin/pytest python/python/tests/test_table.py -k
"large_int_vectors"
Adds IVF_SQ index config through Rust core and Python bindings, plus
alias names IvfHnswSq/Pq for backward compatibility. Updates
remote/table helpers and types to accept the new index type. Includes
tests covering IVF SQ creation and alias usage.
1. Use generated models in lance-namespace for request response models
to avoid multiple layers of conversions
2. Make sure the API is consistent with the namespace spec
3. Deprecate the table_names API in favor of the list_tables API in
namespace that allows full pagination support without the need to have
sorted table names
4. Add describe_namespace API which was a miss in the original
implementation
Add support for enabling stable row IDs when creating tables via the
`new_table_enable_stable_row_ids` storage option.
Stable row IDs ensure that row identifiers remain constant after
compaction, update, delete, and merge operations. This is useful for
materialized views and other use cases that need to track source rows
across these operations.
The option can be set at two levels:
- Connection level: applies to all tables created with that connection
- Table level: per-table override via create_table storage_options
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
- bump all Lance crates to v1.0.0-beta.16 via ci/set_lance_version.py
- refresh Cargo.lock (reqwest/opendal/etc.) to satisfy the new release
## Verification
- cargo clippy --workspace --tests --all-features -- -D warnings
- cargo fmt --all
Triggered by
[refs/tags/v1.0.0-beta.16](https://github.com/lance-format/lance/releases/tag/v1.0.0-beta.16)
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
This PR improves the docstring for `IVF_RQ` (RabitQ) in Python. The
earlier version referred to it as "residual quantization", which is
confusing to future readers of the code.
In contrast, the TypeScript and Rust codebases defined `IVF_RQ` as
RabitQ. So now the three languages use comments that are consistent with
one another.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Did a full scan of all URLs that used to point to the old mkdocs pages,
and now links to the appropriate pages on lancedb.com/docs or lance.org
docs.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The `self._query` value was not set when wrapping its copy `query` with
quotation marks.
The test for phrase queries has been updated to test the
`.phrase_query()` method as well, which will catch this bug.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
This request improves support for `pydantic` integration by adding
`to_pydantic` method to asynchronous queries and handling models that
use `alias` in field definitions. Fixes#2436 and closes#2437 .
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for converting asynchronous query results to Pydantic
models.
- **Bug Fixes**
- Simplified conversion of query results to Pydantic models for improved
reliability.
- Improved handling of field aliases and computed fields when mapping
query results to Pydantic models.
- **Tests**
- Added tests to verify correct mapping of aliased and computed fields
in both synchronous and asynchronous scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->