## Summary
Fixes IndexError when creating tables with empty list data and a
provided schema. Previously, `_into_pyarrow_reader()` would attempt to
access `data[0]` on empty lists, causing an IndexError. Now properly
handles empty lists by using the provided schema.
Also adds regression tests for GitHub issues #1968 and #303 to prevent
future regressions with empty table scenarios.
## Changes
- Fix IndexError in `_into_pyarrow_reader()` for empty list + schema
case
- Add Optional[pa.Schema] parameter to handle empty data gracefully
- Add `test_create_table_empty_list_with_schema` for the IndexError fix
- Add `test_create_empty_then_add_data` for issue #1968
- Add `test_search_empty_table` for issue #303
## Test plan
- [x] All new regression tests pass
- [x] Existing tests continue to pass
- [x] Code formatted with `make format`
## Summary
Fixes#2541
**Problem**: The `register` function was not accessible via `from
lancedb.embeddings import register` as documented, causing ImportError
for users trying to create custom embedding functions.
**Solution**: Added `register` to the exports in
`python/lancedb/embeddings/__init__.py` to match the documented API and
follow the same pattern as other registry functions (`get_registry`,
`EmbeddingFunctionRegistry`).
**Root Cause**: The function existed in `lancedb.embeddings.registry`
but wasn't exposed through the main embeddings module interface.
## Changes
- Add `register` to imports in
`/python/python/lancedb/embeddings/__init__.py`
## Test Plan
- [x] Verified `from lancedb.embeddings import register` works as
documented
- [x] Confirmed existing embedding tests pass
- [x] Checked that the fix follows existing patterns (same as
`get_registry`)
- [x] Validated linting and formatting passes
## References
Fixes#2541
This patch fix can not build on python3.9 dev
the reason is that for ibm-watsonx-ai the min version is py3.10
more can check on `pyoven` https://pyoven.org/package/ibm-watsonx-ai/
also fix tiny md lint
---------
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
- Fix register() method's alias parameter type from 'str = None' to
'Optional[str] = None'
- Add return type annotation 'Type[EmbeddingFunction]' to get() method
- Import Type from typing module for proper type hints
currently, to_pydantic will always return LanceModel. If type checking
is enabled in my project. I have to use `cast(data,
List[RealModelType])` to solve type error. This PR uses generic to solve
this problem.
## Summary
- Fixed flaky Node.js integration test for mirrored store functionality
- Converted callback-based `fs.readdir()` to `fs.promises.readdir()`
with proper async/await
- Used unique temporary directories to prevent test isolation issues
- Updated test expectations to match current IVF-PQ index file structure
## Problem
The mirrored store integration test was experiencing random failures in
CI with errors like:
- `expected 2 to equal 1` at various assertion points
- `done() called multiple times`
## Root Causes Identified
1. **Race conditions**: Mixing callback-based filesystem operations with
async functions created timing issues where assertions ran before
filesystem operations completed
2. **Test isolation**: Multiple tests shared the same temp directory
(`tmpdir()`), causing one test to see files from another
3. **Outdated expectations**: IVF-PQ indexes now create 2 files
(`auxiliary.idx` + `index.idx`) instead of 1, but the test expected only
1
## Solution
- Replace all `fs.readdir()` callbacks with `fs.promises.readdir()` and
`await`
- Use `fs.promises.mkdtemp()` to create unique temporary directories for
each test run
- Update index file count expectations from 1 to 2 files to match
current Lance behavior
- Add descriptive assertion labels for easier debugging
## Analysis
The mirroring implementation in `MirroringObjectStore::put_opts` is
synchronous - it awaits writes to both secondary (local) and primary
(S3) stores before returning. The test failures were due to
callback/async pattern mismatch and test isolation issues, not actual
async mirroring behavior.
## Test plan
- [x] Local tests are running without timing-based failures
- [x] Integration tests with AWS credentials pass in CI
This resolves the flaky failures including 'expected 2 to equal 1'
assertions and 'done() called multiple times' errors seen in CI runs.
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
Fixes intermittent CI failures in `test_search_fts[False]` where boolean
FTS queries were returning fewer results than expected due to
non-deterministic test data generation.
## Problem
The test was using global `random` and `np.random` without seeding,
causing the boolean query `MatchQuery("puppy", "text") &
MatchQuery("runs", "text")` to sometimes return only 3 results instead
of the expected 5, leading to `AssertionError: assert 3 == 5`.
## Solution
- Replace global random calls with local `random.Random(42)` and
`np.random.RandomState(42)` objects in test fixtures
- Ensures deterministic test data while maintaining test isolation
- No impact on other tests since random state is scoped to fixtures only
## Test Results
- ✅ `test_search_fts[False]` now passes consistently
- ✅ All other FTS tests continue to pass
- ✅ No regression in other test suites (verified with `test_basic`)
- ✅ Maintains existing test behavior and coverage
## Summary
Fixed a minor grammar error in the error message for missing API key
when connecting to LanceDB cloud.
## Changes
- Changed 'api_key is required to connected LanceDB cloud' to 'api_key
is required to connect to LanceDB cloud'
- Location: `python/python/lancedb/__init__.py:95`
## Test plan
- Error message formatting is correct and grammatical
- No functional changes to existing behavior
## Summary
- Add `create_import_stub()` helper to `embeddings/utils.py` for
handling optional dependencies
- Fix MLX doctest collection failures by using import stubs in
`gte_mlx_model.py`
- Module now imports successfully for doctest collection even when MLX
is not installed
## Changes
- **New utility function**: `create_import_stub()` creates placeholder
objects that allow class inheritance but raise helpful errors when used
- **Updated MLX model**: Uses import stubs instead of direct imports
that fail immediately
- **Graceful degradation**: Clear error messages when MLX functionality
is accessed without MLX installed
## Test Results
- ✅ `pytest --doctest-modules python/lancedb` now passes (with and
without MLX installed)
- ✅ All existing tests continue to pass
- ✅ MLX functionality works normally when MLX is installed
- ✅ Helpful error messages when MLX functionality is used without MLX
installed
Fixes#2538
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
## Summary
Add support for providing a custom `Session` when connecting to a
`ListingDatabase`. This allows users to configure object store
registries, caching, and other session-related settings while
maintaining full backward compatibility.
## Usage Example
```rust
use std::sync::Arc;
use lancedb::connect;
let custom_session = Arc::new(lance::session::Session::default());
let db = connect("/path/to/database")
.session(custom_session)
.execute()
.await?;
```
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
Fixes#2515 by implementing comprehensive support for missing columns in
Arrow table inputs when using embedding functions.
### Problem
Previously, when an Arrow table was passed to `fromDataToBuffer` with
missing columns and a schema containing embedding functions, the system
would fail because `applyEmbeddingsFromMetadata` expected all columns to
be present in the table.
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
this test adds a new vector and then performs vector search with
distance range.
this may fail if the new vector becomes the closest one to the query
vector
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This fixes two bugs with create_table storage handle reuse. First issue
is, the database object did not previously carry a session that
create_table operations could reuse for create_table operations.
Second issue is, the inheritance logic for create_table and open_table
was causing empty storage options (i.e Some({})) to get sent, instead of
None. Lance handles these differently:
* When None is set, the object store held in the session's storage
registry that was created at "connect" is used. This value stays in the
cache long-term (probably as long as the db reference is held).
* When Some({}) is sent, LanceDB will create a new connection and cache
it for an empty key. However, that cached value will remain valid only
as long as the client holds a reference to the table. After that, the
cache is poisoned and the next create_table with the same key, will
create a new connection. This confounds reuse if e.g python gc's the
table object before another table is created.
My feeling is that the second path, if intentional, is probably meant to
serve cases where tables are overriding settings and the cached
connection is assumed not to be generally applicable. The bug is we were
engaging that mechanism for all tables.
Previously `return_score="all"` was supported only for the default
reranker (RRF) and not the model based rerankers.
This adds support for keeping all scores in the base reranker so that
all model based rerankers can use it. Its a slower path than keeping
just the relevance score but can be useful in debugging
Thanks for all your work.
The docstring for `OptimizeOptions ` seems to reference a non-existent
method on `Table`. I believe this is the correct example for
`cleanupOlderThan`.
This also appears in the generated docs, but I assume they live
downstream from this code?
just noticed that we're doing a 'return' instead of a 'raise' while
trying to get remote functionality working for my project. I went ahead
and implemented tests for both of the unimplemented functions (to_pandas
and to_arrow) while I was in there.
---------
Co-authored-by: Cyrus Attoun <jattoun1@gmail.com>
I can't find any reason for pinning this dependency and the fact that it
is pinned can be kind of annoying to use downstream (e.g. datafusion
currently requires >= 2.6).
this also upgrades:
- datafusion 47.0 -> 48.0
- half 2.5.0 -> 2.6.0
to be consistent with lance
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Make sure we only update the latest version if it's actually newer. This
is important if there are concurrent queries, as they can take different
amounts of time.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Updated dependencies to newer versions for improved compatibility and
stability.
* **Refactor**
* Improved internal handling of data ranges and stream lifetimes for
enhanced performance and reliability.
* Simplified code style for Python query object conversions without
affecting functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Other embedding integrations such as Cohere and OpenAI already send
requests in batches. We should do that for Ollama too to improve
throughput.
The Ollama [`.embed`
API](63ca747622/ollama/_client.py (L359-L378))
was added in version 0.3.0 (almost a year ago) so I updated the version
requirement in pyproject.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved compatibility with newer versions of the "ollama" package by
requiring version 0.3.0 or higher.
- Enhanced embedding generation to process batches of texts more
efficiently and reliably.
- **Refactor**
- Improved type consistency and clarity for embedding-related methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->