## Summary
Fixes IndexError when creating tables with empty list data and a
provided schema. Previously, `_into_pyarrow_reader()` would attempt to
access `data[0]` on empty lists, causing an IndexError. Now properly
handles empty lists by using the provided schema.
Also adds regression tests for GitHub issues #1968 and #303 to prevent
future regressions with empty table scenarios.
## Changes
- Fix IndexError in `_into_pyarrow_reader()` for empty list + schema
case
- Add Optional[pa.Schema] parameter to handle empty data gracefully
- Add `test_create_table_empty_list_with_schema` for the IndexError fix
- Add `test_create_empty_then_add_data` for issue #1968
- Add `test_search_empty_table` for issue #303
## Test plan
- [x] All new regression tests pass
- [x] Existing tests continue to pass
- [x] Code formatted with `make format`
## Summary
Fixes#2541
**Problem**: The `register` function was not accessible via `from
lancedb.embeddings import register` as documented, causing ImportError
for users trying to create custom embedding functions.
**Solution**: Added `register` to the exports in
`python/lancedb/embeddings/__init__.py` to match the documented API and
follow the same pattern as other registry functions (`get_registry`,
`EmbeddingFunctionRegistry`).
**Root Cause**: The function existed in `lancedb.embeddings.registry`
but wasn't exposed through the main embeddings module interface.
## Changes
- Add `register` to imports in
`/python/python/lancedb/embeddings/__init__.py`
## Test Plan
- [x] Verified `from lancedb.embeddings import register` works as
documented
- [x] Confirmed existing embedding tests pass
- [x] Checked that the fix follows existing patterns (same as
`get_registry`)
- [x] Validated linting and formatting passes
## References
Fixes#2541
This patch fix can not build on python3.9 dev
the reason is that for ibm-watsonx-ai the min version is py3.10
more can check on `pyoven` https://pyoven.org/package/ibm-watsonx-ai/
also fix tiny md lint
---------
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
- Fix register() method's alias parameter type from 'str = None' to
'Optional[str] = None'
- Add return type annotation 'Type[EmbeddingFunction]' to get() method
- Import Type from typing module for proper type hints
currently, to_pydantic will always return LanceModel. If type checking
is enabled in my project. I have to use `cast(data,
List[RealModelType])` to solve type error. This PR uses generic to solve
this problem.
## Summary
- Fixed flaky Node.js integration test for mirrored store functionality
- Converted callback-based `fs.readdir()` to `fs.promises.readdir()`
with proper async/await
- Used unique temporary directories to prevent test isolation issues
- Updated test expectations to match current IVF-PQ index file structure
## Problem
The mirrored store integration test was experiencing random failures in
CI with errors like:
- `expected 2 to equal 1` at various assertion points
- `done() called multiple times`
## Root Causes Identified
1. **Race conditions**: Mixing callback-based filesystem operations with
async functions created timing issues where assertions ran before
filesystem operations completed
2. **Test isolation**: Multiple tests shared the same temp directory
(`tmpdir()`), causing one test to see files from another
3. **Outdated expectations**: IVF-PQ indexes now create 2 files
(`auxiliary.idx` + `index.idx`) instead of 1, but the test expected only
1
## Solution
- Replace all `fs.readdir()` callbacks with `fs.promises.readdir()` and
`await`
- Use `fs.promises.mkdtemp()` to create unique temporary directories for
each test run
- Update index file count expectations from 1 to 2 files to match
current Lance behavior
- Add descriptive assertion labels for easier debugging
## Analysis
The mirroring implementation in `MirroringObjectStore::put_opts` is
synchronous - it awaits writes to both secondary (local) and primary
(S3) stores before returning. The test failures were due to
callback/async pattern mismatch and test isolation issues, not actual
async mirroring behavior.
## Test plan
- [x] Local tests are running without timing-based failures
- [x] Integration tests with AWS credentials pass in CI
This resolves the flaky failures including 'expected 2 to equal 1'
assertions and 'done() called multiple times' errors seen in CI runs.
## Summary
- Exposes `Session` in Python and Typescript so users can set the
`index_cache_size_bytes` and `metadata_cache_size_bytes`
* The `Session` is attached to the `Connection`, and thus shared across
all tables in that connection.
- Adds deprecation warnings for table-level cache configuration
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
Fixes intermittent CI failures in `test_search_fts[False]` where boolean
FTS queries were returning fewer results than expected due to
non-deterministic test data generation.
## Problem
The test was using global `random` and `np.random` without seeding,
causing the boolean query `MatchQuery("puppy", "text") &
MatchQuery("runs", "text")` to sometimes return only 3 results instead
of the expected 5, leading to `AssertionError: assert 3 == 5`.
## Solution
- Replace global random calls with local `random.Random(42)` and
`np.random.RandomState(42)` objects in test fixtures
- Ensures deterministic test data while maintaining test isolation
- No impact on other tests since random state is scoped to fixtures only
## Test Results
- ✅ `test_search_fts[False]` now passes consistently
- ✅ All other FTS tests continue to pass
- ✅ No regression in other test suites (verified with `test_basic`)
- ✅ Maintains existing test behavior and coverage
## Summary
Fixed a minor grammar error in the error message for missing API key
when connecting to LanceDB cloud.
## Changes
- Changed 'api_key is required to connected LanceDB cloud' to 'api_key
is required to connect to LanceDB cloud'
- Location: `python/python/lancedb/__init__.py:95`
## Test plan
- Error message formatting is correct and grammatical
- No functional changes to existing behavior
## Summary
- Add `create_import_stub()` helper to `embeddings/utils.py` for
handling optional dependencies
- Fix MLX doctest collection failures by using import stubs in
`gte_mlx_model.py`
- Module now imports successfully for doctest collection even when MLX
is not installed
## Changes
- **New utility function**: `create_import_stub()` creates placeholder
objects that allow class inheritance but raise helpful errors when used
- **Updated MLX model**: Uses import stubs instead of direct imports
that fail immediately
- **Graceful degradation**: Clear error messages when MLX functionality
is accessed without MLX installed
## Test Results
- ✅ `pytest --doctest-modules python/lancedb` now passes (with and
without MLX installed)
- ✅ All existing tests continue to pass
- ✅ MLX functionality works normally when MLX is installed
- ✅ Helpful error messages when MLX functionality is used without MLX
installed
Fixes#2538
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
## Summary
Add support for providing a custom `Session` when connecting to a
`ListingDatabase`. This allows users to configure object store
registries, caching, and other session-related settings while
maintaining full backward compatibility.
## Usage Example
```rust
use std::sync::Arc;
use lancedb::connect;
let custom_session = Arc::new(lance::session::Session::default());
let db = connect("/path/to/database")
.session(custom_session)
.execute()
.await?;
```
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-authored-by: Claude <noreply@anthropic.com>
## Summary
Fixes#2515 by implementing comprehensive support for missing columns in
Arrow table inputs when using embedding functions.
### Problem
Previously, when an Arrow table was passed to `fromDataToBuffer` with
missing columns and a schema containing embedding functions, the system
would fail because `applyEmbeddingsFromMetadata` expected all columns to
be present in the table.
🤖 Generated with [Claude Code](https://claude.ai/code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
this test adds a new vector and then performs vector search with
distance range.
this may fail if the new vector becomes the closest one to the query
vector
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This fixes two bugs with create_table storage handle reuse. First issue
is, the database object did not previously carry a session that
create_table operations could reuse for create_table operations.
Second issue is, the inheritance logic for create_table and open_table
was causing empty storage options (i.e Some({})) to get sent, instead of
None. Lance handles these differently:
* When None is set, the object store held in the session's storage
registry that was created at "connect" is used. This value stays in the
cache long-term (probably as long as the db reference is held).
* When Some({}) is sent, LanceDB will create a new connection and cache
it for an empty key. However, that cached value will remain valid only
as long as the client holds a reference to the table. After that, the
cache is poisoned and the next create_table with the same key, will
create a new connection. This confounds reuse if e.g python gc's the
table object before another table is created.
My feeling is that the second path, if intentional, is probably meant to
serve cases where tables are overriding settings and the cached
connection is assumed not to be generally applicable. The bug is we were
engaging that mechanism for all tables.
Previously `return_score="all"` was supported only for the default
reranker (RRF) and not the model based rerankers.
This adds support for keeping all scores in the base reranker so that
all model based rerankers can use it. Its a slower path than keeping
just the relevance score but can be useful in debugging
Thanks for all your work.
The docstring for `OptimizeOptions ` seems to reference a non-existent
method on `Table`. I believe this is the correct example for
`cleanupOlderThan`.
This also appears in the generated docs, but I assume they live
downstream from this code?
just noticed that we're doing a 'return' instead of a 'raise' while
trying to get remote functionality working for my project. I went ahead
and implemented tests for both of the unimplemented functions (to_pandas
and to_arrow) while I was in there.
---------
Co-authored-by: Cyrus Attoun <jattoun1@gmail.com>
I can't find any reason for pinning this dependency and the fact that it
is pinned can be kind of annoying to use downstream (e.g. datafusion
currently requires >= 2.6).
this also upgrades:
- datafusion 47.0 -> 48.0
- half 2.5.0 -> 2.6.0
to be consistent with lance
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Make sure we only update the latest version if it's actually newer. This
is important if there are concurrent queries, as they can take different
amounts of time.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Updated dependencies to newer versions for improved compatibility and
stability.
* **Refactor**
* Improved internal handling of data ranges and stream lifetimes for
enhanced performance and reliability.
* Simplified code style for Python query object conversions without
affecting functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Other embedding integrations such as Cohere and OpenAI already send
requests in batches. We should do that for Ollama too to improve
throughput.
The Ollama [`.embed`
API](63ca747622/ollama/_client.py (L359-L378))
was added in version 0.3.0 (almost a year ago) so I updated the version
requirement in pyproject.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved compatibility with newer versions of the "ollama" package by
requiring version 0.3.0 or higher.
- Enhanced embedding generation to process batches of texts more
efficiently and reliably.
- **Refactor**
- Improved type consistency and clarity for embedding-related methods.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Improved formatting and clarity in instructional text within the
Multivector on LanceDB notebook.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Fixes issue #2465 where FTS explain plans only showed basic `LanceScan`
instead of detailed execution plans with FTS query details, limits, and
offsets.
## Root Cause
The `FTSQuery::explain_plan()` and `analyze_plan()` methods were missing
the `.full_text_search()` call before calling explain/analyze plan,
causing them to operate on the base query without FTS context.
## Changes
- **Fixed** `explain_plan()` and `analyze_plan()` in `src/query.rs` to
call `.full_text_search()`
- **Added comprehensive test coverage** for FTS explain plans with
limits, offsets, and filters
- **Updated existing tests** to expect correct behavior instead of buggy
behavior
## Before/After
**Before (broken):**
```
LanceScan: uri=..., projection=[...], row_id=false, row_addr=false, ordered=true
```
**After (fixed):**
```
ProjectionExec: expr=[id@2 as id, text@3 as text, _score@1 as _score]
Take: columns="_rowid, _score, (id), (text)"
CoalesceBatchesExec: target_batch_size=1024
GlobalLimitExec: skip=2, fetch=4
MatchQuery: query=test
```
## Test Plan
- [x] All new FTS explain plan tests pass
- [x] Existing tests continue to pass
- [x] FTS queries now show proper execution plans with MatchQuery,
limits, filters
Closes#2465🤖 Generated with [Claude Code](https://claude.ai/code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Tests**
* Added new test cases to verify explain plan output for full-text
search, vector queries with pagination, and queries with filters.
* **Bug Fixes**
* Improved the accuracy of explain plan and analysis output for
full-text search queries, ensuring the correct query details are
reflected.
* **Refactor**
* Enhanced the formatting and hierarchical structure of execution plans
for hybrid queries, providing clearer and more detailed plan
representations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Claude <noreply@anthropic.com>
lance [release
details](https://github.com/lancedb/lance/releases/tag/v0.30.0)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated dependency specifications to use exact version numbers instead
of referencing a git repository and tag.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
We are able to push commits over here:
cb7293e073/.github/workflows/make-release-commit.yml (L88-L95)
So I think it's safe to assume this will work.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated workflow configuration to improve authentication and branch
targeting for automated release processes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This switches the default FTS to native lance FTS for Python sync table
API, the other APIs have switched to native implementation already
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- The default behavior for creating a full-text search index now uses
the new implementation rather than the legacy one.
- **Bug Fixes**
- Improved handling and error messages for phrase queries in full-text
search.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated the release workflow to explicitly check out the main branch
during the publishing process.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated internal dependencies for improved stability and
compatibility.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
- Enhanced error messages for schema inference failures to suggest
providing an explicit schema.
- Updated embedding application logic to check for existing destination
columns, allowing for filling embeddings in columns that are all null.
- Added comments for clarity on handling existing columns during
embedding application.
Fixes https://github.com/lancedb/lancedb/issues/2183
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **Bug Fixes**
- Improved error messages for schema inference to enhance readability.
- Prevented redundant embedding application by skipping columns that
already contain data, avoiding unnecessary errors and computations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated release workflow to set a specific Git user name and email for
automated commits during the package publishing process.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated internal dependencies to use a newer version of the Lance
library.
- **New Features**
- Added support for a new query occurrence type labeled "MUST NOT" in
search filters.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This exposes the maximum_nprobes and minimum_nprobes feature that was
added in https://github.com/lancedb/lance/pull/3903
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for specifying minimum and maximum probe counts in
vector search queries, allowing finer control over search behavior.
- Users can now independently set minimum and maximum probes for vector
and hybrid queries via new methods and parameters in Python, Node.js,
and Rust APIs.
- **Bug Fixes**
- Improved parameter validation to ensure correct usage of minimum and
maximum probe values.
- **Tests**
- Expanded test coverage to validate correct handling, serialization,
and error cases for the new probe parameters.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
- operator for match query
- slop for phrase query
- boolean query
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced support for boolean full-text search queries with AND/OR
logic and occurrence conditions.
- Added operator options for match and multi-match queries to control
term combination logic.
- Enabled phrase queries to specify proximity (slop) for flexible phrase
matching.
- Added new enumerations (`Operator`, `Occur`) and the `BooleanQuery`
class for enhanced query expressiveness.
- **Bug Fixes**
- Improved validation and error handling for invalid operator and
occurrence inputs in full-text queries.
- **Tests**
- Expanded test coverage with new cases for boolean queries and
operator-based full-text searches.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
- AND operator
- phrase query slop param
- boolean query
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for combining full-text search queries using AND/OR
operators, enabling more flexible query composition.
- Introduced new query types and parameters, including boolean queries,
operator selection, occurrence constraints, and phrase slop for advanced
search scenarios.
- Enhanced asynchronous search to accept rich full-text query objects
directly.
- **Bug Fixes**
- Improved handling and validation of full-text search queries in both
synchronous and asynchronous search operations.
- **Tests**
- Updated and expanded tests to cover new full-text query types and
their usage in search functions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This updates lance to v0.29.1-beta.1.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated workspace dependencies for improved consistency and
reliability. No changes to user-facing functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated the release workflow to include an additional step for
improved process reliability. No changes to user-facing functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Adds a script to change the lance dependency easily. To make this
change, I just had to run:
```bash
python ci/set_lance_version.py stable
```
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added a script to automate updating the Lance package version in
project dependencies.
- **Chores**
- Updated workflows to improve lockfile management and automate updates
during releases and publishing.
- Switched Lance dependencies from git-based references to fixed version
numbers for improved stability.
- Enhanced lockfile update script with an option to amend commits and
quieter output.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Follow up to #2416
Forgot to do `git add`.
Also need to delete old actions updating package lock.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Removed legacy workflows related to updating package lock files.
- Improved the update lockfiles script to ensure updated lockfiles are
always included in amended commits.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated various internal dependencies to newer versions for improved
stability and compatibility.
- Increased the version number for the Python package.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Just letting people know where to look starting June 1st.
Both docsites should be pointing to lancedb.github.io/documentation.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Added a notification banner to the documentation site informing users
about a new URL for accessing the latest documentation starting June
1st, 2025. The message includes a clickable link that opens in a new
tab.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated workflow to ignore changes in the `Cargo.lock` file during
documentation checks, reducing unnecessary workflow failures.
- Enhanced release process by adding automated lockfile updates for
Node.js and Rust components.
- Removed an obsolete package-lock update job from the publishing
workflow to streamline releases.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Downgraded version numbers for the Node.js, Python, and Rust packages.
No other user-facing changes were made.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Closes#2412
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved the reliability of listing indices by logging warnings for
errors and skipping problematic entries, ensuring successful results are
returned.
- Internal indices used for optimization are now excluded from the
visible list of indices.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated internal dependencies for improved stability and
compatibility. No user-facing changes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
this introduces some breaking changes in terms of rust API of creating
FTS index, and the default index params changed
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Updated default settings for full-text search (FTS) index creation:
stemming, stop word removal, and ASCII folding are now enabled by
default, while token position storage is disabled by default.
- **Refactor**
- Simplified and streamlined the configuration and handling of FTS index
parameters for improved maintainability and consistency across
interfaces.
- Enhanced serialization and request construction for FTS index
parameters to reduce manual handling and improve code clarity.
- Improved test coverage by explicitly enabling positional indexing in
FTS tests to support phrase queries.
- **Chores**
- Upgraded all internal dependencies related to FTS indexing to the
latest version for enhanced compatibility and performance.
- Updated package versions for Node.js, Python, and Rust components to
the latest beta releases.
- Improved CI workflows by adding Rust toolchain setup with formatting
and linting tools.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Adds example for querying a dataset with SQL
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Added new guides on querying LanceDB tables using SQL with DuckDB and
Apache Datafusion.
- Included detailed instructions for integrating LanceDB with Datafusion
in Python.
- Updated navigation to include Datafusion and SQL querying
documentation.
- Improved formatting in TypeScript and vectordb update examples for
consistency.
- **Tests**
- Added a new test demonstrating SQL querying on Lance tables via
DataFusion integration.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Added a new readme for better navigation, updated language and more
detail
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Updated the README with a modernized header, improved structure, and
clearer descriptions of features and architecture.
- Expanded and reorganized key features and product offerings for better
clarity.
- Simplified installation instructions and added a table of SDK
interfaces with documentation links.
- Enhanced community and contributor sections with new visuals and links
to social and support channels.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated internal dependencies for improved compatibility and
stability. No changes to user-facing features.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated workspace dependencies to use a stable release version for
improved consistency and reliability. No changes to application features
or functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Added an integration banner image to the beginning of the
Genkitx-LanceDB documentation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This is for fe14671f1
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated internal dependencies to newer versions for improved stability
and performance. No changes to features or functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Added a comprehensive guide for integrating LanceDB with Genkit,
including installation instructions, setup, indexing, retrieval, and
building a custom RAG pipeline with example code and screenshots.
- Updated the documentation navigation to include the new Genkit
integration, making it accessible from the site menu.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
add default for result structs, when values are not provided, will go
with the default values
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Improved internal handling of table operation results to support
default values. No changes to user-facing features or functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Provides the ability to set a timeout for merge insert. The default
underlying timeout is however long the first attempt takes, or if there
are multiple attempts, 30 seconds. This has two use cases:
1. Make the timeout shorter, when you want to fail if it takes too long.
2. Allow taking more time to do retries.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for specifying a timeout when performing merge insert
operations in Python, Node.js, and Rust APIs.
- Introduced a new option to control the maximum allowed execution time
for merge inserts, including retry timeout handling.
- **Documentation**
- Updated and added documentation to describe the new timeout option and
its usage in APIs.
- **Tests**
- Added and updated tests to verify correct timeout behavior during
merge insert operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Prior to this commit, attempting to drop an index that did not exist
would return a TableNotFound error stating that the target table does
not exist -- even when it did exist. Instead, we now return an
IndexNotFound error.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved error handling when attempting to drop a non-existent index,
providing a more accurate error message.
- **Tests**
- Added a test to verify correct error reporting when dropping an index
that does not exist.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This moves the __len__ method from LanceTable and RemoteTable to Table
so that child classes don't need to implement their own. In the process,
it fixes the implementation of RemoteTable's length method, which was
previously missing a return statement.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Refactor**
- Centralized the table length functionality in the base table class,
simplifying subclass behavior.
- Removed redundant or non-functional length methods from specific table
classes.
- **Tests**
- Added a new test to verify correct table length reporting for remote
tables.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
add restore with tag API in python and nodejs API and add tests to guard
them
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- The restore functionality now supports using version tags in addition
to numeric version identifiers, allowing you to revert tables to a state
marked by a tag.
- **Bug Fixes**
- Restoring with an unknown tag now properly raises an error.
- **Documentation**
- Updated documentation and examples to clarify that restore accepts
both version numbers and tags.
- **Tests**
- Added new tests to verify restore behavior with version tags and error
handling for unknown tags.
- Added tests for checkout and restore operations involving tags.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
add API originally returns a struct with request_id, add backward
compatibility for that
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved handling of empty server responses for various data
operations to ensure consistent behavior across server versions.
- Added default values to version and numeric fields to prevent errors
when response data is incomplete.
- **Tests**
- Expanded tests to cover multiple server response scenarios, validating
correct version handling in data operations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Reduce the duplicate code for remote write operation testing.
Avoid double call to remote to get version info, just return 0 instead
of suddenly adding extra API calls for end users when they are using old
servers.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added version tracking to table operation results, allowing users to
see the commit version associated with add, update, delete, merge, and
column modification operations.
- **Bug Fixes**
- Improved compatibility with legacy servers by standardizing version
information as zero when the server does not return a version.
- **Documentation**
- Clarified the meaning of the version field in operation results,
especially for cases involving legacy server responses.
- **Tests**
- Enhanced test coverage to verify correct behavior with both legacy and
modern server responses.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
return version info for all write operations (add, update, merge_insert
and column modification operations)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Table modification operations (add, update, delete, merge,
add/alter/drop columns) now return detailed result objects including
version numbers and operation statistics.
- Result objects provide clearer feedback such as rows affected and new
table version after each operation.
- **Documentation**
- Updated documentation to describe new result objects and their fields
for all relevant table operations.
- Added documentation for new result interfaces and updated method
return types in Node.js and Python APIs.
- **Tests**
- Enhanced test coverage to assert correctness of returned versioning
and operation metadata after table modifications.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Closes#2191
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated the required version of the pyarrow package to version 16 or
higher.
- Adjusted automated testing workflows to install pyarrow version 16 for
compatibility checks.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Based on this comment:
https://github.com/lancedb/lancedb/issues/2228#issuecomment-2730463075
and https://github.com/lancedb/lance/pull/2357
Here is my attempt at implementing bindings for returning merge stats
from a `merge_insert.execute` call for lancedb.
Note: I have almost no idea what I am doing in Rust but tried to follow
existing code patterns and pay attention to compiler hints.
- The change in nodejs binding appeared to be necessary to get
compilation to work, presumably this could actual work properly by
returning some kind of NAPI JS object of the stats data?
- I am unsure of what to do with the remote/table.rs changes -
necessarily for compilation to work; I assume this is related to LanceDB
cloud, but unsure the best way to handle that at this point.
Proof of function:
```python
import pandas as pd
import lancedb
db = lancedb.connect("/tmp/test.db")
test_data = pd.DataFrame(
{
"title": ["Hello", "Test Document", "Example", "Data Sample", "Last One"],
"id": [1, 2, 3, 4, 5],
"content": [
"World",
"This is a test",
"Another example",
"More test data",
"Final entry",
],
}
)
table = db.create_table("documents", data=test_data, exist_ok=True, mode="overwrite")
update_data = pd.DataFrame(
{
"title": [
"Hello, World",
"Test Document, it's good",
"Example",
"Data Sample",
"Last One",
"New One",
],
"id": [1, 2, 3, 4, 5, 6],
"content": [
"World",
"This is a test",
"Another example",
"More test data",
"Final entry",
"New content",
],
}
)
stats = (
table.merge_insert(on="id")
.when_matched_update_all()
.when_not_matched_insert_all()
.execute(update_data)
)
print(stats)
```
returns
```
{'num_inserted_rows': 1, 'num_updated_rows': 5, 'num_deleted_rows': 0}
```
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **New Features**
- Merge-insert operations now return detailed statistics, including
counts of inserted, updated, and deleted rows.
- **Bug Fixes**
- Tests updated to validate returned merge-insert statistics for
accuracy.
- **Documentation**
- Method documentation improved to reflect new return values and clarify
merge operation results.
- Added documentation for the new `MergeStats` interface detailing
operation statistics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
metadata{filename=xyz} filename would be there structurally, but ALWAYS
null.
I didn't include this as a file but it may be useful for understanding
the problem for people searching on this issue so I'm including it here
as documentation. Before this patch any field that is more than 1 deep
is accepted but returns null values for subfields when queried.
```js
const lancedb = require('@lancedb/lancedb');
// Debug logger
function debug(message, data) {
console.log(`[TEST] ${message}`, data !== undefined ? data : '');
}
// Log when our unwrapArrowObject is called
const kParent = Symbol.for("parent");
const kRowIndex = Symbol.for("rowIndex");
// Override console.log for our test
const originalConsoleLog = console.log;
console.log = function() {
// Filter out noisy logs
if (arguments[0] && typeof arguments[0] === 'string' && arguments[0].includes('[INFO] [LanceDB]')) {
originalConsoleLog.apply(console, arguments);
}
originalConsoleLog.apply(console, arguments);
};
async function main() {
debug('Starting test...');
// Connect to the database
debug('Connecting to database...');
const db = await lancedb.connect('./.lancedb');
// Try to open an existing table, or create a new one if it doesn't exist
let table;
try {
table = await db.openTable('test_nested_fields');
debug('Opened existing table');
} catch (e) {
debug('Creating new table...');
// Create test data with nested metadata structure
const data = [
{
id: 'test1',
vector: [1, 2, 3],
metadata: {
filePath: "/path/to/file1.ts",
startLine: 10,
endLine: 20,
text: "function test() { return true; }"
}
},
{
id: 'test2',
vector: [4, 5, 6],
metadata: {
filePath: "/path/to/file2.ts",
startLine: 30,
endLine: 40,
text: "function test2() { return false; }"
}
}
];
debug('Data to be inserted:', JSON.stringify(data, null, 2));
// Create the table
table = await db.createTable('test_nested_fields', data);
debug('Table created successfully');
}
// Query the table and get results
debug('Querying table...');
const results = await table.search([1, 2, 3]).limit(10).toArray();
// Log the results
debug('Number of results:', results.length);
if (results.length > 0) {
const firstResult = results[0];
debug('First result properties:', Object.keys(firstResult));
// Check if metadata is accessible and what properties it has
if (firstResult.metadata) {
debug('Metadata properties:', Object.keys(firstResult.metadata));
debug('Metadata filePath:', firstResult.metadata.filePath);
debug('Metadata startLine:', firstResult.metadata.startLine);
// Destructure to see if that helps
const { filePath, startLine, endLine, text } = firstResult.metadata;
debug('Destructured values:', { filePath, startLine, endLine, text });
// Check if it's a proxy object
debug('Result is proxy?', Object.getPrototypeOf(firstResult) === Object.prototype ? false : true);
debug('Metadata is proxy?', Object.getPrototypeOf(firstResult.metadata) === Object.prototype ? false : true);
} else {
debug('Metadata is not accessible!');
}
}
// Close the database
await db.close();
}
main().catch(e => {
console.error('Error:', e);
});
```
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **Bug Fixes**
- Improved handling of nested struct fields to ensure accurate
preservation of values during serialization and deserialization.
- Enhanced robustness when accessing nested object properties, reducing
errors with missing or null values.
- **Tests**
- Added tests to verify correct handling of nested struct fields through
serialization and deserialization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated dependencies for related components to use the latest version
from a specific repository source. No changes to features or public
functionality.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
* Add a new "table stats" API to expose basic table and fragment
statistics with local and remote table implementations
### Questions
* This is using `calculate_data_stats` to determine total bytes in the
table. This seems like a potentially expensive operation - are there any
concerns about performance for large datasets?
### Notes
* bytes_on_disk seems to be stored at the column level but there does
not seem to be a way to easily calculate total bytes per fragment. This
may need to be added in lance before we can support fragment size
(bytes) statistics.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added a method to retrieve comprehensive table statistics, including
total rows, index counts, storage size, and detailed fragment size
metrics such as minimum, maximum, mean, and percentiles.
- Enabled fetching of table statistics from remote sources through
asynchronous requests.
- Extended table interfaces across Python, Rust, and Node.js to support
synchronous and asynchronous retrieval of table statistics.
- **Tests**
- Introduced tests to verify the accuracy of the new table statistics
feature for both populated and empty tables.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
add the tag related API to list existing tags, attach tag to a version,
update the tag version, delete tag, get the version of the tag, and
checkout the version that the tag bounded to.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced table version tagging, allowing users to create, update,
delete, and list human-readable tags for specific table versions.
- Enabled checking out a table by either version number or tag name.
- Added new interfaces for tag management in both Python and Node.js
APIs, supporting synchronous and asynchronous workflows.
- **Bug Fixes**
- None.
- **Documentation**
- Updated documentation to describe the new tagging features, including
usage examples.
- **Tests**
- Added comprehensive tests for tag creation, updating, deletion,
listing, and version checkout by tag in both Python and Node.js
environments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Fix hybrid search explain plan analyze plan API
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added options to view the execution plan and analyze the runtime
performance of hybrid queries.
- **Refactor**
- Improved internal handling of query setup for better modularity and
maintainability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Upstream changelog:
https://github.com/lancedb/lance/releases/tag/v0.26.0
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated dependency management to use published crate versions for
improved reliability and maintainability.
- Added a temporary workaround for build issues by pinning a specific
version of a dependency.
- **Refactor**
- Improved resource management and concurrency by updating internal
ownership models for object storage components.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for distance range filtering in hybrid vector queries,
allowing users to specify lower and upper bounds for search results.
- **Tests**
- Introduced new tests to validate distance range filtering and
reranking in both synchronous and asynchronous hybrid query scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
To workaround this issue: https://github.com/lancedb/lancedb/issues/2211
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved handling of large query parameters to prevent potential
overflow issues when using the "k" parameter in queries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Fixes#2315.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Refactor**
- Enhanced query processing to maintain smooth functionality across
different dependency versions, ensuring improved stability and
performance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Improved CI workflow for documentation builds by optimizing Rust build
settings and updating the runner environment.
- Fixed a typo in a workflow step name.
- Streamlined caching steps to reduce redundancy and improve efficiency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Closes https://github.com/lancedb/lancedb/issues/2307
* Adds retries to remote operations with stream bodies (add,
merge_insert)
* Change default retryable status codes to 409, 429, 500, 502, 503, 504
* Don't retry add or merge_insert operations on 5xx responses
Notes:
* Supporting retries on stream bodies means we have to buffer the body
into memory so it can be cloned on retry. This will impact memory use
patterns for the remote client. This buffering can be disabled by
disabling retries (i.e. setting retries to 0 in RetryConfig)
* It does not seem that retry config can be specified by env vars as the
documentation suggests. I added a follow-up issue
[here](https://github.com/lancedb/lancedb/issues/2350)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **New Features**
- Enhanced retry support for remote requests with configurable limits
and exponential backoff with jitter.
- Added robust retry logic for streaming data uploads, enabling retries
with buffered data to ensure reliability.
- **Bug Fixes**
- Improved error handling and retry behavior for HTTP status codes 409
and 504.
- **Refactor**
- Centralized and modularized HTTP request sending and retry logic
across remote database and table operations.
- Streamlined request ID management for improved traceability.
- Simplified error message construction in index waiting functionality.
- **Tests**
- Added a test verifying merge-insert retries on HTTP 409 responses.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
I added a timeout to query execution options in
https://github.com/lancedb/lancedb/pull/2288. However, this was send to
the request timeout, but the retry implementation is unaware of this
timeout. So once the query timed out, a retry would be triggered.
Instead, this PR changes it so the timeout happens outside the retry
loop.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Improved query timeout handling to provide clearer error messages and
more reliable cancellation if a query takes too long to complete.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Fixes#2344
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Tests**
- Updated tests to use PyArrow Tables instead of pandas DataFrames where
possible, reducing reliance on pandas.
- Tests that require pandas are now automatically skipped if pandas is
not installed.
- **Chores**
- Improved workflow to uninstall both pylance and pandas in a specific
test step.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
The docs in the Guide here do not match the [API reference]
(https://lancedb.github.io/lancedb/js/classes/Table/#updateopts) for the
nodejs client.
I am writing an Elixir wrapper over the typescript library (Rust
forthcoming!) and confirmed in testing that the API reference is correct
vs the Guide.
Following the Guide docs, the error I got was:
"lance error: Invalid user input: Schema error: No field named bar.
Valid fields are foo. For a query of:
await table.update({foo: "buzz"}, { where: "foo = 'bar'"});
Over a table with a schema of just {foo: Utf8}.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Reformatted a code snippet in the guide to enhance readability by
splitting it into multiple lines for improved clarity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
* Add new wait_for_index() table operation that polls until indices are
created/fully indexed
* Add an optional wait timeout parameter to all create_index operations
* Python and NodeJS interfaces
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Summary by CodeRabbit
- **New Features**
- Added optional waiting for index creation completion with configurable
timeout.
- Introduced methods to poll and wait for indices to be fully built
across sync and async tables.
- Extended index creation APIs to accept a wait timeout parameter.
- **Bug Fixes**
- Added a new timeout error variant for improved error reporting on
index operations.
- **Tests**
- Added tests covering successful index readiness waiting, timeout
scenarios, and missing index cases.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This PR adds ColPali support with ColPaliEmbeddings class (tagged
"colpali") using ColQwen2.5 for multi-vector text/image embeddings. Also
added MultiVector Pydantic type to handle the vector lists.
I've added some integration test for the embedding model and some unit
test for the new Pydantic type. Could be a template for other ColPali
variants as well. or until transformers🤗 starts supporting it.
Still `TODO`:
- [ ] Documentation
- [ ] Add an example
_Could also allow Image as query, but didn't work well when testing it._
[ColPali-Engine](https://github.com/illuin-tech/colpali) version:
0.3.9.dev17+g3faee24
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced support for ColPali-based multimodal multi-vector
embeddings for both text and images.
- Added a new embedding class for generating multi-vector embeddings,
configurable for various model and processing options.
- Added a new Pydantic type for multi-vector embeddings, supporting
validation and schema generation for lists of fixed-dimension vectors.
- **Bug Fixes**
- Ensured proper asynchronous index creation in query tests for improved
reliability.
- **Tests**
- Added integration tests for ColPali embeddings, including
text-to-image search and validation of multi-vector fields.
- Added comprehensive tests for the new multi-vector Pydantic type,
covering schema, validation, and default value behavior.
- **Chores**
- Updated optional dependencies to include the ColPali engine.
- Added utility to check for availability of flash attention support.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added the ability to prewarm (load into memory) table indexes via new
methods in Python, Node.js, and Rust APIs, potentially reducing
cold-start query latency.
- **Bug Fixes**
- Ensured prewarming an index does not interfere with subsequent search
operations.
- **Tests**
- Introduced new test cases to verify full-text search index creation,
prewarming, and search functionalities in both Python and Node.js.
- **Chores**
- Updated dependencies for improved compatibility and performance.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: Lu Qiu <luqiujob@gmail.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Updated the link in the documentation to correctly reference the
workflow file, ensuring accurate navigation from the current context.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced full-text search capabilities with support for phrase
queries, fuzzy matching, boosting, and multi-column matching.
- Search methods now accept full-text query objects directly, improving
query flexibility and precision.
- Python and JavaScript SDKs updated to handle full-text queries
seamlessly, including async search support.
- **Tests**
- Added comprehensive tests covering fuzzy search, phrase search, and
boosted queries to ensure robust full-text search functionality.
- **Documentation**
- Updated query class documentation to reflect new constructor options
and removal of deprecated methods for clarity and simplicity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This change allows to centrally manage the plance depndency without
everybody needing to monitor for compatibility manually.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Introduced an optional dependency that enhances development support.
Users can now benefit from improved static analysis capabilities when
installing the recommended version (0.23.2 or later).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This reverts commit a547c523c2 or #2281
The current implementation can cause panics and performance degradation.
I will bring this back with more testing in
https://github.com/lancedb/lancedb/pull/2311
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Documentation**
- Enhanced clarity on read consistency settings with updated
descriptions and default behavior.
- Removed outdated warnings about eventual consistency from the
troubleshooting guide.
- **Refactor**
- Streamlined the handling of the read consistency interval across
integrations, now defaulting to "None" for improved performance.
- Simplified internal logic to offer a more consistent experience.
- **Tests**
- Updated test expectations to reflect the new default representation
for the read consistency interval.
- Removed redundant tests related to "no consistency" settings for
streamlined testing.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
I found some edge cases while running experiments that - depending on
the base reranking libraries, some of them don't handle empty lists
well. This PR manually checks if the result set to be reranked is empty
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Bug Fixes**
- Enhanced search result processing by ensuring that reordering only
occurs when valid, non-empty results are available, thereby preventing
unnecessary operations and potential errors.
- **Tests**
- Added automated tests to verify that empty search result sets are
handled correctly, ensuring consistent behavior across various
rerankers.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated internal library dependencies to the latest beta version for
improved system stability.
- **Tests**
- Added automated tests to validate full-text search functionality on
list-based text fields.
- **Refactor**
- Enhanced the search processing logic to provide robust support for
list-type text data, ensuring more reliable results.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This should prevent the failures we are seeing in Node release.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chore**
- Enhanced the package deprecation process with improved security
measures, ensuring smoother and more reliable updates during package
deprecation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
This bug was introduced in https://github.com/lancedb/lancedb/pull/2281
Likely introduced during a rebase when fixing merge conflicts.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Refactor**
- Updated the refresh process so that reloading now uses the existing
dataset version instead of automatically updating to the latest version.
This change may affect workflows that rely on immediate data updates
during refresh.
- **New Features**
- Introduced a new module for tracking I/O statistics in object store
operations, enhancing monitoring capabilities.
- Added a new test module to validate the functionality of the dataset
operations.
- **Bug Fixes**
- Reintroduced the `write_options` method in the `CreateTableBuilder`,
ensuring consistent functionality across different builder variants.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Closes#2287
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added configurable timeout support for query executions. Users can now
specify maximum wait times for queries, enhancing control over
long-running operations across various integrations.
- **Tests**
- Expanded test coverage to validate timeout behavior in both
synchronous and asynchronous query flows, ensuring timely error
responses when query execution exceeds the specified limit.
- Introduced a new test suite to verify query operations when a timeout
is reached, checking for appropriate error handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
The issue we fixed in https://github.com/lancedb/lancedb/pull/2296 was
caused by an upgrade in dependencies. This could have been caught if we
had run these CI jobs when we did the dependency change.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated our automated pipeline to trigger additional stability checks
when dependency configurations change, ensuring smoother build and
release processes.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated core dependency versions to v0.25.3-beta.2.
- Enabled additional functionality with a new "dynamodb" feature.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
missed to support it in `search()` API and there were some pydantic
errors
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced full-text search capabilities by incorporating additional
parameters, enabling more flexible query definitions.
- Extended table search functionality to support full-text queries
alongside existing search types.
- **Tests**
- Introduced new tests that validate both structured and conditional
full-text search behaviors.
- Expanded test coverage for various query types, including MatchQuery,
BoostQuery, MultiMatchQuery, and PhraseQuery.
- **Bug Fixes**
- Fixed a logic issue in query processing to ensure correct handling of
full-text search queries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Fixes#2255
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Enhanced the build process to improve performance and reliability
across Linux platforms.
- Updated environment settings for more accurate compiler integration.
- Activated previously inactive build configurations to support advanced
feature support.
- Added support for the x86_64 architecture on Linux systems utilizing
the musl C library.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
We recently allowed this for lance but there was a check in lancedb as
well that was preventing it
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for indexing fixed-size binary data using B-tree
structures for efficient data storage and retrieval.
- **Tests**
- Implemented automated tests to ensure the new binary indexing works
correctly and meets the expected configuration.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated dependency versions for improved performance and
compatibility.
- **New Features**
- Added support for structured full-text search with expanded query
types (e.g., match, phrase, boost, multi-match) and flexible input
formats.
- Introduced a new method to check server support for structural
full-text search features.
- Enhanced the query system with new classes and interfaces for handling
various full-text queries.
- Expanded the functionality of existing methods to accept more complex
query structures, including updates to method signatures.
- **Bug Fixes**
- Improved error handling and reporting for full-text search queries.
- **Refactor**
- Enhanced query processing with streamlined input handling and improved
error reporting, ensuring more robust and consistent search results
across platforms.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
Fix restore to always checkout latest version, following local restore
api implementation
a1d1833a40/rust/lancedb/src/table.rs (L1910)
Otherwise
table.create_table -> version 1
table.add_table -> version 2
table.checkout(1), table.restore() -> the version remains at 1 (should
checkout_latest inside restore method to update version to latest
version and allow write operation)
table.checkout_latest() -> version is 3
can do write operations
add analyze plan api to allow executing the queries and see runtime
metrics.
Which help identify the query IO overhead and help identify query
slowness
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.
Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
This fixes an issue for people wishing to use different kinds of
rerankers in lancedb via AnswerDotAI rerankers. Currently, the arguments
are passed sequentially, but they don't match the[Reranker class
implementation](d604a8c47d/rerankers/reranker.py (L179)):
the second argument is expected to be an optional "lang" for default
models, while model_type should be passed explicitly.
The one line changes in this PR fixes it and enables the use of other
methods (eg LLMs-as-rerankers)
* direct users to cloud.lancedb.com since LanceDB Cloud is in public
beta
* removed the `cast vector dimension` from alter columns as we don't
support it
This PR adds a `to_query_object` method to the various query builders
(except not hybrid queries yet). This makes it possible to inspect the
query that is built.
In addition this PR does some normalization between the sync and async
query paths. A few custom defaults were removed in favor of None (with
the default getting set once, in rust).
Also, the synchronous to_batches method will now actually stream results
Also, the remote API now defaults to prefiltering
This PR fixes build issues associated with `aws-lc-rs`, while
simplifying the build process. Previously, we used custom scripts for
the musl and Windows ARM builds. These were complicated and prone to
breaking. This PR switches to a setup that mirrors
https://github.com/napi-rs/package-template/blob/main/.github/workflows/CI.yml.
* linux glibc and musl builds now use the Docker images provided by the
napi project
* Windows ARM build now just cross compiles from Windows x64, which
turns out to work quite well.
- adds `loss` into the index stats for vector index
- now `optimize` can retrain the vector index
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
We soon won't rely on cross compiling from Linux to windows, so can
remove this check. Instead, check that we can cross compile from Windows
between architectures.
`object_store` already hard codes `rustls` as the TLS implementation, so
we have been shipping a mix of `rustls` and `openssl`. For simplicity of
builds, we should consolidate to one, and that has to be `rustls`.
vectordb is deprecated, and these platforms are particularly difficult
to maintain. Removing now to prevent further headaches.
We will keep these platforms supported on `@lancedb/lancedb`.
The `ConnectRequest` has a set of properties that only make sense for
listing databases / catalogs and a set of properties that only make
sense for remote databases.
This PR reduces all options to a single `HashMap<String, String>`. This
makes it easier to add new database / catalog implementations and makes
it clearer to users which options are applicable in which situations.
I don't believe there are any breaking changes here. The closest thing
is that I placed the `ConnectBuilder` methods `api_key`, `region`, and
`host_override` behind a `remote` feature gate. This is not strictly
needed and I could remove the feature gate but it seemed appropriate.
Since using these methods without the remote feature would have been
meaningless I don't feel this counts as a breaking change.
We could look at removing these methods entirely from the
`ConnectBuilder` (and encouraging users to use `RemoteDatabaseOptions`
instead) but I'm not sure how I feel about that.
Another approach we could take is to move these methods into a
`RemoteConnectBuilderExt` trait (and there could be a similar
`ListingConnectBuilderExt` trait to add methods for the listing database
/ catalog).
For now though my main goal is to simplify `ConnectRequest` as much as
possible (I see this being part of the key public API for database /
catalog integrations, similar to the `BaseTable`, `Catalog`, and
`Database` traits and I'd like it to be simple).
Closes#2114
Starting in #1965, we no longer pass the table schema into
`pa.Table.from_pylist()`. This means PyArrow is choosing the order of
the struct subfields, and apparently it does them in alphabetical order.
This is fine in theory, since in Lance we support providing fields in
any order. However, before we pass it to Lance, we call
`pa.Table.cast()` to align column types to the table types.
`pa.Table.cast()` is strict about field order, so we need to create a
cast target schema that aligns with the input data. We were doing this
at the top-level fields, but weren't doing this in nested fields. This
PR adds support to do this for nested ones.
Hello LanceDB team,
while developing using `lancedb` as a library I encountered a typing
problem affecting IDE hints and completions during development.
---
## Current Situation
Currently, the abstract base class `lancedb.query:LanceQueryBuilder`
uses method chaining to build up the search parameters, where the
methods have `LanceQueryBuilder` as a return type hint.
This leads to two issues:
1. Implementing subclasses of `LanceQueryBuilder` need to override
methods to modify the return type hint, even when they don't need to
change its implementation, just to ensure adequate IDE hints and
completions.
2. When using method chaining the first method directly inherited from
the abstract `LanceQueryBuilder` causes the inferred type to switch back
to `LanceQueryBuilder`. So even when the type starts from
`lancdb.table:LanceTable.search(query_type="vector", ...)` and therefor
correctly is inferred as `LanceVectorQueryBuilder`, after calling e.g.
`LanceVectorQueryBuilder.limit(...)` it is seen as the abstract
`LanceQueryBuilder` from that point on.
### Example of current situation

## Proposed changes
I propose to change the return type hints of the corresponding methods
(including classmethod `create()`) in the abstract base class
`LanceQueryBuilder` from `LanceQueryBuilder` to `Self`.
`Self` is already imported in the module:
```py
if sys.version_info >= (3, 11):
from typing import Self
else:
from typing_extensions import Self
```
### Further possible changes
Additionally, the implementing subclasses could also change the return
type hints to `Self` to potentially allow for further inheritance
easily.
> [!NOTE]
> **However this is not part of this pull request as of writing.**
### Example after proposed changes

---
Best regards
Martin
Previously, users could only specify new data types in `alterColumns` as
strings:
```ts
await tbl.alterColumns([
path: "price",
dataType: "float"
]);
```
But this has some problems:
1. It wasn't clear what were valid types
2. It was impossible to specify nested types, like lists and vector
columns.
This PR changes it to take an Arrow data type, similar to how the Python
API works. This allows casting vector types:
```ts
await tbl.alterColumns([
{
path: "vector",
dataType: new arrow.FixedSizeList(
2,
new arrow.Field("item", new arrow.Float16(), false),
),
},
]);
```
Closes#2185
Hello LanceDB team,
---
I have fixed a discrepancy in the class docstring of
`lancedb.embeddings.base:EmbeddingFunction` and made consistency
alignments to that docstring.
### Changes made
1. The docstring referred to the abstract method
`get_source_embeddings()`.
This method does not exist in the repository at the current state.
I have changed the mention to refer to the actual abstract method
`compute_source_embeddings()`.
2. Also, I aligned the consistency within the ordered list which is
describing the methods to be implemented by concrete embedding
functions.
---
Thank you for developing this useful library. 👍
Best regards
Martin
We attempted to make pylance optional in
https://github.com/lancedb/lancedb/pull/2156 but it appears this did not
quite work. Users are unable to use lancedb from a fresh install. This
reverts the optional-ness so we can get back in a working state while we
fix the issue.
Prior to this commit, issuing drop_all_tables on a listing database with
an external manifest store would delete physical tables but leave
references behind in the manifest store. The table drop would succeed,
but subsequent creation of a table with the same name would fail with a
conflict.
With this patch, the external manifest store is updated to account for
the dropped tables so that dropped table names can be reused.
@wjones127 is there a standard way you guys setup your virtualenv? I can
either relist all the dependencies in the pyright precommit section, or
specify a venv, or the user has to be in the virtual environment when
they run git commit. If the venv location was standardized or a python
manager like `uv` was used it would be easier to avoid duplicating the
pyright dependency list.
Per your suggestion, in `pyproject.toml` I added in all the passing
files to the `includes` section.
For ruff I upgraded the version and removed "TCH" which doesn't exist as
an option.
I added a `pyright_report.csv` which contains a list of all files sorted
by pyright errors ascending as a todo list to work on.
I fixed about 30 issues in `table.py` stemming from str's being passed
into methods that required a string within a set of string Literals by
extracting them into `types.py`
Can you verify in the rust bridge that the schema should be a property
and not a method here? If it's a method, then there's another place in
the code where `inner.schema` should be `inner.schema()`
``` python
class RecordBatchStream:
@property
def schema(self) -> pa.Schema: ...
```
Also unless the `_lancedb.pyi` file is wrong, then there is no
`__anext__` here for `__inner` when it's not an `AsyncGenerator` and
only `next` is defined:
``` python
async def __anext__(self) -> pa.RecordBatch:
return await self._inner.__anext__()
if isinstance(self._inner, AsyncGenerator):
batch = await self._inner.__anext__()
else:
batch = await self._inner.next()
if batch is None:
raise StopAsyncIteration
return batch
```
in the else statement, `_inner` is a `RecordBatchStream`
```python
class RecordBatchStream:
@property
def schema(self) -> pa.Schema: ...
async def next(self) -> Optional[pa.RecordBatch]: ...
```
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Datafusion makes the batch size available as part of the `SessionState`.
We should use that to set the `max_batch_length` property in the
`QueryExecutionOptions`.
This PR makes it possible to create a table using an asynchronous stream
of input data. Currently only a synchronous iterator is supported. There
are a number of follow-ups not yet tackled:
* Support for embedding functions (the embedding functions wrapper needs
to be re-written to be async, should be an easy lift)
* Support for async input into the remote table (the make_ipc_batch
needs to change to accept async input, leaving undone for now because I
think we want to support actual streaming uploads into the remote table
soon)
* Support for async input into the add function (pretty essential, but
it is a fairly distinct code path, so saving for a different PR)
It seems that `RecordBatch::with_schema` is unable to remove schema
metadata from a batch. It fails with the error `target schema is not
superset of current schema`.
I'm not sure how the `test_metadata_erased` test is passing. Strangely,
the metadata was not present by the time the batch arrived at the
metadata eraser. I think maybe the schema metadata is only present in
the batch if there is a filter.
I've created a new unit test that makes sure the metadata is erased if
we have a filter also
In earlier PRs (#1886, #1191) we made the default limit 10 regardless of
the query type. This was confusing for users and in many cases a
breaking change. Users would have queries that used to return all
results, but instead only returned the first 10, causing silent bugs.
Part of the cause was consistency: the Python sync API seems to have
always had a limit of 10, while newer APIs (Python async and Nodejs)
didn't.
This PR sets the default limit only for searches (vector search, FTS),
while letting scans (even with filters) be unbounded. It does this
consistently for all SDKs.
Fixes#1983Fixes#1852Fixes#2141
<imgsrc="https://github.com/user-attachments/assets/92dad0a2-2a37-4ce1-b783-0d1b4f30a00c"alt="LanceDB Cloud Public Beta"width="100%"style="max-width: 100%;">
[**How to Install** ](#how-to-install) ✦ [**Detailed Documentation**](https://lancedb.github.io/lancedb/) ✦ [**Tutorials and Recipes**](https://github.com/lancedb/vectordb-recipes/tree/main) ✦ [**Contributors**](#contributors)
**The ultimate multimodal data platform for AI/ML applications.**
LanceDB is designed for fast, scalable, and production-ready vector search. It is built on top of the Lance columnar format. You can store, index, and search over petabytes of multimodal data and vectors with ease.
LanceDB is a central location where developers can build, train and analyze their AI workloads.
</p>
</div>
<hr/>
<br>
LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrieval, filtering and management of embeddings.
## **Demo: Multimodal Search by Keyword, Vector or with SQL**
* Store, query and filter vectors, metadata and multi-modal data (text, images, videos, point clouds, and more).
## **Key Features**:
*Support for vector similarity search, full-text search and SQL.
-**Fast Vector Search**: Search billions of vectors in milliseconds with state-of-the-art indexing.
- **Comprehensive Search**: Support for vector similarity search, full-text search and SQL.
- **Multimodal Support**: Store, query and filter vectors, metadata and multimodal data (text, images, videos, point clouds, and more).
- **Advanced Features**: Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure. GPU support in building vector index.
* Native Python and Javascript/Typescript support.
### **Products**:
- **Open Source & Local**: 100% open source, runs locally or in your cloud. No vendor lock-in.
- **Cloud and Enterprise**: Production-scale vector search with no servers to manage. Complete data sovereignty and security.
* Zero-copy, automatic versioning, manage versions of your data without needing extra infrastructure.
### **Ecosystem**:
- **Columnar Storage**: Built on the Lance columnar format for efficient storage and analytics.
- **Seamless Integration**: Python, Node.js, Rust, and REST APIs for easy integration. Native Python and Javascript/Typescript support.
- **Rich Ecosystem**: Integrations with [**LangChain** 🦜️🔗](https://python.langchain.com/docs/integrations/vectorstores/lancedb/), [**LlamaIndex** 🦙](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/LanceDBIndexDemo.html), Apache-Arrow, Pandas, Polars, DuckDB and more on the way.
* GPU support in building vector index(*).
## **How to Install**:
* Ecosystem integrations with [LangChain 🦜️🔗](https://python.langchain.com/docs/integrations/vectorstores/lancedb/), [LlamaIndex 🦙](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/LanceDBIndexDemo.html), Apache-Arrow, Pandas, Polars, DuckDB and more on the way.
Follow the [Quickstart](https://lancedb.github.io/lancedb/basic/) doc to set up LanceDB locally.
LanceDB's core is written in Rust 🦀 and is built using <ahref="https://github.com/lancedb/lance">Lance</a>, an open-source columnar format designed for performant ML workloads.
**API & SDK:** We also support Python, Typescript and Rust SDKs
If you have any suggestions or feature requests, please feel free to open an issue on GitHub or discuss it on our [**Discord**](https://discord.gg/G5DcmnZWKB) server.
[**Check out the GitHub Issues**](https://github.com/lancedb/lancedb/issues) if you would like to work on the features that are planned for the future. If you have any suggestions or feature requests, please feel free to open an issue on GitHub.
parser=argparse.ArgumentParser(description="Set the version of the Lance package.")
parser.add_argument(
"version",
type=str,
help="The version to set for the Lance package. Use 'stable' for the latest stable version, 'preview' for latest preview version, or a specific version number (e.g., '0.1.0'). You can also specify 'local' to use a local path.",
📚 Starting June 1st, 2025, please use <ahref="https://lancedb.github.io/documentation"target="_blank"rel="noopener noreferrer">lancedb.github.io/documentation</a> for the latest docs.
@@ -69,7 +69,7 @@ Lance supports `IVF_PQ` index type by default.
The following IVF_PQ paramters can be specified:
- **distance_type**: The distance metric to use. By default it uses euclidean distance "`L2`".
- **distance_type**: The distance metric to use. By default it uses euclidean distance "`l2`".
We also support "cosine" and "dot" distance as well.
- **num_partitions**: The number of partitions in the index. The default is the square root
of the number of rows.
@@ -291,7 +291,7 @@ Product quantization can lead to approximately `16 * sizeof(float32) / 1 = 64` t
`num_partitions` is used to decide how many partitions the first level `IVF` index uses.
Higher number of partitions could lead to more efficient I/O during queries and better accuracy, but it takes much more time to train.
On `SIFT-1M` dataset, our benchmark shows that keeping each partition 1K-4K rows lead to a good latency / recall.
On `SIFT-1M` dataset, our benchmark shows that keeping each partition 4K-8K rows lead to a good latency / recall.
`num_sub_vectors` specifies how many Product Quantization (PQ) short codes to generate on each vector. The number should be a factor of the vector dimension. Because
PQ is a lossy compression of the original vector, a higher `num_sub_vectors` usually results in
LanceDB Cloud is a SaaS (software-as-a-service) solution that runs serverless in the cloud, clearly separating storage from compute. It's designed to be highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below.
[Try out LanceDB Cloud](https://noteforms.com/forms/lancedb-mailing-list-cloud-kty1o5?notionforms=1&utm_source=notionforms){ .md-button .md-button--primary }
[Try out LanceDB Cloud (Public Beta)](https://cloud.lancedb.com){ .md-button .md-button--primary }
@@ -54,7 +54,7 @@ As mentioned, after creating embedding, each data point is represented as a vect
Points that are close to each other in vector space are considered similar (or appear in similar contexts), and points that are far away are considered dissimilar. To quantify this closeness, we use distance as a metric which can be measured in the following way -
1.**Euclidean Distance (L2)**: It calculates the straight-line distance between two points (vectors) in a multidimensional space.
1.**Euclidean Distance (l2)**: It calculates the straight-line distance between two points (vectors) in a multidimensional space.
2.**Cosine Similarity**: It measures the cosine of the angle between two vectors, providing a normalized measure of similarity based on their direction.
3.**Dot product**: It is calculated as the sum of the products of their corresponding components. To measure relatedness it considers both the magnitude and direction of the vectors.
@@ -8,15 +8,5 @@ LanceDB provides language APIs, allowing you to embed a database in your languag
* 👾 [JavaScript](examples_js.md) examples
* 🦀 Rust examples (coming soon)
## Python Applications powered by LanceDB
| Project Name | Description |
| --- | --- |
| **Ultralytics Explorer 🚀**<br>[](https://docs.ultralytics.com/datasets/explorer/)<br>[](https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/docs/en/datasets/explorer/explorer.ipynb) | - 🔍 **Explore CV Datasets**: Semantic search, SQL queries, vector similarity, natural language.<br>- 🖥️ **GUI & Python API**: Seamless dataset interaction.<br>- ⚡ **Efficient & Scalable**: Leverages LanceDB for large datasets.<br>- 📊 **Detailed Analysis**: Easily analyze data patterns.<br>- 🌐 **Browser GUI Demo**: Create embeddings, search images, run queries. |
| **Website Chatbot🤖**<br>[](https://github.com/lancedb/lancedb-vercel-chatbot)<br>[](https://vercel.com/new/clone?repository-url=https%3A%2F%2Fgithub.com%2Flancedb%2Flancedb-vercel-chatbot&env=OPENAI_API_KEY&envDescription=OpenAI%20API%20Key%20for%20chat%20completion.&project-name=lancedb-vercel-chatbot&repository-name=lancedb-vercel-chatbot&demo-title=LanceDB%20Chatbot%20Demo&demo-description=Demo%20website%20chatbot%20with%20LanceDB.&demo-url=https%3A%2F%2Flancedb.vercel.app&demo-image=https%3A%2F%2Fi.imgur.com%2FazVJtvr.png) | - 🌐 **Chatbot from Sitemap/Docs**: Create a chatbot using site or document context.<br>- 🚀 **Embed LanceDB in Next.js**: Lightweight, on-prem storage.<br>- 🧠 **AI-Powered Context Retrieval**: Efficiently access relevant data.<br>- 🔧 **Serverless & Native JS**: Seamless integration with Next.js.<br>- ⚡ **One-Click Deploy on Vercel**: Quick and easy setup.. |
## Nodejs Applications powered by LanceDB
| Project Name | Description |
| --- | --- |
| **Langchain Writing Assistant✍️ **<br>[](https://github.com/lancedb/vectordb-recipes/tree/main/applications/node/lanchain_writing_assistant) | - **📂 Data Source Integration**: Use your own data by specifying data source file, and the app instantly processes it to provide insights. <br>- **🧠 Intelligent Suggestions**: Powered by LangChain.js and LanceDB, it improves writing productivity and accuracy. <br>- **💡 Enhanced Writing Experience**: It delivers real-time contextual insights and factual suggestions while the user writes. |
!!! tip "Hosted LanceDB"
If you want S3 cost-efficiency and local performance via a simple serverless API, checkout **LanceDB Cloud**. For private deployments, high performance at extreme scale, or if you have strict security requirements, talk to us about **LanceDB Enterprise**. [Learn more](https://docs.lancedb.com/)
Late interaction is a technique used in retrieval that calculates the relevance of a query to a document by comparing their multi-vector representations. The key difference between late interaction and other popular methods:

[ Illustration from https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/]
<b>No interaction:</b> Refers to independently embedding the query and document, that are compared to calcualte similarity without any interaction between them. This is typically used in vector search operations.
<b>Partial interaction</b> Refers to a specific approach where the similarity computation happens primarily between query vectors and document vectors, without extensive interaction between individual components of each. An example of this is dual-encoder models like BERT.
<b>Early full interaction</b> Refers to techniques like cross-encoders that process query and docs in pairs with full interaction across various stages of encoding. This is a powerful, but relatively slower technique. Because it requires processing query and docs in pairs, doc embeddings can't be pre-computed for fast retrieval. This is why cross encoders are typically used as reranking models combined with vector search. Learn more about [LanceDB Reranking support](https://lancedb.github.io/lancedb/reranking/).
<b>Late interaction</b> Late interaction is a technique that calculates the doc and query similarity independently and then the interaction or evaluation happens during the retrieval process. This is typically used in retrieval models like ColBERT. Unlike early interaction, It allows speeding up the retrieval process without compromising the depth of semantic analysis.
## Internals of ColBERT
Let's take a look at the steps involved in performing late interaction based retrieval using ColBERT:
• ColBERT employs BERT-based encoders for both queries `(fQ)` and documents `(fD)`
• A single BERT model is shared between query and document encoders and special tokens distinguish input types: `[Q]` for queries and `[D]` for documents
**Query Encoder (fQ):**
• Query q is tokenized into WordPiece tokens: `q1, q2, ..., ql`. `[Q]` token is prepended right after BERT's `[CLS]` token
• If query length <Nq,it'spaddedwith[MASK]tokensuptoNq.
@@ -4,6 +4,9 @@ LanceDB is an open-source vector database for AI that's designed to store, manag
Both the database and the underlying data format are designed from the ground up to be **easy-to-use**, **scalable** and **cost-effective**.
!!! tip "Hosted LanceDB"
If you want S3 cost-efficiency and local performance via a simple serverless API, checkout **LanceDB Cloud**. For private deployments, high performance at extreme scale, or if you have strict security requirements, talk to us about **LanceDB Enterprise**. [Learn more](https://docs.lancedb.com/)

## Truly multi-modal
@@ -20,7 +23,7 @@ LanceDB **OSS** is an **open-source**, batteries-included embedded vector databa
LanceDB **Cloud** is a SaaS (software-as-a-service) solution that runs serverless in the cloud, making the storage clearly separated from compute. It's designed to be cost-effective and highly scalable without breaking the bank. LanceDB Cloud is currently in private beta with general availability coming soon, but you can apply for early access with the private beta release by signing up below.
[Try out LanceDB Cloud](https://noteforms.com/forms/lancedb-mailing-list-cloud-kty1o5?notionforms=1&utm_source=notionforms){ .md-button .md-button--primary }
[Try out LanceDB Cloud (Public Beta) Now](https://cloud.lancedb.com){ .md-button .md-button--primary }
// vertexAI provides the textEmbedding004 embedder
vertexAI(),
// the local vector store requires an embedder to translate from text to vector
lancedb([
{
dbUri:'.db',// optional lancedb uri, default to .db
tableName:'table',// optional table name, default to table
embedder: textEmbedding004,
},
]),
],
});
```
You can run this app with the following command:
```bash
genkit start -- tsx --watch src/index.ts
```
This'll add LanceDB as a retriever and indexer to the genkit instance. You can see it in the GUI view
<imgwidth="1710"alt="Screenshot 2025-05-11 at 7 21 05PM"src="https://github.com/user-attachments/assets/e752f7f4-785b-4797-a11e-72ab06a531b7"/>
**Testing retrieval on a sample table**
Let's see the raw retrieval results
<imgwidth="1710"alt="Screenshot 2025-05-11 at 7 21 05PM"src="https://github.com/user-attachments/assets/b8d356ed-8421-4790-8fc0-d6af563b9657"/>
On running this query, you'll 5 results fetched from the lancedb table, where each result looks something like this:
<imgwidth="1417"alt="Screenshot 2025-05-11 at 7 21 18PM"src="https://github.com/user-attachments/assets/77429525-36e2-4da6-a694-e58c1cf9eb83"/>
## Creating a custom RAG flow
Now that we've seen how you can use LanceDB for in a genkit pipeline, let's refine the flow and create a RAG. A RAG flow will consist of an index and a retreiver with its outputs postprocessed an fed into an LLM for final response
### Creating custom indexer flows
You can also create custom indexer flows, utilizing more options and features provided by LanceDB.
```ts
exportconstmenuPdfIndexer=lancedbIndexerRef({
// Using all defaults, for dbUri, tableName, and embedder, etc
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.