Closes#2287
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added configurable timeout support for query executions. Users can now
specify maximum wait times for queries, enhancing control over
long-running operations across various integrations.
- **Tests**
- Expanded test coverage to validate timeout behavior in both
synchronous and asynchronous query flows, ensuring timely error
responses when query execution exceeds the specified limit.
- Introduced a new test suite to verify query operations when a timeout
is reached, checking for appropriate error handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
missed to support it in `search()` API and there were some pydantic
errors
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Enhanced full-text search capabilities by incorporating additional
parameters, enabling more flexible query definitions.
- Extended table search functionality to support full-text queries
alongside existing search types.
- **Tests**
- Introduced new tests that validate both structured and conditional
full-text search behaviors.
- Expanded test coverage for various query types, including MatchQuery,
BoostQuery, MultiMatchQuery, and PhraseQuery.
- **Bug Fixes**
- Fixed a logic issue in query processing to ensure correct handling of
full-text search queries.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
We recently allowed this for lance but there was a check in lancedb as
well that was preventing it
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **New Features**
- Added support for indexing fixed-size binary data using B-tree
structures for efficient data storage and retrieval.
- **Tests**
- Implemented automated tests to ensure the new binary indexing works
correctly and meets the expected configuration.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- **Chores**
- Updated dependency versions for improved performance and
compatibility.
- **New Features**
- Added support for structured full-text search with expanded query
types (e.g., match, phrase, boost, multi-match) and flexible input
formats.
- Introduced a new method to check server support for structural
full-text search features.
- Enhanced the query system with new classes and interfaces for handling
various full-text queries.
- Expanded the functionality of existing methods to accept more complex
query structures, including updates to method signatures.
- **Bug Fixes**
- Improved error handling and reporting for full-text search queries.
- **Refactor**
- Enhanced query processing with streamlined input handling and improved
error reporting, ensuring more robust and consistent search results
across platforms.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: BubbleCal <bubble-cal@outlook.com>
add analyze plan api to allow executing the queries and see runtime
metrics.
Which help identify the query IO overhead and help identify query
slowness
Previously, when we loaded the next version of the table, we would block
all reads with a write lock. Now, we only do that if
`read_consistency_interval=0`. Otherwise, we load the next version
asynchronously in the background. This should mean that
`read_consistency_interval > 0` won't have a meaningful impact on
latency.
Along with this change, I felt it was safe to change the default
consistency interval to 5 seconds. The current default is `None`, which
means we will **never** check for a new version by default. I think that
default is contrary to most users expectations.
This fixes an issue for people wishing to use different kinds of
rerankers in lancedb via AnswerDotAI rerankers. Currently, the arguments
are passed sequentially, but they don't match the[Reranker class
implementation](d604a8c47d/rerankers/reranker.py (L179)):
the second argument is expected to be an optional "lang" for default
models, while model_type should be passed explicitly.
The one line changes in this PR fixes it and enables the use of other
methods (eg LLMs-as-rerankers)
This PR adds a `to_query_object` method to the various query builders
(except not hybrid queries yet). This makes it possible to inspect the
query that is built.
In addition this PR does some normalization between the sync and async
query paths. A few custom defaults were removed in favor of None (with
the default getting set once, in rust).
Also, the synchronous to_batches method will now actually stream results
Also, the remote API now defaults to prefiltering
- adds `loss` into the index stats for vector index
- now `optimize` can retrain the vector index
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
`object_store` already hard codes `rustls` as the TLS implementation, so
we have been shipping a mix of `rustls` and `openssl`. For simplicity of
builds, we should consolidate to one, and that has to be `rustls`.
Closes#2114
Starting in #1965, we no longer pass the table schema into
`pa.Table.from_pylist()`. This means PyArrow is choosing the order of
the struct subfields, and apparently it does them in alphabetical order.
This is fine in theory, since in Lance we support providing fields in
any order. However, before we pass it to Lance, we call
`pa.Table.cast()` to align column types to the table types.
`pa.Table.cast()` is strict about field order, so we need to create a
cast target schema that aligns with the input data. We were doing this
at the top-level fields, but weren't doing this in nested fields. This
PR adds support to do this for nested ones.
Hello LanceDB team,
while developing using `lancedb` as a library I encountered a typing
problem affecting IDE hints and completions during development.
---
## Current Situation
Currently, the abstract base class `lancedb.query:LanceQueryBuilder`
uses method chaining to build up the search parameters, where the
methods have `LanceQueryBuilder` as a return type hint.
This leads to two issues:
1. Implementing subclasses of `LanceQueryBuilder` need to override
methods to modify the return type hint, even when they don't need to
change its implementation, just to ensure adequate IDE hints and
completions.
2. When using method chaining the first method directly inherited from
the abstract `LanceQueryBuilder` causes the inferred type to switch back
to `LanceQueryBuilder`. So even when the type starts from
`lancdb.table:LanceTable.search(query_type="vector", ...)` and therefor
correctly is inferred as `LanceVectorQueryBuilder`, after calling e.g.
`LanceVectorQueryBuilder.limit(...)` it is seen as the abstract
`LanceQueryBuilder` from that point on.
### Example of current situation

## Proposed changes
I propose to change the return type hints of the corresponding methods
(including classmethod `create()`) in the abstract base class
`LanceQueryBuilder` from `LanceQueryBuilder` to `Self`.
`Self` is already imported in the module:
```py
if sys.version_info >= (3, 11):
from typing import Self
else:
from typing_extensions import Self
```
### Further possible changes
Additionally, the implementing subclasses could also change the return
type hints to `Self` to potentially allow for further inheritance
easily.
> [!NOTE]
> **However this is not part of this pull request as of writing.**
### Example after proposed changes

---
Best regards
Martin
Previously, users could only specify new data types in `alterColumns` as
strings:
```ts
await tbl.alterColumns([
path: "price",
dataType: "float"
]);
```
But this has some problems:
1. It wasn't clear what were valid types
2. It was impossible to specify nested types, like lists and vector
columns.
This PR changes it to take an Arrow data type, similar to how the Python
API works. This allows casting vector types:
```ts
await tbl.alterColumns([
{
path: "vector",
dataType: new arrow.FixedSizeList(
2,
new arrow.Field("item", new arrow.Float16(), false),
),
},
]);
```
Closes#2185
Hello LanceDB team,
---
I have fixed a discrepancy in the class docstring of
`lancedb.embeddings.base:EmbeddingFunction` and made consistency
alignments to that docstring.
### Changes made
1. The docstring referred to the abstract method
`get_source_embeddings()`.
This method does not exist in the repository at the current state.
I have changed the mention to refer to the actual abstract method
`compute_source_embeddings()`.
2. Also, I aligned the consistency within the ordered list which is
describing the methods to be implemented by concrete embedding
functions.
---
Thank you for developing this useful library. 👍
Best regards
Martin
We attempted to make pylance optional in
https://github.com/lancedb/lancedb/pull/2156 but it appears this did not
quite work. Users are unable to use lancedb from a fresh install. This
reverts the optional-ness so we can get back in a working state while we
fix the issue.
Prior to this commit, issuing drop_all_tables on a listing database with
an external manifest store would delete physical tables but leave
references behind in the manifest store. The table drop would succeed,
but subsequent creation of a table with the same name would fail with a
conflict.
With this patch, the external manifest store is updated to account for
the dropped tables so that dropped table names can be reused.
@wjones127 is there a standard way you guys setup your virtualenv? I can
either relist all the dependencies in the pyright precommit section, or
specify a venv, or the user has to be in the virtual environment when
they run git commit. If the venv location was standardized or a python
manager like `uv` was used it would be easier to avoid duplicating the
pyright dependency list.
Per your suggestion, in `pyproject.toml` I added in all the passing
files to the `includes` section.
For ruff I upgraded the version and removed "TCH" which doesn't exist as
an option.
I added a `pyright_report.csv` which contains a list of all files sorted
by pyright errors ascending as a todo list to work on.
I fixed about 30 issues in `table.py` stemming from str's being passed
into methods that required a string within a set of string Literals by
extracting them into `types.py`
Can you verify in the rust bridge that the schema should be a property
and not a method here? If it's a method, then there's another place in
the code where `inner.schema` should be `inner.schema()`
``` python
class RecordBatchStream:
@property
def schema(self) -> pa.Schema: ...
```
Also unless the `_lancedb.pyi` file is wrong, then there is no
`__anext__` here for `__inner` when it's not an `AsyncGenerator` and
only `next` is defined:
``` python
async def __anext__(self) -> pa.RecordBatch:
return await self._inner.__anext__()
if isinstance(self._inner, AsyncGenerator):
batch = await self._inner.__anext__()
else:
batch = await self._inner.next()
if batch is None:
raise StopAsyncIteration
return batch
```
in the else statement, `_inner` is a `RecordBatchStream`
```python
class RecordBatchStream:
@property
def schema(self) -> pa.Schema: ...
async def next(self) -> Optional[pa.RecordBatch]: ...
```
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
In earlier PRs (#1886, #1191) we made the default limit 10 regardless of
the query type. This was confusing for users and in many cases a
breaking change. Users would have queries that used to return all
results, but instead only returned the first 10, causing silent bugs.
Part of the cause was consistency: the Python sync API seems to have
always had a limit of 10, while newer APIs (Python async and Nodejs)
didn't.
This PR sets the default limit only for searches (vector search, FTS),
while letting scans (even with filters) be unbounded. It does this
consistently for all SDKs.
Fixes#1983Fixes#1852Fixes#2141
This also changes the pylance pin from `==0.23.2` to `~=0.23.2` which
should allow the pylance dependency to float a little. The pylance
dependency is actually not used for much anymore and so it should be
tolerant of patch changes.