* Test that we can insert subschemas (omit nullable columns) in Python.
* More work is needed to support this in Node. See:
https://github.com/lancedb/lancedb/issues/1832
* Test that we can insert data with nullable schema but no nulls in
non-nullable schema.
* Add `"null"` option for `on_bad_vectors` where we fill with null if
the vector is bad.
* Make null values not considered bad if the field itself is nullable.
Allows users to pass multiple query vector as part of a single query
plan. This just runs the queries in parallel without any further
optimization. It's mostly a convenience.
Previously, I think this was only handled by the sync Python remote API.
This makes it common across all SDKs.
Closes https://github.com/lancedb/lancedb/issues/1803
```python
>>> import lancedb
>>> import asyncio
>>>
>>> async def main():
... db = await lancedb.connect_async("./demo")
... table = await db.create_table("demo", [{"id": 1, "vector": [1, 2, 3]}, {"id": 2, "vector": [4, 5, 6]}], mode="overwrite")
... return await table.query().nearest_to([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [4.0, 5.0, 6.0]]).limit(1).to_pandas()
...
>>> asyncio.run(main())
query_index id vector _distance
0 2 2 [4.0, 5.0, 6.0] 0.0
1 1 2 [4.0, 5.0, 6.0] 0.0
2 0 1 [1.0, 2.0, 3.0] 0.0
```
* Replaces Python implementation of Remote SDK with Rust one.
* Drops dependency on `attrs` and `cachetools`. Makes `requests` an
optional dependency used only for embeddings feature.
* Adds dependency on `nest-asyncio`. This was required to get hybrid
search working.
* Deprecate `request_thread_pool` parameter. We now use the tokio
threadpool.
* Stop caching the `schema` on a remote table. Schema is mutable and
there's no mechanism in place to invalidate the cache.
* Removed the client-side resolution of the vector column. We should
already be resolving this server-side.
Sometimes it is acceptable to users to only search indexed data and skip
and new un-indexed data. For example, if un-indexed data will be shortly
indexed and they don't mind the delay. In these cases, we can save a lot
of CPU time in search, and provide better latency. Users can activate
this on queries using `fast_search()`.
BREAKING CHANGE: default tokenizer no longer does stemming or stop-word
removal. Users should explicitly turn that option on in the future.
- upgrade lance to 0.19.1
- update the FTS docs
- update the FTS API
Upstream change notes:
https://github.com/lancedb/lance/releases/tag/v0.19.1
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
## user story
fixes https://github.com/lancedb/lancedb/issues/1480https://github.com/invl/retry has not had an update in 8 years, one if
its sub-dependencies via requirements.txt
(https://github.com/pytest-dev/py) is no longer maintained and has a
high severity vulnerability (CVE-2022-42969).
retry is only used for a single function in the python codebase for a
deprecated helper function `with_embeddings`, which was created for an
older tutorial (https://github.com/lancedb/lancedb/pull/12) [but is now
deprecated](https://lancedb.github.io/lancedb/embeddings/legacy/).
## changes
i backported a limited range of functionality of the `@retry()`
decorator directly into lancedb so that we no longer have a dependency
to the `retry` package.
## tests
```
/Users/james/src/lancedb/python $ ruff check .
All checks passed!
/Users/james/src/lancedb/python $ pytest python/tests/test_embeddings.py
python/tests/test_embeddings.py .......s.... [100%]
================================================================ 11 passed, 1 skipped, 2 warnings in 7.08s ================================================================
```
* Adds nicer errors to remote SDK, that expose useful properties like
`request_id` and `status_code`.
* Makes sure the Python tracebacks print nicely by mapping the `source`
field from a Rust error to the `__cause__` field.
Resovles #1709. Adds `trust_remote_code` as a parameter to the
`TransformersEmbeddingFunction` class with a default of False. Updated
relevant documentation with the same.
BREAKING CHANGE: the return value of `index_stats` method has changed
and all `index_stats` APIs now take index name instead of UUID. Also
several deprecated index statistics methods were removed.
* Removes deprecated methods for individual index statistics
* Aligns public `IndexStatistics` struct with API response from LanceDB
Cloud.
* Implements `index_stats` for remote Rust SDK and Python async API.
- fixes https://github.com/lancedb/lancedb/issues/1697.
- unifies vector column inference logic for remote and local table to
prevent future disparities.
- Updates docstring in RemoteTable to specify empty queries are not
supported
Right now when passing vector and query explicitly for hybrid search ,
vector_column_name is not deduced.
(https://lancedb.github.io/lancedb/hybrid_search/hybrid_search/#hybrid-search-in-lancedb
). Because vector and query can be both none when initialising the
QueryBuilder in this case. This PR forces deduction of query type if it
is set to "hybrid"
- Enforce all rerankers always return _relevance_score. This was already
loosely done in tests before but based on user feedback its better to
always have _relevance_score present in all reranked results
- Deprecate LinearCombinationReranker in docs. And also fix a case where
it would not return _relevance_score if one result set was missing
- Add overloads to Table.search, to preserve the return information
of different types of QueryBuilder objects for LanceTable
- Fix fts_column type annotation by including making it `Optional`
resolves#1550
---------
Co-authored-by: sayandip-dutta <sayandip.dutta@nevaehtech.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
The `ratelimiter` package hasn't been updated in ages and is no longer
maintained. This PR removes the dependency on `ratelimiter` and replaces
it with a custom rate limiter implementation.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
The new V2 manifest path scheme makes discovering the latest version of
a table constant time on object stores, regardless of the number of
versions in the table. See benchmarks in the PR here:
https://github.com/lancedb/lance/pull/2798Closes#1583
first off, apologies for any folly since i'm new to contributing to
lancedb. this PR is the continuation of [a discord
thread](https://discord.com/channels/1030247538198061086/1030247538667827251/1278844345713299599):
## user story
here's the lance db search query i'd like to run:
```
def search(phrase):
logger.info(f'Searching for phrase: {phrase}')
phrase_embedding = get_embedding(phrase)
df = (table.search((phrase_embedding, phrase), query_type='hybrid')
.limit(10).to_list())
logger.info(f'Success search with row count: {len(df)}')
search('howdy (howdy)')
search('howdy(howdy)')
```
the second search fails due to `ValueError: Syntax Error: howdy(howdy)`
i saw on the
[docs](https://lancedb.github.io/lancedb/fts/#phrase-queries-vs-terms-queries)
that i can use `phrase_query()` to [enable a
flag](https://github.com/lancedb/lancedb/blob/main/python/python/lancedb/query.py#L790-L792)
to wrap the query in double quotes (as well as sanitize single quotes)
prior to sending the query to search. this works for [normal
FTS](https://lancedb.github.io/lancedb/fts/), but the command is
unavailable on [hybrid
search](https://lancedb.github.io/lancedb/hybrid_search/hybrid_search/).
## changes
i added `phrase_query()` function to `LanceHybridQueryBuilder` by
propagating the call down to its `self. _fts_query` object. i'm not too
familiar with the codebase and am not sure if this is the best way to
implement the functionality. feel free to riff on this PR or discard
## tests
```
(lancedb) JamesMPB:python james$ pwd
/Users/james/src/lancedb/python
(lancedb) JamesMPB:python james$ pytest python/tests/test_table.py
python/tests/test_table.py ....................................... [100%]
====================================================== 39 passed, 1 warning in 2.23s =======================================================
```