This PR aims to fix#2047 by doing the following things:
- Add a distance_type parameter to the sync query builders of Python
SDK.
- Make metric an alias to distance_type.
Fixes#2031
When we do hybrid search, we normalize the scores. We do this
calculation in-place, because the Rerankers expect the `_distance` and
`_score` columns to be the normalized ones. So I've changed the logic so
that we restore the original distance and scores by matching on row ids.
This includes several improvements and fixes to the Python Async query
builders:
1. The API reference docs show all the methods for each builder
2. The hybrid query builder now has all the same setter methods as the
vector search one, so you can now set things like `.distance_type()` on
a hybrid query.
3. Re-rankers are now properly hooked up and tested for FTS and vector
search. Previously the re-rankers were accidentally bypassed in unit
tests, because the builders overrode `.to_arrow()`, but the unit test
called `.to_batches()` which was only defined in the base class. Now all
builders implement `.to_batches()` and leave `.to_arrow()` to the base
class.
4. The `AsyncQueryBase` and `AsyncVectoryQueryBase` setter methods now
return `Self`, which provides the appropriate subclass as the type hint
return value. Previously, `AsyncQueryBase` had them all hard-coded to
`AsyncQuery`, which was unfortunate. (This required bringing in
`typing-extensions` for older Python version, but I think it's worth
it.)
related to #2014
this fixes:
- linear reranker may lost some results if the merging consumes all
vector results earlier than fts results
- linear reranker inverts the fts score but only vector distance can be
inverted
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
BREAKING CHANGE: For a field "vector", list of integers will now be
converted to binary (uint8) vectors instead of f32 vectors. Use float
values instead for f32 vectors.
* Adds proper support for inserting and upserting subsets of the full
schema. I thought I had previously implemented this in #1827, but it
turns out I had not tested carefully enough.
* Refactors `_santize_data` and other utility functions to be simpler
and not require `numpy` or `combine_chunks()`.
* Added a new suite of unit tests to validate sanitization utilities.
## Examples
```python
import pandas as pd
import lancedb
db = lancedb.connect("memory://demo")
intial_data = pd.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9]
})
table = db.create_table("demo", intial_data)
# Insert a subschema
new_data = pd.DataFrame({"a": [10, 11]})
table.add(new_data)
table.to_pandas()
```
```
a b c
0 1 4.0 7.0
1 2 5.0 8.0
2 3 6.0 9.0
3 10 NaN NaN
4 11 NaN NaN
```
```python
# Upsert a subschema
upsert_data = pd.DataFrame({
"a": [3, 10, 15],
"b": [6, 7, 8],
})
table.merge_insert(on="a").when_matched_update_all().when_not_matched_insert_all().execute(upsert_data)
table.to_pandas()
```
```
a b c
0 1 4.0 7.0
1 2 5.0 8.0
2 3 6.0 9.0
3 10 7.0 NaN
4 11 NaN NaN
5 15 8.0 NaN
```
it reports error `AttributeError: 'builtins.FTSQuery' object has no
attribute 'select_columns'`
because we missed `select_columns` method in rust
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
binary vectors and hamming distance can work on only IVF_FLAT, so
introduce them all in this PR.
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Hi lancedb team,
This PR adds the `bypass_vector_index` logic to the sync API, as
described in [Issue
#535](https://github.com/lancedb/lancedb/issues/535). (Closes#535).
Iv'e implemented it only for the regular vector search. If you think it
should also be supported for FTS, Hybrid, or Empty queries and for the
cloud solution, please let me know, and I’ll be happy to extend it.
Since there’s no `CONTRIBUTING.md` or contribution guidelines, I opted
for the simplest implementation to get this started.
Looking forward to your feedback!
Thanks!
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
* Add `See Also` section to `cleanup_old_files` and `compact_files` so
they know it's linked to `optimize`.
* Fixes link to `compact_files` arguments
* Improves formatting of note.
### Changes to sync API
* Updated `LanceTable` and `LanceDBConnection` reprs
* Add `storage_options`, `data_storage_version`, and
`enable_v2_manifest_paths` to sync create table API.
* Add `storage_options` to `open_table` in sync API.
* Add `list_indices()` and `index_stats()` to sync API
* `create_table()` will now create only 1 version when data is passed.
Previously it would always create two versions: 1 to create an empty
table and 1 to add data to it.
### Changes to async API
* Add `embedding_functions` to async `create_table()` API.
* Added `head()` to async API
### Refactors
* Refactor index parameters into dataclasses so they are easier to use
from Python
* Moved most tests to use an in-memory DB so we don't need to create so
many temp directories
Closes#1792Closes#1932
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
User reported on Discord, when using
`table.vector_search([np.float16(1.0), np.float16(2.0), ...])`, it
yields `TypeError: 'numpy.float16' object is not iterable`
db.connect with azure storage account name is supported in async connect
but not sync connect.
Add this functionality
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Previously, whenever `Table.add()` was called, we would write and
re-open the underlying dataset. This was bad for performance, as it
reset the table cache and initiated a lot of IO. It also could be the
source of bugs, since we didn't necessarily pass all the necessary
connection options down when re-opening the table.
Closes#1655
Closes#1791Closes#1764Closes#1897 (Makes this unnecessary)
BREAKING CHANGE: when using azure connection string `az://...` the call
to connect will fail if the azure storage credentials are not set. this
is breaking from the previous behaviour where the call would fail after
connect, when user invokes methods on the connection.
This currently only works for local tables (remote tables cannot be
queried)
This is also exclusive to the sync interface. However, since the pyarrow
dataset interface is synchronous I am not sure if there is much value in
making an async-wrapping variant.
In addition, I added a `to_batches` method to the base query in the sync
API. This already exists in the async API. In the sync API this PR only
adds support for vector queries and scalar queries and not for hybrid or
FTS queries.
Users who call the remote SDK from code that uses futures (either
`ThreadPoolExecutor` or `asyncio`) can get odd errors like:
```
Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
self._context.run(self._callback, *self._args)
RuntimeError: cannot enter context: <_contextvars.Context object at 0x7cfe94cdc900> is already entered
```
This PR fixes that by executing all LanceDB futures in a dedicated
thread pool running on a background thread. That way, it doesn't
interact with their threadpool.