Commit Graph

190 Commits

Author SHA1 Message Date
BubbleCal
648327e90c docs: show how to pack bits for binary vector (#2020)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-14 09:00:57 -08:00
BubbleCal
66cbf6b6c5 feat: support multivector type (#2005)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-13 14:10:40 -08:00
Prashant Dixit
b66cd943a7 fix: broken voyageai embedding API (#2013)
This PR fixes the broken Embedding API for Voyageai.
2025-01-13 08:52:38 -08:00
Will Jones
6eacae18c4 test: fix test failure from merge (#2007) 2025-01-09 11:27:24 -08:00
Bert
f4afe456e8 feat!: change default from postfiltering to prefiltering for sync python (#2000)
BREAKING CHANGE: prefiltering is now the default in the synchronous
python SDK

resolves: #1872
2025-01-08 19:13:58 -05:00
Renato Marroquin
ea5c2266b8 feat(python): support .rerank() on non-hybrid queries in Async API (WIP) (#1972)
Fixes https://github.com/lancedb/lancedb/issues/1950

---------

Co-authored-by: Renato Marroquin <renato.marroquin@oracle.com>
2025-01-08 16:42:47 -05:00
Will Jones
c557e77f09 feat(python)!: support inserting and upserting subschemas (#1965)
BREAKING CHANGE: For a field "vector", list of integers will now be
converted to binary (uint8) vectors instead of f32 vectors. Use float
values instead for f32 vectors.

* Adds proper support for inserting and upserting subsets of the full
schema. I thought I had previously implemented this in #1827, but it
turns out I had not tested carefully enough.
* Refactors `_santize_data` and other utility functions to be simpler
and not require `numpy` or `combine_chunks()`.
* Added a new suite of unit tests to validate sanitization utilities.

## Examples

```python
import pandas as pd
import lancedb

db = lancedb.connect("memory://demo")
intial_data = pd.DataFrame({
    "a": [1, 2, 3],
    "b": [4, 5, 6],
    "c": [7, 8, 9]
})
table = db.create_table("demo", intial_data)

# Insert a subschema
new_data = pd.DataFrame({"a": [10, 11]})
table.add(new_data)
table.to_pandas()
```
```
    a    b    c
0   1  4.0  7.0
1   2  5.0  8.0
2   3  6.0  9.0
3  10  NaN  NaN
4  11  NaN  NaN
```


```python
# Upsert a subschema
upsert_data = pd.DataFrame({
    "a": [3, 10, 15],
    "b": [6, 7, 8],
})
table.merge_insert(on="a").when_matched_update_all().when_not_matched_insert_all().execute(upsert_data)
table.to_pandas()
```
```
    a    b    c
0   1  4.0  7.0
1   2  5.0  8.0
2   3  6.0  9.0
3  10  7.0  NaN
4  11  NaN  NaN
5  15  8.0  NaN
```
2025-01-08 10:11:10 -08:00
BubbleCal
3c0a64be8f feat: support distance range in queries (#1999)
this also updates the docs

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-08 11:03:27 +08:00
QianZhu
17c9e9afea docs: add async examples to doc (#1941)
- added sync and async tabs for python examples
- moved python code to tests/docs

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-01-07 15:10:25 -08:00
Gagan Bhullar
b474f98049 feat(python): flatten in AsyncQuery (#1967)
PR fixes #1949

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2025-01-06 10:52:03 -08:00
Takahiro Ebato
2c05ffed52 feat(python): add to_polars to AsyncQueryBase (#1986)
Fixes https://github.com/lancedb/lancedb/issues/1952

Added `to_polars` method to `AsyncQueryBase`.
2025-01-06 09:35:28 -08:00
BubbleCal
f4dea72cc5 feat: support vector search with distance thresholds (#1993)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-06 13:23:39 +08:00
Lei Xu
f76c4a5ce1 chore: add pyright static type checking and fix some of the table interface (#1996)
* Enable `pyright` in the project
* Fixed some pyright typing errors in `table.py`
2025-01-04 15:24:58 -08:00
BubbleCal
445a312667 fix: selecting columns failed on FTS and hybrid search (#1991)
it reports error `AttributeError: 'builtins.FTSQuery' object has no
attribute 'select_columns'`
because we missed `select_columns` method in rust

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2025-01-03 13:08:12 +08:00
Lei Xu
50c30c5d34 chore(python): fix typo of the synchronized checkout API (#1988) 2024-12-30 18:54:31 -08:00
Renato Marroquin
0cb6da6b7e docs: add new indexes to python docs (#1945)
closes issue #1855

Co-authored-by: Renato Marroquin <renato.marroquin@oracle.com>
2024-12-28 15:35:10 -08:00
BubbleCal
16cf2990f3 feat: create IVF_FLAT on remote table (#1978)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-12-25 14:57:07 +08:00
BubbleCal
e70fd4fecc feat: support IVF_FLAT, binary vectors and hamming distance (#1955)
binary vectors and hamming distance can work on only IVF_FLAT, so
introduce them all in this PR.

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-12-24 10:36:20 -08:00
verma nakul
ac0068b80e feat(python): add ignore_missing to the async drop_table() method (#1953)
- feat(db): add `ignore_missing` to async `drop_table` method

Fixes #1951

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-12-24 10:33:47 -08:00
Hezi Zisman
ebac960571 feat(python): add bypass_vector_index to sync api (#1947)
Hi lancedb team,

This PR adds the `bypass_vector_index` logic to the sync API, as
described in [Issue
#535](https://github.com/lancedb/lancedb/issues/535). (Closes #535).

Iv'e implemented it only for the regular vector search. If you think it
should also be supported for FTS, Hybrid, or Empty queries and for the
cloud solution, please let me know, and I’ll be happy to extend it.

Since there’s no `CONTRIBUTING.md` or contribution guidelines, I opted
for the simplest implementation to get this started.

Looking forward to your feedback!

Thanks!

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-12-24 10:33:26 -08:00
Will Jones
61a714a459 docs: improve optimization docs (#1957)
* Add `See Also` section to `cleanup_old_files` and `compact_files` so
they know it's linked to `optimize`.
* Fixes link to `compact_files` arguments
* Improves formatting of note.
2024-12-19 10:55:11 -08:00
Will Jones
980aa70e2d feat(python): async-sync feature parity on Table (#1914)
### Changes to sync API
* Updated `LanceTable` and `LanceDBConnection` reprs
* Add `storage_options`, `data_storage_version`, and
`enable_v2_manifest_paths` to sync create table API.
* Add `storage_options` to `open_table` in sync API.
* Add `list_indices()` and `index_stats()` to sync API
* `create_table()` will now create only 1 version when data is passed.
Previously it would always create two versions: 1 to create an empty
table and 1 to add data to it.

### Changes to async API
* Add `embedding_functions` to async `create_table()` API.
* Added `head()` to async API

### Refactors
* Refactor index parameters into dataclasses so they are easier to use
from Python
* Moved most tests to use an in-memory DB so we don't need to create so
many temp directories

Closes #1792
Closes #1932

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-12-13 12:56:44 -08:00
QianZhu
c0ee370f83 docs: improve schema evolution api examples (#1929) 2024-12-12 10:52:06 -08:00
Lei Xu
347515aa51 fix: support list of numpy f16 floats as query vector (#1931)
User reported on Discord, when using
`table.vector_search([np.float16(1.0), np.float16(2.0), ...])`, it
yields `TypeError: 'numpy.float16' object is not iterable`
2024-12-10 16:17:28 -08:00
BubbleCal
3324e7d525 feat: support 4bit PQ (#1916) 2024-12-10 10:36:03 +08:00
Will Jones
ab5316b4fa feat: support offset in remote client (#1923)
Closes https://github.com/lancedb/lancedb/issues/1876
2024-12-09 17:04:18 -08:00
LuQQiu
35bacdd57e feat: support azure account name storage options in sync db.connect (#1926)
db.connect with azure storage account name is supported in async connect
but not sync connect.
Add this functionality

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-12-08 20:00:23 -08:00
Will Jones
a5ebe5a6c4 fix: create_scalar_index in cloud (#1922)
Fixes #1920
2024-12-07 19:48:40 -08:00
Bert
2a9e3e2084 feat(python): support hybrid search in async sdk (#1915)
fixes: https://github.com/lancedb/lancedb/issues/1765

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-12-06 13:53:15 -05:00
BubbleCal
c663085203 feat: support FTS options on RemoteTable (#1807) 2024-12-06 21:49:03 +08:00
Will Jones
3c487e5fc7 perf: re-use table instance during write (#1909)
Previously, whenever `Table.add()` was called, we would write and
re-open the underlying dataset. This was bad for performance, as it
reset the table cache and initiated a lot of IO. It also could be the
source of bugs, since we didn't necessarily pass all the necessary
connection options down when re-opening the table.

Closes #1655
2024-12-05 14:44:50 -08:00
Bert
239f725b32 feat(python)!: async-sync feature parity on Connections (#1905)
Closes #1791
Closes #1764
Closes #1897 (Makes this unnecessary)

BREAKING CHANGE: when using azure connection string `az://...` the call
to connect will fail if the azure storage credentials are not set. this
is breaking from the previous behaviour where the call would fail after
connect, when user invokes methods on the connection.
2024-12-05 14:54:39 -05:00
Will Jones
79eaa52184 feat: schema evolution APIs in all SDKs (#1851)
* Support `add_columns`, `alter_columns`, `drop_columns` in Remote SDK
and async Python
* Add `data_type` parameter to node
* Docs updates
2024-12-04 14:47:50 -08:00
Lei Xu
bd82e1f66d feat(python): add support for Azure OpenAPI SDK (#1906)
Closes #1699
2024-12-04 13:09:38 -08:00
Weston Pace
c998a47e17 feat: add a pyarrow dataset adapater for LanceDB tables (#1902)
This currently only works for local tables (remote tables cannot be
queried)
This is also exclusive to the sync interface. However, since the pyarrow
dataset interface is synchronous I am not sure if there is much value in
making an async-wrapping variant.

In addition, I added a `to_batches` method to the base query in the sync
API. This already exists in the async API. In the sync API this PR only
adds support for vector queries and scalar queries and not for hybrid or
FTS queries.
2024-12-03 15:42:54 -08:00
Frank Liu
d8c758513c feat: add multimodal capabilities for Voyage embedder (#1878)
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-12-03 10:25:48 -08:00
Will Jones
3795e02ee3 chore: fix ci on main (#1899) 2024-12-02 15:21:18 -08:00
QianZhu
2616a50502 fix: test errors after setting default limit (#1891) 2024-11-26 16:03:16 -08:00
Will Jones
6826039575 fix(python): run remote SDK futures in background thread (#1856)
Users who call the remote SDK from code that uses futures (either
`ThreadPoolExecutor` or `asyncio`) can get odd errors like:

```
Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
RuntimeError: cannot enter context: <_contextvars.Context object at 0x7cfe94cdc900> is already entered
```

This PR fixes that by executing all LanceDB futures in a dedicated
thread pool running on a background thread. That way, it doesn't
interact with their threadpool.
2024-11-25 13:12:47 -08:00
Lei Xu
2ded17452b fix(python)!: handle bad openai embeddings gracefully (#1873)
BREAKING-CHANGE: change Pydantic Vector field to be nullable by default.
Closes #1577
2024-11-23 13:33:52 -08:00
QianZhu
43a670ed4b fix: limit docstring change (#1860) 2024-11-21 10:50:50 -08:00
Bert
cb9a00a28d feat: add list_versions to typescript, rust and remote python sdks (#1850)
Will require update to lance dependency to bring in this change which
makes the version serializable
https://github.com/lancedb/lance/pull/3143
2024-11-21 13:35:14 -05:00
Max Epstein
72af977a73 fix(CohereReranker): updated default model_name param to newest v3 (#1862) 2024-11-21 09:02:49 -08:00
Bert
7cecb71df0 feat: support for checkout and checkout_latest in remote sdks (#1863) 2024-11-21 11:28:46 -05:00
BubbleCal
b2f88f0b29 feat: support to sepcify ef search param (#1844)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-11-19 23:12:25 +08:00
Lei Xu
267aa83bf8 feat(python): check vector query is not None (#1847)
Fix the type hints of `nearest_to` method, and raise `ValueError` when
the input is None
2024-11-18 14:15:22 -08:00
Will Jones
72543c8b9d test(python): test with_row_id in sync query (#1835)
Also remove weird `MockTable` fixture.
2024-11-18 11:32:52 -08:00
Will Jones
587c0824af feat: flexible null handling and insert subschemas in Python (#1827)
* Test that we can insert subschemas (omit nullable columns) in Python.
* More work is needed to support this in Node. See:
https://github.com/lancedb/lancedb/issues/1832
* Test that we can insert data with nullable schema but no nulls in
non-nullable schema.
* Add `"null"` option for `on_bad_vectors` where we fill with null if
the vector is bad.
* Make null values not considered bad if the field itself is nullable.
2024-11-15 11:33:00 -08:00
Rob Meng
b724b1a01f feat: support remote empty query (#1828)
Support sending empty query types to remote lancedb. also include offset
and limit, where were previously omitted.
2024-11-13 23:04:52 -05:00
Will Jones
abd75e0ead feat: search multiple query vectors as one query (#1811)
Allows users to pass multiple query vector as part of a single query
plan. This just runs the queries in parallel without any further
optimization. It's mostly a convenience.

Previously, I think this was only handled by the sync Python remote API.
This makes it common across all SDKs.

Closes https://github.com/lancedb/lancedb/issues/1803

```python
>>> import lancedb
>>> import asyncio
>>> 
>>> async def main():
...     db = await lancedb.connect_async("./demo")
...     table = await db.create_table("demo", [{"id": 1, "vector": [1, 2, 3]}, {"id": 2, "vector": [4, 5, 6]}], mode="overwrite")
...     return await table.query().nearest_to([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [4.0, 5.0, 6.0]]).limit(1).to_pandas()
... 
>>> asyncio.run(main())
   query_index  id           vector  _distance
0            2   2  [4.0, 5.0, 6.0]        0.0
1            1   2  [4.0, 5.0, 6.0]        0.0
2            0   1  [1.0, 2.0, 3.0]        0.0
```
2024-11-13 16:05:16 -08:00