lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-01-12 06:42:56 +00:00

Author	SHA1	Message	Date
Magnus	4f07fea6df	feat: add ColPali embedding support with MultiVector type (#2170 ) This PR adds ColPali support with ColPaliEmbeddings class (tagged "colpali") using ColQwen2.5 for multi-vector text/image embeddings. Also added MultiVector Pydantic type to handle the vector lists. I've added some integration test for the embedding model and some unit test for the new Pydantic type. Could be a template for other ColPali variants as well. or until transformers🤗 starts supporting it. Still `TODO`: - [ ] Documentation - [ ] Add an example _Could also allow Image as query, but didn't work well when testing it._ [ColPali-Engine](https://github.com/illuin-tech/colpali) version: 0.3.9.dev17+g3faee24 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Introduced support for ColPali-based multimodal multi-vector embeddings for both text and images. - Added a new embedding class for generating multi-vector embeddings, configurable for various model and processing options. - Added a new Pydantic type for multi-vector embeddings, supporting validation and schema generation for lists of fixed-dimension vectors. - Bug Fixes - Ensured proper asynchronous index creation in query tests for improved reliability. - Tests - Added integration tests for ColPali embeddings, including text-to-image search and validation of multi-vector fields. - Added comprehensive tests for the new multi-vector Pydantic type, covering schema, validation, and default value behavior. - Chores - Updated optional dependencies to include the ColPali engine. - Added utility to check for availability of flash attention support. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-21 11:47:37 +08:00
Will Jones	1cd76b8498	feat: add timeout to query execution options (#2288 ) Closes #2287 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Added configurable timeout support for query executions. Users can now specify maximum wait times for queries, enhancing control over long-running operations across various integrations. - Tests - Expanded test coverage to validate timeout behavior in both synchronous and asynchronous query flows, ensuring timely error responses when query execution exceeds the specified limit. - Introduced a new test suite to verify query operations when a timeout is reached, checking for appropriate error handling. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-04 12:34:41 -07:00
LuQQiu	a1d1833a40	feat: add analyze_plan api (#2280 ) add analyze plan api to allow executing the queries and see runtime metrics. Which help identify the query IO overhead and help identify query slowness	2025-03-28 14:28:52 -07:00
Weston Pace	9403254442	feat: add to_query_object method (#2239 ) This PR adds a `to_query_object` method to the various query builders (except not hybrid queries yet). This makes it possible to inspect the query that is built. In addition this PR does some normalization between the sync and async query paths. A few custom defaults were removed in favor of None (with the default getting set once, in rust). Also, the synchronous to_batches method will now actually stream results Also, the remote API now defaults to prefiltering	2025-03-21 13:01:51 -07:00
Gagan Bhullar	14677d7c18	fix: metric type inconsistency (#2122 ) PR fixes #2113 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2025-03-12 10:28:37 -07:00
msu-reevo	cc81f3e1a5	fix(python): typing (#2167 ) @wjones127 is there a standard way you guys setup your virtualenv? I can either relist all the dependencies in the pyright precommit section, or specify a venv, or the user has to be in the virtual environment when they run git commit. If the venv location was standardized or a python manager like `uv` was used it would be easier to avoid duplicating the pyright dependency list. Per your suggestion, in `pyproject.toml` I added in all the passing files to the `includes` section. For ruff I upgraded the version and removed "TCH" which doesn't exist as an option. I added a `pyright_report.csv` which contains a list of all files sorted by pyright errors ascending as a todo list to work on. I fixed about 30 issues in `table.py` stemming from str's being passed into methods that required a string within a set of string Literals by extracting them into `types.py` Can you verify in the rust bridge that the schema should be a property and not a method here? If it's a method, then there's another place in the code where `inner.schema` should be `inner.schema()` ``` python class RecordBatchStream: @property def schema(self) -> pa.Schema: ... ``` Also unless the `_lancedb.pyi` file is wrong, then there is no `__anext__` here for `__inner` when it's not an `AsyncGenerator` and only `next` is defined: ``` python async def __anext__(self) -> pa.RecordBatch: return await self._inner.__anext__() if isinstance(self._inner, AsyncGenerator): batch = await self._inner.__anext__() else: batch = await self._inner.next() if batch is None: raise StopAsyncIteration return batch ``` in the else statement, `_inner` is a `RecordBatchStream` ```python class RecordBatchStream: @property def schema(self) -> pa.Schema: ... async def next(self) -> Optional[pa.RecordBatch]: ... ``` --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2025-03-10 09:01:23 -07:00
Will Jones	ecdee4d2b1	feat(python): add search() method to async API (#2049 ) Reviving #1966. Closes #1938 The `search()` method can apply embeddings for the user. This simplifies hybrid search, so instead of writing: ```python vector_query = embeddings.compute_query_embeddings("flower moon")[0] await ( async_tbl.query() .nearest_to(vector_query) .nearest_to_text("flower moon") .to_pandas() ) ``` You can write: ```python await (await async_tbl.search("flower moon", query_type="hybrid")).to_pandas() ``` Unfortunately, we had to do a double-await here because `search()` needs to be async. This is because it often needs to do IO to retrieve and run an embedding function.	2025-02-24 14:19:25 -08:00
BubbleCal	a608621476	test: query with dist range and new rows (#2126 ) we found a bug that flat KNN plan node's stats is not in right order as fields in schema, it would cause an error if querying with distance range and new unindexed rows. we've fixed this in lance so add this test for verifying it works Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2025-02-17 12:57:45 +08:00
Vaibhav	dac0857745	feat: add `distance_type()` parameter to python sync query builders and `metric()` as an alias (#2073 ) This PR aims to fix #2047 by doing the following things: - Add a distance_type parameter to the sync query builders of Python SDK. - Make metric an alias to distance_type.	2025-01-28 13:59:53 -08:00
Will Jones	bcfc93cc88	fix(python): various fixes for async query builders (#2048 ) This includes several improvements and fixes to the Python Async query builders: 1. The API reference docs show all the methods for each builder 2. The hybrid query builder now has all the same setter methods as the vector search one, so you can now set things like `.distance_type()` on a hybrid query. 3. Re-rankers are now properly hooked up and tested for FTS and vector search. Previously the re-rankers were accidentally bypassed in unit tests, because the builders overrode `.to_arrow()`, but the unit test called `.to_batches()` which was only defined in the base class. Now all builders implement `.to_batches()` and leave `.to_arrow()` to the base class. 4. The `AsyncQueryBase` and `AsyncVectoryQueryBase` setter methods now return `Self`, which provides the appropriate subclass as the type hint return value. Previously, `AsyncQueryBase` had them all hard-coded to `AsyncQuery`, which was unfortunate. (This required bringing in `typing-extensions` for older Python version, but I think it's worth it.)	2025-01-20 16:14:34 -08:00
BubbleCal	214d0debf5	docs: claim LanceDB supports float16/float32/float64 for multivector (#2040 )	2025-01-21 07:04:15 +08:00
BubbleCal	66cbf6b6c5	feat: support multivector type (#2005 ) Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2025-01-13 14:10:40 -08:00
Will Jones	6eacae18c4	test: fix test failure from merge (#2007 )	2025-01-09 11:27:24 -08:00
Bert	f4afe456e8	feat!: change default from postfiltering to prefiltering for sync python (#2000 ) BREAKING CHANGE: prefiltering is now the default in the synchronous python SDK resolves: #1872	2025-01-08 19:13:58 -05:00
Renato Marroquin	ea5c2266b8	feat(python): support .rerank() on non-hybrid queries in Async API (WIP) (#1972 ) Fixes https://github.com/lancedb/lancedb/issues/1950 --------- Co-authored-by: Renato Marroquin <renato.marroquin@oracle.com>	2025-01-08 16:42:47 -05:00
Gagan Bhullar	b474f98049	feat(python): `flatten` in `AsyncQuery` (#1967 ) PR fixes #1949 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2025-01-06 10:52:03 -08:00
Takahiro Ebato	2c05ffed52	feat(python): add `to_polars` to `AsyncQueryBase` (#1986 ) Fixes https://github.com/lancedb/lancedb/issues/1952 Added `to_polars` method to `AsyncQueryBase`.	2025-01-06 09:35:28 -08:00
BubbleCal	f4dea72cc5	feat: support vector search with distance thresholds (#1993 ) Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2025-01-06 13:23:39 +08:00
Lei Xu	347515aa51	fix: support list of numpy f16 floats as query vector (#1931 ) User reported on Discord, when using `table.vector_search([np.float16(1.0), np.float16(2.0), ...])`, it yields `TypeError: 'numpy.float16' object is not iterable`	2024-12-10 16:17:28 -08:00
Lei Xu	267aa83bf8	feat(python): check vector query is not None (#1847 ) Fix the type hints of `nearest_to` method, and raise `ValueError` when the input is None	2024-11-18 14:15:22 -08:00
Will Jones	72543c8b9d	test(python): test `with_row_id` in sync query (#1835 ) Also remove weird `MockTable` fixture.	2024-11-18 11:32:52 -08:00
Will Jones	3604d20ad3	feat(python,node): support with_row_id in Python and remote (#1784 ) Needed to support hybrid search in Remote SDK.	2024-11-04 11:25:45 -08:00
Will Jones	96181ab421	feat: `fast_search` in Python and Node (#1623 ) Sometimes it is acceptable to users to only search indexed data and skip and new un-indexed data. For example, if un-indexed data will be shortly indexed and they don't mind the delay. In these cases, we can save a lot of CPU time in search, and provide better latency. Users can activate this on queries using `fast_search()`.	2024-11-01 09:29:09 -07:00
Gagan Bhullar	b24810a011	feat(python, rust): expose offset in query (#1556 ) PR is part of #1555	2024-09-05 08:33:07 -07:00
Gagan Bhullar	a85f039352	fix(bug): limit fix (#1548 ) PR fixes #1151	2024-08-26 14:25:14 -07:00
Gagan Bhullar	9c1adff426	feat(python): add to_list to async api (#1520 ) PR fixes #1517	2024-08-08 11:45:20 -07:00
Will Jones	4f601a2d4c	fix: handle camelCase column names in select (#1460 ) Fixes #1385	2024-07-22 12:53:17 -07:00
Nuvic	46c6ff889d	feat: add the explain_plan function (#1328 ) It's useful to see the underlying query plan for debugging purposes. This exposes LanceScanner's `explain_plan` function. Addresses #1288 --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2024-07-02 11:10:01 -07:00
Ishani Ghose	0838e12b30	feat: add to_batches API #805 (#1048 ) SDK Python Description Exposes pyarrow batch api during query execution - relevant when there is no vector search query, dataset is large and the filtered result is larger than memory. --------- Co-authored-by: Ishani Ghose <isghose@amazon.com> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-04-05 16:33:37 -07:00
Weston Pace	4180b44472	feat: refactor the query API and add query support to the python async API (#1113 ) In addition, there are also a number of changes in nodejs to the docstrings of existing methods because this PR adds a jsdoc linter.	2024-04-05 16:32:47 -07:00
Rob Meng	b8eb5d4bfe	fix: fix columns type for pydantic 2.x (#1045 )	2024-04-05 16:31:36 -07:00
Rob Meng	f3de3d990d	chore: upgrade to lance 0.10.1 (#1034 ) upgrade to lance 0.10.1 and update doc string to reflect dynamic projection options	2024-04-05 16:31:36 -07:00
Weston Pace	2cec2a8937	feat: add a basic async python client starting point (#1014 ) This changes `lancedb` from a "pure python" setuptools project to a maturin project and adds a rust lancedb dependency. The async python client is extremely minimal (only `connect` and `Connection.table_names` are supported). The purpose of this PR is to get the infrastructure in place for building out the rest of the async client. Although this is not technically a breaking change (no APIs are changing) it is still a considerable change in the way the wheels are built because they now include the native shared library.	2024-04-05 16:31:34 -07:00

33 Commits