lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-04-01 13:30:40 +00:00

Author	SHA1	Message	Date
Omair Afzal	715b81c86b	fix(python): graceful handling of empty result sets in hybrid search (#3030 ) ## Problem When applying hard filters that result in zero matches, hybrid search crashes with `IndexError: list index out of range` during reranking. This happens because empty result tables are passed through the full reranker pipeline, which expects at least one result. Traceback from the issue: ``` lancedb/query.py: in _combine_hybrid_results results = reranker.rerank_hybrid(fts_query, vector_results, fts_results) lancedb/rerankers/answerdotai.py: in rerank_hybrid combined_results = self._rerank(combined_results, query) ... IndexError: list index out of range ``` ## Fix Added an early return in `_combine_hybrid_results` when both vector and FTS results are empty. Instead of passing empty tables through normalization, reranking, and score restoration (which can fail in various ways), we now build a properly-typed empty result table with the `_relevance_score` column and return it directly. ## Test Added `test_empty_hybrid_result_reranker` that exercises `_combine_hybrid_results` directly with empty vector and FTS tables, verifying: - Returns empty table with correct schema - Includes `_relevance_score` column - Respects `with_row_ids` flag Closes #2425	2026-02-17 11:37:10 -08:00
fzowl	93c2cf2f59	feat(voyageai): update voyage integration (#2713 ) Adding multimodal usage guide VoyageAI integration changes: - Adding voyage-3.5 and voyage-3.5-lite models - Adding voyage-context-3 model - Adding rerank-2.5 and rerank-2.5-lite models	2025-10-29 16:49:07 +05:30
Ayush Chaurasia	e921c90c1b	feat: support mean reciprocal rank reranker (#2671 ) The basic idea of MRR is this - https://www.evidentlyai.com/ranking-metrics/mean-reciprocal-rank-mrr I've implemented a weighted version for allowing user to set weightage between vector and fts. The gist is something like this ### Scenario A: Document at rank 1 in one set, absent from another ``` # Assuming equal weights: weight_vector = 0.5, weight_fts = 0.5 vector_rr = 1.0 # rank 1 → 1/1 = 1.0 fts_rr = 0.0 # absent → 0.0 weighted_mrr = 0.5 × 1.0 + 0.5 × 0.0 = 0.5 ``` ### Scenario B: Document at rank 1 in one set, rank 2 in another ``` # Same weights: weight_vector = 0.5, weight_fts = 0.5 vector_rr = 1.0 # rank 1 → 1/1 = 1.0 fts_rr = 0.5 # rank 2 → 1/2 = 0.5 weighted_mrr = 0.5 × 1.0 + 0.5 × 0.5 = 0.5 + 0.25 = 0.75 ``` And so with `return_score="all"` the result looks something like this (this is from the reranker tests). Because this is a weighted rank based reranker, some results might have the same score ``` text vector _distance _rowid _score _relevance_score 0 I am your father [-0.010703234, 0.069315575, 0.030076642, 0.002... 8.149148e-13 8589934598 10.978719 1.000000 1 the ground beneath my feet [-0.09500901, 0.00092102867, 0.0755851, 0.0372... 1.376896e+00 8589934604 NaN 0.250000 2 I find your lack of faith disturbing [0.07525753, -0.0100010475, 0.09990541, 0.0209... NaN 8589934595 3.483394 0.250000 3 but I don't wanna die [0.033476487, -0.011235877, -0.057625435, -0.0... 1.538222e+00 8589934610 1.130355 0.238095 4 if you strike me down I shall become more powe... [0.00432201, 0.030120496, 5.3317923e-05, 0.033... 1.381086e+00 8589934594 0.715157 0.216667 5 I see a salty message written in the eves [-0.04213107, 0.0016004723, 0.061052393, -0.02... 1.638301e+00 8589934603 1.043785 0.133333 6 but his son was mortal [0.012462767, 0.049041674, -0.057339743, -0.04... 1.421566e+00 8589934620 NaN 0.125000 7 I've got a bad feeling about this [-0.06973199, -0.029960092, 0.02641632, -0.031... NaN 8589934596 1.043785 0.125000 8 now that's a name I haven't heard in a long time [-0.014374257, -0.013588792, -0.07487557, 0.03... 1.597573e+00 8589934593 0.848772 0.118056 9 he was a god [-0.0258895, 0.11925236, -0.029397793, 0.05888... 1.423147e+00 8589934618 NaN 0.100000 10 I wish they would make another one [-0.14737535, -0.015304729, 0.04318139, -0.061... NaN 8589934622 1.043785 0.100000 11 Kratos had a son [-0.057455737, 0.13734367, -0.03537109, -0.000... 1.488075e+00 8589934617 NaN 0.083333 12 I don't wanna live like this [-0.0028891307, 0.015214227, 0.025183653, 0.08... NaN 8589934609 1.043785 0.071429 13 I see a mansard roof through the trees [0.052383978, 0.087759204, 0.014739997, 0.0239... NaN 8589934602 1.043785 0.062500 14 great kid don't get cocky [-0.047043696, 0.054648954, -0.008509666, -0.0... 1.618125e+00 8589934592 NaN 0.055556 ```	2025-09-23 18:25:18 +05:30
Ayush Chaurasia	f076bb41f4	feat: add support for returning all scores with rerankers (#2509 ) Previously `return_score="all"` was supported only for the default reranker (RRF) and not the model based rerankers. This adds support for keeping all scores in the base reranker so that all model based rerankers can use it. Its a slower path than keeping just the relevance score but can be useful in debugging	2025-07-15 21:03:03 +05:30
Ayush Chaurasia	32fdde23f8	fix: robust handling of empty result when reranking (#2313 ) I found some edge cases while running experiments that - depending on the base reranking libraries, some of them don't handle empty lists well. This PR manually checks if the result set to be reranked is empty <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - Bug Fixes - Enhanced search result processing by ensuring that reordering only occurs when valid, non-empty results are available, thereby preventing unnecessary operations and potential errors. - Tests - Added automated tests to verify that empty search result sets are handled correctly, ensuring consistent behavior across various rerankers. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2025-04-09 16:26:05 +05:30
Ayush Chaurasia	ba1ded933a	fix: add better check for empty results in hybrid search (#2252 ) fixes: https://github.com/lancedb/lancedb/issues/2249	2025-03-21 13:05:05 +05:30
msu-reevo	cc81f3e1a5	fix(python): typing (#2167 ) @wjones127 is there a standard way you guys setup your virtualenv? I can either relist all the dependencies in the pyright precommit section, or specify a venv, or the user has to be in the virtual environment when they run git commit. If the venv location was standardized or a python manager like `uv` was used it would be easier to avoid duplicating the pyright dependency list. Per your suggestion, in `pyproject.toml` I added in all the passing files to the `includes` section. For ruff I upgraded the version and removed "TCH" which doesn't exist as an option. I added a `pyright_report.csv` which contains a list of all files sorted by pyright errors ascending as a todo list to work on. I fixed about 30 issues in `table.py` stemming from str's being passed into methods that required a string within a set of string Literals by extracting them into `types.py` Can you verify in the rust bridge that the schema should be a property and not a method here? If it's a method, then there's another place in the code where `inner.schema` should be `inner.schema()` ``` python class RecordBatchStream: @property def schema(self) -> pa.Schema: ... ``` Also unless the `_lancedb.pyi` file is wrong, then there is no `__anext__` here for `__inner` when it's not an `AsyncGenerator` and only `next` is defined: ``` python async def __anext__(self) -> pa.RecordBatch: return await self._inner.__anext__() if isinstance(self._inner, AsyncGenerator): batch = await self._inner.__anext__() else: batch = await self._inner.next() if batch is None: raise StopAsyncIteration return batch ``` in the else statement, `_inner` is a `RecordBatchStream` ```python class RecordBatchStream: @property def schema(self) -> pa.Schema: ... async def next(self) -> Optional[pa.RecordBatch]: ... ``` --------- Co-authored-by: Will Jones <willjones127@gmail.com>	2025-03-10 09:01:23 -07:00
Will Jones	7ac5f74c80	feat!: add variable store to embeddings registry (#2112 ) BREAKING CHANGE: embedding function implementations in Node need to now call `resolveVariables()` in their constructors and should not implement `toJSON()`. This tries to address the handling of secrets. In Node, they are currently lost. In Python, they are currently leaked into the table schema metadata. This PR introduces an in-memory variable store on the function registry. It also allows embedding function definitions to label certain config values as "sensitive", and the preprocessing logic will raise an error if users try to pass in hard-coded values. Closes #2110 Closes #521 --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2025-02-24 15:52:19 -08:00
Will Jones	15f8f4d627	ci: check license headers (#2076 ) Based on the same workflow in Lance.	2025-01-29 08:27:07 -08:00
Will Jones	28e1b70e4b	fix(python): preserve original distance and score in hybrid queries (#2061 ) Fixes #2031 When we do hybrid search, we normalize the scores. We do this calculation in-place, because the Rerankers expect the `_distance` and `_score` columns to be the normalized ones. So I've changed the logic so that we restore the original distance and scores by matching on row ids.	2025-01-23 13:54:26 -08:00
BubbleCal	d0501f65f1	fix: linear reranker applies wrong score to combine (#2035 ) related to #2014 this fixes: - linear reranker may lost some results if the merging consumes all vector results earlier than fts results - linear reranker inverts the fts score but only vector distance can be inverted --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2025-01-17 11:33:48 +08:00
fzowl	cbbc07d0f5	feat: voyageai support (#1799 ) Adding VoyageAI embedding and rerank support	2024-11-09 00:51:20 +05:30
Ayush Chaurasia	86978e7588	feat!: enforce all rerankers always return relevance score & deprecate linear combination fixes (#1687 ) - Enforce all rerankers always return _relevance_score. This was already loosely done in tests before but based on user feedback its better to always have _relevance_score present in all reranked results - Deprecate LinearCombinationReranker in docs. And also fix a case where it would not return _relevance_score if one result set was missing	2024-09-23 12:12:02 +05:30
Ayush Chaurasia	dc72ece847	feat!: better api for manual hybrid queries (#1575 ) Currently, the only documented way of performing hybrid search is by using embedding API and passing string queries that get automatically embedded. There are use cases where users might like to pass vectors and text manually instead. This ticket contains more information and historical context - https://github.com/lancedb/lancedb/issues/937 This breaks a undocumented pathway that allowed passing (vector, text) tuple queries which was intended to be temporary, so this is marked as a breaking change. For all practical purposes, this should not really impact most users ### usage ``` results = table.search(query_type="hybrid") .vector(vector_query) .text(text_query) .limit(5) .to_pandas() ```	2024-08-30 17:37:58 +05:30
Ayush Chaurasia	549ca51a8a	feat: add answerdotai rerankers support and minor improvements (#1560 ) This PR: - Adds missing license headers - Integrates with answerdotai Rerankers package - Updates ColbertReranker to subclass answerdotai package. This is done to keep backwards compatibility as some users might be used to importing ColbertReranker directly - Set `trust_remote_code` to ` True` by default in CrossEncoder and sentence-transformer based rerankers	2024-08-26 13:25:10 +05:30
Ayush Chaurasia	7d65dd97cf	chore(python): update Colbert architecture and minor improvements (#1547 ) - Update ColBertReranker architecture: The current implementation doesn't use the right arch. This PR uses the implementation in Rerankers library. Fixes https://github.com/lancedb/lancedb/issues/1546 Benchmark diff (hit rate): Hybrid - 91 vs 87 reranked vector - 85 vs 80 - Reranking in FTS is basically disabled in main after last week's FTS updates. I think there's no blocker in supporting that? - Allow overriding accelerators: Most transformer based Rerankers and Embedding automatically select device. This PR allows overriding those settings by passing `device`. Fixes: https://github.com/lancedb/lancedb/issues/1487 --------- Co-authored-by: BubbleCal <bubble-cal@outlook.com>	2024-08-21 12:26:52 +05:30
BubbleCal	f9d5fa88a1	feat!: migrate FTS from tantivy to lance-index (#1483 ) Lance now supports FTS, so add it into lancedb Python, TypeScript and Rust SDKs. For Python, we still use tantivy based FTS by default because the lance FTS index now misses some features of tantivy. For Python: - Support to create lance based FTS index - Support to specify columns for full text search (only available for lance based FTS index) For TypeScript: - Change the search method so that it can accept both string and vector - Support full text search For Rust - Support full text search The others: - Update the FTS doc BREAKING CHANGE: - for Python, this renames the attached score column of FTS from "score" to "_score", this could be a breaking change for users that rely the scores --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>	2024-08-08 15:33:15 +08:00
Ayush Chaurasia	4769d8eb76	feat(python): multi-vector reranking support (#1481 ) Currently targeting the following usage: ``` from lancedb.rerankers import CrossEncoderReranker reranker = CrossEncoderReranker() query = "hello" res1 = table.search(query, vector_column_name="vector").limit(3) res2 = table.search(query, vector_column_name="text_vector").limit(3) res3 = table.search(query, vector_column_name="meta_vector").limit(3) reranked = reranker.rerank_multivector( [res1, res2, res3], deduplicate=True, query=query # some reranker models need query ) ``` - This implements rerank_multivector function in the base reranker so that all rerankers that implement rerank_vector will automatically have multivector reranking support - Special case for RRF reranker that just uses its existing rerank_hybrid fcn to multi-vector reranking. --------- Co-authored-by: Weston Pace <weston.pace@gmail.com>	2024-08-07 01:45:46 +05:30
Ayush Chaurasia	0255221086	feat: add reciprocal rank fusion reranker (#1456 ) Implements https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf Refactors the hybrid search only rerrankers test to avoid repetition.	2024-07-23 21:37:17 +05:30
Ayush Chaurasia	ed7bd45c17	chore: choose appropriate args for concat_table based on pyarrow version & refactor reranker tests (#1455 )	2024-07-18 21:04:59 +05:30
Ishani Ghose	0838e12b30	feat: add to_batches API #805 (#1048 ) SDK Python Description Exposes pyarrow batch api during query execution - relevant when there is no vector search query, dataset is large and the filtered result is larger than memory. --------- Co-authored-by: Ishani Ghose <isghose@amazon.com> Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>	2024-04-05 16:33:37 -07:00
Ayush Chaurasia	42fad84ec8	feat(python): Support reranking for vector and fts (#1103 ) solves https://github.com/lancedb/lancedb/issues/1086 Usage Reranking with FTS: ``` retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite") pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274."}, {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."}, {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."}, {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "}, {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."}, {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."}, ] retriever.add(pylist) retriever.create_fts_index("text", replace=True) query = "What is the capital of the United States?" reranker = CohereReranker(return_score="all") print(retriever.search(query, query_type="fts").limit(10).to_pandas()) print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas()) ``` Result ``` text vector score 0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 1 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 2 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 4 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 text vector score _relevance_score 0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 0.979977 1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 0.299105 2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 0.284874 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 0.089614 4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 0.063832 5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 0.041462 ``` ## Vector Search usage: ``` query = "What is the capital of the United States?" reranker = CohereReranker(return_score="all") print(retriever.search(query).limit(10).to_pandas()) print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here ``` Results ``` text vector _distance 0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 1 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 2 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 3 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 4 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 text vector _distance _relevance_score 0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 0.979977 1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 0.299105 2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 0.284874 3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 0.089614 4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 0.063832 5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 0.041462 ```	2024-04-05 16:33:06 -07:00
Weston Pace	2cec2a8937	feat: add a basic async python client starting point (#1014 ) This changes `lancedb` from a "pure python" setuptools project to a maturin project and adds a rust lancedb dependency. The async python client is extremely minimal (only `connect` and `Connection.table_names` are supported). The purpose of this PR is to get the infrastructure in place for building out the rest of the async client. Although this is not technically a breaking change (no APIs are changing) it is still a considerable change in the way the wheels are built because they now include the native shared library.	2024-04-05 16:31:34 -07:00

23 Commits