mirror of
https://github.com/lancedb/lancedb.git
synced 2026-07-03 02:50:41 +00:00
## What
`MRRReranker.rerank_multivector` averages each document's reciprocal
ranks over the wrong denominator. It divides by the number of rankings
the document *happens to appear in*, instead of the total number of
rankings being fused.
```python
# python/python/lancedb/rerankers/mrr.py
for result_id, reciprocal_ranks in mrr_score_map.items():
mean_rr = np.mean(reciprocal_ranks) # divides by len(present systems)
```
`mrr_score_map[doc]` only accumulates a reciprocal rank for the systems
in which the document was returned, so `np.mean` never accounts for the
systems that missed it.
## Why it's wrong
Mean Reciprocal Rank fusion treats a system that didn't return a
document as a reciprocal rank of `0` and averages across **all**
systems. That's the exact mechanism by which it rewards cross-system
consensus. Dividing by the appearance count removes that, so a document
liked by a single ranking can beat one ranked highly by every ranking.
Concretely, fusing 3 vector rankings:
| Doc | Ranks | Current score | Correct score |
|-----|-------|---------------|---------------|
| A | #1 in 1 system only | `mean([1.0]) = 1.000` | `1.0 / 3 = 0.333` |
| B | #1, #1, #2 across all 3 | `mean([1, 1, .5]) = 0.833` | `2.5 / 3 =
0.833` |
The current code ranks **A above B** - a document two of three rankings
ignored outranks one all three ranked at or near the top.
This also makes `rerank_multivector` inconsistent with `rerank_hybrid`
in the same file, which already treats a missing system as `0`
(`vector_rr = 0.0` / `fts_rr = 0.0`), and with the class docstring
("average of reciprocal ranks across different search results").
## Fix
Divide the summed reciprocal ranks by the total number of rankings:
```python
num_systems = len(vector_results)
...
mean_rr = float(np.sum(reciprocal_ranks)) / num_systems
```
## Tests
Adds `test_mrr_multivector_rewards_consensus`, which asserts the exact
MRR scores and that the consensus document ranks first. It fails on
`main` and passes with this change. Existing reranker tests are
unaffected.
LanceDB Python SDK
A Python library for LanceDB.
Installation
pip install lancedb
Preview Releases
Stable releases are created about every 2 weeks. For the latest features and bug fixes, you can install the preview release. These releases receive the same level of testing as stable releases, but are not guaranteed to be available for more than 6 months after they are released. Once your application is stable, we recommend switching to stable releases.
pip install --pre --extra-index-url https://pypi.fury.io/lancedb/ lancedb
Usage
Basic Example
import lancedb
db = lancedb.connect('<PATH_TO_LANCEDB_DATASET>')
table = db.open_table('my_table')
results = table.search([0.1, 0.3]).limit(20).to_list()
print(results)
Development
See CONTRIBUTING.md for information on how to contribute to LanceDB.