This PR:
- Adds missing license headers
- Integrates with answerdotai Rerankers package
- Updates ColbertReranker to subclass answerdotai package. This is done
to keep backwards compatibility as some users might be used to importing
ColbertReranker directly
- Set `trust_remote_code` to ` True` by default in CrossEncoder and
sentence-transformer based rerankers
Lance now supports FTS, so add it into lancedb Python, TypeScript and
Rust SDKs.
For Python, we still use tantivy based FTS by default because the lance
FTS index now misses some features of tantivy.
For Python:
- Support to create lance based FTS index
- Support to specify columns for full text search (only available for
lance based FTS index)
For TypeScript:
- Change the search method so that it can accept both string and vector
- Support full text search
For Rust
- Support full text search
The others:
- Update the FTS doc
BREAKING CHANGE:
- for Python, this renames the attached score column of FTS from "score"
to "_score", this could be a breaking change for users that rely the
scores
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Currently targeting the following usage:
```
from lancedb.rerankers import CrossEncoderReranker
reranker = CrossEncoderReranker()
query = "hello"
res1 = table.search(query, vector_column_name="vector").limit(3)
res2 = table.search(query, vector_column_name="text_vector").limit(3)
res3 = table.search(query, vector_column_name="meta_vector").limit(3)
reranked = reranker.rerank_multivector(
[res1, res2, res3],
deduplicate=True,
query=query # some reranker models need query
)
```
- This implements rerank_multivector function in the base reranker so
that all rerankers that implement rerank_vector will automatically have
multivector reranking support
- Special case for RRF reranker that just uses its existing
rerank_hybrid fcn to multi-vector reranking.
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
solves https://github.com/lancedb/lancedb/issues/1086
Usage Reranking with FTS:
```
retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite")
pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274."},
{"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."},
{"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."},
{"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "},
{"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."},
{"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."},
]
retriever.add(pylist)
retriever.create_fts_index("text", replace=True)
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query, query_type="fts").limit(10).to_pandas())
print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas())
```
Result
```
text vector score
0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602
1 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046
2 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521
3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898
4 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422
5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346
text vector score _relevance_score
0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 0.979977
1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 0.299105
2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 0.284874
3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 0.089614
4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 0.063832
5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 0.041462
```
## Vector Search usage:
```
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query).limit(10).to_pandas())
print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here
```
Results
```
text vector _distance
0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973
1 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884
2 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200
3 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654
4 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867
5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544
text vector _distance _relevance_score
0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 0.979977
1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 0.299105
2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 0.284874
3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 0.089614
4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 0.063832
5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 0.041462
```
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.
The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.
Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.