BREAKING CHANGE: default tokenizer no longer does stemming or stop-word
removal. Users should explicitly turn that option on in the future.
- upgrade lance to 0.19.1
- update the FTS docs
- update the FTS API
Upstream change notes:
https://github.com/lancedb/lance/releases/tag/v0.19.1
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Resovles #1709. Adds `trust_remote_code` as a parameter to the
`TransformersEmbeddingFunction` class with a default of False. Updated
relevant documentation with the same.
- Enforce all rerankers always return _relevance_score. This was already
loosely done in tests before but based on user feedback its better to
always have _relevance_score present in all reranked results
- Deprecate LinearCombinationReranker in docs. And also fix a case where
it would not return _relevance_score if one result set was missing
Though the markdown can be rendered well on GitHub (GFM style?), but it
seems that it's required to insert a blank line between a paragraph and
a list block to make it render well with `mkdocs`?
see also the web page:
https://lancedb.github.io/lancedb/concepts/index_hnsw/
Docs used `get_registry.get(...)` whereas what works is
`get_registry().get(...)`. Fixing the two instances I found. I tested
the open clip version by trying it locally in a Jupyter notebook.
- Both LinearCombination (the current default) and RRF are pretty fast
compared to model based rerankers. RRF is slightly faster.
- In our tests RRF has also been slightly more accurate.
This PR:
- Makes RRF the default reranker
- Removed duplicate docs for rerankers
Refine and improve the language clarity and quality across all example
pages in the documentation to ensure better understanding and
readability.
---------
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Currently, the only documented way of performing hybrid search is by
using embedding API and passing string queries that get automatically
embedded. There are use cases where users might like to pass vectors and
text manually instead.
This ticket contains more information and historical context -
https://github.com/lancedb/lancedb/issues/937
This breaks a undocumented pathway that allowed passing (vector, text)
tuple queries which was intended to be temporary, so this is marked as a
breaking change. For all practical purposes, this should not really
impact most users
### usage
```
results = table.search(query_type="hybrid")
.vector(vector_query)
.text(text_query)
.limit(5)
.to_pandas()
```
This PR:
- Adds missing license headers
- Integrates with answerdotai Rerankers package
- Updates ColbertReranker to subclass answerdotai package. This is done
to keep backwards compatibility as some users might be used to importing
ColbertReranker directly
- Set `trust_remote_code` to ` True` by default in CrossEncoder and
sentence-transformer based rerankers
Lance now supports FTS, so add it into lancedb Python, TypeScript and
Rust SDKs.
For Python, we still use tantivy based FTS by default because the lance
FTS index now misses some features of tantivy.
For Python:
- Support to create lance based FTS index
- Support to specify columns for full text search (only available for
lance based FTS index)
For TypeScript:
- Change the search method so that it can accept both string and vector
- Support full text search
For Rust
- Support full text search
The others:
- Update the FTS doc
BREAKING CHANGE:
- for Python, this renames the attached score column of FTS from "score"
to "_score", this could be a breaking change for users that rely the
scores
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
### Fix markdown table rendering issue
This PR adds a missing whitespace before a markdown table in the
documentation. This issue causes the table to not render properly in
mkdocs, while it does render properly in GitHub's markdown viewer.
#### Change Details:
- Added a single line of whitespace before the markdown table to ensure
proper rendering in mkdocs.
#### Note:
- I wasn't able to test this fix in the mkdocs environment, but it
should be safe as it only involves adding whitespace which won't break
anything.
---
Cohere supports following input types:
| Input Type | Description |
|-------------------------|---------------------------------------|
| "`search_document`" | Used for embeddings stored in a vector|
| | database for search use-cases. |
| "`search_query`" | Used for embeddings of search queries |
| | run against a vector DB |
| "`semantic_similarity`" | Specifies the given text will be used |
| | for Semantic Textual Similarity (STS) |
| "`classification`" | Used for embeddings passed through a |
| | text classifier. |
| "`clustering`" | Used for the embeddings run through a |
| | clustering algorithm |
Usage Example: