binary vectors and hamming distance can work on only IVF_FLAT, so
introduce them all in this PR.
---------
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
- Tried to address some onboarding feedbacks listed in
https://github.com/lancedb/lancedb/issues/1224
- Improve visibility of pydantic integration and embedding API. (Based
on onboarding feedback - Many ways of ingesting data, defining schema
but not sure what to use in a specific use-case)
- Add a guide that takes users through testing and improving retriever
performance using built-in utilities like hybrid-search and reranking
- Add some benchmarks for the above
- Add missing cohere docs
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.
1. Fixes#1168
2. Closes#1165
3. Closes#1082
4. Closes#439
5. Closes#897
6. Closes#642
7. Closes#281
8. Closes#114
9. Closes#990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
- Rename safe_import -> attempt_import_or_raise (closes
https://github.com/lancedb/lancedb/pull/923)
- Update docs
- Add Notebook example (@changhiskhan you can use it for the talk. Comes
with "open in colab" button)
- Latency benchmark & results comparison, sanity check on real-world
data
- Updates the default openai model to gpt-4
based on https://github.com/lancedb/lancedb/pull/713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)
### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`
```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker
# Same as default
table.search("hello", query_type="hybrid").rerank(
normalize="score",
reranker=LinearCombinationReranker()
).to_pandas()
# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
normalize="score",
reranker=reranker
).to_pandas()
```
Cohere Reranker
```
from lancedb.rerankers import CohereReranker
# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
normalize="rank", # score or rank
reranker=CohereReranker()
).to_pandas()
```
CrossEncoderReranker
```
from lancedb.rerankers import CrossEncoderReranker
table.search("hello", query_type="hybrid").rerank(
normalize="rank",
reranker=CrossEncoderReranker()
).to_pandas()
```
## Using custom Reranker
```
from lancedb.reranker import Reranker
class CustomReranker(Reranker):
def rerank_hybrid(self, vector_result, fts_result):
combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
# Custom rerank logic here
return combined_res
```
- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion
---------
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
have added testing and an example in the docstring, will be pushing a
separate PR in recipe repo for rag example
---------
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
This PR makes incremental changes to the documentation.
* Closes#697
* Closes#698
- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.
- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes#716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes#746
- [x] Add simple polars example to the integrations section. Closes#756
and closes#153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes#781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes#782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes#783 and closes#784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes#785
- [ ] Add a benchmark section that also discusses some best practices.
Closes#787
---------
Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
This PR adds an overview of embeddings docs:
- 2 ways to vectorize your data using lancedb - explicit & implicit
- explicit - manually vectorize your data using `wit_embedding` function
- Implicit - automatically vectorize your data as it comes by ingesting
your embedding function details as table metadata
- Multi-modal example w/ disappearing embedding function
* Rename "Reference" -> "Guides" to create distinction b/w api reference
and user facing docs
* Add all the various ways to create, add and delete from table
Related - https://github.com/lancedb/lancedb/pull/391
- Creates testing files `md_testing.py` and `md_testing.js` for testing
python and nodejs code in markdown files in the documentation
This listens for HTML tags as well: `<!--[language] code code
code...-->` will create a set-up file to create some mock tables or to
fulfill some assumptions in the documentation.
- Creates a github action workflow that triggers every push/pr to
`docs/**`
- Modifies documentation so tests run (mostly indentation, some small
syntax errors and some missing imports)
A list of excluded files that we need to take a closer look at later on:
```javascript
const excludedFiles = [
"../src/fts.md",
"../src/embedding.md",
"../src/examples/serverless_lancedb_with_s3_and_lambda.md",
"../src/examples/serverless_qa_bot_with_modal_and_langchain.md",
"../src/examples/youtube_transcript_bot_with_nodejs.md",
];
```
Many of them can't be done because we need the OpenAI API key :(.
`fts.md` has some issues with the library, I believe this is still
experimental?
Closes#170
---------
Co-authored-by: Will Jones <willjones127@gmail.com>