Added the ability to specify tokenizer_name, when creating a full text
search index using tantivy. This enables the use of language specific
stemming.
Also updated the [guide on full text
search](https://lancedb.github.io/lancedb/fts/) with a short section on
choosing tokenizer.
Fixes#1315
- fix some clippy errors from ci running a different toolchain.
- add some saftey notes about some unsafe blocks.
- locks the toolchain so that it is consistent across dev and CI.
- Tried to address some onboarding feedbacks listed in
https://github.com/lancedb/lancedb/issues/1224
- Improve visibility of pydantic integration and embedding API. (Based
on onboarding feedback - Many ways of ingesting data, defining schema
but not sure what to use in a specific use-case)
- Add a guide that takes users through testing and improving retriever
performance using built-in utilities like hybrid-search and reranking
- Add some benchmarks for the above
- Add missing cohere docs
---------
Co-authored-by: Weston Pace <weston.pace@gmail.com>
most of the time we don't need to reload. Locking the write lock and
performing IO is not an ideal pattern.
This PR tries to make the critical section of `.write()` happen less
frequently.
This isn't the most ideal solution. The most ideal solution should not
lock until the new dataset has been loaded. But that would require too
much refactoring.
- changed the error msg for table.search with wrong query vector dim
- added missing fields for listIndices and indexStats to be consistent
with Python API - will make changes in node integ test
part of https://github.com/lancedb/lancedb/issues/994.
Adds the ability to use the openai embedding functions.
the example can be run by the following
```sh
> EXPORT OPENAI_API_KEY="sk-..."
> cargo run --example openai --features=openai
```
which should output
```
Closest match: Winter Parka
```
This doesn't actually block a python-only release since this step runs
after the version bump has been pushed but it still would be nice for
the git job to finish successfully.
while adding some more docs & examples for the new js sdk, i ran across
a few compatibility issues when using different arrow versions. This
should fix those issues.
- add `return` for `__enter__`
The buggy code didn't return the object, therefore it will always return
None within a context manager:
```python
with await lancedb.connect_async("./.lancedb") as db:
# db is always None
```
(BTW, why not to design an async context manager?)
- add a unit test for Async connection context manager
- update return type of `AsyncConnection.open_table` to `AsyncTable`
Although type annotation doesn't affect the functionality, it is helpful
for IDEs.