Commit Graph

231 Commits

Author SHA1 Message Date
Nitish Sharma
2c3f982f4f Minor updates to FAQ (#935)
Based on discussion over discord, adding minor updates to the FAQ
section about benchmarks, practical data size and concurrency in LanceDB
2024-04-05 16:29:58 -07:00
Ayush Chaurasia
d07817a562 feat(python): Reranker DX improvements (#904)
- Most users might not know how to use `QueryBuilder` object. Instead we
should just pass the string query.
- Add new rerankers: Colbert, openai
2024-04-05 16:29:58 -07:00
QianZhu
b2efd0da53 fix hybrid search example (#922) 2024-04-05 16:29:13 -07:00
QianZhu
2e75b16403 make it explicit about the vector column data type (#916)
<img width="837" alt="Screenshot 2024-02-01 at 4 23 34 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/4f0f5c5a-2a24-4b00-aad1-ef80a593d964">
[
<img width="838" alt="Screenshot 2024-02-01 at 4 26 03 PM"
src="https://github.com/lancedb/lancedb/assets/1305083/ca073bc8-b518-4be3-811d-8a7184416f07">
](url)

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:29:05 -07:00
Weston Pace
4eb819072a feat: upgrade to lance 0.9.11 and expose merge_insert (#906)
This adds the python bindings requested in #870 The javascript/rust
bindings will be added in a future PR.
2024-04-05 16:29:05 -07:00
QianZhu
1f2eafca75 arrow table/f16 example (#907) 2024-04-05 16:29:05 -07:00
Will Jones
d5be6c7a05 docs: provide AWS S3 cleanup and permissions advice (#903)
Adding some more quick advice for how to setup AWS S3 with LanceDB.

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
a41f7be88d feat(python): Hybrid search & Reranker API (#824)
based on https://github.com/lancedb/lancedb/pull/713
- The Reranker api can be plugged into vector only or fts only search
but this PR doesn't do that (see example -
https://txt.cohere.com/rerank/)


### Default reranker -- `LinearCombinationReranker(weight=0.7,
fill=1.0)`

```
table.search("hello", query_type="hybrid").rerank(normalize="score").to_pandas()
```
### Available rerankers
LinearCombinationReranker
```
from lancedb.rerankers import LinearCombinationReranker

# Same as default 
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=LinearCombinationReranker()
                                     ).to_pandas()

# with custom params
reranker = LinearCombinationReranker(weight=0.3, fill=1.0)
table.search("hello", query_type="hybrid").rerank(
                                      normalize="score", 
                                      reranker=reranker
                                     ).to_pandas()
```

Cohere Reranker
```
from lancedb.rerankers import CohereReranker

# default model.. English and multi-lingual supported. See docstring for available custom params
table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank",  # score or rank
                                      reranker=CohereReranker()
                                     ).to_pandas()

```

CrossEncoderReranker

```
from lancedb.rerankers import CrossEncoderReranker

table.search("hello", query_type="hybrid").rerank(
                                      normalize="rank", 
                                      reranker=CrossEncoderReranker()
                                     ).to_pandas()

```

## Using custom Reranker
```
from lancedb.reranker import Reranker

class CustomReranker(Reranker):
    def rerank_hybrid(self, vector_result, fts_result):
           combined_res = self.merge_results(vector_results, fts_results) # or use custom combination logic
           # Custom rerank logic here
           
           return combined_res
```

- [x] Expand testing
- [x] Make sure usage makes sense
- [x] Run simple benchmarks for correctness (Seeing weird result from
cohere reranker in the toy example)
- Support diverse rerankers by default:
- [x] Cross encoding
- [x] Cohere
- [x] Reciprocal Rank Fusion

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Prashanth Rao
ecbbe185c7 Fix image bgcolor (#891)
Minor fix to change the background color for an image in the docs. It's
now readable in both light and dark modes (earlier version made it
impossible to read in dark mode).
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
b326bf2ef6 doc: Add documentation chatbot for LanceDB (#890)
<img width="1258" alt="Screenshot 2024-01-29 at 10 05 52 PM"
src="https://github.com/lancedb/lancedb/assets/15766192/7c108fde-e993-415c-ad01-72010fd5fe31">
2024-04-05 16:28:56 -07:00
Raghav Dixit
472344fcb3 feat(python): Embedding fn support for gte-mlx/gte-large (#873)
have added testing and an example in the docstring, will be pushing a
separate PR in recipe repo for rag example

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
2024-04-05 16:28:56 -07:00
Lei Xu
911d063237 doc: fix js example of create index (#886) 2024-04-05 16:28:56 -07:00
Lei Xu
12e776821a doc: use snippet for rust code example and make sure rust examples run through CI (#885) 2024-04-05 16:28:56 -07:00
Chang She
1d0578ce25 doc(rust): minor fixes for Rust quick start. (#878) 2024-04-05 16:28:56 -07:00
Lei Xu
e7fdb931de chore: convert all js doc test to use snippet. (#881) 2024-04-05 16:28:56 -07:00
Lei Xu
d811b89de2 doc: use code snippet for typescript examples (#880)
The typescript code is in a fully function file, that will be run via the CI.
2024-04-05 16:28:56 -07:00
Ayush Chaurasia
545a03d7f9 feat(python): Aws Bedrock embeddings integration (#822)
Supports amazon titan, cohere english & cohere multi-lingual base
models.
2024-04-05 16:28:56 -07:00
Lei Xu
36dbf47d60 chore: add one rust SDK e2e example (#876)
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:28:56 -07:00
Lei Xu
fd2fd94862 doc: update quick start for full rust example (#872) 2024-04-05 16:28:56 -07:00
Will Jones
7d82e56f76 docs: document basics of configuring object storage (#832)
Created based on upstream PR https://github.com/lancedb/lance/pull/1849

Closes #681

---------

Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:27:51 -07:00
Lei Xu
5b2c602fb3 doc: improve docs for nodejs connect functions (#833)
* improve the docstring for NodeJS connect functions and
`ConnectOptions` parameters.
* Simplify `npm run build` steps.
2024-04-05 16:27:32 -07:00
Prashanth Rao
e6bb907d81 Docs updates incl. Polars (#827)
This PR makes the following aesthetic and content updates to the docs.

- [x] Fix max width issue on mobile: Content should now render more
cleanly and be more readable on smaller devices
- [x] Improve image quality of flowchart in data management page
- [x] Fix syntax highlighting in text at the bottom of the IVF-PQ
concepts page
- [x] Add example of Polars LazyFrames to docs (Integrations)
- [x] Add example of adding data to tables using Polars (guides)
2024-04-05 16:27:14 -07:00
Prashanth Rao
4d5d748acd docs: Updates and refactor (#683)
This PR makes incremental changes to the documentation.

* Closes #697
* Closes #698

- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:27:12 -07:00
Chang She
ac3d95ec34 feat(python): allow the entire table to be converted a polars dataframe (#814) 2024-04-05 16:26:36 -07:00
Chang She
72b39432e8 feat(python): add exist_ok option to create table (#813)
This mimics CREATE TABLE IF NOT EXISTS behavior.
We add `db.create_table(..., exist_ok=True)` parameter.
By default it is set to False, so trying to create
a table with the same name will raise an exception.
If set to True, then it only opens the table if it
already exists. If you pass in a schema, it will
be checked against the existing table to make sure
you get what you want. If you pass in data, it will
NOT be added to the existing table.
2024-04-05 16:26:35 -07:00
Ayush Chaurasia
2f72d5138e feat(python): Add gemini text embedding function (#806)
Named it Gemini-text for now. Not sure how complicated it will be to
support both text and multimodal embeddings under the same class
"gemini"..But its not something to worry about for now I guess.
2024-04-05 16:25:52 -07:00
Chang She
1dd663fc8a chore(python): document phrase queries in fts (#788)
closes #769 

Add unit test and documentation on using quotes to perform a phrase
query
2024-04-05 16:25:01 -07:00
Chang She
3100f0d861 feat(python): Set heap size to get faster fts indexing performance (#762)
By default tantivy-py uses 128MB heapsize. We change the default to 1GB
and we allow the user to customize this

locally this makes `test_fts.py` run 10x faster
2024-04-05 16:25:00 -07:00
sudhir
8a48b32689 Make examples work with current version of Openai api's (#779)
These examples don't work because of changes in openai api from version
1+
2024-04-05 16:24:47 -07:00
Chris
6698376f02 Minor Fixes to Ingest Embedding Functions Docs (#777)
Addressed minor typos and grammatical issues to improve readability

---------

Co-authored-by: Christopher Correa <chris.correa@gmail.com>
2024-04-05 16:24:47 -07:00
Vladimir Varankin
2fd829296e Minor corrections for docs of embedding_functions (#780)
In addition to #777, this pull request fixes more typos in the
documentation for "Ingest Embedding Functions".
2024-04-05 16:24:47 -07:00
QianZhu
a25d10279c small bug fix for example code in SaaS JS doc (#770) 2024-04-05 16:24:47 -07:00
Bengsoon Chuah
e3ba5b2402 Add relevant imports for each step (#764)
I found that it was quite incoherent to have to read through the
documentation and having to search which submodule that each class
should be imported from.

For example, it is cumbersome to have to navigate to another
documentation page to find out that `EmbeddingFunctionRegistry` is from
`lancedb.embeddings`
2024-04-05 16:24:47 -07:00
QianZhu
25d1c62c3f SaaS JS API sdk doc (#740)
Co-authored-by: Aidan <64613310+aidangomar@users.noreply.github.com>
2024-04-05 16:24:47 -07:00
Xin Hao
a63262cfda docs: fix link (#752) 2024-04-05 16:24:47 -07:00
Chang She
7bac1131fb feat: add timezone handling for datetime in pydantic (#578)
If you add timezone information in the Field annotation for a datetime
then that will now be passed to the pyarrow data type.

I'm not sure how pyarrow enforces timezones, right now, it silently
coerces to the timezone given in the column regardless of whether the
input had the matching timezone or not. This is probably not the right
behavior. Though we could just make it so the user has to make the
pydantic model do the validation instead of doing that at the pyarrow
conversion layer.
2024-04-05 16:24:47 -07:00
Chang She
a0afa84786 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:24:47 -07:00
elliottRobinson
3ab4b335c3 Update default_embedding_functions.md (#744)
Modify some grammar, punctuation, and spelling errors.
2024-04-05 16:24:47 -07:00
Will Jones
c34aa09166 docs: update node API reference (#734)
This command hasn't been run for a while...
2024-04-05 16:24:47 -07:00
Will Jones
43662705ad docs: enhance Update user guide (#735)
Closes #705
2024-04-05 16:24:47 -07:00
Chang She
5376970e87 doc(javascript): minor improvement on docs for working with tables (#736)
Closes #639 
Closes #638
2024-04-05 16:24:47 -07:00
Chang She
cc9d74e7a7 feat(python): add option to flatten output in to_pandas (#722)
Closes https://github.com/lancedb/lance/issues/1738

We add a `flatten` parameter to the signature of `to_pandas`. By default
this is None and does nothing.
If set to True or -1, then LanceDB will flatten structs before
converting to a pandas dataframe. All nested structs are also flattened.
If set to any positive integer, then LanceDB will flatten structs up to
the specified level of nesting.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:24:30 -07:00
Ayush Chaurasia
91ff324c70 docs: Update roboflow tutorial position (#666) 2024-04-05 16:23:49 -07:00
QianZhu
bda0135cfc saas python sdk doc (#692)
<img width="256" alt="Screenshot 2023-12-07 at 11 55 41 AM"
src="https://github.com/lancedb/lancedb/assets/1305083/259bf234-9b3b-4c5d-af45-c7f3fada2cc7">
2024-04-05 16:23:49 -07:00
James
a94a033553 (docs):Add CLIP image embedding example (#660)
In this PR, I add a guide that lets you use Roboflow Inference to
calculate CLIP embeddings for use in LanceDB. This post was reviewed by
@AyushExel.
2024-04-05 16:23:49 -07:00
Ayush Chaurasia
088792c821 [Docs]: Add Instructor embeddings and rate limit handler docs (#651) 2024-04-05 16:23:49 -07:00
Ayush Chaurasia
955c2a751a [Docs][SEO] Add sitemap and robots.txt (#645)
Sitemap improves SEO by ranking pages and tracking updates.
2024-04-05 16:23:49 -07:00
Ayush Chaurasia
07ab4cd14c Disable posthog on docs & reduce sentry trace factor (#607)
- posthog charges per event and docs events are registered very
frequently. We can keep tracking them on GA
- Reduced sentry trace factor
2024-04-05 16:23:13 -07:00
QianZhu
3c139c2ee5 Qian/query option doc (#615)
- API documentation improvement for queries (table.search)
- a small bug fix for the remote API on create_table

![image](https://github.com/lancedb/lancedb/assets/1305083/712e9bd3-deb8-4d81-8cd0-d8e98ef68f4e)

![image](https://github.com/lancedb/lancedb/assets/1305083/ba22125a-8c36-4e34-a07f-e39f0136e62c)
2024-04-05 16:22:59 -07:00
Lei Xu
b5e57ebce3 doc: add doc to use GPU for indexing (#611) 2024-04-05 16:22:59 -07:00