mirror of https://github.com/lancedb/lancedb.git synced 2026-01-09 21:32:58 +00:00

Files

Prashanth Rao 119b928a52 docs: Updates and refactor (#683 )

This PR makes incremental changes to the documentation.

* Closes #697 
* Closes #698

## Chores
- [x] Add dark mode
- [x] Fix headers in navbar
- [x] Add `extra.css` to customize navbar styles
- [x] Customize fonts for prose/code blocks, navbar and admonitions
- [x] Inspect all admonition boxes (remove redundant dropdowns) and
improve clarity and readability
- [x] Ensure that all images in the docs have white background (not
transparent) to be viewable in dark mode
- [x] Improve code formatting in code blocks to make them consistent
with autoformatters (eslint/ruff)
- [x] Add bolder weight to h1 headers
- [x] Add diagram showing the difference between embedded (OSS) and
serverless (Cloud)
- [x] Fix [Creating an empty
table](https://lancedb.github.io/lancedb/guides/tables/#creating-empty-table)
section: right now, the subheaders are not clickable.
- [x] In critical data ingestion methods like `table.add` (among
others), the type signature often does not match the actual code
- [x] Proof-read each documentation section and rewrite as necessary to
provide more context, use cases, and explanations so it reads less like
reference documentation. This is especially important for CRUD and
search sections since those are so central to the user experience.

## Restructure/new content 
- [x] The section for [Adding
data](https://lancedb.github.io/lancedb/guides/tables/#adding-to-a-table)
only shows examples for pandas and iterables. We should include pydantic
models, arrow tables, etc.
- [x] Add conceptual tutorial for IVF-PQ index
- [x] Clearly separate vector search, FTS and filtering sections so that
these are easier to find
- [x] Add docs on refine factor to explain its importance for recall.
Closes #716
- [x] Add an FAQ page showing answers to commonly asked questions about
LanceDB. Closes #746
- [x] Add simple polars example to the integrations section. Closes #756
and closes #153
- [ ] Add basic docs for the Rust API (more detailed API docs can come
later). Closes #781
- [x] Add a section on the various storage options on local vs. cloud
(S3, EBS, EFS, local disk, etc.) and the tradeoffs involved. Closes #782
- [x] Revamp filtering docs: add pre-filtering examples and redo headers
and update content for SQL filters. Closes #783 and closes #784.
- [x] Add docs for data management: compaction, cleaning up old versions
and incremental indexing. Closes #785
- [ ] Add a benchmark section that also discusses some best practices.
Closes #787

---------

Co-authored-by: Ayush Chaurasia <ayush.chaurarsia@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>

2024-01-19 00:18:37 +05:30

3.7 KiB

Raw Blame History

Full-text search

LanceDB provides support for full-text search via Tantivy (currently Python only), allowing you to incorporate keyword-based search (based on BM25) in your retrieval solutions. Our goal is to push the FTS integration down to the Rust level in the future, so that it's available for JavaScript users as well.

A hybrid search solution combining vector and full-text search is also on the way.

Installation

To use full-text search, install the dependency tantivy-py:

# Say you want to use tantivy==0.20.1
pip install tantivy==0.20.1

Example

Consider that we have a LanceDB table named my_table, whose string column text we want to index and query via keyword search.

import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)

table = db.create_table(
    "my_table",
    data=[
        {"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
        {"vector": [5.9, 26.5], "text": "There are several kittens playing"},
    ],
)

Create FTS index on single column

The FTS index must be created before you can search via keywords.

table.create_fts_index("text")

To search an FTS index via keywords, LanceDB's table.search accepts a string as input:

table.search("puppy").limit(10).select(["text"]).to_list()

This returns the result as a list of dictionaries as follows.

[{'text': 'Frodo was a happy puppy', 'score': 0.6931471824645996}]

!!! note LanceDB automatically searches on the existing FTS index if the input to the search is of type str. If you provide a vector as input, LanceDB will search the ANN index instead.

Index multiple columns

If you have multiple string columns to index, there's no need to combine them manually -- simply pass them all as a list to create_fts_index:

table.create_fts_index(["text1", "text2"])

Note that the search API call does not change - you can search over all indexed columns at once.

Filtering

Currently the LanceDB full text search feature supports post-filtering, meaning filters are applied on top of the full text search results. This can be invoked via the familiar where syntax:

table.search("puppy").limit(10).where("meta='foo'").to_list()

Syntax

For full-text search you can perform either a phrase query like "the old man and the sea", or a structured search query like "(Old AND Man) AND Sea". Double quotes are used to disambiguate.

For example:

If you intended "they could have been dogs OR cats" as a phrase query, this actually raises a syntax error since OR is a recognized operator. If you make or lower case, this avoids the syntax error. However, it is cumbersome to have to remember what will conflict with the query syntax. Instead, if you search using table.search('"they could have been dogs OR cats"'), then the syntax checker avoids checking inside the quotes.

Configurations

By default, LanceDB configures a 1GB heap size limit for creating the index. You can reduce this if running on a smaller node, or increase this for faster performance while indexing a larger corpus.

# configure a 512MB heap size
heap = 1024 * 1024 * 512
table.create_fts_index(["text1", "text2"], writer_heap_size=heap, replace=True)

Current limitations

Currently we do not yet support incremental writes. If you add data after FTS index creation, it won't be reflected in search results until you do a full reindex.
We currently only support local filesystem paths for the FTS index. This is a tantivy limitation. We've implemented an object store plugin but there's no way in tantivy-py to specify to use it.

3.7 KiB Raw Blame History