Files
lancedb/docs/src/fts.md
Chang She bc83bc9838 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:25:02 -07:00

2.2 KiB

[EXPERIMENTAL] Full text search

LanceDB now provides experimental support for full text search. This is currently Python only. We plan to push the integration down to Rust in the future to make this available for JS as well.

Installation

To use full text search, you must install the dependency tantivy-py:

tantivy 0.20.1

pip install tantivy==0.20.1

Quickstart

Assume:

  1. table is a LanceDB Table
  2. text is the name of the Table column that we want to index

For example,

import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)

table = db.create_table("my_table",
            data=[{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy", "meta": "foo"},
                  {"vector": [5.9, 26.5], "text": "Sam was a loyal puppy", "meta": "bar"},
                  {"vector": [15.9, 6.5], "text": "There are several kittens playing"}])

To create the index:

table.create_fts_index("text")

To search:

table.search("puppy").limit(10).select(["text"]).to_list()

Which returns a list of dictionaries:

[{'text': 'Frodo was a happy puppy', 'score': 0.6931471824645996}]

LanceDB automatically looks for an FTS index if the input is str.

Multiple text columns

If you have multiple columns to index, pass them all as a list to create_fts_index:

table.create_fts_index(["text1", "text2"])

Note that the search API call does not change - you can search over all indexed columns at once.

Filtering

Currently the LanceDB full text search feature supports post-filtering, meaning filters are applied on top of the full text search results. This can be invoked via the familiar where syntax:

table.search("puppy").limit(10).where("meta='foo'").to_list()

Current limitations

  1. Currently we do not yet support incremental writes. If you add data after fts index creation, it won't be reflected in search results until you do a full reindex.

  2. We currently only support local filesystem paths for the fts index. This is a tantivy limitation. We've implemented an object store plugin but there's no way in tantivy-py to specify to use it.