Commit Graph

33 Commits

Author SHA1 Message Date
Chang She
bc83bc9838 feat(python): add post filtering for full text search (#739)
Closes #721 

fts will return results as a pyarrow table. Pyarrow tables has a
`filter` method but it does not take sql filter strings (only pyarrow
compute expressions). Instead, we do one of two things to support
`tbl.search("keywords").where("foo=5").limit(10).to_arrow()`:

Default path: If duckdb is available then use duckdb to execute the sql
filter string on the pyarrow table.
Backup path: Otherwise, write the pyarrow table to a lance dataset and
then do `to_table(filter=<filter>)`

Neither is ideal. 
Default path has two issues:
1. requires installing an extra library (duckdb)
2. duckdb mangles some fields (like fixed size list => list)

Backup path incurs a latency penalty (~20ms on ssd) to write the
resultset to disk.

In the short term, once #676 is addressed, we can write the dataset to
"memory://" instead of disk, this makes the post filter evaluate much
quicker (ETA next week).

In the longer term, we'd like to be able to evaluate the filter string
on the pyarrow Table directly, one possibility being that we use
Substrait to generate pyarrow compute expressions from sql string. Or if
there's enough progress on pyarrow, it could support Substrait
expressions directly (no ETA)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:25:02 -07:00
Chang She
cc9d74e7a7 feat(python): add option to flatten output in to_pandas (#722)
Closes https://github.com/lancedb/lance/issues/1738

We add a `flatten` parameter to the signature of `to_pandas`. By default
this is None and does nothing.
If set to True or -1, then LanceDB will flatten structs before
converting to a pandas dataframe. All nested structs are also flattened.
If set to any positive integer, then LanceDB will flatten structs up to
the specified level of nesting.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:24:30 -07:00
Lei Xu
86efd36689 chore: improve create_table API consistency between local and remote SDK (#627) 2024-04-05 16:23:47 -07:00
Ayush Chaurasia
159ecbac5a Exponential standoff retry support for handling rate limited embedding functions (#614)
Users ingesting data using rate limited apis don't need to manually make
the process sleep for counter rate limits
resolves #579
2024-04-05 16:23:14 -07:00
QianZhu
3c139c2ee5 Qian/query option doc (#615)
- API documentation improvement for queries (table.search)
- a small bug fix for the remote API on create_table

![image](https://github.com/lancedb/lancedb/assets/1305083/712e9bd3-deb8-4d81-8cd0-d8e98ef68f4e)

![image](https://github.com/lancedb/lancedb/assets/1305083/ba22125a-8c36-4e34-a07f-e39f0136e62c)
2024-04-05 16:22:59 -07:00
Bert
c9fee0faed added api docs for prefilter flag (#617)
Added the prefilter flag argument to the `LanceQueryBuilder.where`.

This should make it display here:

https://lancedb.github.io/lancedb/python/python/#lancedb.query.LanceQueryBuilder.select

And also in intellisense like this:
<img width="848" alt="image"
src="https://github.com/lancedb/lancedb/assets/5846846/e0c53f4f-96bc-411b-9159-680a6c4d0070">

Also adds some improved documentation about the `where` argument to this
method.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-04-05 16:22:59 -07:00
Chang She
8469d010f8 feat: add to_list and to_pandas api's (#556)
Add `to_list` to return query results as list of python dict (so we're
not too pandas-centric). Closes #555

Add `to_pandas` API and add deprecation warning on `to_df`. Closes #545

Co-authored-by: Chang She <chang@lancedb.com>
2024-04-05 16:22:59 -07:00
Chang She
693bca1eba feat(python): expose prefilter to lancedb (#522)
We have experimental support for prefiltering (without ANN) in pylance.
This means that we can now apply a filter BEFORE vector search is
performed. This can be done via the `.where(filter_string,
prefilter=True)` kwargs of the query.

Limitations:
- When connecting to LanceDB cloud, `prefilter=True` will raise
NotImplemented
- When an ANN index is present, `prefilter=True` will raise
NotImplemented
- This option is not available for full text search query
- This option is not available for empty search query (just
filter/project)

Additional changes in this PR:
- Bump pylance version to v0.8.0 which supports the experimental
prefiltering.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-10-01 10:34:12 -07:00
Chang She
31dad71c94 multi-modal embedding-function (#484) 2023-09-16 21:23:51 -04:00
Chang She
9a9a73a65d [python] Use pydantic for embedding function persistence (#467)
1. Support persistent embedding function so users can just search using
query string
2. Add fixed size list conversion for multiple vector columns
3. Add support for empty query (just apply select/where/limit).
4. Refactor and simplify some of the data prep code

---------

Co-authored-by: Chang She <chang@lancedb.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
2023-09-05 21:30:45 -07:00
Will Jones
722462c38b chore: upgrade Lance and rename score to _distance (#398)
BREAKING CHANGE: The `score` column has been renamed to `_distance` to
more accurately describe the semantics (smaller means closer / better).

---------

Co-authored-by: Lei Xu <lei@lancedb.com>
2023-08-11 21:42:33 -07:00
Chang She
c1f8feb6ed make pandas an optional dependency in lancedb as well (#385) 2023-07-31 14:08:58 -04:00
Chang She
cada35d5b7 Improve pydantic integration (#384) 2023-07-31 12:16:44 -04:00
Chang She
2fdcb307eb [python] Fix a few minor bugs (#304) 2023-07-15 03:47:42 +08:00
Lei Xu
e6c6da6104 [Python] Initial support of cloud API (#260)
Support connect with remote database, and implement Search API
2023-07-07 15:41:15 -07:00
Chang She
e2325c634b Allow creation of an empty table (#254)
It's inconvenient to always require data at table creation time.
Here we enable you to create an empty table and add data and set schema
later.

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-06 20:44:58 -07:00
Chang She
507eeae9c8 Set default to error instead of drop (#259)
when encountering bad input data, we can default to principle of least
surprise and raise an exception.

Co-authored-by: Chang She <chang@lancedb.com>
2023-07-05 22:44:18 -07:00
Philip Kung
313e66c4c5 Specify and Index Column for Vector Search (#217) 2023-06-26 16:11:08 -07:00
Rob Meng
d1e8a97a2a isort entire repo (#200) 2023-06-15 20:12:10 -04:00
Rob Meng
cbb56e25ab port remote connection client into lancedb (#194)
* to_df() is now async, added `to_df_blocking` to convenience
* add remote lancedb client to public lancedb
* make lancedb connection class understand url scheme
`lancedb+<connection_type>://<host>:<port>`.
2023-06-15 18:57:52 -04:00
Tevin Wang
9b83ce3d2a add black to python CI (#178)
Closes #48
2023-06-12 11:22:34 -07:00
Will Jones
fed33a51d5 wip: make the python API reference a bit nicer (#162)
Adds:

* Make `mkdocstrings` aware we are using numpy-style docstrings
* Fixes broken link on `index.md` to Python API docs (and added link to
node ones)
* Added examples to various classes.
* Added doctest to verify examples work.
2023-06-08 16:07:06 -07:00
Chang She
50cdb16b45 Better handle empty results from tantivy (#155)
Closes #154

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-06-05 18:18:14 -07:00
Chang She
04d97347d7 move tantivy-py installation to be separate from wheel (#97)
pypi does not allow packages to be uploaded that has a direct reference

for now we'll just ask the user to install tantivy separately

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-05-25 17:57:26 -06:00
Chang She
f485378ea4 Basic full text search capabilities (#62)
This is v1 of integrating full text search index into LanceDB.

# API
The query API is roughly the same as before, except if the input is text
instead of a vector we assume that its fts search.

## Example
If `table` is a LanceDB LanceTable, then:

Build index: `table.create_fts_index("text")`

Query: `df = table.search("puppy").limit(10).select(["text"]).to_df()`

# Implementation
Here we use the tantivy-py package to build the index. We then use the
row id's as the full-text-search index's doc id then we just do a Take
operation to fetch the rows.

# Limitations

1. don't support incremental row appends yet. New data won't show up in
search
2. local filesystem only 
3. requires building tantivy explicitly

---------

Co-authored-by: Chang She <chang@lancedb.com>
2023-05-24 22:25:31 -06:00
Chang She
5554fddd54 Merge branch 'main' into changhiskhan/improve-index-docs 2023-04-25 21:04:01 -07:00
Chang She
72a44eb927 specify metric during index creation 2023-04-24 22:45:37 -07:00
Chang She
89e6232aeb Make distance metric configurable during search 2023-04-24 22:40:40 -07:00
Chang She
4f2dae8a0d Add more detailed docs for the ANN index and search features 2023-04-24 19:19:55 -07:00
Chang She
b91139d3c7 Add tutorial notebook
Convert contextualization and embeddings functionality.
And use it with converted notebook for video search
2023-03-23 15:07:58 -07:00
Chang She
5ef5141812 black 2023-03-22 18:29:07 -07:00
Chang She
690141d357 add unit tests 2023-03-21 22:29:19 -07:00
Chang She
fd4870576e Add functionality for opening a table, introspection for db / table 2023-03-21 19:34:48 -07:00