Commit Graph

699 Commits

Author SHA1 Message Date
Lei Xu
32af962c0c feat: fix creating empty table and creating table by a list of RecordBatch for remote python sdk (#1650)
Closes #1637
2024-09-14 11:33:34 -07:00
Ayush Chaurasia
18484d0b6c fix: allow pass optional args in colbert reranker (#1649)
Fixes https://github.com/lancedb/lancedb/issues/1641
2024-09-14 11:18:09 -07:00
Lei Xu
c02ee3c80c chore: make remote client a context manager (#1648)
Allow `RemoteLanceDBClient` to be used as context manager
2024-09-13 22:08:48 -07:00
Sayandip Dutta
9b8472850e fix: unterminated string literal on table update (#1573)
resolves #1429 
(python)

```python
-    return f"'{value}'"
+    return f'"{value}"'
```

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-09-13 12:32:59 -07:00
Sayandip Dutta
36d05ea641 fix: add appropriate QueryBuilder overloads to LanceTable.search (#1558)
- Add overloads to Table.search, to preserve the return information
of different types of QueryBuilder objects for LanceTable
- Fix fts_column type annotation by including making it `Optional`

resolves #1550

---------

Co-authored-by: sayandip-dutta <sayandip.dutta@nevaehtech.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
2024-09-13 12:32:30 -07:00
LuQQiu
c7732585bf fix: support pyarrow input types (#1628)
fixes #1625 
Support PyArrow.RecordBatch, pa.dataset.Dataset, pa.dataset.Scanner,
paRecordBatchReader
2024-09-12 10:59:18 -07:00
BubbleCal
4b79db72bf docs: improve the docs and API param name (#1629)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-09-11 10:18:29 +08:00
Lance Release
64eb43229d Bump version: 0.13.0-beta.2 → 0.13.0 2024-09-10 20:12:35 +00:00
Lance Release
c31c92122f Bump version: 0.13.0-beta.1 → 0.13.0-beta.2 2024-09-10 20:12:35 +00:00
Gagan Bhullar
205fc530cf feat: expose hnsw indices (#1595)
PR closes #1522

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-09-10 11:08:13 -07:00
BubbleCal
2bde5401eb feat: support to build FTS without positions (#1621) 2024-09-10 22:51:32 +08:00
Antonio Molner Domenech
a405847f9b fix(python): remove unmaintained ratelimiter dependency (#1603)
The `ratelimiter` package hasn't been updated in ages and is no longer
maintained. This PR removes the dependency on `ratelimiter` and replaces
it with a custom rate limiter implementation.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-09-09 12:35:53 -07:00
Will Jones
2a6586d6fb feat: add flag to enable faster manifest paths (#1612)
The new V2 manifest path scheme makes discovering the latest version of
a table constant time on object stores, regardless of the number of
versions in the table. See benchmarks in the PR here:
https://github.com/lancedb/lance/pull/2798

Closes #1583
2024-09-09 11:34:36 -07:00
James Wu
029b01bbbf feat: enable phrase_query(bool) for hybrid search queries (#1578)
first off, apologies for any folly since i'm new to contributing to
lancedb. this PR is the continuation of [a discord
thread](https://discord.com/channels/1030247538198061086/1030247538667827251/1278844345713299599):

## user story

here's the lance db search query i'd like to run:

```
def search(phrase):
    logger.info(f'Searching for phrase: {phrase}')
    phrase_embedding = get_embedding(phrase)
    df = (table.search((phrase_embedding, phrase), query_type='hybrid')
        .limit(10).to_list())
    logger.info(f'Success search with row count: {len(df)}')

search('howdy (howdy)')
search('howdy(howdy)')
```

the second search fails due to `ValueError: Syntax Error: howdy(howdy)`

i saw on the
[docs](https://lancedb.github.io/lancedb/fts/#phrase-queries-vs-terms-queries)
that i can use `phrase_query()` to [enable a
flag](https://github.com/lancedb/lancedb/blob/main/python/python/lancedb/query.py#L790-L792)
to wrap the query in double quotes (as well as sanitize single quotes)
prior to sending the query to search. this works for [normal
FTS](https://lancedb.github.io/lancedb/fts/), but the command is
unavailable on [hybrid
search](https://lancedb.github.io/lancedb/hybrid_search/hybrid_search/).

## changes

i added `phrase_query()` function to `LanceHybridQueryBuilder` by
propagating the call down to its `self. _fts_query` object. i'm not too
familiar with the codebase and am not sure if this is the best way to
implement the functionality. feel free to riff on this PR or discard


## tests

```
(lancedb) JamesMPB:python james$ pwd
/Users/james/src/lancedb/python
(lancedb) JamesMPB:python james$ pytest python/tests/test_table.py 
python/tests/test_table.py .......................................                                                                   [100%]
====================================================== 39 passed, 1 warning in 2.23s =======================================================
```
2024-09-07 08:58:05 +05:30
Will Jones
cd32944e54 feat: upgrade lance to v0.17.0 (#1608)
Changelog: https://github.com/lancedb/lance/releases/tag/v0.17.0

Highlights:

* You can do "phrase queries" by adding double quotes around phrases
(multiple tokens) in FTS.

Added follow ups in: https://github.com/lancedb/lancedb/issues/1611
2024-09-06 14:10:02 -07:00
BubbleCal
8dcd328dce feat: support to create table from record batch iterator (#1593) 2024-09-06 10:41:38 +08:00
Gagan Bhullar
b24810a011 feat(python, rust): expose offset in query (#1556)
PR is part of #1555
2024-09-05 08:33:07 -07:00
Ayush Chaurasia
03ef1dc081 feat: update default reranker to RRF (#1580)
- Both LinearCombination (the current default) and RRF are pretty fast
compared to model based rerankers. RRF is slightly faster.
- In our tests RRF has also been slightly more accurate.

This PR:
- Makes RRF the default reranker
- Removed duplicate docs for rerankers
2024-09-03 14:00:13 +05:30
Ayush Chaurasia
dc72ece847 feat!: better api for manual hybrid queries (#1575)
Currently, the only documented way of performing hybrid search is by
using embedding API and passing string queries that get automatically
embedded. There are use cases where users might like to pass vectors and
text manually instead.
This ticket contains more information and historical context -
https://github.com/lancedb/lancedb/issues/937

This breaks a undocumented pathway that allowed passing (vector, text)
tuple queries which was intended to be temporary, so this is marked as a
breaking change. For all practical purposes, this should not really
impact most users

### usage
```
results = table.search(query_type="hybrid")
                .vector(vector_query)
                .text(text_query)
                .limit(5)
                .to_pandas()
```
2024-08-30 17:37:58 +05:30
BubbleCal
1521435193 fix: specify column to search for FTS (#1572)
Before this we ignored the `fts_columns` parameter, and for now we
support to search on only one column, it could lead to an error if we
have multiple indexed columns for FTS

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-08-29 23:43:46 +08:00
Gagan Bhullar
a85f039352 fix(bug): limit fix (#1548)
PR fixes #1151
2024-08-26 14:25:14 -07:00
Ayush Chaurasia
549ca51a8a feat: add answerdotai rerankers support and minor improvements (#1560)
This PR:
- Adds missing license headers
- Integrates with answerdotai Rerankers package
- Updates ColbertReranker to subclass answerdotai package. This is done
to keep backwards compatibility as some users might be used to importing
ColbertReranker directly
- Set `trust_remote_code` to ` True` by default in CrossEncoder and
sentence-transformer based rerankers
2024-08-26 13:25:10 +05:30
Lance Release
89bcc1b2e7 Bump version: 0.13.0-beta.0 → 0.13.0-beta.1 2024-08-23 13:56:30 +00:00
Gagan Bhullar
6eb7ccfdee fix: rerank attribute unknown (#1554)
PR fixes #1550
2024-08-22 11:46:36 +05:30
Ayush Chaurasia
7d65dd97cf chore(python): update Colbert architecture and minor improvements (#1547)
- Update ColBertReranker architecture: The current implementation
doesn't use the right arch. This PR uses the implementation in Rerankers
library. Fixes https://github.com/lancedb/lancedb/issues/1546
Benchmark diff (hit rate):
Hybrid - 91 vs 87
reranked vector - 85 vs 80

- Reranking in FTS is basically disabled in main after last week's FTS
updates. I think there's no blocker in supporting that?
- Allow overriding accelerators: Most transformer based Rerankers and
Embedding automatically select device. This PR allows overriding those
settings by passing `device`. Fixes:
https://github.com/lancedb/lancedb/issues/1487

---------

Co-authored-by: BubbleCal <bubble-cal@outlook.com>
2024-08-21 12:26:52 +05:30
Lei Xu
5857cb4c6e docs: add a section to describe scalar index (#1495) 2024-08-16 18:48:29 -07:00
BubbleCal
0fa50775d6 feat: support to query/index FTS on RemoteTable/AsyncTable (#1537)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-08-16 12:01:05 +08:00
Gagan Bhullar
20faa4424b feat(python): add delete unverified parameter (#1542)
PR fixes #1527
2024-08-15 09:01:32 -07:00
BubbleCal
b624fc59eb docs: add create_fts_index doc in Python API Reference (#1533)
resolve #1313

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-08-15 11:35:16 +08:00
BubbleCal
501817cfac chore: bump the required python version to 3.9 (#1541)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-08-14 08:44:31 -07:00
Ryan Green
b3daa25f46 feat: allow new scalar index types to be created in remote table (#1538) 2024-08-13 16:05:42 -02:30
Lance Release
ff5bbfdd4c Bump version: 0.12.0 → 0.13.0-beta.0 2024-08-12 19:47:57 +00:00
Lei Xu
b2317c904d feat: create bitmap and label list scalar index using python async api (#1529)
* Expose `bitmap` and `LabelList` scalar index type via Rust and Async
Python API
* Add documents
2024-08-11 09:16:11 -07:00
BubbleCal
613f3063b9 chore: upgrade lance to 0.16.1 (#1524)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-08-09 19:18:05 +08:00
Gagan Bhullar
9c1adff426 feat(python): add to_list to async api (#1520)
PR fixes #1517
2024-08-08 11:45:20 -07:00
BubbleCal
f9d5fa88a1 feat!: migrate FTS from tantivy to lance-index (#1483)
Lance now supports FTS, so add it into lancedb Python, TypeScript and
Rust SDKs.

For Python, we still use tantivy based FTS by default because the lance
FTS index now misses some features of tantivy.

For Python:
- Support to create lance based FTS index
- Support to specify columns for full text search (only available for
lance based FTS index)

For TypeScript:
- Change the search method so that it can accept both string and vector
- Support full text search

For Rust
- Support full text search

The others:
- Update the FTS doc

BREAKING CHANGE: 
- for Python, this renames the attached score column of FTS from "score"
to "_score", this could be a breaking change for users that rely the
scores

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
2024-08-08 15:33:15 +08:00
Lance Release
ec39d98571 Bump version: 0.12.0-beta.0 → 0.12.0 2024-08-07 20:55:40 +00:00
Lance Release
0cb37f0e5e Bump version: 0.11.0 → 0.12.0-beta.0 2024-08-07 20:55:39 +00:00
Lei Xu
2bdf0a02f9 feat!: upgrade lance to 0.16 (#1519) 2024-08-07 13:15:22 -07:00
Gagan Bhullar
32123713fd feat(python): optimize stats repr method (#1510)
PR fixes #1507
2024-08-07 08:47:52 -07:00
Gagan Bhullar
d5a01ffe7b feat(python): index config repr method (#1509)
PR fixes #1506
2024-08-07 08:46:46 -07:00
Ayush Chaurasia
e01045692c feat(python): support embedding functions in remote table (#1405) 2024-08-07 20:22:43 +05:30
Ayush Chaurasia
4769d8eb76 feat(python): multi-vector reranking support (#1481)
Currently targeting the following usage:
```
from lancedb.rerankers import CrossEncoderReranker

reranker = CrossEncoderReranker()

query = "hello"

res1 = table.search(query, vector_column_name="vector").limit(3)
res2 = table.search(query, vector_column_name="text_vector").limit(3)
res3 = table.search(query, vector_column_name="meta_vector").limit(3)

reranked = reranker.rerank_multivector(
               [res1, res2, res3],  
              deduplicate=True,
              query=query # some reranker models need query
)
```
- This implements rerank_multivector function in the base reranker so
that all rerankers that implement rerank_vector will automatically have
multivector reranking support
- Special case for RRF reranker that just uses its existing
rerank_hybrid fcn to multi-vector reranking.

---------

Co-authored-by: Weston Pace <weston.pace@gmail.com>
2024-08-07 01:45:46 +05:30
Ayush Chaurasia
d07d7a5980 chore: update polars version range (#1508) 2024-08-06 23:43:15 +05:30
Robby
8d2ff7b210 feat(python): add watsonx embeddings to registry (#1486)
Related issue: https://github.com/lancedb/lancedb/issues/1412

---------

Co-authored-by: Robby <h0rv@users.noreply.github.com>
2024-08-06 10:58:33 +05:30
Ryan Green
6af69b57ad fix: return LanceMergeInsertBuilder in overridden merge_insert method on remote table (#1484) 2024-07-31 12:25:16 -02:30
Lance Release
7b6d3f943b Bump version: 0.11.0-beta.0 → 0.11.0 2024-07-26 20:18:31 +00:00
Lance Release
676876f4d5 Bump version: 0.10.2 → 0.11.0-beta.0 2024-07-26 20:18:30 +00:00
Will Jones
9555efacf9 feat: upgrade lance to 0.15.0 (#1477)
Changelog: https://github.com/lancedb/lance/releases/tag/v0.15.0

* Fixes #1466
* Closes #1475
* Fixes #1446
2024-07-26 09:13:49 -07:00
Chang She
374c1e7aba fix: infer schema from huggingface dataset (#1444)
Closes #1383

When creating a table from a HuggingFace dataset, infer the arrow schema
directly
2024-07-23 13:12:34 -07:00