Commit Graph

322 Commits

Author SHA1 Message Date
ayush chaurasia
53c946917e Merge branch 'main' into docs_march 2024-03-26 20:55:06 +05:30
Ayush Chaurasia
e9e0a37ca8 docs: Add all available HF/sentence transformers embedding models list (#1134)
Solves -  https://github.com/lancedb/lancedb/issues/968
2024-03-26 19:04:09 +05:30
Weston Pace
c37a28abbd docs: add the async python API to the docs (#1156) 2024-03-26 07:54:16 -05:00
ayush chaurasia
db04453520 Merge branch 'main' of https://github.com/lancedb/lancedb into docs_march 2024-03-23 20:35:03 +05:30
ayush chaurasia
9a41894cd2 update 2024-03-22 14:33:39 +05:30
ayush chaurasia
ba1266a6b9 update 2024-03-22 13:58:23 +05:30
Lei Xu
25988d23cd chore: validate table name (#1146)
Closes #1129
2024-03-21 14:46:13 -07:00
Lance Release
c0dd98c798 [python] Bump version: 0.6.4 → 0.6.5 2024-03-21 19:53:38 +00:00
Lei Xu
ee73a3bcb8 chore: bump lance to 0.10.5 (#1145) 2024-03-21 12:53:02 -07:00
Ishani Ghose
e14f079fe2 feat: add to_batches API #805 (#1048)
SDK
Python

Description
Exposes pyarrow batch api during query execution - relevant when there
is no vector search query, dataset is large and the filtered result is
larger than memory.

---------

Co-authored-by: Ishani Ghose <isghose@amazon.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-03-20 13:38:06 -07:00
Weston Pace
7d790bd9e7 feat: introduce ArrowNative wrapper struct for adding data that is already a RecordBatchReader (#1139)
In
2de226220b
I added a new `IntoArrow` trait for adding data into a table.
Unfortunately, it seems my approach for implementing the trait for
"things that are already record batch readers" was flawed. This PR
corrects that flaw and, conveniently, removes the need to box readers at
all (though it is ok if you do).
2024-03-20 13:28:17 -07:00
natcharacter
dbdd0a7b4b Order by field support FTS (#1132)
This PR adds support for passing through a set of ordering fields at
index time (unsigned ints that tantivity can use as fast_fields) that at
query time you can sort your results on. This is useful for cases where
you want to get related hits, i.e by keyword, but order those hits by
some other score, such as popularity.

I.e search for songs descriptions that match on "sad AND jazz AND 1920"
and then order those by number of times played. Example usage can be
seen in the fts tests.

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-03-20 01:27:37 -07:00
Chang She
befb79c5f9 feat(python): support writing huggingface dataset and dataset dict (#1110)
HuggingFace Dataset is written as arrow batches.
For DatasetDict, all splits are written with a "split" column appended.

- [x] what if the dataset schema already has a `split` column
- [x] add unit tests
2024-03-20 00:22:03 -07:00
Ayush Chaurasia
0a387a5429 feat(python): Support reranking for vector and fts (#1103)
solves https://github.com/lancedb/lancedb/issues/1086

Usage Reranking with FTS:
```
retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite")
pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274."},
          {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."},
        {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."},
        {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "},
        {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."},
        {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."},
        ]
retriever.add(pylist)
retriever.create_fts_index("text", replace=True)

query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query, query_type="fts").limit(10).to_pandas())
print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas())
```
Result
```
                                                text                                             vector     score
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602
1  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046
2  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898
4  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346
                                                text                                             vector     score  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046          0.041462
```

## Vector Search usage:
```
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query).limit(10).to_pandas())
print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here
```

Results
```
                                                text                                             vector  _distance
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973
1  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884
2  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200
3  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654
4  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544
                                                text                                             vector  _distance  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654          0.041462
```
2024-03-19 22:20:31 +05:30
Weston Pace
6331807b95 feat: refactor the query API and add query support to the python async API (#1113)
In addition, there are also a number of changes in nodejs to the
docstrings of existing methods because this PR adds a jsdoc linter.
2024-03-18 12:36:49 -07:00
Lance Release
81f2cdf736 [python] Bump version: 0.6.3 → 0.6.4 2024-03-16 18:59:14 +00:00
Weston Pace
3bcd61c8de feat: bump lance to 0.10.4 (#1123) 2024-03-15 22:21:04 -07:00
Christian Di Lorenzo
d974413745 fix(python): Add python azure blob read support (#1102)
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
2024-03-15 14:15:41 -07:00
Weston Pace
ec4f2fbd30 feat: update lance to v0.10.3 (#1094) 2024-03-15 08:50:28 -07:00
Ayush Chaurasia
6375ea419a chore(python): Increase event interval for telemetry (#1108)
Increasing event reporting interval from 5mins to 60mins
2024-03-15 17:04:43 +05:30
Raghav Dixit
6689192cee doc updates (#1085)
closes #1084
2024-03-14 14:38:28 +05:30
Chang She
dbec598610 feat(python): support optional vector field in pydantic model (#1097)
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField()
    text: str = openai.SourceField()
```

Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:

```python
table.add([{"id": "foo", "text": "rust all the things"}])
```

However, trying to construct pydantic model instances without vector
doesn't because it's a required field.

Instead, you need add a default value:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField(default=None)
    text: str = openai.SourceField()
```

then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```

However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
2024-03-14 03:05:08 +05:30
QianZhu
8f6e7ce4f3 add index_stats python api (#1096)
the integration test will be covered in another PR:
https://github.com/lancedb/sophon/pull/1876
2024-03-13 08:47:54 -07:00
Chang She
b482f41bf4 fix(python): fix typo in passing in the api_key explicitly (#1098)
fix silly typo
2024-03-12 22:01:12 -07:00
Weston Pace
4dc7497547 feat: add list_indices to the async api (#1074) 2024-03-12 14:41:21 -07:00
Weston Pace
d744972f2f feat: add update to the async API (#1093) 2024-03-12 14:11:37 -07:00
Weston Pace
510d449167 feat: add time travel operations to the async API (#1070) 2024-03-12 09:20:23 -07:00
Weston Pace
356e89a800 feat: add create_index to the async python API (#1052)
This also refactors the rust lancedb index builder API (and,
correspondingly, the nodejs API)
2024-03-12 05:17:05 -07:00
Lance Release
1ae08fe31d [python] Bump version: 0.6.2 → 0.6.3 2024-03-11 20:16:36 +00:00
Rob Meng
a517629c65 feat: configurable timeout for LanceDB Cloud queries (#1090) 2024-03-11 16:15:48 -04:00
Lance Release
272cbcad7a [python] Bump version: 0.6.1 → 0.6.2 2024-03-06 16:28:50 +00:00
Lei Xu
d1983602c2 chore: bump lance to 0.10.2 (#1061) 2024-03-05 10:16:07 -08:00
Weston Pace
9148cd6d47 feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059)
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
2024-03-05 08:38:18 -08:00
Will Jones
47dbb988bf feat: more accessible errors (#1025)
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
2024-03-05 07:57:11 -08:00
Chang She
6821536d44 doc(python): document the method in fts (#982)
Co-authored-by: prrao87 <prrao87@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-03-04 16:42:24 -08:00
Ayush Chaurasia
d6f0663671 fix(python): Few fts patches (#1039)
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-03-04 16:41:59 -08:00
Weston Pace
abaf315baf feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-03-04 09:27:41 -08:00
Chang She
14b9277ac1 chore(rust): update rust version (#810) 2024-03-03 18:51:58 -08:00
Chang She
d621826b79 feat(python): allow user to override api url (#1054) 2024-03-03 18:29:47 -08:00
Chang She
08c0803ae1 chore(python): use pypi tantivy to speed up CI (#987) 2024-03-03 16:57:55 -08:00
Chang She
d14c9b6d9e feat(python): add model_names() method to openai embedding function (#1049)
small QoL improvement
2024-03-03 12:33:00 -08:00
QianZhu
c1af53b787 Add create scalar index to sdk (#1033) 2024-02-29 13:32:01 -08:00
Weston Pace
2a02d1394b feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-02-29 13:29:29 -08:00
Lance Release
085066d2a8 [python] Bump version: 0.6.0 → 0.6.1 2024-02-29 19:48:16 +00:00
Rob Meng
adf1a38f4d fix: fix columns type for pydantic 2.x (#1045) 2024-02-29 14:47:56 -05:00
Weston Pace
294c33a42e feat: Initial remote table implementation for rust (#1024)
This will eventually replace the remote table implementations in python
and node.
2024-02-29 10:55:49 -08:00
Lance Release
245786fed7 [python] Bump version: 0.5.7 → 0.6.0 2024-02-29 16:03:01 +00:00
Rob Meng
ebaa2dede5 chore: upgrade to lance 0.10.1 (#1034)
upgrade to lance 0.10.1 and update doc string to reflect dynamic
projection options
2024-02-28 11:06:46 -05:00
Weston Pace
a6bcbd007b feat: add a basic async python client starting point (#1014)
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.

The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.

Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.
2024-02-27 04:52:02 -08:00
Will Jones
5af74b5aca feat: {add|alter|drop}_columns APIs (#1015)
Initial work for #959. This exposes the basic functionality for each in
all of the APIs. Will add user guide documentation in a later PR.
2024-02-26 11:04:53 -08:00