Commit Graph

28 Commits

Author SHA1 Message Date
Weston Pace
4f512af024 feat: add the optimize function to nodejs and async python (#1257)
The optimize function is pretty crucial for getting good performance
when building a large scale dataset but it was only exposed in rust
(many sync python users are probably doing this via to_lance today)

This PR adds the optimize function to nodejs and to python.

I left the function marked experimental because I think there will
likely be changes to optimization (e.g. if we add features like
"optimize on write"). I also only exposed the `cleanup_older_than`
configuration parameter since this one is very commonly used and the
rest have sensible defaults and we don't really know why we would
recommend different values for these defaults anyways.
2024-05-20 07:09:31 -07:00
asmith26
3850d5fb35 Add ollama embeddings function (#1263)
Following the docs
[here](https://lancedb.github.io/lancedb/python/python/#lancedb.embeddings.openai.OpenAIEmbeddings)
I've been trying to use ollama embedding via the OpenAI API interface,
but unfortunately I couldn't get it to work (possibly related to
https://github.com/ollama/ollama/issues/2416)

Given the popularity of ollama I thought it could be helpful to have a
dedicated Ollama Embedding function in lancedb.

Very much welcome any thought on this or my code etc. Thanks!
2024-05-13 13:09:19 +05:30
Will Jones
12dbca5248 ci: better test for test_syntax (#1278)
The syntax error was fixed in tantivy 0.22.0, so I changed the test case
to something more wrong.
2024-05-07 11:52:39 -07:00
Weston Pace
3d7c48feca feat: allow the index_cache_size to be configured when opening a table (#1245)
This was already configurable in the rust API but it wasn't actually
being passed down to the underlying dataset. I added this option to both
the async python API and the new nodejs API.

I also added this option to the synchronous python API.

I did not add the option to vectordb.
2024-04-26 13:42:02 -07:00
Raghav Dixit
a6aa67baed python: Bug fixes / tests (#1210)
closes #1194 #1172 #1124 #1208 


@wjones127 : `if query_type != "fts":` is needed because both fts and
vector search create `LanceQueryBuilder` which has `vector_column_name`
as a required attribute.
2024-04-10 10:17:14 -07:00
Will Jones
1d23af213b feat: expose storage options in LanceDB (#1204)
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.

1. Fixes #1168
2. Closes #1165
3. Closes #1082
4. Closes #439
5. Closes #897
6. Closes #642
7. Closes #281
8. Closes #114
9. Closes #990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
2024-04-10 10:12:04 -07:00
Raghav Dixit
1c41a00d87 Embeddings: HF model hub support added via transformers (#1154) 2024-04-05 16:34:56 -07:00
Weston Pace
f97c7dad8c docs: add the async python API to the docs (#1156) 2024-04-05 16:34:37 -07:00
Lei Xu
473ef7e426 chore: validate table name (#1146)
Closes #1129
2024-04-05 16:33:37 -07:00
Ishani Ghose
0838e12b30 feat: add to_batches API #805 (#1048)
SDK
Python

Description
Exposes pyarrow batch api during query execution - relevant when there
is no vector search query, dataset is large and the filtered result is
larger than memory.

---------

Co-authored-by: Ishani Ghose <isghose@amazon.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:33:37 -07:00
natcharacter
f6e9f8e3f4 Order by field support FTS (#1132)
This PR adds support for passing through a set of ordering fields at
index time (unsigned ints that tantivity can use as fast_fields) that at
query time you can sort your results on. This is useful for cases where
you want to get related hits, i.e by keyword, but order those hits by
some other score, such as popularity.

I.e search for songs descriptions that match on "sad AND jazz AND 1920"
and then order those by number of times played. Example usage can be
seen in the fts tests.

---------

Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:33:36 -07:00
Chang She
4466cfa958 feat(python): support writing huggingface dataset and dataset dict (#1110)
HuggingFace Dataset is written as arrow batches.
For DatasetDict, all splits are written with a "split" column appended.

- [x] what if the dataset schema already has a `split` column
- [x] add unit tests
2024-04-05 16:33:06 -07:00
Ayush Chaurasia
42fad84ec8 feat(python): Support reranking for vector and fts (#1103)
solves https://github.com/lancedb/lancedb/issues/1086

Usage Reranking with FTS:
```
retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite")
pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274."},
          {"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."},
        {"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."},
        {"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "},
        {"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."},
        {"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."},
        ]
retriever.add(pylist)
retriever.create_fts_index("text", replace=True)

query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query, query_type="fts").limit(10).to_pandas())
print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas())
```
Result
```
                                                text                                             vector     score
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602
1  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046
2  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898
4  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346
                                                text                                             vector     score  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  0.653422          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  0.671521          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  0.729602          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  0.667898          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  0.639346          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  0.678046          0.041462
```

## Vector Search usage:
```
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query).limit(10).to_pandas())
print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here
```

Results
```
                                                text                                             vector  _distance
0  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973
1  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884
2  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200
3  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654
4  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867
5  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544
                                                text                                             vector  _distance  _relevance_score
0  Washington, D.C. (also known as simply Washing...  [-0.0090408325, 0.42578125, 0.3798828, -0.3574...  41.384884          0.979977
1  The Commonwealth of the Northern Mariana Islan...  [0.3684082, 0.30493164, 0.004600525, -0.049407...  60.060867          0.299105
2  Capital punishment (the death penalty) has exi...  [0.099975586, 0.047943115, -0.16723633, -0.183...  39.728973          0.284874
3  Carson City is the capital city of the America...  [0.13989258, 0.14990234, 0.14172363, 0.0546569...  55.220200          0.089614
4  North Dakota is a state in the United States. ...  [0.55859375, -0.2109375, 0.14526367, 0.1634521...  64.260544          0.063832
5  Charlotte Amalie is the capital and largest ci...  [-0.021255493, 0.03363037, -0.027450562, -0.17...  58.345654          0.041462
```
2024-04-05 16:33:06 -07:00
Weston Pace
4180b44472 feat: refactor the query API and add query support to the python async API (#1113)
In addition, there are also a number of changes in nodejs to the
docstrings of existing methods because this PR adds a jsdoc linter.
2024-04-05 16:32:47 -07:00
Christian Di Lorenzo
8bb983bc3d fix(python): Add python azure blob read support (#1102)
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
2024-04-05 16:32:31 -07:00
Chang She
377832e532 feat(python): support optional vector field in pydantic model (#1097)
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField()
    text: str = openai.SourceField()
```

Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:

```python
table.add([{"id": "foo", "text": "rust all the things"}])
```

However, trying to construct pydantic model instances without vector
doesn't because it's a required field.

Instead, you need add a default value:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField(default=None)
    text: str = openai.SourceField()
```

then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```

However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
2024-04-05 16:32:15 -07:00
Weston Pace
b6a522d483 feat: add list_indices to the async api (#1074) 2024-04-05 16:32:15 -07:00
Weston Pace
9031ec6878 feat: add update to the async API (#1093) 2024-04-05 16:32:15 -07:00
Weston Pace
47daf9b7b0 feat: add time travel operations to the async API (#1070) 2024-04-05 16:32:15 -07:00
Weston Pace
f822255683 feat: add create_index to the async python API (#1052)
This also refactors the rust lancedb index builder API (and,
correspondingly, the nodejs API)
2024-04-05 16:32:14 -07:00
Weston Pace
73c69a6b9a feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059)
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
2024-04-05 16:31:45 -07:00
Chang She
10089481c0 doc(python): document the method in fts (#982)
Co-authored-by: prrao87 <prrao87@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Ayush Chaurasia
b5326d31e9 fix(python): Few fts patches (#1039)
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Weston Pace
8033a44d68 feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:31:36 -07:00
Weston Pace
4299f719ec feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-04-05 16:31:36 -07:00
Rob Meng
b8eb5d4bfe fix: fix columns type for pydantic 2.x (#1045) 2024-04-05 16:31:36 -07:00
Rob Meng
f3de3d990d chore: upgrade to lance 0.10.1 (#1034)
upgrade to lance 0.10.1 and update doc string to reflect dynamic
projection options
2024-04-05 16:31:36 -07:00
Weston Pace
2cec2a8937 feat: add a basic async python client starting point (#1014)
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.

The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.

Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.
2024-04-05 16:31:34 -07:00