SDK
Python
Description
Exposes pyarrow batch api during query execution - relevant when there
is no vector search query, dataset is large and the filtered result is
larger than memory.
---------
Co-authored-by: Ishani Ghose <isghose@amazon.com>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
In
2de226220b
I added a new `IntoArrow` trait for adding data into a table.
Unfortunately, it seems my approach for implementing the trait for
"things that are already record batch readers" was flawed. This PR
corrects that flaw and, conveniently, removes the need to box readers at
all (though it is ok if you do).
This PR adds support for passing through a set of ordering fields at
index time (unsigned ints that tantivity can use as fast_fields) that at
query time you can sort your results on. This is useful for cases where
you want to get related hits, i.e by keyword, but order those hits by
some other score, such as popularity.
I.e search for songs descriptions that match on "sad AND jazz AND 1920"
and then order those by number of times played. Example usage can be
seen in the fts tests.
---------
Co-authored-by: Nat Roth <natroth@Nats-MacBook-Pro.local>
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
HuggingFace Dataset is written as arrow batches.
For DatasetDict, all splits are written with a "split" column appended.
- [x] what if the dataset schema already has a `split` column
- [x] add unit tests
solves https://github.com/lancedb/lancedb/issues/1086
Usage Reranking with FTS:
```
retriever = db.create_table("fine-tuning", schema=Schema, mode="overwrite")
pylist = [{"text": "Carson City is the capital city of the American state of Nevada. At the 2010 United States Census, Carson City had a population of 55,274."},
{"text": "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan."},
{"text": "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas."},
{"text": "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. "},
{"text": "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states."},
{"text": "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."},
]
retriever.add(pylist)
retriever.create_fts_index("text", replace=True)
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query, query_type="fts").limit(10).to_pandas())
print(retriever.search(query, query_type="fts").rerank(reranker=reranker).limit(10).to_pandas())
```
Result
```
text vector score
0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602
1 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046
2 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521
3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898
4 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422
5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346
text vector score _relevance_score
0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 0.653422 0.979977
1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 0.671521 0.299105
2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 0.729602 0.284874
3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 0.667898 0.089614
4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 0.639346 0.063832
5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 0.678046 0.041462
```
## Vector Search usage:
```
query = "What is the capital of the United States?"
reranker = CohereReranker(return_score="all")
print(retriever.search(query).limit(10).to_pandas())
print(retriever.search(query).rerank(reranker=reranker, query=query).limit(10).to_pandas()) # <-- Note: passing extra string query here
```
Results
```
text vector _distance
0 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973
1 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884
2 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200
3 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654
4 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867
5 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544
text vector _distance _relevance_score
0 Washington, D.C. (also known as simply Washing... [-0.0090408325, 0.42578125, 0.3798828, -0.3574... 41.384884 0.979977
1 The Commonwealth of the Northern Mariana Islan... [0.3684082, 0.30493164, 0.004600525, -0.049407... 60.060867 0.299105
2 Capital punishment (the death penalty) has exi... [0.099975586, 0.047943115, -0.16723633, -0.183... 39.728973 0.284874
3 Carson City is the capital city of the America... [0.13989258, 0.14990234, 0.14172363, 0.0546569... 55.220200 0.089614
4 North Dakota is a state in the United States. ... [0.55859375, -0.2109375, 0.14526367, 0.1634521... 64.260544 0.063832
5 Charlotte Amalie is the capital and largest ci... [-0.021255493, 0.03363037, -0.027450562, -0.17... 58.345654 0.041462
```
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.
It should be now as simple as:
```python
import lancedb
db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```
Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.
Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:
```python
class Schema(LanceModel):
id: str
vector: Vector(openai.ndims()) = openai.VectorField()
text: str = openai.SourceField()
```
Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:
```python
table.add([{"id": "foo", "text": "rust all the things"}])
```
However, trying to construct pydantic model instances without vector
doesn't because it's a required field.
Instead, you need add a default value:
```python
class Schema(LanceModel):
id: str
vector: Vector(openai.ndims()) = openai.VectorField(default=None)
text: str = openai.SourceField()
```
then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```
However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.
However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.
---------
Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).
While doing this we also cleaned up some inconsistencies between the
SDKs:
* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.
Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
This changes `lancedb` from a "pure python" setuptools project to a
maturin project and adds a rust lancedb dependency.
The async python client is extremely minimal (only `connect` and
`Connection.table_names` are supported). The purpose of this PR is to
get the infrastructure in place for building out the rest of the async
client.
Although this is not technically a breaking change (no APIs are
changing) it is still a considerable change in the way the wheels are
built because they now include the native shared library.