Commit Graph

1057 Commits

Author SHA1 Message Date
Weston Pace
b36c750cc7 fix: fix compile error in example caused by merge conflict (#1135) 2024-04-05 16:33:06 -07:00
Weston Pace
a23b856410 feat: change DistanceType to be independent thing instead of resuing lance_linalg (#1133)
This PR originated from a request to add `Serialize` / `Deserialize` to
`lance_linalg::distance::DistanceType`. However, that is a strange
request for `lance_linalg` which shouldn't really have to worry about
`Serialize` / `Deserialize`. The problem is that `lancedb` is re-using
`DistanceType` and things in `lancedb` do need to worry about
`Serialize`/`Deserialize` (because `lancedb` needs to support remote
client).

On the bright side, separating the two types allows us to independently
document distance type and allows `lance_linalg` to make changes to
`DistanceType` in the future without having to worry about backwards
compatibility concerns.
2024-04-05 16:33:06 -07:00
Weston Pace
0fe0976a0e docs: add links to rust SDK docs, remove references to rust SDK being unstable / experimental (#1131) 2024-04-05 16:33:05 -07:00
Weston Pace
abde77eafb feat(rust): add trait for incoming data (#1128)
This will make it easier for 3rd party integrations. They simply need to
implement `IntoArrow` for their types in order for those types to be
used in ingestion.
2024-04-05 16:32:47 -07:00
vincent d warmerdam
85a9ef472f Unhide Pydantic guides in Docs (#1122)
@wjones127 after fixing https://github.com/lancedb/lancedb/issues/1112 I
noticed something else on the docs. There's an odd chunk of the docs
missing
[here](https://lancedb.github.io/lancedb/guides/tables/#from-a-polars-dataframe).
I can see the heading, but after clicking it the contents don't show.

![CleanShot 2024-03-15 at 23 40
17@2x](https://github.com/lancedb/lancedb/assets/1019791/04784b19-0200-4c3f-ae17-7a8f871ef9bd)

Apon inspection it was a markdown issue, one tab too many on a whole
segment.

This PR fixes it. It looks like this now and the sections appear again:

![CleanShot 2024-03-15 at 23 42
32@2x](https://github.com/lancedb/lancedb/assets/1019791/c5aaec4c-1c37-474d-9fb0-641f4cf52626)
2024-04-05 16:32:47 -07:00
Weston Pace
4180b44472 feat: refactor the query API and add query support to the python async API (#1113)
In addition, there are also a number of changes in nodejs to the
docstrings of existing methods because this PR adds a jsdoc linter.
2024-04-05 16:32:47 -07:00
Lance Release
2db257ca29 [python] Bump version: 0.6.3 → 0.6.4 2024-04-05 16:32:41 -07:00
Lance Release
1f816d597a Bump version: 0.4.12 → 0.4.13 2024-04-05 16:32:31 -07:00
Weston Pace
c1e3dc48af feat: bump lance to 0.10.4 (#1123) 2024-04-05 16:32:31 -07:00
vincent d warmerdam
b9afc01cfd Explain vonoroi seed initalisation (#1114)
This PR fixes https://github.com/lancedb/lancedb/issues/1112. It turned
out that K-means is currently used internally, so I figured adding that
context to the docs would be nice.
2024-04-05 16:32:31 -07:00
Christian Di Lorenzo
8bb983bc3d fix(python): Add python azure blob read support (#1102)
I know there's a larger effort to have the python client based on the
core rust implementation, but in the meantime there have been several
issues (#1072 and #485) with some of the azure blob storage calls due to
pyarrow not natively supporting an azure backend. To this end, I've
added an optional import of the fsspec implementation of azure blob
storage [`adlfs`](https://pypi.org/project/adlfs/) and passed it to
`pyarrow.fs`. I've modified the existing test and manually verified it
with some real credentials to make sure it behaves as expected.

It should be now as simple as:

```python
import lancedb

db = lancedb.connect("az://blob_name/path")
table = db.open_table("test")
table.search(...)
```

Thank you for this cool project and we're excited to start using this
for real shortly! 🎉 And thanks to @dwhitena for bringing it to my
attention with his prediction guard posts.

Co-authored-by: christiandilorenzo <christian.dilorenzo@infiniaml.com>
2024-04-05 16:32:31 -07:00
Weston Pace
1ea0c33545 feat: update lance to v0.10.3 (#1094) 2024-04-05 16:32:31 -07:00
Raghav Dixit
765569425c doc updates (#1085)
closes #1084
2024-04-05 16:32:15 -07:00
Chang She
377832e532 feat(python): support optional vector field in pydantic model (#1097)
The LanceDB embeddings registry allows users to annotate the pydantic
model used as table schema with the desired embedding function, e.g.:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField()
    text: str = openai.SourceField()
```

Tables created like this does not require embeddings to be calculated by
the user explicitly, e.g. this works:

```python
table.add([{"id": "foo", "text": "rust all the things"}])
```

However, trying to construct pydantic model instances without vector
doesn't because it's a required field.

Instead, you need add a default value:

```python
class Schema(LanceModel):
    id: str
    vector: Vector(openai.ndims()) = openai.VectorField(default=None)
    text: str = openai.SourceField()
```

then this completes without errors:
```python
table.add([Schema(id="foo", text="rust all the things")])
```

However, all of the vectors are filled with zeros. Instead in
add_vector_col we have to add an additional check so that the embedding
generation is called.
2024-04-05 16:32:15 -07:00
QianZhu
723defbe7e add index_stats python api (#1096)
the integration test will be covered in another PR:
https://github.com/lancedb/sophon/pull/1876
2024-04-05 16:32:15 -07:00
Chang She
c33110397e fix(python): fix typo in passing in the api_key explicitly (#1098)
fix silly typo
2024-04-05 16:32:15 -07:00
Weston Pace
b6a522d483 feat: add list_indices to the async api (#1074) 2024-04-05 16:32:15 -07:00
Weston Pace
9031ec6878 feat: add update to the async API (#1093) 2024-04-05 16:32:15 -07:00
Will Jones
f0c5f5ba62 fix: handle uri in object (#1091)
Fixes #1078
2024-04-05 16:32:15 -07:00
Weston Pace
47daf9b7b0 feat: add time travel operations to the async API (#1070) 2024-04-05 16:32:15 -07:00
Weston Pace
f822255683 feat: add create_index to the async python API (#1052)
This also refactors the rust lancedb index builder API (and,
correspondingly, the nodejs API)
2024-04-05 16:32:14 -07:00
Will Jones
90af5cf028 fix: propagate filter validation errors (#1092)
In Rust and Node, we have been swallowing filter validation errors. If
there was an error in parsing the filter, then the filter was silently
ignored, returning unfiltered results.

Fixes #1081
2024-04-05 16:31:53 -07:00
Lance Release
fec6f92184 [python] Bump version: 0.6.2 → 0.6.3 2024-04-05 16:31:53 -07:00
Rob Meng
35bc4f3078 feat: configurable timeout for LanceDB Cloud queries (#1090) 2024-04-05 16:31:53 -07:00
Ivan Leo
89ce417452 Update default_embedding_functions.md (#1073)
Added a small bit of documentation for the `dim` feature which is
provided by the new `text-embedding-3` model series that allows users to
shorten an embedding.

Happy to discuss a bit on the phrasing but I struggled quite a bit with
getting it to work so wanted to help others who might want to use the
newer model too
2024-04-05 16:31:53 -07:00
Weston Pace
d4502add44 Remove remote integration workflow (#1076) 2024-04-05 16:31:53 -07:00
Will Jones
334857a8cb fix: Allow converting from NativeTable to Table (#1069) 2024-04-05 16:31:53 -07:00
Lance Release
386d5da22f Bump version: 0.4.11 → 0.4.12 2024-04-05 16:31:45 -07:00
Lance Release
77ba97416d [python] Bump version: 0.6.1 → 0.6.2 2024-04-05 16:31:45 -07:00
Will Jones
5120bf262b fix: make checkout_latest force a reload (#1064)
#1002 accidentally changed `checkout_latest` to do nothing if the table
was already in latest mode. This PR makes sure it forces a reload of the
table (if there is a newer version).
2024-04-05 16:31:45 -07:00
Lei Xu
f27167017b chore: bump lance to 0.10.2 (#1061) 2024-04-05 16:31:45 -07:00
Weston Pace
73c69a6b9a feat: page_token / limit to native table_names function. Use async table_names function from sync table_names function (#1059)
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.

However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
2024-04-05 16:31:45 -07:00
Will Jones
05f9a77baf feat: more accessible errors (#1025)
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
2024-04-05 16:31:45 -07:00
Chang She
10089481c0 doc(python): document the method in fts (#982)
Co-authored-by: prrao87 <prrao87@gmail.com>
Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Ayush Chaurasia
b5326d31e9 fix(python): Few fts patches (#1039)
1. filtering with fts mutated the schema, which caused schema mistmatch
problems with hybrid search as it combines fts and vector search tables.
2. fts with filter failed with `with_row_id`. This was because row_id
was calculated before filtering which caused size mismatch on attaching
it after.
3. The fix for 1 meant that now row_id is attached before filtering but
passing a filter to `to_lance` on a dataset that already contains
`_rowid` raises a panic from lance. So temporarily, in case where fts is
used with a filter AND `with_row_id`, we just force user to using the
duckdb pathway.

---------

Co-authored-by: Chang She <759245+changhiskhan@users.noreply.github.com>
2024-04-05 16:31:45 -07:00
Weston Pace
c60a193767 fix: sanitize foreign schemas (#1058)
Arrow-js uses brittle `instanceof` checks throughout the code base.
These fail unless the library instance that produced the object matches
exactly the same instance the vectordb is using. At a minimum, this
means that a user using arrow version 15 (or any version that doesn't
match exactly the version that vectordb is using) will get strange
errors when they try and use vectordb.

However, there are even cases where the versions can be perfectly
identical, and the instanceof check still fails. One such example is
when using `vite` (e.g. https://github.com/vitejs/vite/issues/3910)

This PR solves the problem in a rather brute force, but workable,
fashion. If we encounter a schema that does not pass the `instanceof`
check then we will attempt to sanitize that schema by traversing the
object and, if it has all the correct properties, constructing an
appropriate `Schema` instance via deep cloning.
2024-04-05 16:31:42 -07:00
Weston Pace
785ecfa037 feat: reconfigure typescript linter / formatter for nodejs (#1042)
The eslint rules specify some formatting requirements that are rather
strict and conflict with vscode's default formatter. I was unable to get
auto-formatting to setup correctly. Also, eslint has quite recently
[given up on
formatting](https://eslint.org/blog/2023/10/deprecating-formatting-rules/)
and recommends using a 3rd party formatter.

This PR adds prettier as the formatter. It restores the eslint rules to
their defaults. This does mean we now have the "no explicit any" check
back on. I know that rule is pedantic but it did help me catch a few
corner cases in type testing that weren't covered in the current code.
Leaving in draft as this is dependent on other PRs.
2024-04-05 16:31:36 -07:00
Weston Pace
8033a44d68 feat: add support for add to async python API (#1037)
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).

While doing this we also cleaned up some inconsistencies between the
SDKs:

* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.

Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
2024-04-05 16:31:36 -07:00
Chang She
3bbcaba65b chore(rust): update rust version (#810) 2024-04-05 16:31:36 -07:00
Chang She
e60fde73ba feat(python): allow user to override api url (#1054) 2024-04-05 16:31:36 -07:00
Chang She
a7dbe933dc chore(python): use pypi tantivy to speed up CI (#987) 2024-04-05 16:31:36 -07:00
Chang She
4f34a01020 doc: fix docs deployment GHA (#1055) 2024-04-05 16:31:36 -07:00
Prashanth Rao
f9c244e608 [docs]: Fix issues with Rust code snippets in "quick start" (#1047)
The renaming of `vectordb` to `lancedb` broke the [quick start
docs](https://lancedb.github.io/lancedb/basic/#__tabbed_5_3) (it's
pointing to a non-existent directory). This PR fixes the code snippets
and the paths in the docs page.

Additionally, more fixes related to indexing docs below 👇🏽.
2024-04-05 16:31:36 -07:00
Louis Guitton
7f9ef0d329 Fix default_embedding_functions.md (#1043)
typo and broken table
2024-04-05 16:31:36 -07:00
Chang She
a3761f4209 doc: fix langchain link (#1053) 2024-04-05 16:31:36 -07:00
Chang She
4b40dad963 feat(python): add model_names() method to openai embedding function (#1049)
small QoL improvement
2024-04-05 16:31:36 -07:00
QianZhu
b32b69c993 Add create scalar index to sdk (#1033) 2024-04-05 16:31:36 -07:00
Weston Pace
4299f719ec feat: port create_table to the async python API and the remote rust API (#1031)
I've also started `ASYNC_MIGRATION.MD` to keep track of the breaking
changes from sync to async python.
2024-04-05 16:31:36 -07:00
Lance Release
accf31fa92 [python] Bump version: 0.6.0 → 0.6.1 2024-04-05 16:31:36 -07:00
Rob Meng
b8eb5d4bfe fix: fix columns type for pydantic 2.x (#1045) 2024-04-05 16:31:36 -07:00