The optimize function is pretty crucial for getting good performance
when building a large scale dataset but it was only exposed in rust
(many sync python users are probably doing this via to_lance today)
This PR adds the optimize function to nodejs and to python.
I left the function marked experimental because I think there will
likely be changes to optimization (e.g. if we add features like
"optimize on write"). I also only exposed the `cleanup_older_than`
configuration parameter since this one is very commonly used and the
rest have sensible defaults and we don't really know why we would
recommend different values for these defaults anyways.
Todo:
- [x] add proper documentation
- [x] add unit tests
- [x] better handling of the registry**1
- [x] allow user defined registry**2
**1 The python implementation just uses a global registry so it makes
things a bit easier. I attached it to the db/connection to prevent
future conflicts if running multiple connections/databases. I mostly
modeled the registry & pattern off of datafusion's
[FunctionRegistry](https://docs.rs/datafusion/latest/datafusion/execution/trait.FunctionRegistry.html).
**2 Ideally, the user should be able to provide it's own registry
entirely, but currently it just uses an in memory registry by default
(_which isn't configurable_)
`rust/lancedb/examples/embedding_registry.rs` provides a thorough
example of expected usage.
---
Some additional notes:
This does not provide any of the out of box functionality that the
python registry does.
_i.e there are no built-in embedding functions._
You can think of this as the ground work for adding those built in
functions, So while this is part of
https://github.com/lancedb/lancedb/issues/994, it does not yet offer
feature parity.
This was already configurable in the rust API but it wasn't actually
being passed down to the underlying dataset. I added this option to both
the async python API and the new nodejs API.
I also added this option to the synchronous python API.
I did not add the option to vectordb.
The rust implementation of the remote client is not yet ready. This is
understandably confusing for users since it is enabled by default. This
PR disables it by default. We can re-enable it when we are ready (even
then it is not clear this is something that should be a default
feature).
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
1. added rename_table fn to enable dashboard to rename a table
2. added index_type and distance_type (for vector index) to index_stats
so that more detailed data can be shown on the table page.
Exposes `storage_options` in LanceDB. This is provided for Python async,
Node `lancedb`, and Node `vectordb` (and Rust of course). Python
synchronous is omitted because it's not compatible with the PyArrow
filesystems we use there currently. In the future, we will move the sync
API to wrap the async one, and then it will get support for
`storage_options`.
1. Fixes#1168
2. Closes#1165
3. Closes#1082
4. Closes#439
5. Closes#897
6. Closes#642
7. Closes#281
8. Closes#114
9. Closes#990
10. Deprecating `awsCredentials` and `awsRegion`. Users are encouraged
to use `storageOptions` instead.
In
2de226220b
I added a new `IntoArrow` trait for adding data into a table.
Unfortunately, it seems my approach for implementing the trait for
"things that are already record batch readers" was flawed. This PR
corrects that flaw and, conveniently, removes the need to box readers at
all (though it is ok if you do).
This PR originated from a request to add `Serialize` / `Deserialize` to
`lance_linalg::distance::DistanceType`. However, that is a strange
request for `lance_linalg` which shouldn't really have to worry about
`Serialize` / `Deserialize`. The problem is that `lancedb` is re-using
`DistanceType` and things in `lancedb` do need to worry about
`Serialize`/`Deserialize` (because `lancedb` needs to support remote
client).
On the bright side, separating the two types allows us to independently
document distance type and allows `lance_linalg` to make changes to
`DistanceType` in the future without having to worry about backwards
compatibility concerns.
This will make it easier for 3rd party integrations. They simply need to
implement `IntoArrow` for their types in order for those types to be
used in ingestion.
In Rust and Node, we have been swallowing filter validation errors. If
there was an error in parsing the filter, then the filter was silently
ignored, returning unfiltered results.
Fixes#1081
#1002 accidentally changed `checkout_latest` to do nothing if the table
was already in latest mode. This PR makes sure it forces a reload of the
table (if there is a newer version).
The synchronous table_names function in python lancedb relies on arrow's
filesystem which behaves slightly differently than object_store. As a
result, the function would not work properly in GCS.
However, the async table_names function uses object_store directly and
thus is accurate. In most cases we can fallback to using the async
table_names function and so this PR does so. The one case we cannot is
if the user is already in an async context (we can't start a new async
event loop). Soon, we can just redirect those users to use the async API
instead of the sync API and so that case will eventually go away. For
now, we fallback to the old behavior.
The fact that we convert errors to strings makes them really hard to
work with. For example, in SaaS we want to know whether the underlying
`lance::Error` was the `InvalidInput` variant, so we can return a 400
instead of a 500.
In order to add support for `add` we needed to migrate the rust `Table`
trait to a `Table` struct and `TableInternal` trait (similar to the way
the connection is designed).
While doing this we also cleaned up some inconsistencies between the
SDKs:
* Python and Node are garbage collected languages and it can be
difficult to trigger something to be freed. The convention for these
languages is to have some kind of close method. I added a close method
to both the table and connection which will drop the underlying rust
object.
* We made significant improvements to table creation in
cc5f2136a6
for the `node` SDK. I copied these changes to the `nodejs` SDK.
* The nodejs tables were using fs to create tmp directories and these
were not getting cleaned up. This is mostly harmless but annoying and so
I changed it up a bit to ensure we cleanup tmp directories.
* ~~countRows in the node SDK was returning `bigint`. I changed it to
return `number`~~ (this actually happened in a previous PR)
* Tables and connections now implement `std::fmt::Display` which is
hooked into python's `__repr__`. Node has no concept of a regular "to
string" function and so I added a `display` method.
* Python method signatures are changing so that optional parameters are
always `Optional[foo] = None` instead of something like `foo = False`.
This is because we want those defaults to be in rust whenever possible
(though we still need to mention the default in documentation).
* I changed the python `AsyncConnection/AsyncTable` classes from
abstract classes with a single implementation to just classes because we
no longer have the remote implementation in python.
Note: this does NOT add the `add` function to the remote table. This PR
was already large enough, and the remote implementation is unique
enough, that I am going to do all the remote stuff at a later date (we
should have the structure in place and correct so there shouldn't be any
refactor concerns)
---------
Co-authored-by: Will Jones <willjones127@gmail.com>