Implements `InsertExec` and `RemoteInsertExec` to support running
inserts in DataFusion.
## Context
In https://github.com/lancedb/lancedb/pull/2929, I've prototyped moving
the insert pipeline into DataFusion. This will enable parallelism at two
levels:
1. Running preprocessing, such as casting the input schema or computing
embeddings
2. Writing out files
This PR is just the first part of running the actual writes. In the end,
the plans might look like:
```
InsertExec
RepartitionExec num_partitions=<write_parallelism>
ProjectionExec vector=compute_embedding()
RepartitionExec num_partitions=<num_cpus>
DataSourceExec
```
where `num_cpus` is used to take advantage of all cores, while
`write_parallelism` might be less than `num_cpus` if there are too few
rows to want to split writes across `num_cpus` files.
Later PRs will move the preprocessing steps into DataFusion, and then
hook this up to the `Table::add()` implementations.
## Relation to future SQL work
We eventually plan on having the Remote SDK go through a FlightSQL
endpoint. Then for most queries we will send just the SQL string to the
server, and not run any sort of DataFusion plan on the client.
However, I think writes will be a little special, especially bulk writes
where we need to upload large streams of data and likely want
parallelism. So we'll have different code paths for writes, and I think
using DataFusion makes sense, especially as long as we are doing the
pre-processing on the client side still.
Issues found during integration tests:
1. describe_namespace should use POST
2. service needs to access the underlying namespace to be able to do
operations like create_empty_table directly, or get credentials in
isolated paths like a remote take
## Summary
- update all lance crates to v1.0.0 using the helper script (fallbacks
to the v1.0.0 tag)
- refresh Cargo.lock to pull the new release
- add script fallback to retry with the git tag when a crates.io release
is unavailable
## Testing
- cargo clippy --workspace --tests --all-features -- -D warnings
- cargo fmt --all
Tag: https://github.com/lance-format/lance/releases/tag/v1.0.0
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
1. Use generated models in lance-namespace for request response models
to avoid multiple layers of conversions
2. Make sure the API is consistent with the namespace spec
3. Deprecate the table_names API in favor of the list_tables API in
namespace that allows full pagination support without the need to have
sorted table names
4. Add describe_namespace API which was a miss in the original
implementation
## Summary
- bump all Lance crates to v1.0.0-beta.16 via ci/set_lance_version.py
- refresh Cargo.lock (reqwest/opendal/etc.) to satisfy the new release
## Verification
- cargo clippy --workspace --tests --all-features -- -D warnings
- cargo fmt --all
Triggered by
[refs/tags/v1.0.0-beta.16](https://github.com/lance-format/lance/releases/tag/v1.0.0-beta.16)
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
## Summary
- bump all Lance crates to 1.0.0-beta.14 via ci/set_lance_version.py
- refresh Cargo.lock to capture new transitive requirements
- verified `cargo clippy --workspace --tests --all-features -- -D
warnings` and `cargo fmt --all`
Triggered by refs/tags/v1.0.0-beta.14
---------
Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
Based on https://github.com/lancedb/lance/pull/4984
1. Bump to 1.0.0-beta.2
2. Use DirectoryNamespace in lance to perform all testing in python and
rust for much better coverage
3. Refactor `ListingDatabase` to be able to accept location and
namespace. This is because we have to leverage listing database (local
lancedb connection) for using namespace, namespace only resolves the
location and storage options but we don't want to bind all the way to
rust since user will plug-in namespace from python side. And thus
`ListingDatabase` needs to be able to accept location and namespace that
are created from namespace connection.
4. For credentials vending, we also pass storage options provider all
the way to rust layer, and the rust layer calls back to the python
function to fetch next storage option. This is exactly the same thing we
did in pylance.