## Summary
Continues the modularization effort of table operations as outlined in
#2949.
- Extracts optimization operations (`OptimizeAction`, `OptimizeStats`,
`execute_optimize`, `compact_files_impl`, `cleanup_old_versions`,
`optimize_indices`) from
`table.rs` into `table/optimize.rs`
- Public API remains unchanged via re-exports
- Adds comprehensive tests including error cases with message assertions
## Test plan
- [x] All new optimization tests pass
- [x] All existing tests pass
- [x] `cargo clippy` passes with no warnings
- [x] `cargo fmt --check` passes
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
Continues the modularization effort of schema evolution operations as
outlined in #2949
## Summary
- Extracts schema evolution operations (add_columns, alter_columns,
drop_columns) from `table.rs` into `table/schema_evolution.rs`
- Public API remains unchanged via re-exports
## Test plan
- [x] All new schema evolution tests pass
- [x] All existing tests pass
- [x] `cargo clippy` passes with no warnings
- [x] `cargo fmt --check` passes
Expose `initial_storage_options()` and `latest_storage_options()` in
lance Dataset, in lancedb rust, python and typescript SDKs.
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Implements `InsertExec` and `RemoteInsertExec` to support running
inserts in DataFusion.
## Context
In https://github.com/lancedb/lancedb/pull/2929, I've prototyped moving
the insert pipeline into DataFusion. This will enable parallelism at two
levels:
1. Running preprocessing, such as casting the input schema or computing
embeddings
2. Writing out files
This PR is just the first part of running the actual writes. In the end,
the plans might look like:
```
InsertExec
RepartitionExec num_partitions=<write_parallelism>
ProjectionExec vector=compute_embedding()
RepartitionExec num_partitions=<num_cpus>
DataSourceExec
```
where `num_cpus` is used to take advantage of all cores, while
`write_parallelism` might be less than `num_cpus` if there are too few
rows to want to split writes across `num_cpus` files.
Later PRs will move the preprocessing steps into DataFusion, and then
hook this up to the `Table::add()` implementations.
## Relation to future SQL work
We eventually plan on having the Remote SDK go through a FlightSQL
endpoint. Then for most queries we will send just the SQL string to the
server, and not run any sort of DataFusion plan on the client.
However, I think writes will be a little special, especially bulk writes
where we need to upload large streams of data and likely want
parallelism. So we'll have different code paths for writes, and I think
using DataFusion makes sense, especially as long as we are doing the
pre-processing on the client side still.
References #2949 Part 2 of table.rs refactor. Moved UpdateResult,
UpdateBuilder, and execution logic to src/table/update.rs. No functional
changes API remains identical.
---------
Co-authored-by: Will Jones <willjones127@gmail.com>
## Summary
- PR #2957 changed the permutation builder to only select `_rowid` from
the base table, but `Splitter::project()` for hash and calculated splits
replaced the selection entirely, dropping `_rowid`.
- Include `_rowid` in the column selections for hash and calculated
split projections.
- Fix a Python test that queried the permutation table for base table
columns no longer materialized.
Fixes the `test_split_hash`, `test_split_hash_with_discard`,
`test_split_calculated`, `test_shuffle_combined_with_splits`, and
`test_filter_with_splits` failures in `test_permutation.py`.
## Test plan
- [x] `cargo test -p lancedb -- permutation` (22 passed)
- [x] `pytest python/tests/test_permutation.py` (46 passed)
- [x] `npm test __test__/permutation.test.ts` (20 passed)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
References #2949 Moved DeleteResult and delete() implementation to
src/table/delete.rs. No functional changes. Added a test delete which
works. Will work on refactoring update next.
Fixes the Rust SDK's `create_empty_table` to properly support embedding
column definitions, bringing it to parity with the Python SDK.
## Problem
The Rust SDK's `Connection::create_empty_table` did not support setting
embedding columns. When using `.add_embedding()` on the builder, the
embedding column definitions were lost because
`TableDefinition::new_from_schema(schema)` marks all columns as physical
only, without embedding metadata.
The Python SDK worked around this by creating an empty record batch with
proper schema metadata rather than using `create_empty_table` directly.
## Solution
Modified `CreateTableBuilder<false>` to handle embeddings
Closes#2759
The permutation table was always intended to be a small table of row id
pointers (and split id). However, it was accidentally doing a full
materialization of the base table 🤦
This PR changes the permutation builder to only store row id and split
id.
Realized our MSRV check was inert because `rust-toolchain.toml` was
overriding the Rust version. We set the `RUSTUP_TOOLCHAIN` environment
variable, which overrides that.
Also needed to update to MSRV 1.88 (due to dependencies like Lance and
DataFusion) and fix some clippy warnings.
Unlike in Amazon S3, in Azure bucket names are not globally unique.
Instead, the combination of (storage_account_name, bucket_name) is
unique.
Therefore, when using Azure blob store, we always need a way to
configure the storage account name. One way is to use the
storage_options hash map and set azure_storage_account_name. Another way
is to set an environment variable, AZURE_STORAGE_ACCOUNT_NAME.
Prior to this PR, the second way (environment variable) did not work
with remote connections. This is because the existing code that checks
for these environment variables happens inside the Azure object store
implementation itself, which does not run locally when using remote
connections.
This PR addresses that situation by adding a check of the environment
variable. This functions as a default if the relevant storage option is
not set in the storage_options hash map.
BREAKING CHANGE: removes `aws`, `dynamodb`, `azure`, `gcs`, `oss`,
`huggingface` from default Rust features. They can be enabled by users
as needed.
They are still enabled for Python and NodeJS, since those users don't
control the compilation of artifacts.
Closes#2911
Implement parallel execution of multiple embedding functions using
std:🧵:scope to improve performance when a table has multiple
embedding columns.
Key changes:
- Add compute_embeddings_parallel() helper method to WithEmbeddings
- Use fast path for single embeddings (no threading overhead)
- Use scoped threads for parallel execution of multiple embeddings
- Add comprehensive tests including parallelization timing verification
- Update WithEmbeddings documentation
Performance improvements:
- I/O-bound embeddings (OpenAI, Bedrock): High benefit from concurrent
API calls
- CPU-bound embeddings (sentence-transformers): Medium benefit from core
utilization
- Single embedding: No overhead (fast path)
Closes TODO on line 266 in rust/lancedb/src/embeddings.rs
Convert test_table_names to test both remote and local connections.
This PR also includes some miscellaneous improvements in
src/test_utils/connection.rs. It starts a thread to drain stdout from
the server process. It adds the
PRINT_LANCEDB_TEST_CONNECTION_SCRIPT_OUTPUT environment variable, which
optionally displays server stdout.
Fix a bash conditional in run_with_test_connection.sh.
Issues found during integration tests:
1. describe_namespace should use POST
2. service needs to access the underlying namespace to be able to do
operations like create_empty_table directly, or get credentials in
isolated paths like a remote take
Adds IVF_SQ index config through Rust core and Python bindings, plus
alias names IvfHnswSq/Pq for backward compatibility. Updates
remote/table helpers and types to accept the new index type. Includes
tests covering IVF SQ creation and alias usage.