feat: hook up new writer for insert (#3029)

This hooks up a new writer implementation for the `add()` method. The
main immediate benefit is it allows streaming requests to remote tables,
and at the same time allowing retries for most inputs.

In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's
always retry-able.

For Python, all are retry-able, except `Iterator` and
`pa.RecordBatchReader`, which can only be consumed once. Some, like
`pa.datasets.Dataset` are retry-able *and* streaming.

A lot of the changes here are to make the new DataFusion write pipeline
maintain the same behavior as the existing Python-based preprocessing,
such as:

* casting input data to target schema
* rejecting NaN values if `on_bad_vectors="error"`
* applying embedding functions.

In future PRs, we'll enhance these by moving the embedding calls into
DataFusion and making sure we parallelize them. See:
https://github.com/lancedb/lancedb/issues/3048

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Will Jones
2026-02-23 14:43:31 -08:00
committed by GitHub
parent 367262662d
commit 0e486511fa
20 changed files with 2446 additions and 359 deletions

View File

@@ -71,6 +71,17 @@ impl Table {
pub async fn add(&self, buf: Buffer, mode: String) -> napi::Result<AddResult> {
let batches = ipc_file_to_batches(buf.to_vec())
.map_err(|e| napi::Error::from_reason(format!("Failed to read IPC file: {}", e)))?;
let batches = batches
.into_iter()
.map(|batch| {
batch.map_err(|e| {
napi::Error::from_reason(format!(
"Failed to read record batch from IPC file: {}",
e
))
})
})
.collect::<Result<Vec<_>>>()?;
let mut op = self.inner_ref()?.add(batches);
op = if mode == "append" {