feat: hook up new writer for insert (#3029)

This hooks up a new writer implementation for the `add()` method. The
main immediate benefit is it allows streaming requests to remote tables,
and at the same time allowing retries for most inputs.

In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's
always retry-able.

For Python, all are retry-able, except `Iterator` and
`pa.RecordBatchReader`, which can only be consumed once. Some, like
`pa.datasets.Dataset` are retry-able *and* streaming.

A lot of the changes here are to make the new DataFusion write pipeline
maintain the same behavior as the existing Python-based preprocessing,
such as:

* casting input data to target schema
* rejecting NaN values if `on_bad_vectors="error"`
* applying embedding functions.

In future PRs, we'll enhance these by moving the embedding calls into
DataFusion and making sure we parallelize them. See:
https://github.com/lancedb/lancedb/issues/3048

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Will Jones
2026-02-23 14:43:31 -08:00
committed by GitHub
parent 367262662d
commit 0e486511fa
20 changed files with 2446 additions and 359 deletions

View File

@@ -7,6 +7,7 @@ use crate::{
error::PythonErrorExt,
index::{extract_index_params, IndexConfig},
query::{Query, TakeQuery},
table::scannable::PyScannable,
};
use arrow::{
datatypes::{DataType, Schema},
@@ -25,6 +26,8 @@ use pyo3::{
};
use pyo3_async_runtimes::tokio::future_into_py;
mod scannable;
/// Statistics about a compaction operation.
#[pyclass(get_all)]
#[derive(Clone, Debug)]
@@ -293,12 +296,10 @@ impl Table {
pub fn add<'a>(
self_: PyRef<'a, Self>,
data: Bound<'_, PyAny>,
data: PyScannable,
mode: String,
) -> PyResult<Bound<'a, PyAny>> {
let batches: Box<dyn arrow::array::RecordBatchReader + Send> =
Box::new(ArrowArrayStreamReader::from_pyarrow_bound(&data)?);
let mut op = self_.inner_ref()?.add(batches);
let mut op = self_.inner_ref()?.add(data);
if mode == "append" {
op = op.mode(AddDataMode::Append);
} else if mode == "overwrite" {