mirror of
https://github.com/lancedb/lancedb.git
synced 2026-05-16 19:40:40 +00:00
feat: hook up new writer for insert (#3029)
This hooks up a new writer implementation for the `add()` method. The main immediate benefit is it allows streaming requests to remote tables, and at the same time allowing retries for most inputs. In NodeJS, we always convert the data to `Vec<RecordBatch>`, so it's always retry-able. For Python, all are retry-able, except `Iterator` and `pa.RecordBatchReader`, which can only be consumed once. Some, like `pa.datasets.Dataset` are retry-able *and* streaming. A lot of the changes here are to make the new DataFusion write pipeline maintain the same behavior as the existing Python-based preprocessing, such as: * casting input data to target schema * rejecting NaN values if `on_bad_vectors="error"` * applying embedding functions. In future PRs, we'll enhance these by moving the embedding calls into DataFusion and making sure we parallelize them. See: https://github.com/lancedb/lancedb/issues/3048 --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -7,6 +7,7 @@ use crate::{
|
||||
error::PythonErrorExt,
|
||||
index::{extract_index_params, IndexConfig},
|
||||
query::{Query, TakeQuery},
|
||||
table::scannable::PyScannable,
|
||||
};
|
||||
use arrow::{
|
||||
datatypes::{DataType, Schema},
|
||||
@@ -25,6 +26,8 @@ use pyo3::{
|
||||
};
|
||||
use pyo3_async_runtimes::tokio::future_into_py;
|
||||
|
||||
mod scannable;
|
||||
|
||||
/// Statistics about a compaction operation.
|
||||
#[pyclass(get_all)]
|
||||
#[derive(Clone, Debug)]
|
||||
@@ -293,12 +296,10 @@ impl Table {
|
||||
|
||||
pub fn add<'a>(
|
||||
self_: PyRef<'a, Self>,
|
||||
data: Bound<'_, PyAny>,
|
||||
data: PyScannable,
|
||||
mode: String,
|
||||
) -> PyResult<Bound<'a, PyAny>> {
|
||||
let batches: Box<dyn arrow::array::RecordBatchReader + Send> =
|
||||
Box::new(ArrowArrayStreamReader::from_pyarrow_bound(&data)?);
|
||||
let mut op = self_.inner_ref()?.add(batches);
|
||||
let mut op = self_.inner_ref()?.add(data);
|
||||
if mode == "append" {
|
||||
op = op.mode(AddDataMode::Append);
|
||||
} else if mode == "overwrite" {
|
||||
|
||||
Reference in New Issue
Block a user