perf: re-use table instance during write (#1909)

Previously, whenever `Table.add()` was called, we would write and
re-open the underlying dataset. This was bad for performance, as it
reset the table cache and initiated a lot of IO. It also could be the
source of bugs, since we didn't necessarily pass all the necessary
connection options down when re-opening the table.

Closes #1655
This commit is contained in:
Will Jones
2024-12-05 14:44:50 -08:00
committed by GitHub
parent d6219d687c
commit 3c487e5fc7
2 changed files with 16 additions and 56 deletions

View File

@@ -1624,15 +1624,7 @@ class LanceTable(Table):
on_bad_vectors=on_bad_vectors,
fill_value=fill_value,
)
# Access the dataset_mut property to ensure that the dataset is mutable.
self._ref.dataset_mut
self._ref.dataset = lance.write_dataset(
data,
self._dataset_uri,
schema=self.schema,
mode=mode,
storage_options=self._ref.storage_options,
)
self._ref.dataset_mut.insert(data, mode=mode, schema=self.schema)
def merge(
self,