fix(python): make Permutation picklable for PyTorch multiprocessing (#3335)

## Summary

When pytorch is used with multiprocessing and the mp mode is spawn then
the Permutation needs to be pickled. It could not be pickled because
`Table` and `Connection` are not serializable. This PR adds pickle
support to Permutation without adding general pickle support to `Table`
or `Connection`. To add general support we probably need to start by
adding serialization in the namespace client.

In the meantime this PR enable pickling by adding special cases for:

 * In-memory tables (just serialize as Arrow IPC)
 * Native tables (serialize the URI)

If a user is not using one of the above cases (e.g. using a remote
connection) then they will need to provide a connection factory that can
be pickled.

## Breaking change

`PermutationBuilder.persist(...)` is removed from the Python bindings;
the permutation table is now always in-memory. The underlying Rust
`PermutationBuilder::persist` API is untouched and can be re-exposed
later if needed. It probably won't make sense to do that until we have a
way to serialize `Table` and `Connection`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Weston Pace
2026-05-04 21:37:58 -07:00
committed by GitHub
parent 87b831bcae
commit 1fc23e5473
5 changed files with 354 additions and 81 deletions

View File

@@ -9,21 +9,6 @@ from lancedb import DBConnection, Table, connect
from lancedb.permutation import Permutation, Permutations, permutation_builder
def test_permutation_persistence(tmp_path):
db = connect(tmp_path)
tbl = db.create_table("test_table", pa.table({"x": range(100), "y": range(100)}))
permutation_tbl = (
permutation_builder(tbl).shuffle().persist(db, "test_permutation").execute()
)
assert permutation_tbl.count_rows() == 100
re_open = db.open_table("test_permutation")
assert re_open.count_rows() == 100
assert permutation_tbl.to_arrow() == re_open.to_arrow()
def test_split_random_ratios(mem_db):
"""Test random splitting with ratios."""
tbl = mem_db.create_table(