lancedb

mirror of https://github.com/lancedb/lancedb.git synced 2026-06-06 05:40:40 +00:00

Author	SHA1	Message	Date
Drew Gallardo	aac6c62459	feat(python): add public take_offsets method on Permutation (#3375 ) Closes #3243. This PR exposes a new public api `Permutation.take_offsets(offsets: list[int])`, since users initially had to call __getitems__ directly to batch-fetch rows by position. Currently, the name matches the existing `Table.take_offsets` pattern, and now the dunder `__getitem__` and `__getitems__` now delegate to it. Also, fixes a parse error when `PermutationReader::take_offsets` gets an empty list. Now returns an empty `RecordBatch` with the correct schema instead. Bundled this because without the fix the new public API blows up on a perfectly reasonable input. `__getitems__` is preserved since PyTorch's batched DataLoader requires it. ### Testing - Added 3 new Rust tests for empty offsets including permutation table with Select::All, Select::Columns, and identity path - Added 3 new Python tests for the public API including a happy case, and empty input on both identity and permutation clippy, format, check all clean! cc: @westonpace	2026-05-18 09:35:56 -07:00
Weston Pace	1fc23e5473	fix(python): make Permutation picklable for PyTorch multiprocessing (#3335 ) ## Summary When pytorch is used with multiprocessing and the mp mode is spawn then the Permutation needs to be pickled. It could not be pickled because `Table` and `Connection` are not serializable. This PR adds pickle support to Permutation without adding general pickle support to `Table` or `Connection`. To add general support we probably need to start by adding serialization in the namespace client. In the meantime this PR enable pickling by adding special cases for: * In-memory tables (just serialize as Arrow IPC) * Native tables (serialize the URI) If a user is not using one of the above cases (e.g. using a remote connection) then they will need to provide a connection factory that can be pickled. ## Breaking change `PermutationBuilder.persist(...)` is removed from the Python bindings; the permutation table is now always in-memory. The underlying Rust `PermutationBuilder::persist` API is untouched and can be re-exposed later if needed. It probably won't make sense to do that until we have a way to serialize `Table` and `Connection`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:37:58 -07:00
lennylxx	4c2939d66e	fix(python): guard against None before .decode() on split_names metadata key (#3229 ) `.get(b"split_names", None).decode()` was called unconditionally in both Permutations.__init__ and Permutation.from_tables(), crashing with AttributeError when schema metadata existed but lacked the split_names key. Guard the decode behind a None check and add regression tests.	2026-04-08 16:04:13 -07:00
Weston Pace	70cbee6293	feat: improve Permutation pytorch integration (#3016 ) This changes around the output format of `Permutation` in some breaking ways but I think the API is still new enough to be considered experimental. 1. In order to align with both huggingface's dataset and torch's expectations the default output format is now a list of dicts (row-major) instead of a dict of lists (column-major). I've added a python_col option which will return the dict of lists. 2. In order to align with pytorch's expectation the `torch` format is now a list of tensors (row-major) instead of a 2D tensor (column-major). I've added a torch_col option which will return the 2D tensor instead. Added tests for torch integration with Permutation ~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013 merges as this is built on top of that~~	2026-02-12 13:41:14 -08:00
Weston Pace	02783bf440	feat: add a getitems implementation for the permutation (#3013 )	2026-02-12 05:36:11 -08:00
Will Jones	131024839f	fix: include _rowid in hash and calculated split projections (#2965 ) ## Summary - PR #2957 changed the permutation builder to only select `_rowid` from the base table, but `Splitter::project()` for hash and calculated splits replaced the selection entirely, dropping `_rowid`. - Include `_rowid` in the column selections for hash and calculated split projections. - Fix a Python test that queried the permutation table for base table columns no longer materialized. Fixes the `test_split_hash`, `test_split_hash_with_discard`, `test_split_calculated`, `test_shuffle_combined_with_splits`, and `test_filter_with_splits` failures in `test_permutation.py`. ## Test plan - [x] `cargo test -p lancedb -- permutation` (22 passed) - [x] `pytest python/tests/test_permutation.py` (46 passed) - [x] `npm test __test__/permutation.test.ts` (20 passed) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 16:27:58 -08:00
Weston Pace	aeac9c7644	feat: add python Permutation class to mimic hugging face dataset and provide pytorch dataloader (#2725 )	2025-11-06 16:15:33 -08:00
Weston Pace	4cfcd95320	feat: add a permutation reader that can read a permutation view (#2712 ) This adds a rust permutation builder. In the next PR I will have python bindings and integration with pytorch.	2025-10-17 05:00:23 -07:00
Weston Pace	5a19cf15a6	feat: a utility for creating "permutation views" (#2552 ) I'm working on a lancedb version of pytorch data loading (and hopefully addressing https://github.com/lancedb/lance/issues/3727). However, rather than rely on pytorch for everything I'm moving some of the things that pytorch does into rust. This gives us more control over data loading (e.g. using shards or a hash-based split) and it allows permutations to be persistent. In particular I hope to be able to: * Create a persistent permutation * This permutation can handle splits, filtering, shuffling, and sharding * Create a rust data loader that can read a permutation (one or more splits), or a subset of a permutation (for DDP) * Create a python data loader that delegates to the rust data loader Eventually create integrations for other data loading libraries, including rust & node	2025-10-09 18:07:31 -07:00

9 Commits