Commit Graph

6 Commits

Author SHA1 Message Date
Weston Pace
70cbee6293 feat: improve Permutation pytorch integration (#3016)
This changes around the output format of `Permutation` in some breaking
ways but I think the API is still new enough to be considered
experimental.

1. In order to align with both huggingface's dataset and torch's
expectations the default output format is now a list of dicts
(row-major) instead of a dict of lists (column-major). I've added a
python_col option which will return the dict of lists.
2. In order to align with pytorch's expectation the `torch` format is
now a list of tensors (row-major) instead of a 2D tensor (column-major).
I've added a torch_col option which will return the 2D tensor instead.

Added tests for torch integration with Permutation

~~Leaving draft until https://github.com/lancedb/lancedb/pull/3013
merges as this is built on top of that~~
2026-02-12 13:41:14 -08:00
Weston Pace
02783bf440 feat: add a getitems implementation for the permutation (#3013) 2026-02-12 05:36:11 -08:00
Will Jones
131024839f fix: include _rowid in hash and calculated split projections (#2965)
## Summary

- PR #2957 changed the permutation builder to only select `_rowid` from
the base table, but `Splitter::project()` for hash and calculated splits
replaced the selection entirely, dropping `_rowid`.
- Include `_rowid` in the column selections for hash and calculated
split projections.
- Fix a Python test that queried the permutation table for base table
columns no longer materialized.

Fixes the `test_split_hash`, `test_split_hash_with_discard`,
`test_split_calculated`, `test_shuffle_combined_with_splits`, and
`test_filter_with_splits` failures in `test_permutation.py`.

## Test plan

- [x] `cargo test -p lancedb -- permutation` (22 passed)
- [x] `pytest python/tests/test_permutation.py` (46 passed)
- [x] `npm test __test__/permutation.test.ts` (20 passed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 16:27:58 -08:00
Weston Pace
aeac9c7644 feat: add python Permutation class to mimic hugging face dataset and provide pytorch dataloader (#2725) 2025-11-06 16:15:33 -08:00
Weston Pace
4cfcd95320 feat: add a permutation reader that can read a permutation view (#2712)
This adds a rust permutation builder. In the next PR I will have python
bindings and integration with pytorch.
2025-10-17 05:00:23 -07:00
Weston Pace
5a19cf15a6 feat: a utility for creating "permutation views" (#2552)
I'm working on a lancedb version of pytorch data loading (and hopefully
addressing https://github.com/lancedb/lance/issues/3727).

However, rather than rely on pytorch for everything I'm moving some of
the things that pytorch does into rust. This gives us more control over
data loading (e.g. using shards or a hash-based split) and it allows
permutations to be persistent. In particular I hope to be able to:

* Create a persistent permutation
* This permutation can handle splits, filtering, shuffling, and sharding
* Create a rust data loader that can read a permutation (one or more
splits), or a subset of a permutation (for DDP)
* Create a python data loader that delegates to the rust data loader

Eventually create integrations for other data loading libraries,
including rust & node
2025-10-09 18:07:31 -07:00