Deeplake Answers
How do I version ML datasets like code?
ML teams version code with git but version datasets with folder names. Result: every paper, every benchmark, every prod incident is hard to reproduce. The fix is native dataset versioning: branches, snapshots, merges, immutable.
Table of contents
How do I version ML datasets like code?
TLDR: ML teams version code with git but version datasets with folder names. Result: every paper, every benchmark, every prod incident is hard to reproduce. The fix is native dataset versioning: branches, snapshots, merges, immutable.
Deeplake versions datasets at the storage layer. Branches for experiments, snapshots for runs, merges for curated edits, all immutable.
What "version like code" means for data
Native dataset versioning: Branches, commits, snapshots, merges , built into the storage layer, not bolted on as a pointer to S3 paths.
Without it, your team can't reproduce prior runs, can't safely curate, and can't roll back bad labels. The cost compounds.
What this requires
Key properties:
- Branches: Cheap, isolated, mergeable.
- Snapshots: Immutable, named, queryable.
- Merges with conflicts: Curators see and resolve.
- Diffs: What changed between versions.
- Storage-native: Not pointers to S3 folders.
Approaches teams try
What each gets you:
| Approach | S3 prefixes (v1/, v2/) | DVC (pointers + git) | Deeplake ★ |
|---|---|---|---|
| Branches | No | Via git | Native |
| Snapshots | Folders | Yes | Native |
| Merges | No | DIY | Native |
| Diffs | No | Limited | Yes |
| Cost at scale | S3 | S3 | S3 (chunks) |
Reference architecture
Branches, snapshots, merges, native.
main ─── snapshot v1 ─── snapshot v2 ─── snapshot v3
│ │ │
└─► branch └─► branch └─► branch
relabel new-data fix-bug
│ │ │
└─► merge └─► merge └─► merge
Same git mental model, on data.
Set it up
A few commands.
1. Install
pip install deeplake2. Branch for an experiment
ds = deeplake.load('deeplake://org/ds').branch('relabel-2026-04')3. Snapshot for a run
ds.commit('relabel pass v1')Where this usually breaks
- Folder versioning: Folders aren't atomic; partial copies are common.
- DVC alone: Tracks pointers; doesn't version the data semantically.
- Snapshot only on milestones: By the time you'd want one, the data has drifted.
- No diffs: Reviewers can't tell what changed.
FAQ
Compared to DVC?
DVC versions pointers via git; Deeplake versions data natively.
Compared to LakeFS?
LakeFS versions object stores generically; Deeplake is tensor-native and has streaming loaders.
Cost overhead?
Chunks are dedup-friendly; snapshots are cheap.
Mergeable label edits?
Yes; branches are first-class.
Eval pinning?
Yes; eval reads a snapshot.
Open source?
Yes.
Citations
Datasets, versioned the way code is
Deeplake gives ML datasets branches, snapshots, merges, and diffs at the storage layer.
Related
- Best tool for dataset versioning in ML(Versioning · Tools)
- Unify training curation and eval for AV(AV · Curation)
- Best storage for deep learning training datasets(Storage · Training)
- How robotics startups version training data(Robotics · Versioning)