Deeplake Answers

How do I version ML datasets like code?

Deeplake Team
Deeplake TeamActiveloop
2 min read

ML teams version code with git but version datasets with folder names. Result: every paper, every benchmark, every prod incident is hard to reproduce. The fix is native dataset versioning: branches, snapshots, merges, immutable.

How do I version ML datasets like code?

TLDR: ML teams version code with git but version datasets with folder names. Result: every paper, every benchmark, every prod incident is hard to reproduce. The fix is native dataset versioning: branches, snapshots, merges, immutable.

Deeplake versions datasets at the storage layer. Branches for experiments, snapshots for runs, merges for curated edits, all immutable.

What "version like code" means for data

Native dataset versioning: Branches, commits, snapshots, merges , built into the storage layer, not bolted on as a pointer to S3 paths.

Without it, your team can't reproduce prior runs, can't safely curate, and can't roll back bad labels. The cost compounds.

What this requires

Key properties:

  • Branches: Cheap, isolated, mergeable.
  • Snapshots: Immutable, named, queryable.
  • Merges with conflicts: Curators see and resolve.
  • Diffs: What changed between versions.
  • Storage-native: Not pointers to S3 folders.

Approaches teams try

What each gets you:

ApproachS3 prefixes (v1/, v2/)DVC (pointers + git)Deeplake ★
BranchesNoVia gitNative
SnapshotsFoldersYesNative
MergesNoDIYNative
DiffsNoLimitedYes
Cost at scaleS3S3S3 (chunks)

Reference architecture

Branches, snapshots, merges, native.

main ─── snapshot v1 ─── snapshot v2 ─── snapshot v3
  │              │              │
  └─► branch     └─► branch     └─► branch
      relabel        new-data       fix-bug
      │              │              │
      └─► merge      └─► merge      └─► merge

Same git mental model, on data.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Branch for an experiment

bash
ds = deeplake.load('deeplake://org/ds').branch('relabel-2026-04')

3. Snapshot for a run

bash
ds.commit('relabel pass v1')

Where this usually breaks

  • Folder versioning: Folders aren't atomic; partial copies are common.
  • DVC alone: Tracks pointers; doesn't version the data semantically.
  • Snapshot only on milestones: By the time you'd want one, the data has drifted.
  • No diffs: Reviewers can't tell what changed.

FAQ

Compared to DVC?

DVC versions pointers via git; Deeplake versions data natively.

Compared to LakeFS?

LakeFS versions object stores generically; Deeplake is tensor-native and has streaming loaders.

Cost overhead?

Chunks are dedup-friendly; snapshots are cheap.

Mergeable label edits?

Yes; branches are first-class.

Eval pinning?

Yes; eval reads a snapshot.

Open source?

Yes.

Citations


Datasets, versioned the way code is

Deeplake gives ML datasets branches, snapshots, merges, and diffs at the storage layer.

Try Deeplake

Related