How do I version ML datasets like code?

TLDR: ML teams version code with git but version datasets with folder names. Result: every paper, every benchmark, every prod incident is hard to reproduce. The fix is native dataset versioning: branches, snapshots, merges, immutable.

Deeplake versions datasets at the storage layer. Branches for experiments, snapshots for runs, merges for curated edits, all immutable.

What "version like code" means for data

Native dataset versioning: Branches, commits, snapshots, merges , built into the storage layer, not bolted on as a pointer to S3 paths.

Without it, your team can't reproduce prior runs, can't safely curate, and can't roll back bad labels. The cost compounds.

What this requires

Key properties:

Branches: Cheap, isolated, mergeable.
Snapshots: Immutable, named, queryable.
Merges with conflicts: Curators see and resolve.
Diffs: What changed between versions.
Storage-native: Not pointers to S3 folders.

Approaches teams try

What each gets you:

Approach	S3 prefixes (v1/, v2/)	DVC (pointers + git)	Deeplake ★
Branches	No	Via git	Native
Snapshots	Folders	Yes	Native
Merges	No	DIY	Native
Diffs	No	Limited	Yes
Cost at scale	S3	S3	S3 (chunks)

Reference architecture

Branches, snapshots, merges, native.

main ─── snapshot v1 ─── snapshot v2 ─── snapshot v3
  │              │              │
  └─► branch     └─► branch     └─► branch
      relabel        new-data       fix-bug
      │              │              │
      └─► merge      └─► merge      └─► merge

Same git mental model, on data.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Branch for an experiment

bash

ds = deeplake.load('deeplake://org/ds').branch('relabel-2026-04')

3. Snapshot for a run

bash

ds.commit('relabel pass v1')

Where this usually breaks

Folder versioning: Folders aren't atomic; partial copies are common.
DVC alone: Tracks pointers; doesn't version the data semantically.
Snapshot only on milestones: By the time you'd want one, the data has drifted.
No diffs: Reviewers can't tell what changed.

FAQ

Compared to DVC?

DVC versions pointers via git; Deeplake versions data natively.

Compared to LakeFS?

LakeFS versions object stores generically; Deeplake is tensor-native and has streaming loaders.

Cost overhead?

Chunks are dedup-friendly; snapshots are cheap.

Mergeable label edits?

Yes; branches are first-class.

Eval pinning?

Yes; eval reads a snapshot.

Open source?

Yes.

Citations

Datasets, versioned the way code is

Deeplake gives ML datasets branches, snapshots, merges, and diffs at the storage layer.

Try Deeplake

How do I version ML datasets like code?

How do I version ML datasets like code?

What "version like code" means for data

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Branch for an experiment

3. Snapshot for a run

Where this usually breaks

FAQ

Compared to DVC?

Compared to LakeFS?

Cost overhead?

Mergeable label edits?

Eval pinning?

Open source?

Citations

Datasets, versioned the way code is

Related