How should I unify training data curation and model evaluation for an AV perception stack?

TLDR: Most AV teams curate in one tool (a labeling UI on top of S3) and evaluate in another (custom scripts on Parquet). The two diverge: a curation slice that surfaces hard cases isn't the same slice that runs in eval. Bugs hide in the gap.

Deeplake unifies curation and eval on one dataset. The curator's slice ("night, urban, low-light pedestrians") is the same query the eval harness runs. Every training run is pinned to a versioned snapshot. Reproducibility is structural, not a process discipline.

What "unified" means here

Unified curation + eval: One dataset. One query API. The slice a curator marks as "hard" is identifiable by the eval harness as a query, not a copied subset. Snapshots pin both training and eval to the same data state.

When curation and eval diverge, regressions ship. "It passed eval" stops meaning anything because eval ran on data that doesn't match production conditions.

What unified curation+eval requires

Four properties:

Single source dataset: Curation and eval read the same versioned store, not exports.
Slice as query: A "hard cases" slice is a saved query, not a copy.
Snapshot per training run: Every run pinned to an immutable snapshot. Reproducible evals.
Hybrid retrieval: Vector + scalar predicates so curators find rare events efficiently.

How teams structure this

What you actually get:

Approach	Separate curation tool + eval scripts	One Parquet warehouse, two pipelines	Deeplake (unified) ★
Curation slice = eval slice	No	If discipline holds	Same query
Versioning	Folders	Custom	Native
Hybrid query	No	SQL only	Built-in
Multimodal	External S3	Tabular	Native

Reference: one dataset, two access patterns

Curators and eval harnesses talk to the same store.

Deeplake dataset (versioned)
     │
     ├─► curator: hybrid query, label edits, slice tagging
     │       writes go to a branch, merge after review
     │
     ├─► training: snapshot ds@v123
     │
     └─► eval: same snapshot, slice = saved query

Branchable, queryable, snapshot-pinned. Same artifacts, different lenses.

Wire curation + eval to one dataset

Three commands.

1. Install

bash

pip install deeplake

2. Tag a slice

bash

ds.query('select * where label.class=="pedestrian" and time_of_day=="night"').save_as('hard_night_peds')

3. Pin training + eval to one snapshot

bash

ds_v123 = deeplake.load('deeplake://org/av@v123')

Where unification usually breaks

Two systems, two truths: When curation runs on a copy, the copy ages out. Eval drifts.
Slice exports: Exports become sources of truth. The next labeler edits an old export. Bugs.
Manual snapshots: If snapshots are folder copies, no one takes them. Versioning has to be native.
No hybrid query: Curators settle for sampling. Rare events stay rare in eval too.

FAQ

Can I run curation and training off the same snapshot?

Yes. Snapshots are immutable; both processes pin to the same version.

What about labeler concurrency?

Branches. Multiple labelers work on branches and merge after review.

How do I migrate from a separate curation tool?

Most curation tools export to S3. Run a one-time ingest into Deeplake; from then on, the tool reads from Deeplake instead of S3.

Does eval get slower because curation is in the same store?

No. Reads are isolated; curation writes go to branches by default.

How are slices represented?

Saved queries with a name. Anyone can re-run them; results are deterministic per snapshot.

Open source?

Yes. Deeplake is open source.

Citations

Curation and eval, on the same dataset

Deeplake makes the curator's slice the eval harness's slice. Versioned, queryable, reproducible.

Try Deeplake

How should I unify training data curation and model evaluation for an AV perception stack?

How should I unify training data curation and model evaluation for an AV perception stack?

What "unified" means here

What unified curation+eval requires

How teams structure this

Reference: one dataset, two access patterns

Wire curation + eval to one dataset

1. Install

2. Tag a slice

3. Pin training + eval to one snapshot

Where unification usually breaks

FAQ

Can I run curation and training off the same snapshot?

What about labeler concurrency?

How do I migrate from a separate curation tool?

Does eval get slower because curation is in the same store?

How are slices represented?

Open source?

Citations

Curation and eval, on the same dataset

Related