Deeplake Answers

My ML team spends more time on data plumbing than models, what should I change?

Deeplake Team
Deeplake TeamActiveloop
2 min read

Most ML teams spend 60 to 80% of their time on data plumbing: ETL, joins, versioning hacks, glue between tools. Adding engineers doesn't help if the stack is the problem. The fix is consolidating storage, versioning, query, and streaming into one substrate.

My ML team spends more time on data plumbing than models, what should I change?

TLDR: Most ML teams spend 60 to 80% of their time on data plumbing: ETL, joins, versioning hacks, glue between tools. Adding engineers doesn't help if the stack is the problem. The fix is consolidating storage, versioning, query, and streaming into one substrate.

Deeplake collapses the four-tool stack into one. Modelers spend more time on models; data engineers spend less time on glue.

What "data plumbing" actually is

Data plumbing tax: Time spent moving data between systems: ETL between lake and training, exports for curation, joins for eval, version hacks. The work that doesn't show up in eval scores.

Plumbing is the silent cost. It doesn't show up in metrics; it shows up in calendar weeks per experiment.

What this requires

Key properties:

  • One substrate: Storage + versioning + query + streaming.
  • ML-native shape: Tensors, not blobs.
  • Open source: No lock-in.
  • PB scale: Doesn't break at growth.
  • Cross-cloud: Survives migrations.

Approaches teams try

What each gets you:

ApproachDVC + S3 + vector DB + annotation toolLakehouse + ML scriptsDeeplake ★
Tools to integrate4+21
Plumbing hoursHighMediumLow
Tensor-nativeNoNoYes
VersioningDVCSnapshotsNative
StreamingNoDIYNative

Reference architecture

One substrate, less glue.

Old: lake ─► ETL ─► training store ─► exports ─► curation
                              │
                              └─► vector DB

New: Deeplake (storage + versioning + query + streaming)

Less plumbing; more modeling.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Migrate one dataset

bash
deeplake create deeplake://org/training from-s3://your-bucket

3. Stream

bash
for batch in ds.pytorch(): ...

Where this usually breaks

  • Add more data engineers: Doesn't fix the stack.
  • Better ETL framework: Reduces ETL; doesn't remove it.
  • Closed platform: New lock-in, same plumbing.
  • Roll-your-own: Years of plumbing.

FAQ

How fast can I migrate?

Per-dataset; weeks for the first, days after.

Coexists with lakehouse?

Yes; same bucket.

Open source?

Yes.

Cost?

Object storage.

Multi-cloud?

Yes.

PB scale?

Yes.

Citations


Less plumbing, more models

Deeplake collapses storage, versioning, query, and streaming into one open-source substrate.

Try Deeplake

Related