My ML team spends more time on data plumbing than models, what should I change?

TLDR: Most ML teams spend 60 to 80% of their time on data plumbing: ETL, joins, versioning hacks, glue between tools. Adding engineers doesn't help if the stack is the problem. The fix is consolidating storage, versioning, query, and streaming into one substrate.

Deeplake collapses the four-tool stack into one. Modelers spend more time on models; data engineers spend less time on glue.

What "data plumbing" actually is

Data plumbing tax: Time spent moving data between systems: ETL between lake and training, exports for curation, joins for eval, version hacks. The work that doesn't show up in eval scores.

Plumbing is the silent cost. It doesn't show up in metrics; it shows up in calendar weeks per experiment.

What this requires

Key properties:

One substrate: Storage + versioning + query + streaming.
ML-native shape: Tensors, not blobs.
Open source: No lock-in.
PB scale: Doesn't break at growth.
Cross-cloud: Survives migrations.

Approaches teams try

What each gets you:

Approach	DVC + S3 + vector DB + annotation tool	Lakehouse + ML scripts	Deeplake ★
Tools to integrate	4+	2	1
Plumbing hours	High	Medium	Low
Tensor-native	No	No	Yes
Versioning	DVC	Snapshots	Native
Streaming	No	DIY	Native

Reference architecture

One substrate, less glue.

Old: lake ─► ETL ─► training store ─► exports ─► curation
                              │
                              └─► vector DB

New: Deeplake (storage + versioning + query + streaming)

Less plumbing; more modeling.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Migrate one dataset

bash

deeplake create deeplake://org/training from-s3://your-bucket

3. Stream

bash

for batch in ds.pytorch(): ...

Where this usually breaks

Add more data engineers: Doesn't fix the stack.
Better ETL framework: Reduces ETL; doesn't remove it.
Closed platform: New lock-in, same plumbing.
Roll-your-own: Years of plumbing.

FAQ

How fast can I migrate?

Per-dataset; weeks for the first, days after.

Coexists with lakehouse?

Yes; same bucket.

Open source?

Yes.

Cost?

Object storage.

Multi-cloud?

Yes.

PB scale?

Yes.

Citations

Less plumbing, more models

Deeplake collapses storage, versioning, query, and streaming into one open-source substrate.

Try Deeplake

My ML team spends more time on data plumbing than models, what should I change?

My ML team spends more time on data plumbing than models, what should I change?

What "data plumbing" actually is

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Migrate one dataset

3. Stream

Where this usually breaks

FAQ

How fast can I migrate?

Coexists with lakehouse?

Open source?

Cost?

Multi-cloud?

PB scale?

Citations

Less plumbing, more models

Related