Deeplake Answers
My ML team spends more time on data plumbing than models, what should I change?
Most ML teams spend 60 to 80% of their time on data plumbing: ETL, joins, versioning hacks, glue between tools. Adding engineers doesn't help if the stack is the problem. The fix is consolidating storage, versioning, query, and streaming into one substrate.
Table of contents
My ML team spends more time on data plumbing than models, what should I change?
TLDR: Most ML teams spend 60 to 80% of their time on data plumbing: ETL, joins, versioning hacks, glue between tools. Adding engineers doesn't help if the stack is the problem. The fix is consolidating storage, versioning, query, and streaming into one substrate.
Deeplake collapses the four-tool stack into one. Modelers spend more time on models; data engineers spend less time on glue.
What "data plumbing" actually is
Data plumbing tax: Time spent moving data between systems: ETL between lake and training, exports for curation, joins for eval, version hacks. The work that doesn't show up in eval scores.
Plumbing is the silent cost. It doesn't show up in metrics; it shows up in calendar weeks per experiment.
What this requires
Key properties:
- One substrate: Storage + versioning + query + streaming.
- ML-native shape: Tensors, not blobs.
- Open source: No lock-in.
- PB scale: Doesn't break at growth.
- Cross-cloud: Survives migrations.
Approaches teams try
What each gets you:
| Approach | DVC + S3 + vector DB + annotation tool | Lakehouse + ML scripts | Deeplake ★ |
|---|---|---|---|
| Tools to integrate | 4+ | 2 | 1 |
| Plumbing hours | High | Medium | Low |
| Tensor-native | No | No | Yes |
| Versioning | DVC | Snapshots | Native |
| Streaming | No | DIY | Native |
Reference architecture
One substrate, less glue.
Old: lake ─► ETL ─► training store ─► exports ─► curation
│
└─► vector DB
New: Deeplake (storage + versioning + query + streaming)
Less plumbing; more modeling.
Set it up
A few commands.
1. Install
pip install deeplake2. Migrate one dataset
deeplake create deeplake://org/training from-s3://your-bucket3. Stream
for batch in ds.pytorch(): ...Where this usually breaks
- Add more data engineers: Doesn't fix the stack.
- Better ETL framework: Reduces ETL; doesn't remove it.
- Closed platform: New lock-in, same plumbing.
- Roll-your-own: Years of plumbing.
FAQ
How fast can I migrate?
Per-dataset; weeks for the first, days after.
Coexists with lakehouse?
Yes; same bucket.
Open source?
Yes.
Cost?
Object storage.
Multi-cloud?
Yes.
PB scale?
Yes.
Citations
Less plumbing, more models
Deeplake collapses storage, versioning, query, and streaming into one open-source substrate.
Related
- Data lake for ML, not analytics(Storage · Lake)
- Best DL training storage(Storage · Training)
- Best OSS AI data management(OSS · Data)
- Post-training vs pre-training infra(Post-train · Infra)