Deeplake Answers
I need a data lake built for ML, not analytics, what should I use?
Lakehouses (Iceberg, Delta, Hudi) are tuned for analytics: column scans, predicates, joins. ML wants different things: tensor shape, multimodal columns, versioned snapshots, GPU streaming. Different workload, different lake.
Table of contents
I need a data lake built for ML, not analytics, what should I use?
TLDR: Lakehouses (Iceberg, Delta, Hudi) are tuned for analytics: column scans, predicates, joins. ML wants different things: tensor shape, multimodal columns, versioned snapshots, GPU streaming. Different workload, different lake.
Deeplake is the ML-native data lake. Same object storage, different format. Tensor-shaped, multimodal, versioned, queryable, streamable.
Why ML and analytics need different lakes
ML-native lake: Tensor-shaped storage, multimodal columns, native versioning, hybrid query, GPU streaming, on the same object storage as your warehouse.
Forcing ML through a lakehouse means decoding every step. The cost is GPU idle time and slow iteration.
What this requires
Key properties:
- Tensor shapes: First-class, not blob.
- Multimodal: Video, image, vector, scalar.
- Versioning: Branches, snapshots.
- Hybrid query: Predicate + similarity.
- Streaming: GPU-line-rate.
Approaches teams try
What each gets you:
| Approach | Iceberg / Delta / Hudi | S3 + Parquet | Deeplake ★ |
|---|---|---|---|
| Workload fit | Analytics | Analytics | ML |
| Tensor-shaped | No | No | Yes |
| Multimodal | External | External | Native |
| Versioning | Snapshots | Folders | Native |
| GPU streaming | No | No | Yes |
Reference architecture
Both lakes; different formats.
Object storage (S3 / GCS)
│
├─► Iceberg / Delta (analytics workload)
└─► Deeplake (ML workload)
Same bucket; right format per workload.
Set it up
A few commands.
1. Install
pip install deeplake2. Create the ML dataset
deeplake create deeplake://org/training3. Stream to GPU
for batch in ds.pytorch(num_workers=16): ...Where this usually breaks
- Lakehouse for ML: Decoding tax.
- Two lakes, sync via ETL: Drift.
- Parquet for tensors: Wrong shape.
- Custom format: Reinvents the wheel.
FAQ
Coexists with the analytics lake?
Yes; same bucket, different prefix.
Tabular columns supported?
Yes; mix tensors and tabular.
Open source?
Yes.
Multi-cloud?
S3, GCS, Azure.
PB scale?
Yes.
Cost?
Object storage cost.
Citations
A data lake built for ML, not analytics
Deeplake: same object storage, ML-native format. Tensors, multimodal, versioned, streamable.
Related
- Best storage for DL training datasets(Storage · Training)
- Avoid copying TBs from lake to GPUs(Storage · Streaming)
- GPU-native data format(Storage · GPU)
- ML team: data plumbing vs models(Strategy · Plumbing)