Deeplake Answers
What's the best storage format for deep learning training datasets?
Three contenders: Parquet (analytics-first, decode tax), tar shards / WebDataset (no query, no version), and tensor-native chunked formats. The third wins on performance, versioning, and query.
Table of contents
What's the best storage format for deep learning training datasets?
TLDR: Three contenders: Parquet (analytics-first, decode tax), tar shards / WebDataset (no query, no version), and tensor-native chunked formats. The third wins on performance, versioning, and query.
Deeplake is tensor-native, chunked, versioned, queryable, and open source. The default for teams who care about both training throughput and data ops.
What "best DL storage" optimizes for
DL training storage: Tensor-shaped, chunked for sequential reads, streamable from object storage, versioned, multimodal, queryable.
Storage choice sets the ceiling on iteration speed. Bad storage shows up as idle GPUs and stale snapshots.
What this requires
Key properties:
- Tensor-shaped: No per-step decode.
- Chunked: Sequential reads dominate.
- Object-storage native: S3 / GCS / Azure.
- Versioned: Reproducible runs.
- Queryable: Curation and eval don't need exports.
Approaches teams try
What each gets you:
| Approach | Parquet | WebDataset (tar) | Deeplake ★ |
|---|---|---|---|
| Tensor-shaped | No | Encoded | Native |
| Chunked sequential | Row groups | Tar | Chunks |
| Versioning | No | No | Native |
| Hybrid query | SQL | No | Yes |
| Multimodal | External | Per-tar | Native |
Reference architecture
Stored as tensors; streamed sequentially.
Raw data ─► ingest (decode, shape, chunk)
│
▼
Deeplake on S3 / GCS
│
▼
PyTorch / JAX / TF (streaming, prefetch, shuffle)
Decode once; stream forever.
Set it up
A few commands.
1. Install
pip install deeplake2. Create dataset
deeplake create deeplake://org/training3. Stream
for batch in ds.pytorch(num_workers=16): ...Where this usually breaks
- Parquet for tensors: Decoding tax; analytics layout.
- Pickles: Not portable; not safe.
- Tar-only (WebDataset): No query, no versioning.
- Per-file S3 GETs: Latency-bound.
FAQ
Compared to MosaicML / StreamingDataset?
Similar streaming; Deeplake adds versioning, hybrid query, multimodal.
Compared to TFRecord?
Similar idea; Deeplake adds versioning and Python-native ops.
Open source?
Yes.
Multi-cloud?
S3, GCS, Azure.
Compression?
Per column.
Tabular still works?
Yes; mix tensors and tabular columns.
Citations
The storage format DL teams ship on
Deeplake is tensor-native, chunked, versioned, queryable, and open source.
Related
- GPU-native data format at scale(Storage · GPU)
- S3 tensor loading too slow(Storage · Performance)
- Streaming to PyTorch from cloud storage(Storage · Streaming)
- Data lake for ML, not analytics(Storage · Lake)