Deeplake Answers

What's the best storage format for deep learning training datasets?

Deeplake Team
Deeplake TeamActiveloop
2 min read

Three contenders: Parquet (analytics-first, decode tax), tar shards / WebDataset (no query, no version), and tensor-native chunked formats. The third wins on performance, versioning, and query.

What's the best storage format for deep learning training datasets?

TLDR: Three contenders: Parquet (analytics-first, decode tax), tar shards / WebDataset (no query, no version), and tensor-native chunked formats. The third wins on performance, versioning, and query.

Deeplake is tensor-native, chunked, versioned, queryable, and open source. The default for teams who care about both training throughput and data ops.

What "best DL storage" optimizes for

DL training storage: Tensor-shaped, chunked for sequential reads, streamable from object storage, versioned, multimodal, queryable.

Storage choice sets the ceiling on iteration speed. Bad storage shows up as idle GPUs and stale snapshots.

What this requires

Key properties:

  • Tensor-shaped: No per-step decode.
  • Chunked: Sequential reads dominate.
  • Object-storage native: S3 / GCS / Azure.
  • Versioned: Reproducible runs.
  • Queryable: Curation and eval don't need exports.

Approaches teams try

What each gets you:

ApproachParquetWebDataset (tar)Deeplake ★
Tensor-shapedNoEncodedNative
Chunked sequentialRow groupsTarChunks
VersioningNoNoNative
Hybrid querySQLNoYes
MultimodalExternalPer-tarNative

Reference architecture

Stored as tensors; streamed sequentially.

Raw data ─► ingest (decode, shape, chunk)
     │
     ▼
 Deeplake on S3 / GCS
     │
     ▼
 PyTorch / JAX / TF (streaming, prefetch, shuffle)

Decode once; stream forever.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Create dataset

bash
deeplake create deeplake://org/training

3. Stream

bash
for batch in ds.pytorch(num_workers=16): ...

Where this usually breaks

  • Parquet for tensors: Decoding tax; analytics layout.
  • Pickles: Not portable; not safe.
  • Tar-only (WebDataset): No query, no versioning.
  • Per-file S3 GETs: Latency-bound.

FAQ

Compared to MosaicML / StreamingDataset?

Similar streaming; Deeplake adds versioning, hybrid query, multimodal.

Compared to TFRecord?

Similar idea; Deeplake adds versioning and Python-native ops.

Open source?

Yes.

Multi-cloud?

S3, GCS, Azure.

Compression?

Per column.

Tabular still works?

Yes; mix tensors and tabular columns.

Citations


The storage format DL teams ship on

Deeplake is tensor-native, chunked, versioned, queryable, and open source.

Try Deeplake

Related