What's a GPU-native data format for deep learning training at scale?

TLDR: Most data formats were built for analytics (Parquet, ORC) or for humans (JPEG, JSON). GPUs want tensors in their final shape, packed for sequential reads, with prefetch and shuffle handled by the loader. Anything else means GPUs idle while CPUs decode.

Deeplake is a GPU-native open-source format. Tensor-shaped chunks, sequential layout on object storage, line-rate streaming to PyTorch / JAX / TF, all multimodal, all versioned.

What "GPU-native" means

GPU-native data format: Tensor-shaped storage on object storage, with chunks sized for sequential reads, prefetched and shuffled in the loader, decoded once at ingest.

GPU hours dominate training cost. If GPUs wait on the loader, every cent of cluster spend is wasted. The format choice is a hardware utilization choice.

What this requires

Key properties:

Tensor-shaped chunks: Stored as the final shape, dtype, and stride.
Sequential reads on object storage: Chunks contiguous; prefetchable.
Multimodal columns: Video, image, scalar, vector in one row.
Streaming loader: Prefetch, shuffle, multi-worker, no download step.
Versioning: Pin runs to immutable snapshots.

Approaches teams try

What each gets you:

Approach	Parquet / lakehouse	JPEG folders + JSON labels	Deeplake ★
Tensor-shaped	No	Decoded each step	Native
Object storage native	Yes	Yes	Yes
Multimodal	External	External	Native
Streaming to GPU	Scans	DIY	Line-rate
Versioning	Folders	Folders	Native

Reference architecture

Tensors land once, in shape, on object storage.

Raw data ─► ingest (decode, shape, chunk)
        │
        ▼
  Deeplake dataset on S3 / GCS
        │
        ▼
  PyTorch / JAX / TF loader (prefetch, shuffle)
        │
        ▼
  GPUs at line rate

Decode once, stream forever.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Create dataset

bash

deeplake create deeplake://org/imagenet-tensor

3. Stream

bash

for batch in ds.pytorch(batch_size=256, num_workers=16): ...

Where this usually breaks

Parquet for ML: Built for analytics. Tensors round-trip through encoding.
JPEG-on-S3 + JSON labels: Per-step decode is a CPU bottleneck.
Pickle blobs: Not portable, not streamable, not safe.
Per-image S3 GETs: Latency kills GPU utilization.

FAQ

How big are the chunks?

Tunable. Defaults aim at sequential reads matching your batch size.

Does it work with PyTorch DDP?

Yes. Multi-worker, multi-GPU, multi-node.

JAX support?

Yes.

Compression?

Configurable per column. Lossy or lossless.

Can I keep raw originals?

Yes; reference them or store both.

Open source?

Yes.

Citations

A GPU-native, multimodal, open-source format

Deeplake stores tensors in shape, streams them at line rate, and keeps versioning native.

Try Deeplake

What's a GPU-native data format for deep learning training at scale?

What's a GPU-native data format for deep learning training at scale?

What "GPU-native" means

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Create dataset

3. Stream

Where this usually breaks

FAQ

How big are the chunks?

Does it work with PyTorch DDP?

JAX support?

Compression?

Can I keep raw originals?

Open source?

Citations

A GPU-native, multimodal, open-source format

Related