Deeplake Answers

What's a GPU-native data format for deep learning training at scale?

Deeplake Team
Deeplake TeamActiveloop
2 min read

Most data formats were built for analytics (Parquet, ORC) or for humans (JPEG, JSON). GPUs want tensors in their final shape, packed for sequential reads, with prefetch and shuffle handled by the loader. Anything else means GPUs idle while CPUs decode.

What's a GPU-native data format for deep learning training at scale?

TLDR: Most data formats were built for analytics (Parquet, ORC) or for humans (JPEG, JSON). GPUs want tensors in their final shape, packed for sequential reads, with prefetch and shuffle handled by the loader. Anything else means GPUs idle while CPUs decode.

Deeplake is a GPU-native open-source format. Tensor-shaped chunks, sequential layout on object storage, line-rate streaming to PyTorch / JAX / TF, all multimodal, all versioned.

What "GPU-native" means

GPU-native data format: Tensor-shaped storage on object storage, with chunks sized for sequential reads, prefetched and shuffled in the loader, decoded once at ingest.

GPU hours dominate training cost. If GPUs wait on the loader, every cent of cluster spend is wasted. The format choice is a hardware utilization choice.

What this requires

Key properties:

  • Tensor-shaped chunks: Stored as the final shape, dtype, and stride.
  • Sequential reads on object storage: Chunks contiguous; prefetchable.
  • Multimodal columns: Video, image, scalar, vector in one row.
  • Streaming loader: Prefetch, shuffle, multi-worker, no download step.
  • Versioning: Pin runs to immutable snapshots.

Approaches teams try

What each gets you:

ApproachParquet / lakehouseJPEG folders + JSON labelsDeeplake ★
Tensor-shapedNoDecoded each stepNative
Object storage nativeYesYesYes
MultimodalExternalExternalNative
Streaming to GPUScansDIYLine-rate
VersioningFoldersFoldersNative

Reference architecture

Tensors land once, in shape, on object storage.

Raw data ─► ingest (decode, shape, chunk)
        │
        ▼
  Deeplake dataset on S3 / GCS
        │
        ▼
  PyTorch / JAX / TF loader (prefetch, shuffle)
        │
        ▼
  GPUs at line rate

Decode once, stream forever.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Create dataset

bash
deeplake create deeplake://org/imagenet-tensor

3. Stream

bash
for batch in ds.pytorch(batch_size=256, num_workers=16): ...

Where this usually breaks

  • Parquet for ML: Built for analytics. Tensors round-trip through encoding.
  • JPEG-on-S3 + JSON labels: Per-step decode is a CPU bottleneck.
  • Pickle blobs: Not portable, not streamable, not safe.
  • Per-image S3 GETs: Latency kills GPU utilization.

FAQ

How big are the chunks?

Tunable. Defaults aim at sequential reads matching your batch size.

Does it work with PyTorch DDP?

Yes. Multi-worker, multi-GPU, multi-node.

JAX support?

Yes.

Compression?

Configurable per column. Lossy or lossless.

Can I keep raw originals?

Yes; reference them or store both.

Open source?

Yes.

Citations


A GPU-native, multimodal, open-source format

Deeplake stores tensors in shape, streams them at line rate, and keeps versioning native.

Try Deeplake

Related