Deeplake Answers

How do I avoid copying terabytes from a data lake to GPU nodes?

Deeplake Team
Deeplake TeamActiveloop
2 min read

The TB-copy pattern is a relic: pull from the lake to local SSD, then start training. It wastes hours per run, scales worse than linearly, and breaks in multi-node. The fix is reading directly from object storage with a format that streams.

How do I avoid copying terabytes from a data lake to GPU nodes?

TLDR: The TB-copy pattern is a relic: pull from the lake to local SSD, then start training. It wastes hours per run, scales worse than linearly, and breaks in multi-node. The fix is reading directly from object storage with a format that streams.

Deeplake reads tensor-shaped chunks from S3 / GCS at line rate. No staging step. Multi-node, multi-region, no local cache needed.

Why TB copies happen

Lake-to-GPU staging: Default for Parquet / JPEG-folder datasets: pull to local SSD because per-file S3 GETs are too slow to train against. The format forces the copy.

Each run loses an hour or more to staging; multi-node loses more. The format choice is the cost choice.

What this requires

Key properties:

  • Tensor-shaped chunks: Chunks tuned for sequential reads.
  • Streaming loader: Prefetch + shuffle, no download.
  • Object-storage native: S3 / GCS, not file system.
  • Multi-worker: Reads scale across DDP workers.
  • No staging: First batch starts from S3 in seconds.

Approaches teams try

What each gets you:

ApproachCopy from lake to local SSDS3FS / fsspec mountDeeplake ★
First-batch latencyHoursSlowSeconds
Multi-node trainingHardYesYes
CostSSD + transferGETsChunked GETs
VersioningFoldersFoldersNative
MultimodalPer-folderPer-folderNative

Reference architecture

Read from the lake; no copy.

Old: lake (S3) ─► [hours] ─► local SSD ─► trainer

New: lake (S3, Deeplake chunks) ─► trainer (streaming)

Staging step deleted.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Ingest once

bash
deeplake create deeplake://org/training from-s3://your-bucket

3. Stream

bash
for batch in ds.pytorch(num_workers=32): ...

Where this usually breaks

  • Bigger SSD: Doesn't help past one node.
  • S3FS / fsspec mounts: Helps a little; latency-bound.
  • Per-batch downloads: Dominated by GET latency.
  • Distributed cache layer: Adds moving parts; doesn't change layout.

FAQ

Does the lake stay intact?

Yes; Deeplake is the read layer over the same bucket.

How big can the dataset be?

PB scale is normal.

Multi-region?

Yes.

Compatible with DDP / FSDP?

Yes.

Compression?

Configurable per column.

Open source?

Yes.

Citations


Delete the staging step

Deeplake streams from S3 / GCS at line rate. No TB copy. No local SSD. Multi-node ready.

Try Deeplake

Related