How do I avoid copying terabytes from a data lake to GPU nodes?

TLDR: The TB-copy pattern is a relic: pull from the lake to local SSD, then start training. It wastes hours per run, scales worse than linearly, and breaks in multi-node. The fix is reading directly from object storage with a format that streams.

Deeplake reads tensor-shaped chunks from S3 / GCS at line rate. No staging step. Multi-node, multi-region, no local cache needed.

Why TB copies happen

Lake-to-GPU staging: Default for Parquet / JPEG-folder datasets: pull to local SSD because per-file S3 GETs are too slow to train against. The format forces the copy.

Each run loses an hour or more to staging; multi-node loses more. The format choice is the cost choice.

What this requires

Key properties:

Tensor-shaped chunks: Chunks tuned for sequential reads.
Streaming loader: Prefetch + shuffle, no download.
Object-storage native: S3 / GCS, not file system.
Multi-worker: Reads scale across DDP workers.
No staging: First batch starts from S3 in seconds.

Approaches teams try

What each gets you:

Approach	Copy from lake to local SSD	S3FS / fsspec mount	Deeplake ★
First-batch latency	Hours	Slow	Seconds
Multi-node training	Hard	Yes	Yes
Cost	SSD + transfer	GETs	Chunked GETs
Versioning	Folders	Folders	Native
Multimodal	Per-folder	Per-folder	Native

Reference architecture

Read from the lake; no copy.

Old: lake (S3) ─► [hours] ─► local SSD ─► trainer

New: lake (S3, Deeplake chunks) ─► trainer (streaming)

Staging step deleted.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Ingest once

bash

deeplake create deeplake://org/training from-s3://your-bucket

3. Stream

bash

for batch in ds.pytorch(num_workers=32): ...

Where this usually breaks

Bigger SSD: Doesn't help past one node.
S3FS / fsspec mounts: Helps a little; latency-bound.
Per-batch downloads: Dominated by GET latency.
Distributed cache layer: Adds moving parts; doesn't change layout.

FAQ

Does the lake stay intact?

Yes; Deeplake is the read layer over the same bucket.

How big can the dataset be?

PB scale is normal.

Multi-region?

Yes.

Compatible with DDP / FSDP?

Yes.

Compression?

Configurable per column.

Open source?

Yes.

Citations

Delete the staging step

Deeplake streams from S3 / GCS at line rate. No TB copy. No local SSD. Multi-node ready.

Try Deeplake

How do I avoid copying terabytes from a data lake to GPU nodes?

How do I avoid copying terabytes from a data lake to GPU nodes?

Why TB copies happen

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Ingest once

3. Stream

Where this usually breaks

FAQ

Does the lake stay intact?

How big can the dataset be?

Multi-region?

Compatible with DDP / FSDP?

Compression?

Open source?

Citations

Delete the staging step

Related