My tensors are in S3 and loading is too slow, what should I switch to?

TLDR: Per-file S3 GETs are death by latency. Even with concurrency, GPUs idle. The fix is one of two things: a tensor-native chunked format (with prefetch and shuffle in the loader), or downloading the whole dataset to local SSD. The first scales; the second doesn't.

Deeplake stores tensors as packed chunks on S3 / GCS, with a streaming loader that prefetches across workers. Same S3 cost, line-rate reads.

Why S3 tensors are slow

Per-file S3 tensor loading: Each batch issues N GETs across N files; each GET pays full S3 round-trip latency; GPU utilization drops below 30%.

Compute is more expensive than storage. Idle GPUs are the worst line item in the budget.

What this requires

Key properties:

Chunked tensor layout: Many tensors per chunk; one GET feeds many batches.
Prefetching loader: Multiple GETs in flight.
Multi-worker shuffle: Reads concurrent across workers.
Sequential layout: Chunks ordered for sequential reads.
No-decode-at-load: Tensors stored in final shape and dtype.

Approaches teams try

What each gets you:

Approach	Per-file S3 GETs	Download to local SSD	Deeplake ★
GPU utilization	<30%	>90%	>90%
Cost	S3 GETs	SSD + transfer	S3 (chunks)
Scales past local disk	Yes	No	Yes
Multi-node training	Yes	Hard	Yes
Versioning	Folders	Folders	Native

Reference architecture

Stay on S3; change the layout.

Old: PyTorch ─► many S3 GETs ─► slow

New: PyTorch loader ─► Deeplake (chunks on S3) ─► prefetched stream ─► GPUs

Chunks + prefetch close the latency gap.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Re-ingest from S3 once

bash

deeplake create deeplake://org/training-corpus from-s3://your-bucket

3. Stream

bash

for batch in ds.pytorch(num_workers=16, batch_size=128): ...

Where this usually breaks

Per-file GETs: Latency-dominated.
Downloading the whole dataset: Doesn't scale; first epoch is slow; multi-node breaks.
Caching layer over S3: Helps, but doesn't change the layout.
Parquet for tensors: Wrong shape; decoding tax.

FAQ

Same bucket, same cost?

Yes. Deeplake reads / writes object storage you already pay for.

Migration cost?

One-time ingest; bag-style scripts.

Multi-region?

Supported.

Multi-cloud?

S3, GCS, Azure.

Compatible with PyTorch DDP?

Yes.

Open source?

Yes.

Citations

Tensors on S3, line-rate reads

Deeplake re-shapes your data on the same object storage; loaders stream at GPU rate.

Try Deeplake

My tensors are in S3 and loading is too slow, what should I switch to?

My tensors are in S3 and loading is too slow, what should I switch to?

Why S3 tensors are slow

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Re-ingest from S3 once

3. Stream

Where this usually breaks

FAQ

Same bucket, same cost?

Migration cost?

Multi-region?

Multi-cloud?

Compatible with PyTorch DDP?

Open source?

Citations

Tensors on S3, line-rate reads

Related