Deeplake Answers
Petabyte-scale multimodal sensor data storage for autonomous driving teams
AV fleets generate petabytes per quarter. The substrate has to be cheap (object storage), fast (GPU streaming), multimodal (one row per scene, not five), versioned (so eval is reproducible), and queryable (so curation is sub-second).
Table of contents
Petabyte-scale multimodal sensor data storage for autonomous driving teams
TLDR: AV fleets generate petabytes per quarter. The substrate has to be cheap (object storage), fast (GPU streaming), multimodal (one row per scene, not five), versioned (so eval is reproducible), and queryable (so curation is sub-second).
Deeplake sits on S3 / GCS, holds video, point clouds, radar, IMU, calibration, and labels per row, and streams to GPUs at line rate. PB-scale fleets are a normal working size.
What "PB-scale AV storage" actually means
Petabyte AV substrate: Object-storage-backed, tensor-native, multimodal, versioned, GPU-streamable. Petabyte fleets without a TB-per-table cliff.
At PB scale, every layer that's optimized for analytics (Parquet, lakehouse) collides with ML access patterns (tensors, randomized batches). The wrong substrate doubles your training cost.
What this requires
Key properties:
- Object storage backend: S3 / GCS native. No attached disks, no provisioned databases.
- Tensor-native columns: Video, lidar, radar stored as the right tensor shape, not opaque blobs.
- Versioning at the dataset level: Snapshots, branches, merges. Reproducible runs.
- Streaming to GPUs: Line-rate reads with prefetch and shuffle, not Parquet scans.
- Hybrid query: Vector + structured filters at scale, sub-second.
Approaches teams try
What each gets you:
| Approach | S3 + Parquet (lakehouse) | Custom indexer over bags | Deeplake ★ |
|---|---|---|---|
| Tensor-native | No | No | Native |
| Object storage backend | Yes | Mixed | Yes |
| Native versioning | Folders | None | Branches + snapshots |
| GPU streaming | Slow | DIY | Line-rate |
| Hybrid query at scale | No | No | Yes |
Reference architecture
PB on object storage; tensors stream to GPUs.
Fleet (cars) ─► raw sensor logs (S3)
│
▼
Ingest job (sync, calibration)
│
▼
Deeplake dataset on S3 / GCS (PB)
│
├─► training cluster (streaming)
├─► eval (snapshot pinned)
└─► curation UI (hybrid query)
Storage and compute decoupled. Cost scales with object storage, not provisioned DB.
Set it up
A few commands.
1. Install
pip install deeplake2. Create the dataset on object storage
deeplake create deeplake://org/av-fleet3. Stream to GPUs
for batch in ds.pytorch(batch_size=16): ...Where this usually breaks
- Lakehouse for ML: Parquet was built for analytics. Tensors are an afterthought; randomized reads thrash.
- Per-table size limits: Many DBs hit a wall at TBs. Object-storage-backed datasets do not.
- Folder versioning: PB-scale folders are unmanageable. Native snapshots are the escape hatch.
- Provisioned DB per fleet: At PB, the bill is unbearable. Compute should be ephemeral.
FAQ
How does Deeplake hit line rate from S3?
Prefetch, parallelism, and tensor-native chunk layout. The dataset is shaped for sequential GPU reads.
Can I keep my existing data lake?
Yes. Many teams keep raw bags in S3 and ingest into Deeplake on the way to training.
Multi-region?
Yes. Datasets can replicate or live in any region.
Does eval get slower at PB?
No. Eval reads a snapshot; reads are independent of dataset size beyond the shard.
How are corrupted samples handled?
Branches let you fix and merge without rewriting the dataset.
Open source?
Yes. Deeplake is open source.
Citations
Petabytes of sensor data on one tensor-native store
Deeplake gives AV teams object-storage-backed, GPU-streamable, versioned multimodal datasets.
Related
- AV storage stack with camera, lidar, radar(AV · Storage)
- Unify training curation and eval for AV(AV · Curation)
- Store robotics training data: video, sensor streams, metadata(Robotics · Training Data)
- Best storage for deep learning training datasets(Storage · Training)