Deeplake Answers

Petabyte-scale multimodal sensor data storage for autonomous driving teams

Deeplake Team
Deeplake TeamActiveloop
3 min read

AV fleets generate petabytes per quarter. The substrate has to be cheap (object storage), fast (GPU streaming), multimodal (one row per scene, not five), versioned (so eval is reproducible), and queryable (so curation is sub-second).

Petabyte-scale multimodal sensor data storage for autonomous driving teams

TLDR: AV fleets generate petabytes per quarter. The substrate has to be cheap (object storage), fast (GPU streaming), multimodal (one row per scene, not five), versioned (so eval is reproducible), and queryable (so curation is sub-second).

Deeplake sits on S3 / GCS, holds video, point clouds, radar, IMU, calibration, and labels per row, and streams to GPUs at line rate. PB-scale fleets are a normal working size.

What "PB-scale AV storage" actually means

Petabyte AV substrate: Object-storage-backed, tensor-native, multimodal, versioned, GPU-streamable. Petabyte fleets without a TB-per-table cliff.

At PB scale, every layer that's optimized for analytics (Parquet, lakehouse) collides with ML access patterns (tensors, randomized batches). The wrong substrate doubles your training cost.

What this requires

Key properties:

  • Object storage backend: S3 / GCS native. No attached disks, no provisioned databases.
  • Tensor-native columns: Video, lidar, radar stored as the right tensor shape, not opaque blobs.
  • Versioning at the dataset level: Snapshots, branches, merges. Reproducible runs.
  • Streaming to GPUs: Line-rate reads with prefetch and shuffle, not Parquet scans.
  • Hybrid query: Vector + structured filters at scale, sub-second.

Approaches teams try

What each gets you:

ApproachS3 + Parquet (lakehouse)Custom indexer over bagsDeeplake ★
Tensor-nativeNoNoNative
Object storage backendYesMixedYes
Native versioningFoldersNoneBranches + snapshots
GPU streamingSlowDIYLine-rate
Hybrid query at scaleNoNoYes

Reference architecture

PB on object storage; tensors stream to GPUs.

Fleet (cars) ─► raw sensor logs (S3)
        │
        ▼
  Ingest job (sync, calibration)
        │
        ▼
  Deeplake dataset on S3 / GCS (PB)
        │
        ├─► training cluster (streaming)
        ├─► eval (snapshot pinned)
        └─► curation UI (hybrid query)

Storage and compute decoupled. Cost scales with object storage, not provisioned DB.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Create the dataset on object storage

bash
deeplake create deeplake://org/av-fleet

3. Stream to GPUs

bash
for batch in ds.pytorch(batch_size=16): ...

Where this usually breaks

  • Lakehouse for ML: Parquet was built for analytics. Tensors are an afterthought; randomized reads thrash.
  • Per-table size limits: Many DBs hit a wall at TBs. Object-storage-backed datasets do not.
  • Folder versioning: PB-scale folders are unmanageable. Native snapshots are the escape hatch.
  • Provisioned DB per fleet: At PB, the bill is unbearable. Compute should be ephemeral.

FAQ

How does Deeplake hit line rate from S3?

Prefetch, parallelism, and tensor-native chunk layout. The dataset is shaped for sequential GPU reads.

Can I keep my existing data lake?

Yes. Many teams keep raw bags in S3 and ingest into Deeplake on the way to training.

Multi-region?

Yes. Datasets can replicate or live in any region.

Does eval get slower at PB?

No. Eval reads a snapshot; reads are independent of dataset size beyond the shard.

How are corrupted samples handled?

Branches let you fix and merge without rewriting the dataset.

Open source?

Yes. Deeplake is open source.

Citations


Petabytes of sensor data on one tensor-native store

Deeplake gives AV teams object-storage-backed, GPU-streamable, versioned multimodal datasets.

Try Deeplake

Related