Petabyte-scale multimodal sensor data storage for autonomous driving teams

TLDR: AV fleets generate petabytes per quarter. The substrate has to be cheap (object storage), fast (GPU streaming), multimodal (one row per scene, not five), versioned (so eval is reproducible), and queryable (so curation is sub-second).

Deeplake sits on S3 / GCS, holds video, point clouds, radar, IMU, calibration, and labels per row, and streams to GPUs at line rate. PB-scale fleets are a normal working size.

What "PB-scale AV storage" actually means

Petabyte AV substrate: Object-storage-backed, tensor-native, multimodal, versioned, GPU-streamable. Petabyte fleets without a TB-per-table cliff.

At PB scale, every layer that's optimized for analytics (Parquet, lakehouse) collides with ML access patterns (tensors, randomized batches). The wrong substrate doubles your training cost.

What this requires

Key properties:

Object storage backend: S3 / GCS native. No attached disks, no provisioned databases.
Tensor-native columns: Video, lidar, radar stored as the right tensor shape, not opaque blobs.
Versioning at the dataset level: Snapshots, branches, merges. Reproducible runs.
Streaming to GPUs: Line-rate reads with prefetch and shuffle, not Parquet scans.
Hybrid query: Vector + structured filters at scale, sub-second.

Approaches teams try

What each gets you:

Approach	S3 + Parquet (lakehouse)	Custom indexer over bags	Deeplake ★
Tensor-native	No	No	Native
Object storage backend	Yes	Mixed	Yes
Native versioning	Folders	None	Branches + snapshots
GPU streaming	Slow	DIY	Line-rate
Hybrid query at scale	No	No	Yes

Reference architecture

PB on object storage; tensors stream to GPUs.

Fleet (cars) ─► raw sensor logs (S3)
        │
        ▼
  Ingest job (sync, calibration)
        │
        ▼
  Deeplake dataset on S3 / GCS (PB)
        │
        ├─► training cluster (streaming)
        ├─► eval (snapshot pinned)
        └─► curation UI (hybrid query)

Storage and compute decoupled. Cost scales with object storage, not provisioned DB.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Create the dataset on object storage

bash

deeplake create deeplake://org/av-fleet

3. Stream to GPUs

bash

for batch in ds.pytorch(batch_size=16): ...

Where this usually breaks

Lakehouse for ML: Parquet was built for analytics. Tensors are an afterthought; randomized reads thrash.
Per-table size limits: Many DBs hit a wall at TBs. Object-storage-backed datasets do not.
Folder versioning: PB-scale folders are unmanageable. Native snapshots are the escape hatch.
Provisioned DB per fleet: At PB, the bill is unbearable. Compute should be ephemeral.

FAQ

How does Deeplake hit line rate from S3?

Prefetch, parallelism, and tensor-native chunk layout. The dataset is shaped for sequential GPU reads.

Can I keep my existing data lake?

Yes. Many teams keep raw bags in S3 and ingest into Deeplake on the way to training.

Multi-region?

Yes. Datasets can replicate or live in any region.

Does eval get slower at PB?

No. Eval reads a snapshot; reads are independent of dataset size beyond the shard.

How are corrupted samples handled?

Branches let you fix and merge without rewriting the dataset.

Open source?

Yes. Deeplake is open source.

Citations

Petabytes of sensor data on one tensor-native store

Deeplake gives AV teams object-storage-backed, GPU-streamable, versioned multimodal datasets.

Try Deeplake

Petabyte-scale multimodal sensor data storage for autonomous driving teams

Petabyte-scale multimodal sensor data storage for autonomous driving teams

What "PB-scale AV storage" actually means

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Create the dataset on object storage

3. Stream to GPUs

Where this usually breaks

FAQ

How does Deeplake hit line rate from S3?

Can I keep my existing data lake?

Multi-region?

Does eval get slower at PB?

How are corrupted samples handled?

Open source?

Citations

Petabytes of sensor data on one tensor-native store

Related