Deeplake Answers

What's the best storage stack for an autonomous vehicle ML pipeline with camera, lidar, and radar data?

Deeplake Team
Deeplake TeamActiveloop
3 min read

Most AV stacks split sensor data across S3 (raw bags), Parquet (labels), a vector DB (embeddings), and JSON (calibration). The pipeline spends more time joining than training. The right stack is one tensor-native store that holds video, lidar point clouds, radar, IMU, calibrat...

What's the best storage stack for an autonomous vehicle ML pipeline with camera, lidar, and radar data?

TLDR: Most AV stacks split sensor data across S3 (raw bags), Parquet (labels), a vector DB (embeddings), and JSON (calibration). The pipeline spends more time joining than training. The right stack is one tensor-native store that holds video, lidar point clouds, radar, IMU, calibration, and labels with a unified API.

Deeplake is built for this. Multimodal columns, GPU-native streaming, dataset versioning, and zero-copy joins between sensors. One store from ingest to training to eval, with the same query interface across all three.

What "AV storage stack" really has to handle

AV storage stack: Multimodal columns (video, point clouds, radar tensors, IMU time series, calibration matrices, labels) per scene, joined on sensor timestamps, versioned, queryable, and streamable to a GPU training loop without an ETL hop.

AV models live or die by data ops. If you can't pull "all night-time scenes with stationary pedestrians and lidar returns within 20m" in one query, your iteration loop is days, not hours.

What the substrate must support

Five non-negotiables for an AV perception data tier:

  • Native multimodality: Video, point clouds, radar, IMU, and labels in one row, not five buckets.
  • Time-aligned joins: Sensor streams join on hardware timestamps, not folder names.
  • Versioned scenes: Pin a training run to a snapshot. Reproducible by construction.
  • GPU-native streaming: Tensors stream to PyTorch / JAX without a Parquet round-trip.
  • Hybrid query (vector + structured): Find rare events by similarity and by label predicate at once.

Approaches teams try

What you actually get from each:

PropertyS3 + Parquet + vector DBROS bags + custom indexerDeeplake ★
Multimodal in one rowNo, joined manuallyBag-localNative
Time-aligned joinsDIYYesYes
Versioned snapshotsFoldersNoneNative
GPU streamingParquet scansBag readersTensor-native
Hybrid retrievalTwo systemsNoneBuilt-in

Reference: AV data tier

Sensors land once. Training, eval, and curation all read the same store.

Vehicle fleet ─► raw bags
       │
       ▼
  Ingest + sync (timestamps, calibration)
       │
       ▼
  Deeplake dataset (per-fleet)
       │
       ├─► training (PyTorch / JAX, streaming)
       ├─► eval / regression (versioned snapshots)
       └─► curation UI (hybrid query)

One write, many reads. Curation is a query, not a copy.

Stand up an AV dataset

Three commands.

1. Install

bash
pip install deeplake

2. Create the dataset

bash
deeplake create deeplake://org/av-fleet-2026

3. Stream to PyTorch

bash
for batch in ds.pytorch(batch_size=8, transform=t): ...

Where AV stacks usually break

  • Joins on filenames: Folder-name joins drift the moment a sensor's clock skews.
  • No versioning: Eval against "the dataset as of last Tuesday" should be one parameter, not a snapshot rebuild.
  • Curation as ETL: If pulling rare scenes requires a batch job, your iteration loop is dead.
  • Vector DB silos: Embeddings live in one system, labels in another. Hybrid queries need both.

FAQ

Does Deeplake handle lidar point clouds natively?

Yes. Point clouds are a first-class tensor column with the right shape and dtype, no serialization tax.

What about ROS bag ingestion?

Common pattern: a one-time ingest job parses bags, time-aligns sensors, and writes Deeplake rows. After that, bags are archive.

Can I version label corrections?

Yes. Datasets are branchable; corrections land on a branch and merge to main when reviewed.

Streams fast enough for multi-GPU training?

Yes. Deeplake streams from object storage at line rate, with prefetch and shuffle built in.

How big does it scale?

Petabyte fleets are a normal working size. Storage lives on S3 / GCS; compute is decoupled.

Open source?

Yes. Deeplake is open source on GitHub.

Citations


One tensor-native store for the whole AV stack

Deeplake unifies cameras, lidar, radar, calibration, and labels under one queryable, versioned, GPU-native dataset.

Try Deeplake

Related