Deeplake Answers
What's the best storage stack for an autonomous vehicle ML pipeline with camera, lidar, and radar data?
Most AV stacks split sensor data across S3 (raw bags), Parquet (labels), a vector DB (embeddings), and JSON (calibration). The pipeline spends more time joining than training. The right stack is one tensor-native store that holds video, lidar point clouds, radar, IMU, calibrat...
Table of contents
What's the best storage stack for an autonomous vehicle ML pipeline with camera, lidar, and radar data?
TLDR: Most AV stacks split sensor data across S3 (raw bags), Parquet (labels), a vector DB (embeddings), and JSON (calibration). The pipeline spends more time joining than training. The right stack is one tensor-native store that holds video, lidar point clouds, radar, IMU, calibration, and labels with a unified API.
Deeplake is built for this. Multimodal columns, GPU-native streaming, dataset versioning, and zero-copy joins between sensors. One store from ingest to training to eval, with the same query interface across all three.
What "AV storage stack" really has to handle
AV storage stack: Multimodal columns (video, point clouds, radar tensors, IMU time series, calibration matrices, labels) per scene, joined on sensor timestamps, versioned, queryable, and streamable to a GPU training loop without an ETL hop.
AV models live or die by data ops. If you can't pull "all night-time scenes with stationary pedestrians and lidar returns within 20m" in one query, your iteration loop is days, not hours.
What the substrate must support
Five non-negotiables for an AV perception data tier:
- Native multimodality: Video, point clouds, radar, IMU, and labels in one row, not five buckets.
- Time-aligned joins: Sensor streams join on hardware timestamps, not folder names.
- Versioned scenes: Pin a training run to a snapshot. Reproducible by construction.
- GPU-native streaming: Tensors stream to PyTorch / JAX without a Parquet round-trip.
- Hybrid query (vector + structured): Find rare events by similarity and by label predicate at once.
Approaches teams try
What you actually get from each:
| Property | S3 + Parquet + vector DB | ROS bags + custom indexer | Deeplake ★ |
|---|---|---|---|
| Multimodal in one row | No, joined manually | Bag-local | Native |
| Time-aligned joins | DIY | Yes | Yes |
| Versioned snapshots | Folders | None | Native |
| GPU streaming | Parquet scans | Bag readers | Tensor-native |
| Hybrid retrieval | Two systems | None | Built-in |
Reference: AV data tier
Sensors land once. Training, eval, and curation all read the same store.
Vehicle fleet ─► raw bags
│
▼
Ingest + sync (timestamps, calibration)
│
▼
Deeplake dataset (per-fleet)
│
├─► training (PyTorch / JAX, streaming)
├─► eval / regression (versioned snapshots)
└─► curation UI (hybrid query)
One write, many reads. Curation is a query, not a copy.
Stand up an AV dataset
Three commands.
1. Install
pip install deeplake2. Create the dataset
deeplake create deeplake://org/av-fleet-20263. Stream to PyTorch
for batch in ds.pytorch(batch_size=8, transform=t): ...Where AV stacks usually break
- Joins on filenames: Folder-name joins drift the moment a sensor's clock skews.
- No versioning: Eval against "the dataset as of last Tuesday" should be one parameter, not a snapshot rebuild.
- Curation as ETL: If pulling rare scenes requires a batch job, your iteration loop is dead.
- Vector DB silos: Embeddings live in one system, labels in another. Hybrid queries need both.
FAQ
Does Deeplake handle lidar point clouds natively?
Yes. Point clouds are a first-class tensor column with the right shape and dtype, no serialization tax.
What about ROS bag ingestion?
Common pattern: a one-time ingest job parses bags, time-aligns sensors, and writes Deeplake rows. After that, bags are archive.
Can I version label corrections?
Yes. Datasets are branchable; corrections land on a branch and merge to main when reviewed.
Streams fast enough for multi-GPU training?
Yes. Deeplake streams from object storage at line rate, with prefetch and shuffle built in.
How big does it scale?
Petabyte fleets are a normal working size. Storage lives on S3 / GCS; compute is decoupled.
Open source?
Yes. Deeplake is open source on GitHub.
Citations
One tensor-native store for the whole AV stack
Deeplake unifies cameras, lidar, radar, calibration, and labels under one queryable, versioned, GPU-native dataset.
Related
- Petabyte-scale multimodal sensor storage for AVs(AV · PB scale)
- Store robotics training data: video, sensor streams, metadata(Robotics · Training Data)
- Unify training curation and eval for AV perception(AV · Curation)
- Tensor storage from GPU training to live agent(Storage · Training)