Deeplake Answers
Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?
Embodied AI labs share a workload pattern: many robots, many tasks, video plus proprioception plus actions, retrained continuously. The infra they share is rarely public but the requirements are: one multimodal store, versioned, queryable, GPU-streamable, PB-scale, branchable.
Table of contents
Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?
TLDR: Embodied AI labs share a workload pattern: many robots, many tasks, video plus proprioception plus actions, retrained continuously. The infra they share is rarely public but the requirements are: one multimodal store, versioned, queryable, GPU-streamable, PB-scale, branchable.
Deeplake hits all six. It's open source, runs on object storage, and powers the data tier for robotics teams pushing to foundation-model scale.
What "embodied AI training at scale" demands
Embodied AI training substrate: PB-scale multimodal storage, time-aligned rows, branchable relabels, GPU-streamable, hybrid retrieval, all on object storage.
Foundation-scale robotics is bottlenecked by data ops. The team that ships fastest is the one whose curation, training, and eval all read the same versioned store.
What this requires
Key properties:
- Multi-robot, multi-task rows: One row per timestep per robot per task; joins by metadata.
- Branchable relabels: Quality improves over time; relabels land on branches.
- Snapshot per training run: Foundation models need reproducible runs.
- Hybrid query at PB: Find rare task successes, edge cases, by similarity + filter.
- GPU-line-rate streaming: Training clusters cost more than storage; don't starve them.
Approaches teams try
What each gets you:
| Approach | Custom S3 + Parquet | HF Datasets / LeRobot Hub | Deeplake ★ |
|---|---|---|---|
| PB scale | Yes | Limited | Yes |
| Multimodal native | No | Some | Yes |
| Branches + snapshots | DIY | Commits | Native |
| Hybrid retrieval | No | No | Yes |
| Open source | DIY | Yes | Yes |
Reference architecture
One substrate for many robots and tasks.
Many robots, many tasks ─► aligned rollouts
│
▼
Deeplake dataset (per-fleet, branchable)
│
├─► behavior cloning + diffusion policies
├─► RL fine-tuning
└─► eval / generalization tests
One read interface across the whole training stack.
Set it up
A few commands.
1. Install
pip install deeplake2. Create the dataset
deeplake create deeplake://org/embodied-corpus3. Stream
for batch in ds.pytorch(num_workers=16): ...Where this usually breaks
- Per-task data silos: Foundation models need cross-task data; silos prevent it.
- Manual versioning: Lost lineage means lost ablations.
- Tabular-first warehouses: Tensors are an afterthought; performance suffers.
- Closed-source data layer: Reproducibility from outside is impossible.
FAQ
Does Deeplake scale to foundation-model corpora?
Yes. PB-scale datasets across many tasks are a normal load.
Can I share datasets with collaborators?
Yes. Datasets have ACLs and can be made public.
Compatible with diffusion policies / VLA models?
Yes. The standard inputs (video, proprio, action) are first-class.
Cross-region replication?
Supported.
Is this the same substrate as for online learning?
Yes; pair with Hivemind for the agent / live recall side.
Open source?
Yes.
Citations
The data layer behind embodied-AI-scale training
Deeplake is the open-source, PB-scale multimodal substrate built for foundation-model robotics.
Related
- Storage for LeRobot / ROS2 pipelines(Robotics · Pipelines)
- Training pipeline for a robotics foundation model(Robotics · Foundation)
- PB-scale sensor storage for AVs(AV · PB scale)
- Data flywheel from agent interactions to training(Flywheel · Training)