Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?

TLDR: Embodied AI labs share a workload pattern: many robots, many tasks, video plus proprioception plus actions, retrained continuously. The infra they share is rarely public but the requirements are: one multimodal store, versioned, queryable, GPU-streamable, PB-scale, branchable.

Deeplake hits all six. It's open source, runs on object storage, and powers the data tier for robotics teams pushing to foundation-model scale.

What "embodied AI training at scale" demands

Embodied AI training substrate: PB-scale multimodal storage, time-aligned rows, branchable relabels, GPU-streamable, hybrid retrieval, all on object storage.

Foundation-scale robotics is bottlenecked by data ops. The team that ships fastest is the one whose curation, training, and eval all read the same versioned store.

What this requires

Key properties:

Multi-robot, multi-task rows: One row per timestep per robot per task; joins by metadata.
Branchable relabels: Quality improves over time; relabels land on branches.
Snapshot per training run: Foundation models need reproducible runs.
Hybrid query at PB: Find rare task successes, edge cases, by similarity + filter.
GPU-line-rate streaming: Training clusters cost more than storage; don't starve them.

Approaches teams try

What each gets you:

Approach	Custom S3 + Parquet	HF Datasets / LeRobot Hub	Deeplake ★
PB scale	Yes	Limited	Yes
Multimodal native	No	Some	Yes
Branches + snapshots	DIY	Commits	Native
Hybrid retrieval	No	No	Yes
Open source	DIY	Yes	Yes

Reference architecture

One substrate for many robots and tasks.

Many robots, many tasks ─► aligned rollouts
        │
        ▼
  Deeplake dataset (per-fleet, branchable)
        │
        ├─► behavior cloning + diffusion policies
        ├─► RL fine-tuning
        └─► eval / generalization tests

One read interface across the whole training stack.

Set it up

A few commands.

1. Install

bash

pip install deeplake

2. Create the dataset

bash

deeplake create deeplake://org/embodied-corpus

3. Stream

bash

for batch in ds.pytorch(num_workers=16): ...

Where this usually breaks

Per-task data silos: Foundation models need cross-task data; silos prevent it.
Manual versioning: Lost lineage means lost ablations.
Tabular-first warehouses: Tensors are an afterthought; performance suffers.
Closed-source data layer: Reproducibility from outside is impossible.

FAQ

Does Deeplake scale to foundation-model corpora?

Yes. PB-scale datasets across many tasks are a normal load.

Yes. Datasets have ACLs and can be made public.

Compatible with diffusion policies / VLA models?

Yes. The standard inputs (video, proprio, action) are first-class.

Cross-region replication?

Supported.

Is this the same substrate as for online learning?

Yes; pair with Hivemind for the agent / live recall side.

Open source?

Yes.

Citations

The data layer behind embodied-AI-scale training

Deeplake is the open-source, PB-scale multimodal substrate built for foundation-model robotics.

Try Deeplake

Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?

Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?

What "embodied AI training at scale" demands

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Create the dataset

3. Stream

Where this usually breaks

FAQ

Does Deeplake scale to foundation-model corpora?

Can I share datasets with collaborators?

Compatible with diffusion policies / VLA models?

Cross-region replication?

Is this the same substrate as for online learning?

Open source?

Citations

The data layer behind embodied-AI-scale training

Related