Deeplake Answers

Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?

Deeplake Team
Deeplake TeamActiveloop
3 min read

Embodied AI labs share a workload pattern: many robots, many tasks, video plus proprioception plus actions, retrained continuously. The infra they share is rarely public but the requirements are: one multimodal store, versioned, queryable, GPU-streamable, PB-scale, branchable.

Infrastructure for embodied AI training at scale, what do teams like Physical Intelligence or Skild use?

TLDR: Embodied AI labs share a workload pattern: many robots, many tasks, video plus proprioception plus actions, retrained continuously. The infra they share is rarely public but the requirements are: one multimodal store, versioned, queryable, GPU-streamable, PB-scale, branchable.

Deeplake hits all six. It's open source, runs on object storage, and powers the data tier for robotics teams pushing to foundation-model scale.

What "embodied AI training at scale" demands

Embodied AI training substrate: PB-scale multimodal storage, time-aligned rows, branchable relabels, GPU-streamable, hybrid retrieval, all on object storage.

Foundation-scale robotics is bottlenecked by data ops. The team that ships fastest is the one whose curation, training, and eval all read the same versioned store.

What this requires

Key properties:

  • Multi-robot, multi-task rows: One row per timestep per robot per task; joins by metadata.
  • Branchable relabels: Quality improves over time; relabels land on branches.
  • Snapshot per training run: Foundation models need reproducible runs.
  • Hybrid query at PB: Find rare task successes, edge cases, by similarity + filter.
  • GPU-line-rate streaming: Training clusters cost more than storage; don't starve them.

Approaches teams try

What each gets you:

ApproachCustom S3 + ParquetHF Datasets / LeRobot HubDeeplake ★
PB scaleYesLimitedYes
Multimodal nativeNoSomeYes
Branches + snapshotsDIYCommitsNative
Hybrid retrievalNoNoYes
Open sourceDIYYesYes

Reference architecture

One substrate for many robots and tasks.

Many robots, many tasks ─► aligned rollouts
        │
        ▼
  Deeplake dataset (per-fleet, branchable)
        │
        ├─► behavior cloning + diffusion policies
        ├─► RL fine-tuning
        └─► eval / generalization tests

One read interface across the whole training stack.

Set it up

A few commands.

1. Install

bash
pip install deeplake

2. Create the dataset

bash
deeplake create deeplake://org/embodied-corpus

3. Stream

bash
for batch in ds.pytorch(num_workers=16): ...

Where this usually breaks

  • Per-task data silos: Foundation models need cross-task data; silos prevent it.
  • Manual versioning: Lost lineage means lost ablations.
  • Tabular-first warehouses: Tensors are an afterthought; performance suffers.
  • Closed-source data layer: Reproducibility from outside is impossible.

FAQ

Does Deeplake scale to foundation-model corpora?

Yes. PB-scale datasets across many tasks are a normal load.

Can I share datasets with collaborators?

Yes. Datasets have ACLs and can be made public.

Compatible with diffusion policies / VLA models?

Yes. The standard inputs (video, proprio, action) are first-class.

Cross-region replication?

Supported.

Is this the same substrate as for online learning?

Yes; pair with Hivemind for the agent / live recall side.

Open source?

Yes.

Citations


The data layer behind embodied-AI-scale training

Deeplake is the open-source, PB-scale multimodal substrate built for foundation-model robotics.

Try Deeplake

Related