A Deployable Annotation Service for Robotics Datasets

Robotics teams are collecting more multimodal demonstrations than ever, but turning those demonstrations into train-ready datasets still requires a large amount of manual interpretation. A raw episode may contain images, states, actions, timestamps, and a short task instruction, yet the fields that make the data useful for training and debugging often live outside the dataset: goals, phases, segment boundaries, quality signals, modality disagreements, and review status.

Roboscribe-AF shows how Deeplake and AgentField can be combined into a practical annotation layer for robotics data. Deeplake provides the versioned multimodal dataset layer, while AgentField runs the reasoning workflows that inspect episodes, create structured annotations, flag uncertainty, and write results back into dataset branches. The result is a deployable pattern for transforming robot demonstrations into queryable, reviewable, and train-ready data.

Robot datasets increasingly need more than a task string.

Open X-Embodiment showed the value of pooling robot demonstrations across embodiments and tasks. OpenVLA trains a vision-language-action policy on robot demonstration data. ECoT goes further and trains VLAs to reason over intermediate plans, sub-tasks, motions, object boxes, and end-effector positions before predicting actions.

That literature points to a practical gap. Many demonstrations arrive as video, state, action, timestamps, and a sparse instruction. The useful training record often wants richer fields: episode goal, segment boundaries, phase labels, modality agreement, anomaly flags, and review status.

Roboscribe-AF is our open-source example of that middle layer: a deployable annotation service built from two infrastructure pieces.

Layer	Role
Deeplake	Multimodal dataset, branches, tensors, embeddings, queries
AgentField	Reasoners, deterministic skills, async execution, workflow trace

The separation matters. Deeplake stores the corpus and annotation versions. AgentField runs the reasoning graph that produces new annotation rows.

The annotation

For each episode, Roboscribe-AF loads keyframes and action/state trajectories, runs a visual thread and an action thread, segments the episode, checks whether the modality stories agree, embeds the scene summary, and writes the result to a Deeplake branch.

The output is deliberately plain:

json

{
  "episode_id": 0,
  "episode_goal": "Push the gray T-shaped block into the green target outline",
  "segments": [
    {"start_frame": 0, "end_frame": 21, "phase": "approach"},
    {"start_frame": 21, "end_frame": 40, "phase": "manipulate"}
  ],
  "visual_phase": "manipulate",
  "action_phase": "approach",
  "consistency_score": 0.2,
  "human_review_recommended": true
}

The mismatch is treated as data. The visual reasoner and trajectory reasoner stay separate until a verifier compares them. If they disagree, the disagreement becomes a dataset field that can be queried, reviewed, or filtered.

The data layer

Deeplake's documented surface covers the artifact we need to store: images, embeddings, tensors, text, vector search, versioning, and PyTorch/TensorFlow streaming (core docs). Its LeRobot guide shows robot telemetry, frames, state/action arrays, episode indices, and task descriptions as queryable, streamable data (LeRobot integration). Its VLA guide uses data stored in Deeplake for fine-tuning (VLA fine-tuning).

Roboscribe-AF keeps raw and derived fields in the same schema:

python

{
    "episode_id": Int32,
    "keyframes_png": Sequence(Bytes),
    "actions": Array(Float32, 2),
    "states": Array(Float32, 2),
    "lang_episode_goal": Text,
    "visual_phase": Text,
    "action_phase": Text,
    "consistency_score": Float32,
    "human_review_recommended": Bool,
    "scene_embedding": Embedding(size=1024),
    "annotation_version": Text,
}

That makes branch-level annotation practical. Raw data can remain on main; a first annotation pass can write to roboscribe-v1; a stricter verifier can write to roboscribe-v2; a reviewed subset can become a train-ready branch.

It also keeps queries close to training decisions:

sql

SELECT episode_id, lang_episode_goal, consistency_score
WHERE visual_phase = 'manipulate' AND consistency_score > 0.6

The execution layer

The service is complex at runtime but small in code shape. Roboscribe-AF registers 16 reasoners and 8 skills. A corpus run fans out into loaders, per-keyframe object detectors, scene reasoners, action reasoners, boundary judges, segment narrators, verifiers, embedding calls, Deeplake writes, and branch comparisons.

The developer surface is just named units:

python

@router.skill()
async def commit_annotation_to_branch(...):
    ...
 
@router.reasoner()
async def visual_thread(...):
    ...

Skills do deterministic work: load frames, compute velocity summaries, query Deeplake, write branches. Reasoners do model-backed judgment: detect objects, classify phases, judge boundaries, reconcile modalities. AgentField exposes both as callable targets and tracks parent-child executions through its control plane, as described in its how-it-works docs.

The fan-out remains ordinary Python:

python

visual, action = await asyncio.gather(
    composer_router.call("roboscribe-af.visual_thread", keyframes_b64=frames),
    composer_router.call("roboscribe-af.action_thread", states=states),
)

That is the useful property: the deployed system has a real workflow graph, async API, and UI trace, but the implementation is still a set of small domain functions.

What exists now

The repository currently includes:

PushT and Aloha-style task adapters.
Parallel visual and action modality threads.
Segment narrator fan-out based on detected segment count.
Cross-modal consistency scoring.
Deeplake ingestion and annotation branch writes.
TQL examples, semantic search over scene embeddings, and branch comparison.
Docker Compose deployment for the AgentField control plane and Roboscribe-AF agent service.

Run it locally:

bash

cd code/examples/roboscribe-af
cp .env.example .env
# Add OPENROUTER_API_KEY.
docker compose up --build
./scripts/run_demo.sh

During the run, AgentField shows the reasoning DAG. Deeplake holds the resulting branch.

Where this pattern goes

In a robotics lab, the same architecture becomes a data engine rather than a one-off annotator.

First, reactive annotation. AgentField documents webhook triggers, schedules, memory triggers, async execution, and workflow DAGs in its production capabilities. A lab can ingest new teleoperation episodes into Deeplake, trigger an annotation worker, write low-confidence rows to a review queue, and promote approved rows into a train-ready branch.

Second, curation. BridgeData V2 and Open X-Embodiment are reminders that scale and diversity matter, while Re-Mix and recent work on demonstration curation point toward selecting better training mixtures rather than treating every trajectory equally. An AgentField reasoner can query Deeplake for phase disagreements, rare tasks, unusual action embeddings, or weakly represented environments and create a curation queue.

Third, lab automation. A robotics lab already has events: a teleop session finished, a nightly training run failed, an eval policy regressed on contact-rich tasks, a reviewer approved a batch. Those events can become backend triggers. Deeplake holds the versioned data state; AgentField runs the small pieces of reasoning and bookkeeping around it.

The broader pattern is not a shared chatbot sitting beside the dataset. It is background agents attached to the pipeline itself: guided by schemas, branches, triggers, and review policies; autonomous enough to inspect new data, enrich it, flag uncertainty, and prepare train-ready branches without waiting for a human to manually query every corpus change. Deeplake provides governed data access and versioned state. AgentField turns that access into autonomous background work.

Roboscribe-AF is a small example, but the pattern is broader: robotics datasets should not be passive storage buckets. As robot learning pipelines scale, the dataset layer needs to support continuous enrichment, review, curation, and promotion of higher-quality training subsets.

Deeplake provides the versioned multimodal foundation for that workflow. AgentField adds the execution layer for background reasoning, deterministic data operations, workflow tracing, and review-aware automation. Together, they make it possible to build robotics data pipelines where new episodes can be ingested, annotated, checked, queried, and promoted without turning every dataset update into a manual labeling project.

For labs building vision-language-action systems, this turns annotation from a one-off preprocessing step into an operational loop: collect data, enrich it, inspect disagreements, curate useful subsets, and keep the training corpus aligned with the realities of the robot pipeline.

A Deployable Annotation Service for Robotics Datasets

The annotation

The data layer

The execution layer

What exists now

Where this pattern goes

References