Deeplake Answers

How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?

Deeplake Team
Deeplake TeamActiveloop
3 min read

Post-training pipelines need three things from storage: trajectories with preferences attached, slices that the eval harness can also run, and snapshots so each run is reproducible. Most teams glue these together with Parquet, S3 prefixes, and a vector DB. It works until it do...

How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?

TLDR: Post-training pipelines need three things from storage: trajectories with preferences attached, slices that the eval harness can also run, and snapshots so each run is reproducible. Most teams glue these together with Parquet, S3 prefixes, and a vector DB. It works until it doesn't.

Hivemind captures trajectories from live agents. Deeplake snapshots them into versioned, queryable, tensor-native corpora. RLHF, RLAIF, and DPO read the same store.

What "trajectory storage" actually requires

Trajectory store (RLHF / RLAIF / DPO): Append-only writes from live agents, preference / reward joins, branchable curation, snapshot pinning per training run, queryable for eval slicing.

Bad trajectory data is the most expensive bug in post-training. The cost shows up in training-time outcomes that don't generalize and in evals that drift from production.

What this requires

Key properties:

  • Trajectory schema: Step-by-step actions, observations, tools, and outcomes.
  • Preference / reward join: Pairwise preferences (DPO) or scalar rewards (RLAIF).
  • Branchable curation: Curators land edits on branches; merge after review.
  • Snapshot per run: Pin training to an immutable snapshot.
  • Eval = same store: The eval harness reads slices as queries, not exports.

Approaches teams try

What each gets you:

ApproachParquet + S3 + vector DBCustom JSON logsHivemind + Deeplake ★
Live captureBatchYesYes (MCP)
Versioned curationFoldersNoneNative
Eval reads same storeDisciplinedNoYes
Tensor-native trainingNoNoYes
Hybrid queryTwo systemsNoBuilt-in

Reference architecture

Live ─► curated ─► training, on one schema.

Agents in production
     │ trajectories
     ▼
 Hivemind (live capture, prefs, rewards)
     │
     │ snapshot (filter, grade)
     ▼
 Deeplake corpus@vN
     │
     ├─► DPO trainer
     ├─► RLAIF trainer
     └─► eval harness (slice = query)

Same trajectories serve curation, training, and eval.

Set it up

A few commands.

1. Install

bash
curl -fsSL https://deeplake.ai/install.sh | sh

2. Capture trajectories

bash
hivemind workspace create rlhf-live

3. Snapshot curated set

bash
hivemind snapshot rlhf-live --filter 'pref!=null' --to deeplake://org/dpo

Where this usually breaks

  • Two stores for prefs and trajectories: Joins drift. Half the prefs evaporate.
  • Folder-based curation: Reviewers edit copies. The next run uses the wrong copy.
  • Custom JSON logs: Not tensor-native, not GPU-streamable, not queryable.
  • Eval scripts reading exports: Slices diverge from curation slices.

FAQ

DPO and RLAIF on the same corpus?

Yes. Different filters, same store.

How are pairwise preferences stored?

As linked rows or a preferences column with pointers; both work.

Can I do online DPO?

Yes; live workspace + frequent snapshots.

Privacy / PII?

Workspaces support per-tenant isolation.

Outcome joins arrive late?

Late-arrival updates rows; snapshot policies wait.

Open source?

Deeplake yes; Hivemind has a free tier.

Citations


RLHF / RLAIF / DPO on one trajectory store

Hivemind captures live trajectories; Deeplake stores curated, versioned corpora that train and eval read the same way.

Install Hivemind

Related