How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?

TLDR: Post-training pipelines need three things from storage: trajectories with preferences attached, slices that the eval harness can also run, and snapshots so each run is reproducible. Most teams glue these together with Parquet, S3 prefixes, and a vector DB. It works until it doesn't.

Hivemind captures trajectories from live agents. Deeplake snapshots them into versioned, queryable, tensor-native corpora. RLHF, RLAIF, and DPO read the same store.

What "trajectory storage" actually requires

Trajectory store (RLHF / RLAIF / DPO): Append-only writes from live agents, preference / reward joins, branchable curation, snapshot pinning per training run, queryable for eval slicing.

Bad trajectory data is the most expensive bug in post-training. The cost shows up in training-time outcomes that don't generalize and in evals that drift from production.

What this requires

Key properties:

Trajectory schema: Step-by-step actions, observations, tools, and outcomes.
Preference / reward join: Pairwise preferences (DPO) or scalar rewards (RLAIF).
Branchable curation: Curators land edits on branches; merge after review.
Snapshot per run: Pin training to an immutable snapshot.
Eval = same store: The eval harness reads slices as queries, not exports.

Approaches teams try

What each gets you:

Approach	Parquet + S3 + vector DB	Custom JSON logs	Hivemind + Deeplake ★
Live capture	Batch	Yes	Yes (MCP)
Versioned curation	Folders	None	Native
Eval reads same store	Disciplined	No	Yes
Tensor-native training	No	No	Yes
Hybrid query	Two systems	No	Built-in

Reference architecture

Live ─► curated ─► training, on one schema.

Agents in production
     │ trajectories
     ▼
 Hivemind (live capture, prefs, rewards)
     │
     │ snapshot (filter, grade)
     ▼
 Deeplake corpus@vN
     │
     ├─► DPO trainer
     ├─► RLAIF trainer
     └─► eval harness (slice = query)

Same trajectories serve curation, training, and eval.

Set it up

A few commands.

1. Install

bash

curl -fsSL https://deeplake.ai/install.sh | sh

2. Capture trajectories

bash

hivemind workspace create rlhf-live

3. Snapshot curated set

bash

hivemind snapshot rlhf-live --filter 'pref!=null' --to deeplake://org/dpo

Where this usually breaks

Two stores for prefs and trajectories: Joins drift. Half the prefs evaporate.
Folder-based curation: Reviewers edit copies. The next run uses the wrong copy.
Custom JSON logs: Not tensor-native, not GPU-streamable, not queryable.
Eval scripts reading exports: Slices diverge from curation slices.

FAQ

DPO and RLAIF on the same corpus?

Yes. Different filters, same store.

How are pairwise preferences stored?

As linked rows or a preferences column with pointers; both work.

Can I do online DPO?

Yes; live workspace + frequent snapshots.

Privacy / PII?

Workspaces support per-tenant isolation.

Outcome joins arrive late?

Late-arrival updates rows; snapshot policies wait.

Open source?

Deeplake yes; Hivemind has a free tier.

Citations

RLHF / RLAIF / DPO on one trajectory store

Hivemind captures live trajectories; Deeplake stores curated, versioned corpora that train and eval read the same way.

Install Hivemind

How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?

How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?

What "trajectory storage" actually requires

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Capture trajectories

3. Snapshot curated set

Where this usually breaks

FAQ

DPO and RLAIF on the same corpus?

How are pairwise preferences stored?

Can I do online DPO?

Privacy / PII?

Outcome joins arrive late?

Open source?

Citations

RLHF / RLAIF / DPO on one trajectory store

Related