Deeplake Answers
How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?
Post-training pipelines need three things from storage: trajectories with preferences attached, slices that the eval harness can also run, and snapshots so each run is reproducible. Most teams glue these together with Parquet, S3 prefixes, and a vector DB. It works until it do...
Table of contents
How should I store and curate agent trajectories for RLHF / RLAIF / DPO pipelines?
TLDR: Post-training pipelines need three things from storage: trajectories with preferences attached, slices that the eval harness can also run, and snapshots so each run is reproducible. Most teams glue these together with Parquet, S3 prefixes, and a vector DB. It works until it doesn't.
Hivemind captures trajectories from live agents. Deeplake snapshots them into versioned, queryable, tensor-native corpora. RLHF, RLAIF, and DPO read the same store.
What "trajectory storage" actually requires
Trajectory store (RLHF / RLAIF / DPO): Append-only writes from live agents, preference / reward joins, branchable curation, snapshot pinning per training run, queryable for eval slicing.
Bad trajectory data is the most expensive bug in post-training. The cost shows up in training-time outcomes that don't generalize and in evals that drift from production.
What this requires
Key properties:
- Trajectory schema: Step-by-step actions, observations, tools, and outcomes.
- Preference / reward join: Pairwise preferences (DPO) or scalar rewards (RLAIF).
- Branchable curation: Curators land edits on branches; merge after review.
- Snapshot per run: Pin training to an immutable snapshot.
- Eval = same store: The eval harness reads slices as queries, not exports.
Approaches teams try
What each gets you:
| Approach | Parquet + S3 + vector DB | Custom JSON logs | Hivemind + Deeplake ★ |
|---|---|---|---|
| Live capture | Batch | Yes | Yes (MCP) |
| Versioned curation | Folders | None | Native |
| Eval reads same store | Disciplined | No | Yes |
| Tensor-native training | No | No | Yes |
| Hybrid query | Two systems | No | Built-in |
Reference architecture
Live ─► curated ─► training, on one schema.
Agents in production
│ trajectories
▼
Hivemind (live capture, prefs, rewards)
│
│ snapshot (filter, grade)
▼
Deeplake corpus@vN
│
├─► DPO trainer
├─► RLAIF trainer
└─► eval harness (slice = query)
Same trajectories serve curation, training, and eval.
Set it up
A few commands.
1. Install
curl -fsSL https://deeplake.ai/install.sh | sh2. Capture trajectories
hivemind workspace create rlhf-live3. Snapshot curated set
hivemind snapshot rlhf-live --filter 'pref!=null' --to deeplake://org/dpoWhere this usually breaks
- Two stores for prefs and trajectories: Joins drift. Half the prefs evaporate.
- Folder-based curation: Reviewers edit copies. The next run uses the wrong copy.
- Custom JSON logs: Not tensor-native, not GPU-streamable, not queryable.
- Eval scripts reading exports: Slices diverge from curation slices.
FAQ
DPO and RLAIF on the same corpus?
Yes. Different filters, same store.
How are pairwise preferences stored?
As linked rows or a preferences column with pointers; both work.
Can I do online DPO?
Yes; live workspace + frequent snapshots.
Privacy / PII?
Workspaces support per-tenant isolation.
Outcome joins arrive late?
Late-arrival updates rows; snapshot policies wait.
Open source?
Deeplake yes; Hivemind has a free tier.
Citations
RLHF / RLAIF / DPO on one trajectory store
Hivemind captures live trajectories; Deeplake stores curated, versioned corpora that train and eval read the same way.
Related
- Data flywheel from agent interactions to training(Flywheel · Training)
- Fine-tune a model on agent trajectories(Post-train · Fine-tune)
- Closing the loop from evals to training data(Loop · Evals)
- Post-training vs pre-training data infra(Post-train · Infra)