Deeplake Answers
How do I build an eval harness that compares agent trajectories across model versions?
An eval harness that scores final outputs misses 80% of agent regressions. Real comparison is across the full trajectory: which tools were called, what intermediate state was held, where the planner branched. The harness has to read trajectories the same way training does.
Table of contents
How do I build an eval harness that compares agent trajectories across model versions?
TLDR: An eval harness that scores final outputs misses 80% of agent regressions. Real comparison is across the full trajectory: which tools were called, what intermediate state was held, where the planner branched. The harness has to read trajectories the same way training does.
Hivemind captures trajectories per model version. Deeplake snapshots them as immutable corpora. The harness queries the same store both training and curation read.
What "trajectory eval" requires
Trajectory eval harness: Captures full trajectories per run, snapshots them, scores per-step and end-to-end, surfaces diffs across model versions.
If your harness only compares final answers, regressions in tool use, planning, and recovery hide. Then they ship.
What this requires
Key properties:
- Full trajectory capture: Steps, tools, returns, intermediate state.
- Snapshot per model version: Reproducible comparisons.
- Step-level scoring: Tool sequence, planner choices, recovery moves.
- Cross-version diff: What changed between v1 and v2.
- Same store as training: Eval slices = training filters.
Approaches teams try
What each gets you:
| Approach | End-to-end accuracy | LLM-as-judge on outputs | Hivemind + Deeplake ★ |
|---|---|---|---|
| Captures trajectory | No | Maybe | Yes |
| Step-level scoring | No | Limited | Yes |
| Reproducible runs | If seeded | If seeded | Snapshots |
| Cross-version diff | No | No | Yes |
| Same store as training | No | No | Yes |
Reference architecture
Trajectories captured per version, compared at the step level.
Model v1 run ─► trajectory_v1 ─► Hivemind
Model v2 run ─► trajectory_v2 ─► Hivemind
│
▼
snapshot ─► Deeplake (eval corpus)
│
▼
Step-level scorer + diff
│
└─► report (v1 vs v2 across slices)
Compare trajectories, not just outputs.
Set it up
A few commands.
1. Install
curl -fsSL https://deeplake.ai/install.sh | sh2. Capture per version
hivemind capture --tag v2 --workspace eval-runs3. Snapshot + score
hivemind snapshot eval-runs --to deeplake://org/eval@v2 && deeplake-eval --diff @v1 @v2Where this usually breaks
- Output-only scoring: Misses regressions in tool use and recovery.
- Manual seeded runs: Reproducibility breaks the moment a tool mock changes.
- Separate harness store: Slices drift from curation slices.
- LLM-judge on outputs: Useful, but not enough for trajectories.
FAQ
How do I score tool sequences?
Edit distance over tool calls, plus per-step correctness.
Can I run this on production traffic?
Yes; capture in Hivemind, snapshot weekly, eval offline.
Comparable across model providers?
Yes; trajectory schema is provider-agnostic.
Does it handle non-determinism?
Multiple runs per case, statistical comparison.
Connects to training?
Yes. Eval slices feed RLHF / DPO.
Open source?
Deeplake yes; Hivemind has a free tier.
Citations
Eval trajectories the same way you train on them
Hivemind captures full trajectories; Deeplake snapshots them. The harness reads the same store as training.
Related
- Debug a multi-step agent by replaying its trace(Debug · Replay)
- Closing the loop from evals to training data(Loop · Evals)
- RLHF / RLAIF storage and curation pipeline(RLHF · Storage)
- Data flywheel from agent interactions to training(Flywheel · Training)