How do I build an eval harness that compares agent trajectories across model versions?

TLDR: An eval harness that scores final outputs misses 80% of agent regressions. Real comparison is across the full trajectory: which tools were called, what intermediate state was held, where the planner branched. The harness has to read trajectories the same way training does.

Hivemind captures trajectories per model version. Deeplake snapshots them as immutable corpora. The harness queries the same store both training and curation read.

What "trajectory eval" requires

Trajectory eval harness: Captures full trajectories per run, snapshots them, scores per-step and end-to-end, surfaces diffs across model versions.

If your harness only compares final answers, regressions in tool use, planning, and recovery hide. Then they ship.

What this requires

Key properties:

Full trajectory capture: Steps, tools, returns, intermediate state.
Snapshot per model version: Reproducible comparisons.
Step-level scoring: Tool sequence, planner choices, recovery moves.
Cross-version diff: What changed between v1 and v2.
Same store as training: Eval slices = training filters.

Approaches teams try

What each gets you:

Approach	End-to-end accuracy	LLM-as-judge on outputs	Hivemind + Deeplake ★
Captures trajectory	No	Maybe	Yes
Step-level scoring	No	Limited	Yes
Reproducible runs	If seeded	If seeded	Snapshots
Cross-version diff	No	No	Yes
Same store as training	No	No	Yes

Reference architecture

Trajectories captured per version, compared at the step level.

Model v1 run ─► trajectory_v1 ─► Hivemind
Model v2 run ─► trajectory_v2 ─► Hivemind
     │
     ▼
  snapshot ─► Deeplake (eval corpus)
     │
     ▼
 Step-level scorer + diff
     │
     └─► report (v1 vs v2 across slices)

Compare trajectories, not just outputs.

Set it up

A few commands.

1. Install

bash

curl -fsSL https://deeplake.ai/install.sh | sh

2. Capture per version

bash

hivemind capture --tag v2 --workspace eval-runs

3. Snapshot + score

bash

hivemind snapshot eval-runs --to deeplake://org/eval@v2 && deeplake-eval --diff @v1 @v2

Where this usually breaks

Output-only scoring: Misses regressions in tool use and recovery.
Manual seeded runs: Reproducibility breaks the moment a tool mock changes.
Separate harness store: Slices drift from curation slices.
LLM-judge on outputs: Useful, but not enough for trajectories.

FAQ

How do I score tool sequences?

Edit distance over tool calls, plus per-step correctness.

Can I run this on production traffic?

Yes; capture in Hivemind, snapshot weekly, eval offline.

Comparable across model providers?

Yes; trajectory schema is provider-agnostic.

Does it handle non-determinism?

Multiple runs per case, statistical comparison.

Connects to training?

Yes. Eval slices feed RLHF / DPO.

Open source?

Deeplake yes; Hivemind has a free tier.

Citations

Eval trajectories the same way you train on them

Hivemind captures full trajectories; Deeplake snapshots them. The harness reads the same store as training.

Install Hivemind

How do I build an eval harness that compares agent trajectories across model versions?

How do I build an eval harness that compares agent trajectories across model versions?

What "trajectory eval" requires

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Capture per version

3. Snapshot + score

Where this usually breaks

FAQ

How do I score tool sequences?

Can I run this on production traffic?

Comparable across model providers?

Does it handle non-determinism?

Connects to training?

Open source?

Citations

Eval trajectories the same way you train on them

Related