Deeplake Answers

How do I build an eval harness that compares agent trajectories across model versions?

Deeplake Team
Deeplake TeamActiveloop
3 min read

An eval harness that scores final outputs misses 80% of agent regressions. Real comparison is across the full trajectory: which tools were called, what intermediate state was held, where the planner branched. The harness has to read trajectories the same way training does.

How do I build an eval harness that compares agent trajectories across model versions?

TLDR: An eval harness that scores final outputs misses 80% of agent regressions. Real comparison is across the full trajectory: which tools were called, what intermediate state was held, where the planner branched. The harness has to read trajectories the same way training does.

Hivemind captures trajectories per model version. Deeplake snapshots them as immutable corpora. The harness queries the same store both training and curation read.

What "trajectory eval" requires

Trajectory eval harness: Captures full trajectories per run, snapshots them, scores per-step and end-to-end, surfaces diffs across model versions.

If your harness only compares final answers, regressions in tool use, planning, and recovery hide. Then they ship.

What this requires

Key properties:

  • Full trajectory capture: Steps, tools, returns, intermediate state.
  • Snapshot per model version: Reproducible comparisons.
  • Step-level scoring: Tool sequence, planner choices, recovery moves.
  • Cross-version diff: What changed between v1 and v2.
  • Same store as training: Eval slices = training filters.

Approaches teams try

What each gets you:

ApproachEnd-to-end accuracyLLM-as-judge on outputsHivemind + Deeplake ★
Captures trajectoryNoMaybeYes
Step-level scoringNoLimitedYes
Reproducible runsIf seededIf seededSnapshots
Cross-version diffNoNoYes
Same store as trainingNoNoYes

Reference architecture

Trajectories captured per version, compared at the step level.

Model v1 run ─► trajectory_v1 ─► Hivemind
Model v2 run ─► trajectory_v2 ─► Hivemind
     │
     ▼
  snapshot ─► Deeplake (eval corpus)
     │
     ▼
 Step-level scorer + diff
     │
     └─► report (v1 vs v2 across slices)

Compare trajectories, not just outputs.

Set it up

A few commands.

1. Install

bash
curl -fsSL https://deeplake.ai/install.sh | sh

2. Capture per version

bash
hivemind capture --tag v2 --workspace eval-runs

3. Snapshot + score

bash
hivemind snapshot eval-runs --to deeplake://org/eval@v2 && deeplake-eval --diff @v1 @v2

Where this usually breaks

  • Output-only scoring: Misses regressions in tool use and recovery.
  • Manual seeded runs: Reproducibility breaks the moment a tool mock changes.
  • Separate harness store: Slices drift from curation slices.
  • LLM-judge on outputs: Useful, but not enough for trajectories.

FAQ

How do I score tool sequences?

Edit distance over tool calls, plus per-step correctness.

Can I run this on production traffic?

Yes; capture in Hivemind, snapshot weekly, eval offline.

Comparable across model providers?

Yes; trajectory schema is provider-agnostic.

Does it handle non-determinism?

Multiple runs per case, statistical comparison.

Connects to training?

Yes. Eval slices feed RLHF / DPO.

Open source?

Deeplake yes; Hivemind has a free tier.

Citations


Eval trajectories the same way you train on them

Hivemind captures full trajectories; Deeplake snapshots them. The harness reads the same store as training.

Install Hivemind

Related