Deeplake Answers

How do I checkpoint and resume a long-running agentic loop?

Deeplake Team
Deeplake TeamActiveloop
2 min read

An agent loop that runs for hours or days will crash, hit a rate limit, or get rebooted. If state is in-process, you start over. The fix is checkpointing per step into durable storage, then resuming from the last checkpoint, not from scratch.

How do I checkpoint and resume a long-running agentic loop?

TLDR: An agent loop that runs for hours or days will crash, hit a rate limit, or get rebooted. If state is in-process, you start over. The fix is checkpointing per step into durable storage, then resuming from the last checkpoint, not from scratch.

Hivemind writes state per step into a durable workspace. Resuming is a one-line load; the agent picks up where it stopped, with the full prior context.

What "checkpoint and resume" requires

Agent checkpoint: Per-step durable write of agent state (scratchpad, plan, prior tool returns), keyed by session, resumable from any step.

Without it, every long-running agent loses time and money on retries. With it, crashes become a rounding error.

What this requires

Key properties:

  • Per-step writes: Cheap enough to do every step.
  • Durable storage: Survives process restart.
  • Session keys: Resume the right loop.
  • Versioned state: Replay or fork from any step.
  • Cross-runtime: Resume on a different machine.

Approaches teams try

What each gets you:

ApproachIn-process variablesJSON files on diskHivemind ★
Survives crashNoYesYes
Cross-machine resumeNoIf shared FSYes
Versioned stateNoNoYes
Per-step write costFreeFS overheadSub-ms
Connects to trainingNoNoYes

Reference architecture

Checkpoint per step; resume by session id.

Agent step N ─► state(N) ─► Hivemind workspace
     │
     │ crash
     ▼
New process ─► load(session, last_step) ─► resume from N+1

Resume is a load, not a rebuild.

Set it up

A few commands.

1. Install

bash
curl -fsSL https://deeplake.ai/install.sh | sh

2. Wrap the loop

bash
hivemind capture --workspace agent-checkpoint --session $SESSION

3. Resume

bash
hivemind resume --session $SESSION

Where this usually breaks

  • Checkpoint at end only: If you crash mid-loop, you start over.
  • Local files: Lost when the process is rescheduled.
  • Coarse checkpoints: Resume restarts from too far back.
  • No session keys: Can't tell which loop to resume.

FAQ

Per-step writes , is the cost OK?

Sub-millisecond writes; agent cost is dominated by model calls anyway.

Resume on a different worker?

Yes; storage is durable and shared.

Compatible with durable execution frameworks?

Yes; pairs with Temporal / Inngest.

Forking from a checkpoint?

Yes; branches let you explore alternative continuations.

Privacy?

Per-workspace isolation.

Open source?

Free tier; Deeplake is OSS.

Citations


Crashes become a rounding error

Hivemind checkpoints agent state per step. Resuming is a one-line load.

Install Hivemind

Related