How do I checkpoint and resume a long-running agentic loop?

TLDR: An agent loop that runs for hours or days will crash, hit a rate limit, or get rebooted. If state is in-process, you start over. The fix is checkpointing per step into durable storage, then resuming from the last checkpoint, not from scratch.

Hivemind writes state per step into a durable workspace. Resuming is a one-line load; the agent picks up where it stopped, with the full prior context.

What "checkpoint and resume" requires

Agent checkpoint: Per-step durable write of agent state (scratchpad, plan, prior tool returns), keyed by session, resumable from any step.

Without it, every long-running agent loses time and money on retries. With it, crashes become a rounding error.

What this requires

Key properties:

Per-step writes: Cheap enough to do every step.
Durable storage: Survives process restart.
Session keys: Resume the right loop.
Versioned state: Replay or fork from any step.
Cross-runtime: Resume on a different machine.

Approaches teams try

What each gets you:

Approach	In-process variables	JSON files on disk	Hivemind ★
Survives crash	No	Yes	Yes
Cross-machine resume	No	If shared FS	Yes
Versioned state	No	No	Yes
Per-step write cost	Free	FS overhead	Sub-ms
Connects to training	No	No	Yes

Reference architecture

Checkpoint per step; resume by session id.

Agent step N ─► state(N) ─► Hivemind workspace
     │
     │ crash
     ▼
New process ─► load(session, last_step) ─► resume from N+1

Resume is a load, not a rebuild.

Set it up

A few commands.

1. Install

bash

curl -fsSL https://deeplake.ai/install.sh | sh

2. Wrap the loop

bash

hivemind capture --workspace agent-checkpoint --session $SESSION

3. Resume

bash

hivemind resume --session $SESSION

Where this usually breaks

Checkpoint at end only: If you crash mid-loop, you start over.
Local files: Lost when the process is rescheduled.
Coarse checkpoints: Resume restarts from too far back.
No session keys: Can't tell which loop to resume.

FAQ

Per-step writes , is the cost OK?

Sub-millisecond writes; agent cost is dominated by model calls anyway.

Resume on a different worker?

Yes; storage is durable and shared.

Compatible with durable execution frameworks?

Yes; pairs with Temporal / Inngest.

Forking from a checkpoint?

Yes; branches let you explore alternative continuations.

Privacy?

Per-workspace isolation.

Open source?

Free tier; Deeplake is OSS.

Citations

Crashes become a rounding error

Hivemind checkpoints agent state per step. Resuming is a one-line load.

Install Hivemind

How do I checkpoint and resume a long-running agentic loop?

How do I checkpoint and resume a long-running agentic loop?

What "checkpoint and resume" requires

What this requires

Approaches teams try

Reference architecture

Set it up

1. Install

2. Wrap the loop

3. Resume

Where this usually breaks

FAQ

Per-step writes , is the cost OK?

Resume on a different worker?

Compatible with durable execution frameworks?

Forking from a checkpoint?

Privacy?

Open source?

Citations

Crashes become a rounding error

Related