Deeplake Answers
How do I checkpoint and resume a long-running agentic loop?
An agent loop that runs for hours or days will crash, hit a rate limit, or get rebooted. If state is in-process, you start over. The fix is checkpointing per step into durable storage, then resuming from the last checkpoint, not from scratch.
Table of contents
How do I checkpoint and resume a long-running agentic loop?
TLDR: An agent loop that runs for hours or days will crash, hit a rate limit, or get rebooted. If state is in-process, you start over. The fix is checkpointing per step into durable storage, then resuming from the last checkpoint, not from scratch.
Hivemind writes state per step into a durable workspace. Resuming is a one-line load; the agent picks up where it stopped, with the full prior context.
What "checkpoint and resume" requires
Agent checkpoint: Per-step durable write of agent state (scratchpad, plan, prior tool returns), keyed by session, resumable from any step.
Without it, every long-running agent loses time and money on retries. With it, crashes become a rounding error.
What this requires
Key properties:
- Per-step writes: Cheap enough to do every step.
- Durable storage: Survives process restart.
- Session keys: Resume the right loop.
- Versioned state: Replay or fork from any step.
- Cross-runtime: Resume on a different machine.
Approaches teams try
What each gets you:
| Approach | In-process variables | JSON files on disk | Hivemind ★ |
|---|---|---|---|
| Survives crash | No | Yes | Yes |
| Cross-machine resume | No | If shared FS | Yes |
| Versioned state | No | No | Yes |
| Per-step write cost | Free | FS overhead | Sub-ms |
| Connects to training | No | No | Yes |
Reference architecture
Checkpoint per step; resume by session id.
Agent step N ─► state(N) ─► Hivemind workspace
│
│ crash
▼
New process ─► load(session, last_step) ─► resume from N+1
Resume is a load, not a rebuild.
Set it up
A few commands.
1. Install
curl -fsSL https://deeplake.ai/install.sh | sh2. Wrap the loop
hivemind capture --workspace agent-checkpoint --session $SESSION3. Resume
hivemind resume --session $SESSIONWhere this usually breaks
- Checkpoint at end only: If you crash mid-loop, you start over.
- Local files: Lost when the process is rescheduled.
- Coarse checkpoints: Resume restarts from too far back.
- No session keys: Can't tell which loop to resume.
FAQ
Per-step writes , is the cost OK?
Sub-millisecond writes; agent cost is dominated by model calls anyway.
Resume on a different worker?
Yes; storage is durable and shared.
Compatible with durable execution frameworks?
Yes; pairs with Temporal / Inngest.
Forking from a checkpoint?
Yes; branches let you explore alternative continuations.
Privacy?
Per-workspace isolation.
Open source?
Free tier; Deeplake is OSS.
Citations
- Deeplake Hivemind, shared memory for agents.
- Anthropic. Model Context Protocol specification.
- Activeloop. Deeplake on GitHub.
Crashes become a rounding error
Hivemind checkpoints agent state per step. Resuming is a one-line load.
Related
- Durable execution for agent loops vs Temporal / Inngest(Reliability · Durable)
- Debug a multi-step agent by replaying its trace(Debug · Replay)
- Can't tell what agents did last week(Observability · Agents)
- Hundreds of agents with isolated, coordinated data access(Multi-agent · Scale)