Agents that post to Slack, send email, charge customers, or call paid external APIs need at-most-once tool execution. If the process dies between “handler ran” and “tool result persisted,” a naive resume re-runs the handler on the next invocation — your bot posts twice, your customer gets billed twice, your external API fires twice.Documentation Index
Fetch the complete documentation index at: https://docs.vertz.dev/llms.txt
Use this file to discover all available pages before exploring further.
@vertz/agents handles this automatically. Pair a durable store with a
sessionId on run() and you get:
- Each tool-call step commits atomically (one write pre-dispatch, one post-dispatch).
- On resume, the framework detects orphaned tool calls and either
re-invokes safe handlers or surfaces a typed
ToolDurabilityErrortool_result for the LLM to reason about. - No separate
resume()API, no durable-mode flag — durability is a consequence of using a durable store with session identity.
Activation
Durable resume turns on when all three are true:- You pass a
storetorun(). - You pass a
sessionId. - The store is durable —
sqliteStoreord1Store. The in-memorymemoryStorecannot guarantee durable writes and throwsMemoryStoreNotDurableErrorat entry if combined withsessionId.
The safeToRetry flag
A tool declaration controls how the framework behaves if its handler
was requested but its result was lost:
safeToRetry is NOT network retry
This is a common confusion worth calling out:
safeToRetryonly controls resume replay — whether the framework re-invokes a handler when a previous run crashed.- It does nothing for transient network errors during normal
execution. If your handler calls
fetch()and the fetch fails, the error is persisted as thetool_resulteither way.
safeToRetry: true is a declaration about the operation, not about
retry policy. Think “this call is safe to run twice” not “retry this
call on failure.”
What resume looks like
Consider a triage bot whose handler posts to Slack. The normal flow:- LLM: “I’ll post to Slack” → requests
postSlack. - Framework writes the assistant message with the tool_call id.
postSlackhandler runs. Slack gets the message.- Framework writes the tool_result.
run() with the
same sessionId loads the session and sees: assistant asked for
postSlack, no tool_result exists. For a non-safeToRetry tool like
postSlack:
- Framework writes a synthetic tool_result with content:
- The LLM’s next turn sees the error in the message history and decides what to do: check Slack for a duplicate post, ask the user, abort the thread, etc.
safeToRetry: true tool like getIssue, step 5 instead
re-invokes the handler and persists the real result. Step 6 never
happens — the LLM never sees the crash.
Crash windows in detail
The framework performs two atomic writes per tool-call step, plus one at end-of-turn for trailing text. Crash outcomes:| Crash window | Store state | Resume behavior |
|---|---|---|
| Before the first persisted write | no new messages for this turn | No orphan. Next LLM call starts fresh from prior state. |
| Between write #1 (assistant + user + toolCalls) and handler dispatch | assistant-with-toolCalls persisted, no tool_results | Orphan. For each call: if safeToRetry, re-invoke; else surface ToolDurabilityError. |
| During handler dispatch | same as above | Same behavior. The framework cannot distinguish “handler never started” from “handler ran + result lost” without safeToRetry. |
| Between handlers and write #2 | same as above | Same behavior. |
| Mid-write #2 | atomic — either all tool_results present or none | Either case is well-defined. |
| After write #2 | full step committed | No orphan. Loop resumes normally. |
safeToRetry tools that crash in the middle windows are
intentionally pessimistic: the framework cannot know if your side
effect landed. The LLM decides, with the error visible in history.
Performance
Under durable execution, each tool-call step commits two atomic writes to the store instead of one end-of-run batch. For a 10-step loop with 2 tools per step, that’s roughly 20 writes instead of 1. On Cloudflare D1 same-region, expect ~100–200ms overhead per 10-step session. For high-volume read-heavy agents that don’t need crash recovery, omitsessionId to run statelessly — no durable writes happen.
The ToolDurabilityError class
Exported from @vertz/agents so callers inspecting resumed session
history can pattern-match:
Testing
The package exposes a crash harness at@vertz/agents/testing for
writing resume tests: