A founder I know left a Claude Code agent running on a refactoring task at 11 PM. The task was a 22-file cross-cutting rename. Nothing exotic. He went to sleep.
At 7 AM he woke up to a $3,412 token bill. The agent had hit an integration test that was flaky against a third-party stub, looped on the test fix for 6 hours, and never asked for help.
The agent did not malfunction. It did what agents do: it tried to make the test green. The fix that should have taken 40 minutes ran for 360 minutes because nobody told the agent when to stop.
The math that would have predicted this is in the Agent Loop Cost Predictor. The hook that would have prevented it is in the Hooks Builder.
The math
A Claude Code session has four cost knobs, each multiplying the others:
- Iterations. How many times the agent submits a fresh request. Each one is a separate API call.
- Tool calls per iteration. Reads, edits, bash runs. Each one returns content that lands in the next iteration's context.
- Tokens per tool result. A file read returns the whole file. A grep returns matching lines. A bash run returns stdout.
- Output tokens. What the agent generates each turn: plan, code, response.
The cost per iteration is roughly:
cost_per_iter = (cumulative_context * input_rate) + (output_tokens * output_rate)
Where cumulative_context grows every iteration because each tool result and each output is appended to history.
Caching helps. Anthropic's prompt cache reduces the cost of already-seen context by up to 90 percent. So the real formula is:
effective_input = context * (1 - 0.9 * cache_hit_rate)
If you have caching on and a 70 percent hit rate, your effective input cost is 37 percent of nominal. If caching is off (and the audit shows it is off in roughly half of production setups), you pay full freight.
What happened in the $3,400 incident
I rebuilt the numbers from the chat history afterwards:
- Model: Claude Opus 4.7 (frontier, $15/M input, $75/M output)
- Iterations: 287
- Tool calls per iteration: 4.2 average
- Tokens per tool result: 1,800 average (mostly file reads)
- Output tokens per iteration: 600 average
- Caching: not configured
- Initial context: 14,000 tokens (system + CLAUDE.md + skills)
The cumulative context grew by roughly 8,200 tokens per iteration. By iteration 287, each call was processing 2.4 million tokens of context (most of it cached internally by Anthropic, but billed as if not, because caching was off).
Total cost: ~$3,400. The predictor agrees within a few percent.
The three things that stop this
Thing 1: a token budget hook
The simplest fix. Cap cumulative task tokens at 200,000. The hook tracks usage and refuses new tool calls past the cap. The agent stops. The human is told. The bill stops at roughly $20, not $3,400.
{
"hooks": {
"PreToolUse": [
{
"matcher": ".*",
"hooks": [
{ "type": "command", "command": ".claude/hooks/budget-guard.sh 200000" }
]
}
]
}
}
200K is generous for most legitimate work. Sessions that approach it should stop and ask the human to slice the task smaller.
Thing 2: prompt caching on
Anthropic prompt caching is one config flag. It reduces context costs by up to 90 percent. If you are running agents and you are not caching, you are paying 3 to 10x what you need to.
Thing 3: model routing
The $3,400 incident ran on Opus. Most of those iterations were the agent trying small edits and running tests. That is Sonnet work. Or Haiku work.
The rule I run: tasks under 5 files run on Sonnet. Tasks over 12 files escalate to Opus. Tasks involving infra or migrations confirm with the human first. This single change cuts agent spend by 35 to 50 percent on the median team.
How to predict your own worst case
Open the predictor. Plug in your worst-case loop assumptions:
- Iterations: 300 (an agent that is genuinely stuck and not asking for help)
- Tool calls per iteration: your typical
- Tokens per tool result: depends on file sizes; 1,500 is a reasonable default
- Caching hit rate: be honest. If caching is off, use 0.
The number that comes out is your overnight exposure. If you would not be comfortable paying it, write the budget hook this week.
The honest part
Every team I have audited had a story like this. Most of them did not have a $3,400 incident yet, but most had something at $200 to $400 that nobody investigated because "we should probably look into that someday." Someday turns into the post-mortem.
The math is not hard. The hook is 70 lines of bash. The savings are real.
Receipts
- Confirmed runaway-loop incidents across audits in the last quarter: 4.
- Median wasted spend per incident: $830.
- Largest single incident: $3,412.
- Teams with token budget hooks before the audit: 0 of 12.
- Teams with hooks 30 days after the audit: 12 of 12.