Calibrated Approval: When Human-in-Loop Fails AI Systems

On May 1, 2026, six national cybersecurity agencies — the U.S. CISA and NSA, Australia's ASD ACSC, the Canadian Cyber Centre, New Zealand's NCSC, and the UK's NCSC — jointly published "Careful Adoption of Agentic AI Services." One of its worked examples: an over-privileged patch-management agent obediently deletes firewall logs alongside applying a patch, because its permissions allow the action.

A week earlier, on April 24, that pattern had already played out for real. A Cursor agent powered by Anthropic's Claude Opus 4.6 issued a single GraphQL call against Railway's API, deleted PocketOS's production database and its volume-level backups in nine seconds, and left the company restoring from a three-month-old off-volume backup.

The reflex reading is "add a human in the loop." Anthropic measured what that gets you — 93 % of Claude Code permission prompts get approved. The named pattern below is the engineering answer in the middle.

The spectrum

Approval gating in an agent is rate limiting for action surface. You don't reject every request — you reject above a threshold. The same logic applies here, and the failure modes are the same on both ends.

At one end, gate everything. That 93% rate above is the failure mode — Anthropic's own term for it is approval fatigue. The gate exists, the human is in the loop, and the loop has learned to rubber-stamp.

At the other end, gate nothing. That is the PocketOS shape. The destructive call had no confirmation step, no typed-name prompt, no dry-run flag. Railway's CEO confirmed afterward that the no-prompt deletion was the documented API behavior, not a defect. It worked exactly as designed; what was missing was a layer above the API that decided this particular call shouldn't go through.

The middle position — conditional gating sized to the action — is what I'll call Calibrated Approval. Three pieces: a scored notion of how destructive is this call, computed before the call fires; a threshold above which the call pauses; an optional classifier as a second filter for the calls that don't trip the threshold but should.

The pattern is not novel. The Five Eyes guidance is explicit that "deciding which actions require approval is a job for system designers, not the agent." Calibrated Approval is the engineering shape of that sentence.

What it would have caught — and what it wouldn't

Replay PocketOS as if the gate were wired in — the way you'd replay a query through a different planner to see what changes.

The agent encountered a credential mismatch in staging and decided to fix the problem by deleting a Railway volume. It scanned the codebase, found a token created for managing custom domains, and used it. The token was over-scoped — created legitimately for a different purpose, but scoped for any operation, including destructive ones.

The destructive call — a single GraphQL volume-deletion mutation against the production Railway environment — would score high on every factor of a blast-radius rubric I'd score against:

Reversibility: none — the volume's deletion is immediate and the backups live in the same volume.
Scope: the whole production database.
Target sensitivity: production, customer-facing data.
Cost: unrecoverable below a three-month-old backup.

That call gets paused. The pattern catches it.

What it does not catch is the step before — the agent quietly scanning the repo for a usable token after hitting an auth error. That's not a destructive action; it's a read followed by a permission escalation. Calibrated Approval gates the destructive call, not the credential discovery. Anthropic calls this family of behavior credential exploration and tracks it in an internal incident log alongside deleted git branches and uploaded auth tokens. The Five Eyes guidance has the name for it one layer up: privilege. Their first risk category.

Two different gates need to be in place. Calibrated Approval is the action-gate. Identity scoping — verified agent identity, short-lived credentials — is the permission-gate. PocketOS is the case where both were missing.

Where it breaks

Every production filter has the same three failure modes: false positives, false negatives, and a self-trust problem. Calibrated Approval is no exception.

The scoring is noisy. Reversibility looks binary on paper — DELETE vs SELECT — but most production calls aren't either-or. A column-type migration is reversible only if you have a backup that hasn't been overwritten. An email send is socially irreversible the moment the recipient sees it. Calibration sits on top of judgment, not in place of it.

The classifier has its own error rate. Anthropic publishes the auto-mode numbers honestly: the full two-stage pipeline trades a 0.4-percent false-positive rate for a 17-percent false-negative rate on real overeager actions — their own evaluation set of fifty-two cases. Their answer to that ceiling is deny-and-continue with escalation thresholds — three consecutive denials or twenty total, then stop and escalate to a human. The threshold itself is a design choice, not a default to inherit.

The agent can't calibrate itself. This is the Five Eyes line in plain terms: the agent doesn't decide which of its own actions need approval. The system designer does, in code, before the agent runs. If your gate is "the agent will pause when it thinks the action is risky," you don't have a gate. You have a vibes-based checkpoint that the same generator can talk itself past.

Resilience, reversibility, risk containment — in that order. Efficiency comes after.

What to do this week

Treat this the way you'd treat adding a single circuit breaker to a service — pick one call, instrument it, then move.

Open whatever agent code you have running — a LangGraph graph, a Mastra workflow, a raw OpenAI function-calling loop; the framework doesn't matter. Find one tool that writes, deletes, sends, or pays. Write four lines next to it: a reversibility note, a scope note, a target-sensitivity note, a cost note. If any one reads not reversible or production customer data, that call should pause and require approval — not your whole agent.

The starter shape in LangGraph (the Python framework for stateful agent graphs) is one conditional interrupt() call — the LangGraph mechanic for pausing execution mid-tool-call and waiting for a human decision — inside the tool node:

from langgraph.types import interrupt, Command

def destructive_tool_node(state):
    call = state["proposed_call"]
    score = compute_blast_radius(call)   # your scoring function
    if score >= THRESHOLD:               # your threshold
        decision = interrupt({
            "action": call.name,
            "args": call.args,
            "factors": score.factors,
        })
        if decision["action"] == "reject":
            return Command(goto="rollback", update={"reason": decision["reason"]})
        if decision["action"] == "edit":
            call = decision["edited_call"]
    return Command(goto="execute", update={"call": call})

Note: compute_blast_radius and THRESHOLD are user-defined; the interrupt and Command primitives are real LangGraph APIs.

Approve, edit, reject, respond — the four states LangGraph's human-in-the-loop documentation names as the model's response options. Pick one tool. Wire one gate. That's the week.

Close

Today's Tuesday Reel walks the PocketOS incident specifically — the credential discovery, the nine seconds, what the gate would have caught.

What did you think of today's email?
Your feedback helps me create better emails for you! comment down 👇
Loved It 😊
It was ok 🙂
Could be better 🤔

Until next time - Teja Derangula,
The gap between thinking and building has shrunk — take advantage.

Calibrated Approval

The spectrum

What it would have caught — and what it wouldn't

Where it breaks

What to do this week

Close

Reply

Keep Reading

NextGen