Audit-first AI governance
← Insights
Deep Dive

Model risk management for AI: an SR 11-7 deep dive

Bank model-risk management already assumes models are imperfect and demands independent challenge across the lifecycle. That makes it the right lens for AI — and it exposes where annual review breaks down once a model starts to act.

The discipline was built for imperfect models

In 2011, U.S. banking supervisors issued guidance on model risk management — Federal Reserve SR 11-7, mirrored in OCC Bulletin 2011-12. It has aged well. The reason is its starting assumption: every model is wrong in some way, and the question is not whether but how badly, and whether anyone would notice in time.

SR 11-7 defines model risk as the risk of adverse consequences from decisions based on incorrect or misused models. It names two sources. A model can be fundamentally wrong — bad design, bad data, bad math. Or a sound model can be used incorrectly — applied outside its intended scope, fed inputs it was never built for, trusted past its limits. That second source is the one most organizations underweight, and it is the one that matters most for modern AI.

The guidance rests on three pillars: robust model development, implementation, and use; sound model validation, with "effective challenge" at its center; and governance, policies, and controls — a model inventory, defined roles, documentation, and independent audit. None of this is exotic. It is the discipline of treating a model as a thing that must earn trust and keep earning it.

What makes it the right lens for AI — including outside banking — is precisely that it never assumed the model was right. It assumed the opposite and built controls around the assumption. Most AI governance frameworks are still arguing about principles. SR 11-7 already shipped a working theory of accountability fifteen years ago.

Pillar one: development and use — where general-purpose AI breaks the frame

Traditional model development produces a model with a stated intended use and documented limitations. A PD model estimates default probability for a defined portfolio. Use it for that, and you are inside the lines. Use it to price a product it was never calibrated for, and you have created model risk without changing a line of code.

Now apply that to a large language model. The "intended use" of a general-purpose LLM is, by construction, almost anything. It will answer a question about loan policy, draft a customer letter, summarize a contract, and propose a course of action — all with the same fluent confidence, whether or not it has any business doing so. The "used incorrectly" failure mode is no longer an edge case. It is the default risk surface.

This is the part SR 11-7 anticipated without naming it. The limitation that matters for an LLM is rarely a statistic on a validation report. It is a boundary on use: which decisions it may touch, which it may only inform, and which it must never make alone. Documenting intended use for AI is less about describing what the model can do and more about declaring what it is permitted to do.

Pillar two: validation and effective challenge — challenge the use, not just the math

Effective challenge is the heart of SR 11-7: critical review by objective, informed parties who have the authority and the standing to challenge a model and to compel change. It is not a second opinion. It is a check with teeth.

Classical validation assumes you can open the model. You inspect the math, replicate the outputs, test sensitivity to inputs. With a vendor or black-box model — and most LLMs in production are exactly that — you cannot fully inspect anything. The weights are not yours. The training data is opaque. The next version may behave differently for reasons you will never be told.

So the object of challenge has to move. If you cannot validate the internals, you validate the use and the outputs. Is the model being applied within its declared boundaries? Are its outputs accurate, consistent, and defensible on the cases that actually matter? What happens at the edges, on adversarial inputs, on the populations where error is most costly? SR 11-7 already permits this posture — it treats validation as commensurate with risk and use, not as a single technique. For AI, the burden shifts from "prove the model is correct" to "prove the model is being used correctly, and show the evidence."

Pillar three: governance — inventory the AI you bought

Governance is the unglamorous pillar that holds the other two up: a complete model inventory, clear ownership, documentation, ongoing monitoring, and internal audit independent of the developers.

The first failure here is quiet and common. Organizations inventory the models they built and miss the AI they bought — the copilot embedded in a SaaS tool, the LLM behind a vendor feature, the assistant a team wired in last quarter. If it influences a decision, it is a model under SR 11-7's logic, regardless of who trained it or where it runs. An inventory that excludes purchased AI is not an inventory. It is a blind spot with a spreadsheet.

The second is monitoring. Traditional models drift slowly as the world moves away from their training data. AI drifts in additional ways. A prompt change shifts behavior. A vendor silently updates a model. The same input yields a different output tomorrow because the system is non-deterministic by design. Ongoing monitoring for AI is not an annual recalibration. It is a standing question: is this thing still doing what we signed off on?

What is genuinely harder with AI

Four things break the old cadence. Non-determinism: the same input need not produce the same output, so "tested once, approved" means less. Behavior drift: prompts and vendor updates move the model under your feet. Black-box vendor models: you govern what you cannot inspect. And the largest shift — models now act. An agentic system does not just score an application; it sends the email, updates the record, moves the money. The gap between a wrong output and a wrong consequence collapses to nothing.

SR 11-7 imagined a model that produces a number a human then uses. When the model takes the action, the human checkpoint that effective challenge assumed is simply gone — unless you rebuild it into the moment of action.

The KAiM point of view: effective challenge has to be continuous

Here is the honest problem. Effective challenge as an annual or quarterly validation cycle was sufficient when a human stood between the model and the consequence. When the model acts in real time, a review that happens twice a year cannot challenge a decision that happens in milliseconds. The control has to move to the point of action, and it has to run every time.

That is the gap KAiM Helm is built to close. KAiM Helm is a deterministic control engine. Before an AI-driven action proceeds, it is evaluated against six axes — Authority, Policy, Evidence, Harm, Regulation, and Escalation — and resolved to a clear verdict: ALLOW, BLOCK, or ESCALATE. The evaluation is deterministic: the same facts produce the same verdict, which is the property the underlying model lacks. In SR 11-7 terms, this is effective challenge made continuous and automated — an objective check, with the standing to block, applied at the point of action rather than once a year.

The mapping is deliberate. The deterministic evaluator is a continuous effective-challenge and use-control layer. Bounded authority — the model may act only within an explicitly granted scope, and escalates rather than acts when it reaches the edge — is the use-limit control SR 11-7 asks for, enforced rather than documented. And every verdict produces a signed, reproducible record: what was proposed, which rules applied, what was decided, and why. That record is validation and audit evidence generated as a byproduct of operation, not reconstructed after the fact.

We will be candid about the line between built and sequenced. The deterministic evaluation, the six-axis verdict, bounded authority, and the signed record are the working core, exercised today in design-partner pilots and controlled demonstrations. A consolidated model inventory spanning every AI an institution has bought, and quantitative monitoring dashboards for drift, are partial and on the roadmap — useful to state plainly, because the discipline SR 11-7 describes is a program, not a product, and no single layer completes it.

The principle holds regardless. A control that can say no — every time, at the moment it matters, and leave a record of why — is what effective challenge has to become when the model stops advising and starts acting.

Key takeaways


Most institutions already know how to govern a model that produces a number. The open question is what happens when the model acts. A Control Gap Assessment is a structured way to find where AI in your environment is making or taking decisions without a control that can challenge it at the point of action — and to see what closing that gap would take. If that is the question on your desk, it is worth a conversation.