Stress-Testing AI Models: A Modern Imperative For Model Risk Management

Why Stress-Testing AI Models Matters — Now More Than Ever

Artificial intelligence (AI) and machine learning (ML) are now embedded in the core of banking — powering decisions in credit, fraud, anti-money laundering (AML), and more. These systems bring scale and speed — but also a new type of risk.

AI models are often black boxes: They learn from data, but under stress — like a downturn, market shock, or regulatory change — they can behave unpredictably. Unlike traditional models, they do not degrade gradually. They can fail fast — misclassifying risk, introducing bias, or generating false positives — which creates reputational, financial, and compliance risk. For example, during COVID-19, many credit underwriting models that had been trained on years of stable economic data suddenly rejected large numbers of qualified borrowers, or conversely underestimated default risk, because they had never encountered pandemic-like conditions in their training history. This illustrated how quickly AI models can break down when exposed to novel stress scenarios.

Regulators are Responding

U.S. guidance like SR 11-7 is being reinterpreted for AI.
The Office of the Comptroller of the Currency's (OCC) Principles for Responsible AI call out fairness, explainability, and model fragility.
The EU AI Act categorizes many financial AI models as "high-risk," requiring formal oversight.

In this new landscape, the old approach to model validation is not enough. Stress-testing AI models must go beyond accuracy — it must test for fairness, transparency, and resilience under pressure.

At Ankura, we believe AI is not too big to fail — it is too complex to ignore.

This paper introduces our practical framework for stress-testing AI in risk-sensitive applications, helping institutions future-proof their models and stay ahead of regulation.

Why Traditional Stress-Testing Falls Short for AI

Traditional stress-testing was designed for simpler, transparent models. But AI systems behave differently: They are nonlinear, sensitive to data shifts, and harder to interpret. That is why Ankura's stress-testing framework is built specifically for AI — addressing the unique ways these models can fail under pressure.

Aspect	Traditional Models	AI/ML Models
Structure	Linear, transparent, easy to interpret	Nonlinear, complex, "black box"
Data Sensitivity	More stable under moderate shifts	Highly sensitive to distributional changes
Response to Stress	Gradual degradation	Sudden, unpredictable breakdowns
Governance and Review	Easier to document and validate	Requires advanced tools for explainability

Ankura's 5-Step AI Stress-Testing Framework

Ankura's five-step framework is designed specifically for AI/ML systems, combining technical rigor with governance best practices.

Case Study: How a Loan Default Model Broke Under COVID Stress

To demonstrate how AI models can falter under real-world stress, we stress-tested a loan default model using Lending Club data (2007–2018). Our aim was to see how the model, trained in normal times, would behave under crisis-like conditions — specifically, the COVID-19 economic shock.

Step 1: Applying Macro and Micro-Stressors

We trained the model on pre-COVID-19 data, then tested it on a synthetic dataset mimicking the COVID-19 era¹ — incorporating higher unemployment, income drops, and elevated default rates. We also added micro-stressors like distorted borrower features (e.g., inflated debt-to-income ratios, altered credit scores).

The test set was designed to reflect both real economic turmoil and data instability — the kind of conditions a production model would face in crisis.

Step 2: Measuring Model Behavior

We used an XGBoost-based classification model to compare predictions under two scenarios:

Pre-Stress: Normal economic inputs
Post-Stress: COVID-19-like stressed inputs

We evaluated both performance metrics and behavioral shifts using Receiver Operating Characteristic (ROC) curves, precision-recall, calibration plots, and feature importance charts.

Metric	Pre-Stress	Post-Stress	Interpretation
True Positives (TP)	4440	9577	Better detection of defaults
False negatives (FN)	34143	29006	Fewer missed defaults
False Positives (FP)	3929	14292	Spike in false alarms
True Negatives (TN)	140959	130596	Decrease in correct non-default classification
Precision	53%	40%	Declines due to higher FP
Recall (Sensitivity)	12%	25%	Improves due to lower FN
Accuracy	0.792	0.764	Modest drop in overall correctness
AUC (ROC)	0.718	0.677	Shows degraded discrimination power
Calibration	Near-perfect	Overconfident	Model overstates certainty – risk of misinformed decisions

What We Found

Performance Drift: While the model captured more defaults (higher recall), it also flagged many good borrowers as risky — precision dropped significantly.
Confidence Collapse: Calibration plots showed the model became overconfident under stress — inflating default probabilities.
Feature Fragility: SHapley Additive exPlanations (SHAP) analysis revealed an increased reliance on unstable features — like income or employment, leading to erratic behavior.
Output Distribution Shift: The Kernel Density Estimation (KDE) plot showed a visible rightward shift in predicted risk scores — signaling overreaction to stress inputs.

Key Takeaways for Business Leaders

More defaults flagged — but less accuracy: Precision dropped, making the model less useful for credit decisions.
The model became fragile: Over-reliance on volatile inputs made outputs unstable and less trustworthy.
Real risk of operational failure: In live systems, this could mean bad loans get approved, or good customers get unfairly rejected.

What We Recommend

Make AI models more resilient.

Retrain on stress-informed data: Include past crisis periods in training data.
Rebalance model thresholds: Optimize tradeoff between false positives and negatives.
Add monitoring alerts: Track shifts in key features like income or Debt-to-Income (DTI).
Refresh calibration regularly: Ensure risk scores remain meaningful.
Use hybrid guardrails: Add rule-based overrides during high-risk periods.

Ankura's AI Model Risk Services

Ankura helps institutions deploy AI responsibly by combining technical expertise with regulatory insight. Our services ensure models are robust, fair, explainable, and audit-ready.

Model Validation: End-to-end validation covering data, design, and outcomes — aligned with SR 11-7 and global best practices.
Stress-Testing: Simulation of shocks and distribution shifts to test model resilience under adverse conditions.
Fairness Audits: Evaluation of bias and disparate impact across sensitive attributes.
Explainability: Use of SHAP, Local Interpretable Model Agnostic Explanation (LIME), and other tools to improve transparency and support governance.
Governance Support: Setup of model documentation, monitoring plans, and audit trails for compliance.

Whether you're building your first ML model or scaling AI across the enterprise, we bring rigor and oversight to your model risk management.

Footnote

1. To build the synthetic COVID scenario, we anchored macroeconomic assumptions (unemployment, income shocks, default rates) to ranges observed during 2020–2021 using public sources such as the Federal Reserve Economic Data (FRED). These were then translated into borrower-level features by proportionally adjusting key inputs (e.g., increasing debt-to-income ratios, reducing disposable income, shifting credit score distributions). While simplified, this approach allowed us to capture realistic stress dynamics at the borrower level.

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.