Explainable AI: Opening the Black Box of Machine Learning

A deep learning model classifies a loan application as high-risk. The applicant is denied credit. The applicant asks why — and the bank cannot tell them, because the model's internal workings are opaque. This scenario, increasingly common as AI is deployed in consequential decisions, has prompted a growing field of research and regulation: Explainable AI (XAI).

Why Explainability Matters

There are three compelling reasons to demand explanations from AI systems. The first is trust and acceptance: decision-makers are unlikely to rely on a system they do not understand, particularly in high-stakes contexts. A clinical decision support tool that simply outputs a diagnosis without reasoning will struggle to gain adoption among physicians trained to evaluate evidence.

The second is debugging and improvement: understanding why a model makes specific predictions allows developers to identify errors, biases, and spurious correlations — and to improve the model accordingly. A credit model that turns out to rely heavily on postcode might be discriminating by proxy against protected groups, a pattern only discoverable through careful interpretation.

The third is regulatory compliance. The EU's General Data Protection Regulation (GDPR) grants individuals the right to "meaningful information about the logic involved" in automated decisions that significantly affect them. The EU AI Act, which entered into force in 2024, imposes specific explainability requirements on high-risk AI systems in sectors including credit, employment, education, and law enforcement.

Intrinsically Interpretable Models

The simplest solution to explainability is to use models that are interpretable by design. Linear regression produces coefficients for each feature — directly interpretable as the expected change in output for a unit change in the feature, holding others constant. Decision trees produce a rule structure that can be read as a series of if-then conditions. Generalised Additive Models (GAMs) model the output as a sum of smooth functions of individual features, producing visually interpretable contribution plots.

These models are often sufficient for structured tabular data with well-engineered features. The trade-off is accuracy: for complex, high-dimensional problems — image classification, fraud detection across thousands of features, NLP — intrinsically interpretable models often underperform deep learning and ensemble methods by a significant margin.

Post-hoc Explanation Methods

For complex models, post-hoc explanation methods provide approximate interpretations without modifying the model itself. LIME (Local Interpretable Model-agnostic Explanations) explains individual predictions by fitting a simple linear model to the model's behaviour in the neighbourhood of the input of interest. It perturbs the input, observes how the model's output changes, and uses the results to identify which features most influenced the prediction for that specific instance.

SHAP (SHapley Additive exPlanations) has become the dominant post-hoc explanation framework. Drawing on Shapley values from cooperative game theory, SHAP attributes each feature's contribution to a prediction by considering all possible orderings in which features could be included. SHAP values are consistent and locally accurate, meaning they provide a theoretically grounded decomposition of any model's output into additive feature contributions. TreeSHAP — an efficient algorithm for tree-based models — runs in polynomial time, making it practical for production use with XGBoost and random forest models.

Attention as Explanation

For Transformer-based NLP models, attention weights are sometimes offered as explanations: a heat map showing which words the model "attended to" when making a prediction. This approach is visually intuitive but theoretically problematic. Research has shown that attention weights do not correlate reliably with gradient-based feature importance measures, and that models can produce identical outputs with very different attention patterns. Attention is not, in general, explanation.

Global vs. Local Explanations

Explanations can describe model behaviour globally — summarising what the model has learned across the entire training set — or locally — explaining a specific prediction for a specific input. SHAP summary plots, partial dependence plots, and feature importance rankings provide global insight; SHAP waterfall charts and LIME outputs are local. Both are valuable: global explanations reveal systematic patterns and potential biases; local explanations support individual decision review and user trust.

The Fidelity-Interpretability Trade-off

No post-hoc explanation method perfectly captures a complex model's behaviour. By definition, simpler explanations are approximations. The key question is whether the approximation is accurate enough to be useful — does the explanation faithfully reflect what the model actually does? Rigorous evaluation of explanation fidelity, through techniques such as perturbation testing and ground-truth simulation, is a critical step in responsible XAI deployment.