Explainability and Interpretable AI (XAI)

In Short

Explainability is the ability to describe why a model produced a specific output in terms a human can act on. The field ranges from post-hoc approximation methods like SHAP and LIME that work on any model, to mechanistic interpretability that attempts to reverse-engineer a model's internal computation, to documentation practices like model cards that communicate model behavior to downstream users and regulators.

01. What It Is

The terms interpretability and explainability are often used interchangeably, but the distinction is useful. Interpretability is a property of the model itself: a linear regression is interpretable because its decision process is directly readable from its coefficients. Explainability is a property of a method applied to a model: an explanation tells you why a particular prediction was made, even if the underlying model is a black box.

Classical black-box models, including deep neural networks, gradient-boosted trees at scale, and LLMs, are not directly interpretable. Their parameters number in the millions to hundreds of billions, and the computation they perform involves high-dimensional non-linear transformations that resist direct human reading. Explainability methods are the engineering response to this gap.

02. Why It Matters

The practical stakes are high in four areas.

Trust and adoption:
A model that cannot explain its outputs cannot be debugged. Engineers cannot tell whether a correct prediction was correct for the right reasons or by coincidence. Users cannot verify that the system is operating as intended.

Debugging and bias detection:
Explanations reveal when a model has latched onto spurious correlations. A classic failure: a chest X-ray classifier that learned to predict pneumonia from the scanner model in the metadata rather than the lung tissue, which was discovered only through saliency maps.

Regulatory compliance:
The EU AI Act (in force 2024-2026 phased rollout) requires high-risk AI systems to provide explanations for individual decisions. The GDPR Article 22 right to explanation applies to automated decisions with legal or significant effects. Financial regulators including the US OCC and the EU's EBA require model explainability for credit scoring.

Model editing and safety:
Understanding where knowledge is stored in a model's weights makes it possible to correct factual errors without full retraining. Meng et al. (arXiv:2202.05262, NeurIPS 2022) demonstrated that factual associations in GPT-style models are localized to specific mid-layer feed-forward modules and can be edited with Rank-One Model Editing (ROME).

03. How It Works

SHAP (SHapley Additive exPlanations)

SHAP, introduced by Lundberg and Lee (2017), assigns each input feature a Shapley value: its average marginal contribution to the prediction across all possible subsets of features. Shapley values come from cooperative game theory, where the "game" is producing the prediction and the "players" are the features.

SHAP has desirable mathematical properties: consistency (a feature that contributes more always gets a higher value), dummy (a feature that contributes nothing gets zero), and local accuracy (individual contributions sum to the difference between the prediction and the baseline). These properties make SHAP values comparable across models and datasets.

The key drawback is computational cost. Exact Shapley computation is exponential in the number of features. SHAP approximations (TreeSHAP for tree models, KernelSHAP for arbitrary models, DeepSHAP for neural networks) make it tractable but introduce approximation error.

LIME (Local Interpretable Model-Agnostic Explanations)

LIME, introduced by Ribeiro, Singh, and Guestrin (2016), explains a single prediction by fitting an interpretable surrogate model, typically a linear regression or decision tree, around that prediction in a locally sampled neighborhood. The surrogate is only required to be locally faithful, not globally accurate.

The process: perturb the input, observe how the model's output changes, weight the perturbed samples by their proximity to the original input, and train a simple model on this weighted dataset. The simple model's coefficients are the explanation.

LIME is flexible (model-agnostic, works for text, image, and tabular data) but has a known instability problem: small changes to the perturbation sampling can produce substantially different explanations for the same prediction.

Feature importance

For tree-based models, feature importance can be computed directly from the tree structure (how often a feature is used for splits, weighted by the information gain at each split). For neural networks, gradient-based methods including vanilla gradients, integrated gradients, and SmoothGrad attribute importance to input dimensions by examining how the output changes with respect to small input perturbations.

Attention visualization and its limits

Transformer models produce attention weights that indicate which tokens a given head attended to when producing each output token. Attention maps are visually intuitive and widely cited as evidence of model reasoning. The problem is that attention weights do not reliably indicate causal importance. Jain and Wallace (2019) demonstrated that attention can be permuted or perturbed without changing the output, meaning high attention weight does not mean the token was causally important to the prediction.

Gradient-based attribution methods and SHAP are more causally grounded. Attention maps remain useful for debugging and intuition but should not be treated as explanations in high-stakes contexts.

Mechanistic interpretability

Mechanistic interpretability, an active research program led in part by Anthropic's team, attempts to reverse-engineer the internal computation of neural networks, not just attribute importance to inputs. The analogy is to reverse engineering a compiled binary: the goal is to understand what algorithms the weights implement.

Chris Olah's essay from the Transformer Circuits thread (published June 2022) describes the central challenge as decomposing neural network activations into independently understandable features, analogous to identifying variables in a binary program. The key barrier is superposition: a single neuron often represents multiple unrelated features simultaneously (polysemanticity), which makes per-neuron analysis unreliable.

Anthropic's Scaling Monosemanticity work (2024) applied sparse autoencoders to Claude 3 Sonnet's residual stream, extracting millions of interpretable features from what would otherwise be polysemantic neurons. This is among the most technically advanced XAI work on frontier LLMs. The goal is not just to explain individual predictions but to provide a complete enough internal map that safety-relevant behaviors can be located and, eventually, verified.

04. Key Terms and Methods

Shapley value:
A game-theoretic measure of each feature's average marginal contribution to a prediction. The foundation of SHAP.

Local surrogate:
An interpretable model (linear regression, decision tree) trained to approximate a black-box model's behavior in the neighborhood of a specific data point. The foundation of LIME.

Saliency map:
A visualization of which input regions (pixels, tokens) had the most influence on a model's output, derived from gradient-based attribution.

Polysemanticity:
A neuron representing multiple unrelated concepts simultaneously, a major obstacle to mechanistic interpretability.

Superposition:
The hypothesis that neural networks store more features than they have neurons by using overlapping, near-orthogonal directions in activation space.

Sparse autoencoder:
A technique used in mechanistic interpretability to disentangle superposed features into monosemantic directions.

Model card:
A short document accompanying a trained model that describes its intended uses, performance across demographic groups, evaluation procedures, and known limitations. Proposed by Mitchell et al. (2018).

Datasheet for datasets:
A document describing the motivation, composition, collection process, and recommended uses of a dataset. Proposed by Gebru et al. (2018, CACM 2021).

ROME (Rank-One Model Editing):
A method for directly editing factual associations in transformer models by modifying specific feed-forward weight matrices, based on the localization of factual storage to mid-layer modules.

05. Examples

A bank uses a gradient-boosted model to approve or deny mortgage applications. Regulators require that each denied applicant receive an explanation. The team deploys SHAP to generate per-application feature attributions, which their front-end system translates into natural-language explanations: "Your application was declined primarily because your debt-to-income ratio (contribution: -0.18) and employment tenure below 2 years (contribution: -0.12) fell below our thresholds."

A medical imaging team trains a neural network to detect diabetic retinopathy. They use integrated gradients to generate saliency maps over retinal scans. During a debugging session, they discover the model attends strongly to the optic disc in images from one scanner manufacturer and to lesion areas in images from another, revealing a hardware-specific bias that their accuracy metric had masked.

Anthropic researchers apply sparse autoencoders to Claude 3 Sonnet's activations and find features that correspond to entities, countries, emotional valence, and concepts like "deception," with the last firing on inputs about fraud, manipulation, and dishonesty across many languages. This allows them to trace how the concept of deception propagates through the model's layers.

06. Common Pitfalls and Misconceptions

"Attention equals explanation."
Attention weights are not causal. A high-attention token may not be the reason for the prediction. Gradient-based attribution and Shapley values are more reliable.

"LIME and SHAP give the same answer."
They use different mathematical foundations and can give substantially different attributions for the same prediction. Neither is definitively correct. They answer slightly different questions about local versus global contribution.

"Interpretability and accuracy trade off."
True for classical models (linear regression is interpretable but limited) but less true for modern ML. Post-hoc explanation methods can be applied to any model. Mechanistic interpretability is orthogonal to model capability.

"Model cards are just paperwork."
Model cards encode engineering decisions about intended use cases, evaluation conditions, and known failure modes. A model deployed outside its documented use case without acknowledgment of that documentation is an engineering failure.

"XAI solves the alignment problem."
Interpretability informs alignment work but does not replace it. Knowing that a feature labeled "deception" exists in a model's activations does not guarantee the model will not deceive. It is a diagnostic tool, not a safety guarantee.