03. How It Works
SHAP (SHapley Additive exPlanations)
SHAP, introduced by Lundberg and Lee (2017), assigns each input feature a Shapley value: its average marginal contribution to the prediction across all possible subsets of features. Shapley values come from cooperative game theory, where the "game" is producing the prediction and the "players" are the features.
SHAP has desirable mathematical properties: consistency (a feature that contributes more always gets a higher value), dummy (a feature that contributes nothing gets zero), and additivity (individual contributions sum to the difference between the prediction and the baseline). These properties make SHAP values comparable across models and datasets.
The key drawback is computational cost. Exact Shapley computation is exponential in the number of features. SHAP approximations (TreeSHAP for tree models, KernelSHAP for arbitrary models, DeepSHAP for neural networks) make it tractable but introduce approximation error.
LIME (Local Interpretable Model-Agnostic Explanations)
LIME, introduced by Ribeiro, Singh, and Guestrin (2016), explains a single prediction by fitting an interpretable surrogate model, typically a linear regression or decision tree, around that prediction in a locally sampled neighborhood. The surrogate is only required to be locally faithful, not globally accurate.
The process: perturb the input, observe how the model's output changes, weight the perturbed samples by their proximity to the original input, and train a simple model on this weighted dataset. The simple model's coefficients are the explanation.
LIME is flexible (model-agnostic, works for text, image, and tabular data) but has a known instability problem: small changes to the perturbation sampling can produce substantially different explanations for the same prediction.
Feature importance
For tree-based models, feature importance can be computed directly from the tree structure (how often a feature is used for splits, weighted by the information gain at each split). For neural networks, gradient-based methods including vanilla gradients, integrated gradients, and SmoothGrad attribute importance to input dimensions by examining how the output changes with respect to small input perturbations.
Attention visualization and its limits
Transformer models produce attention weights that indicate which tokens a given head attended to when producing each output token. Attention maps are visually intuitive and widely cited as evidence of model reasoning. The problem is that attention weights do not reliably indicate causal importance. Jain and Wallace (2019) demonstrated that attention can be permuted or perturbed without changing the output, meaning high attention weight does not mean the token was causally important to the prediction.
Gradient-based attribution methods and SHAP are more causally grounded. Attention maps remain useful for debugging and intuition but should not be treated as explanations in high-stakes contexts.
Mechanistic interpretability
Mechanistic interpretability, an active research program led in part by Anthropic's team, attempts to reverse-engineer the internal computation of neural networks, not just attribute importance to inputs. The analogy is to reverse engineering a compiled binary: the goal is to understand what algorithms the weights implement.
Chris Olah's essay from the Transformer Circuits thread (published June 2022) describes the central challenge as decomposing neural network activations into independently understandable features, analogous to identifying variables in a binary program. The key barrier is superposition: a single neuron often represents multiple unrelated features simultaneously (polysemanticity), which makes per-neuron analysis unreliable.
Anthropic's Scaling Monosemanticity work (2024) applied sparse autoencoders to Claude 3 Sonnet's residual stream, extracting millions of interpretable features from what would otherwise be polysemantic neurons. This is among the most technically advanced XAI work on frontier LLMs. The goal is not just to explain individual predictions but to provide a complete enough internal map that safety-relevant behaviors can be located and, eventually, verified.