04. Key Terms / Benchmarks
MMLU (Massive Multitask Language Understanding): 57 academic subjects from elementary math to professional law and medicine, ~14,000 multiple-choice questions. Introduced by Hendrycks et al. in 2020. Was the dominant knowledge breadth benchmark through 2023. Saturated by 2026: frontier models cluster above 88%, making score differences statistically meaningless for model selection.
MMLU-Pro: Harder variant with 10-choice questions and more graduate-level difficulty. Frontier models are now approaching 90% (Gemini 3 Pro: 90.1%, Claude Opus 4.5: 89.5%), indicating it too is approaching saturation.
GPQA Diamond: 198 PhD-level questions in biology, chemistry, and physics. Non-expert humans score 34%; domain experts average 65%. Still differentiates models in the 60-91% range, making it one of the more useful current science benchmarks. o3/GPT-5.1 reached 91.9%.
HumanEval: 164 Python programming problems that test function synthesis. Introduced by OpenAI in 2021. Saturated. Frontier models exceed 95%. Not predictive of real-world coding assistance quality.
SWE-bench: Real GitHub issues from open-source Python repositories. The model must understand the codebase, identify the bug, write a fix, and pass the existing test suite. Far more predictive of coding assistant quality than HumanEval. OpenAI found 59.4% of hard SWE-bench tasks have flawed tests. SWE-bench Pro (Scale AI) shows the same model scoring 80.9% on original vs. 45.9% on the improved version, illustrating contamination effects.
MATH and MATH-500: Competition mathematics problems at AMC/AIME difficulty. MATH-500 is a curated 500-question subset used for faster evaluation. DeepSeek R1 reached 97.3% on MATH-500.
AIME (American Invitational Mathematics Examination): High-school competition math, historically used to select math olympiad teams. AIME 2024 and AIME 2025 are now standard reasoning model benchmarks. o3 scored 88.9% on AIME 2025. DeepSeek R1 scored 71% pass@1 on AIME 2024, rising to 86.7% with majority voting.
ARC-AGI (Abstraction and Reasoning Corpus): Created by Francois Chollet in 2019 to test novel generalization from minimal examples. ARC-AGI-1: reasoning models now score 85%+ with scaffolding. ARC-AGI-2: o3 at 45.1%, standard frontier models near 0%. ARC-AGI-3 (2026): every frontier model scored below 1%, while untrained humans scored 100%. ARC-AGI-3 introduces interactive, multi-step tasks that cannot be gamed by static pattern matching.
Humanity's Last Exam (HLE): 2,500 expert-designed questions across domains. Top models reach only 37.5% while human domain experts average approximately 90%. Currently the hardest static benchmark.
Chatbot Arena / Elo: Human preference-based ranking via blind A/B battles. Less gameable than task benchmarks because it measures whether real users prefer one model's responses over another. Considered the most reliable conversational quality signal.
LiveBench: A contamination-limited benchmark that refreshes questions from recent events, competition problems, and newly published papers, preventing memorization. Available on OpenReview.