AI Benchmarks

Standardized tests used to measure and compare AI model capabilities.

1 article3 related tools7 related tagstopic

Key Facts

MMLU Top Score89.2% (LLaMA 4-400B)

GPQA Diamond SOTA92% (GPT-5 leaked)

ARC-AGI SOTA87.5% (OpenAI o3)

HumanEval Leader87.3% (LLaMA 4)

Key WarningBenchmark overfitting risk

New BenchmarkGPQA Diamond gaining traction

AI benchmarks are standardized evaluation datasets used to measure specific model capabilities and compare models objectively. Key benchmarks include: MMLU (massive multitask language understanding across 57 subjects), HumanEval (Python coding), MATH (mathematical reasoning), GPQA Diamond (graduate-level STEM), AIME (mathematical olympiad), and ARC-AGI (general reasoning). Benchmark performance drives product decisions, pricing, and public perception — but leading researchers increasingly warn of 'benchmark gaming,' where models are optimized for specific tests rather than genuine capability.

1 Story tagged#benchmarks

Model Release Breaking

GPT-5 Leaks Surface: OpenAI's Next Model Could Reason Like a PhD Physicist

Internal benchmarks allegedly show GPT-5 achieving near-human performance on graduate-level STEM tasks, with a 92% pass rate on the GPQA Diamond benchmark — a 34-point jump over GPT-4o.

Mar 13, 20266 minJames Whitfield