The Half-life of Benchmarks

Every AI benchmark expires but measuring reliability across longer time horizons reveals what sets the absolute limits.

Mar 9, 20265 min read

Every model release needs to make progress across meaningful dimensions, to the point where evals have themselves become marketing.

But benchmarks have an expiry date. MMLU lasted 4 years, HumanEval lasted 2 and the newer ones might not make it to 18 months. Their lifespan itself is shrinking exponentially.

Why then do some go out faster than others? And what do durable evals actually require?

The Graveyard

Benchmark Timeline
saturatedfadingactive
ARC-AGI-1
MMLU
GSM8K
HumanEval
MATH
GPQA
SWE-bench
FrontierMath
ARC-AGI-2
HLE
METR TH
20192020202120222023202420252026

Two things stand out:

  1. Their useful lifespan is compressing, and
  2. The benchmarks that last longest test capabilities that aren't on the internet

FrontierMath uses problems with no published solutions, Humanity's Last Exam (HLE) solicits questions from domain experts and ARC-AGI-2 tests novel abstract reasoning patterns.

The ones that die fast like MMLU, GSM8K, and HumanEval are static datasets that inevitably leak into training data. A contamination-free rewrite of MMLU revealed accuracy drops of up to 16% 1. OpenAI recently retired SWE-bench Verified, replacing it with the harder SWE-bench Pro where top scores dropped from >70% to under 25% 2.

And then there's search-time contamination, where agents find benchmark answers via retrieval during inference 3.

All that to say, static benchmarks are terminal.

The Ruler

This is why METR's time horizon benchmark sticks out.

The 50% time horizon measures the length of tasks (in human-expert time) that a model can complete with 50% reliability 4. The metric doesn't cap out at 100% because there's always a longer task. Think of it as a ruler, not a test, as it measures where you are on a continuous scale, not whether you pass a discrete one.

And it has been doubling roughly every 7 months since 2019 (a Moore's law for unsupervised autonomy of sorts).

METR 50% Time Horizon
AnthropicOpenAIGoogle7mo doubling

Opus 4.6 pushed the 50%-time horizon to ~12 hours but the 95% confidence interval stretches from 5.3 - 66 hours, which METR acknowledge is noisy given their task suite is "nearly saturated."

Other domains METR measured also showed exponential progress: software, math, scientific reasoning all double at ~2-6 month rates. Agentic computer use (OSWorld, WebArena) lags 40-100x behind but rides the same slope. Even autonomous driving shows exponential gains, just slower (~0.6/year) 5.

But there's a catch: the ruler is infinite but the marks on it are hand-carved. They expanded it by 34% in January 2026, adding longer tasks. But as model capability is now doubling every 4 months post-2023, the task suite would need to grow quarterly just to keep up 6. And building a single 20-hour task with human baselines, automated scoring, and QA consumes weeks of expert time.

METR seemingly avoids every other failure mode that kills static benchmarks: tasks are bespoke and success is continuous. You simply can't google your way through a never-before-seen 5-hour software task 3. The design is genuinely durable but the supply feeding it is not.

No Free Lunch

Static benchmarks saturate and METR's ruler runs out of marks. So what doesn't expire?

Elo-based systems like Chatbot Arena dodge saturation entirely but they measure preference, not capability. A verbose, confident-sounding response wins votes over a terse, correct one.

So we're left with three buckets:

  1. Static benchmarks: precise and verifiable but expire fast
  2. Time horizon benchmarks: scale continuously but throttle on expert task creation
  3. Elo/preference systems: never expire but measure vibes not capability

There is no silver bullet.

The Expertise Ceiling

We're burning through evals data faster than training data. The benchmarks that still have runway all require domain experts to hand-craft problems. Case in point: FrontierMath commissioned 60+ mathematicians, HLE got ~1,000 specialists across 500 institutions, and METR compensates senior engineers with generous hourly rates for thousands of baseline hours 7 8 9.

Along with raw data and compute, the ceiling now includes the supply of expert human judgment.

The Expertise PyramidCapability level: 0
Frontier experts
~1K
World-class
~10K
Senior researchers
~100K
PhD-level
~10M
Practitioners
~27M
General population
~100M
05

As capabilities improve, the minimum expertise needed to write meaningful evals rises, shrinking the available evaluator pool.

That pool is shrinking from above. Every time a frontier model crosses a capability threshold, it disqualifies a tier of evaluators. When GPT-4 saturated MMLU, undergrads stopped being useful baselines. When models cracked competition math, the pool contracted to research mathematicians, whereas FrontierMath needed Fields Medal-caliber contributors (drag the slider above to watch the squeeze).

Even if models can generate candidate eval tasks, someone still has to verify the answer. You can't check a proof you don't understand. That leaves two paths: model-graded evals (which are inherently circular) or eval costs that go superlinear with capability.

The 5x Gap and Beyond

METR's data shows the ratio between 50% success and 80% success is approximately 5x 4. A model with a 10-hour time horizon @ 50% only reliably handles ~2 hour tasks. Anthropic's own deployment data confirms this gap: METR estimates Claude Opus 4.5 at ~4.9 hours but the 99.9th percentile turn duration in Claude Code is ~42 minutes, which factors in user interruptions, clarifications and approvals 10.

And that's a constant multiplier that doesn't shrink as models improve

Time Horizon Projectiondoubling every 4 months
4mo12mo
50% reliable80% reliable

Projection starts from Opus 4.6's ~12h time horizon (Feb 2026). Post-2023 doubling rate is ~4 months.

If you ship agents to production, plan around the 80% number, not the 50%. Today that's roughly an hour for frontier models to run unsupervised.

Your internal evals will remain your moat. Every public benchmark will eventually leak, saturate or both. Ones that hold up are built from your failure modes, your users, and your own domain experts.

But when you zoom out, the picture grows stranger. We're in a race where the thing being measured is catching up to the people doing the measuring. The benchmarks that still work all depend on experts who are vanishing into the frontier.

At some point, the bottleneck becomes whether there's anyone left who can tell you if the answer is right.

Footnotes

  1. MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

  2. Why SWE-bench Verified No Longer Measures Frontier Coding Capabilities

  3. Search-Time Data Contamination 2

  4. Measuring AI Ability to Complete Long Tasks 2

  5. How Does Time Horizon Vary Across Domains?

  6. Time Horizon 1.1

  7. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

  8. Humanity's Last Exam

  9. HCAST: Human-Calibrated Autonomy Software Tasks

  10. Measuring AI Agent Autonomy in Practice