Every model release needs to make progress across meaningful dimensions, to the point where evals have themselves become marketing.
But benchmarks have an expiry date. MMLU lasted 4 years, HumanEval lasted 2 and the newer ones might not make it to 18 months. Their lifespan itself is shrinking exponentially.
Why then do some go out faster than others? And what do durable evals actually require?
The Graveyard
Two things stand out:
- Their useful lifespan is compressing, and
- The benchmarks that last longest test capabilities that aren't on the internet
FrontierMath uses problems with no published solutions, Humanity's Last Exam (HLE) solicits questions from domain experts and ARC-AGI-2 tests novel abstract reasoning patterns.
The ones that die fast like MMLU, GSM8K, and HumanEval are static datasets that inevitably leak into training data. A contamination-free rewrite of MMLU revealed accuracy drops of up to 16% 1. OpenAI recently retired SWE-bench Verified, replacing it with the harder SWE-bench Pro where top scores dropped from >70% to under 25% 2.
And then there's search-time contamination, where agents find benchmark answers via retrieval during inference 3.
All that to say, static benchmarks are terminal.
The Ruler
This is why METR's time horizon benchmark sticks out.
The 50% time horizon measures the length of tasks (in human-expert time) that a model can complete with 50% reliability 4. The metric doesn't cap out at 100% because there's always a longer task. Think of it as a ruler, not a test, as it measures where you are on a continuous scale, not whether you pass a discrete one.
And it has been doubling roughly every 7 months since 2019 (a Moore's law for unsupervised autonomy of sorts).
Opus 4.6 pushed the 50%-time horizon to ~12 hours but the 95% confidence interval stretches from 5.3 - 66 hours, which METR acknowledge is noisy given their task suite is "nearly saturated."
Other domains METR measured also showed exponential progress: software, math, scientific reasoning all double at ~2-6 month rates. Agentic computer use (OSWorld, WebArena) lags 40-100x behind but rides the same slope. Even autonomous driving shows exponential gains, just slower (~0.6/year) 5.
But there's a catch: the ruler is infinite but the marks on it are hand-carved. They expanded it by 34% in January 2026, adding longer tasks. But as model capability is now doubling every 4 months post-2023, the task suite would need to grow quarterly just to keep up 6. And building a single 20-hour task with human baselines, automated scoring, and QA consumes weeks of expert time.
METR seemingly avoids every other failure mode that kills static benchmarks: tasks are bespoke and success is continuous. You simply can't google your way through a never-before-seen 5-hour software task 3. The design is genuinely durable but the supply feeding it is not.
No Free Lunch
Static benchmarks saturate and METR's ruler runs out of marks. So what doesn't expire?
Elo-based systems like Chatbot Arena dodge saturation entirely but they measure preference, not capability. A verbose, confident-sounding response wins votes over a terse, correct one.
So we're left with three buckets:
- Static benchmarks: precise and verifiable but expire fast
- Time horizon benchmarks: scale continuously but throttle on expert task creation
- Elo/preference systems: never expire but measure vibes not capability
There is no silver bullet.
The Expertise Ceiling
We're burning through evals data faster than training data. The benchmarks that still have runway all require domain experts to hand-craft problems. Case in point: FrontierMath commissioned 60+ mathematicians, HLE got ~1,000 specialists across 500 institutions, and METR compensates senior engineers with generous hourly rates for thousands of baseline hours 7 8 9.
Along with raw data and compute, the ceiling now includes the supply of expert human judgment.
As capabilities improve, the minimum expertise needed to write meaningful evals rises, shrinking the available evaluator pool.
That pool is shrinking from above. Every time a frontier model crosses a capability threshold, it disqualifies a tier of evaluators. When GPT-4 saturated MMLU, undergrads stopped being useful baselines. When models cracked competition math, the pool contracted to research mathematicians, whereas FrontierMath needed Fields Medal-caliber contributors (drag the slider above to watch the squeeze).
Even if models can generate candidate eval tasks, someone still has to verify the answer. You can't check a proof you don't understand. That leaves two paths: model-graded evals (which are inherently circular) or eval costs that go superlinear with capability.
The 5x Gap and Beyond
METR's data shows the ratio between 50% success and 80% success is approximately 5x 4. A model with a 10-hour time horizon @ 50% only reliably handles ~2 hour tasks. Anthropic's own deployment data confirms this gap: METR estimates Claude Opus 4.5 at ~4.9 hours but the 99.9th percentile turn duration in Claude Code is ~42 minutes, which factors in user interruptions, clarifications and approvals 10.
And that's a constant multiplier that doesn't shrink as models improve
Projection starts from Opus 4.6's ~12h time horizon (Feb 2026). Post-2023 doubling rate is ~4 months.
If you ship agents to production, plan around the 80% number, not the 50%. Today that's roughly an hour for frontier models to run unsupervised.
Your internal evals will remain your moat. Every public benchmark will eventually leak, saturate or both. Ones that hold up are built from your failure modes, your users, and your own domain experts.
But when you zoom out, the picture grows stranger. We're in a race where the thing being measured is catching up to the people doing the measuring. The benchmarks that still work all depend on experts who are vanishing into the frontier.
At some point, the bottleneck becomes whether there's anyone left who can tell you if the answer is right.