Hipparchus observing from Alexandria. Engraving, Louis Figuier (1877).
Why?
As more of the economy is delegated and run by agents, we need layers of verification and governance over what they claim to do and what they are actually doing. Easier for domains with verifiable rewards (coding, math), much harder for the rest.
As next token predictors, their reliability, especially on long-horizon tasks, will always be an open question. The abilities we prize in them are partially trained but also emergent. Those high-order cognitive leaps have a lot of implicit assumptions built in, which making explicit would be exhaustive and keep us stuck in review hell.
We need verifiers with more determinism, beyond LLM judges and specs for goals and planning, that let us formally verify agents' work as they CRUD over the real world.
When, not If
In the coming decade, most information exchange on the internet will be agent-to-agent. Along with the verification problem, solving for when humans should intervene across the entire trajectory will be critical given rate limits on review time and a constrained expert pool.
An analogy would be a student who precisely knows when to raise their hand to get the teacher to review their work.
Operate Upstream
There will be consolidation at the model layer; perhaps not one "super agent" but a handful (already we see this with Claude Code, Codex, OpenClaw and Hermes). Same models with different harnesses can produce a wide performance gap e.g. on TerminalBench-2, optimized harness for Opus 4.6 achieved 76.4%, outperforming even Claude Code at 58% 1. Everything around the model will need to continuously evolve to scaffold the next leap in performance. Put simply:
This recursiveness will not stop. It's better to (re)build scaffolding that augments base capability than applications that the next model release absorbs. This echoes recent thinking from the Claude Code team (Boris in particular) on latent demand and building for what the models need in ~6 months.
The terminology will evolve - evals today, something else tomorrow - but the principle holds. As human needs proxy through agents, verification itself becomes a non-negotiable piece of infra. And it's not about babysitting agents; in fact it'll define the very trust boundaries the entire ecosystem will come to rely on.
That's the bet.
Footnotes
-
Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. Preprint. ↩