Why I'm bullish on evals
As more of the economy is delegated and run by agents, we'll need layers of verification and governance over what they claim to do and what they are actually doing. Easier for domains with verifiable rewards (coding, math), much harder for the rest.
As next token predictors, their reliability especially on long horizon tasks will always be an open question. The abilities we prize in them are partially trained but also emergent. Those high-order cognitive leaps have a lot of implicit assumptions built in, which to make explicit would be exhaustive and keep us stuck in review hell.
We need verifiers with more determinism, beyond specs for goals and planning, that let us formally verify agents' work as they CRUD over the real world.
When, not If
Most information exchange on the internet will be agent-to-agent. Along with the verification problem, solving for when humans should intervene in the entire trajectory will be critical given rate limits on review time.
An analogy would be a student who precisely knows when to raise their hand to get the teacher to review their work.
Operate Upstream
There will be a consolidation of agents at the model layer; perhaps not one "super agent" but a handful (already we see this with Claude Code, Codex). Same models with different harnesses can produce a wide performance gap e.g. on TerminalBench-2, optimized harness for Opus 4.6 achieved 76.4%, outperforming even Claude Code at 58% 1. Everything around the model will need to continuously evolve to scaffold the next leap in performance.
This recursiveness will not stop. It's better to (re)build scaffolding that augments base capability than apps the next model release absorbs. This also echoes Boris Cherny's thinking on latent demand and building for what the models need in ~6 months. Human needs are now proxied through agents carrying out their intent; and therein lies the demand.
Footnotes
-
Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. Preprint. ↩