Devloop: Closing the loop

Coding for me now looks something like this: Codex implements a spec, I eyeball the diff and bring in Claude Code for review.

Claude then runs an adversarial review, points out potential bugs, after which I pick structural/blocking issues for Codex to iterate on.

This game of ping pong typically goes on for five or six iterations.

It doesn't know what it doesn't know

You're likely asking: why go through all that trouble when you can one-shot Claude in a ralph loop? There are at least two problems with that.

First, running coding loops over partially specified intent, i.e. without specs, is not an efficient use of human time and machine tokens. I like one-shotting only when it makes sense, which is frankly not a lot when you're developing software for others.

And second, even if I were to feed it a spec, there's a fundamental gap: a model family has blind spots it cannot critique in itself. LLMs are more favorable to outputs that come from their own training distribution.¹

Self-review reliability doesn't hold uniformly. For example, when modernizing a legacy codebase and measuring drift on observable behaviors, 31.7% of all drift cases are silently endorsed by the same model that produced them.² Switch to rubric-based evals and the same self-preference bias surfaces when a model grades its own work.³

When A/B testing on code review, I've found the same model reviewer to often omit/downgrade findings (e.g. blockers quietly demoted to considerations or nits). Bringing in Claude or Codex, not unlike a second opinion you get from doctors, is ultimately cheaper when weighed against the cost of false negatives (reviewer misses a real bug).

This is where an actor-critic setup covers ground that a single model typically misses. Putting Codex and Claude Code head-to-head, à la classic coder and reviewer duo, is an effective setup I've found in practice.

Leaning into it, however, I became the bottleneck. Again.

Jumping between terminal windows was inefficient, and a task would just sit there waiting for the next signal while I babysat other sessions. Codex shipped a Claude Code plugin that lets you invoke Codex from inside Claude, but not the reverse.

Human out of the loop

So, I wanted to automate myself out of the loop. The rule would be to summon me only once the issues are mostly resolved and the PR is in good shape to review.

Think of it as reserving my time for the last mile that'd really benefit from my judgment.

This is how Devloop was born. It's a simple bash harness that completes the loop without me: one skill pins down the spec, Codex implements, Claude Code uses another skill to review adversarially or vice versa, and the two iterate until the work is human reviewable.

The review doesn't stop at the spec. Acceptance criteria and 100% test coverage are easy to game. It also checks security, maintainability, and completeness: engineering qualities that are harder to catch without being deliberate.

What not to build matters as much, so I define those boundaries too: no backwards compat for greenfield, no smoke tests where they don't belong.

Having run numerous sessions with it, I can say this works surprisingly well. If I jam on the spec thoroughly, spend time ideating, probing failure modes, and grokking the critical details, then I can let the agent cook for a good hour without supervision.

When I'm back, the work is mostly there. And that's where judgment becomes important again.

Loops all the way

I don't fully buy the idea that software is solved. Some parts are, yes. But judgment and taste just moved upstream, to what to build in the first place.

Every time you remove yourself from one loop, you move up the value chain to join another, incomplete one. Your leverage compounds across loops instead of keeping you occupied in a mostly automatable one.

The IDE/terminal will keep morphing into a place for specifying intent and verifying outputs while agents fill the middle.

This is my attempt to see how far I can remove myself from the loop without trading off quality.

Check it out here. Or simply install via:

curl -fsSL https://devloop.sh/install | bash

Footnotes

Arjun Panickssery, Samuel R. Bowman, and Shi Feng, "LLM Evaluators Recognize and Favor Their Own Generations" (NeurIPS 2024) show that LLM judges score their own outputs higher than human raters do, and this self-preference rises with the model's ability to recognize its own text. ↩
Gokul Chandra Purnachandra Reddy, Aditya Lolla, and Harsha Sanku, "Articulate but Wrong: Self-Review Failures in LLM-Based Code Modernization" (arXiv 2026) find that when a model reviews its own modernization changes, much of the behavior-altering drift is silently endorsed. ↩
José Pombal, Ricardo Rei, and André F. T. Martins, "Self-Preference Bias in Rubric-Based Evaluation of Large Language Models" (arXiv 2026) show self-preference bias persists under rubric-based scoring, with models rating their own outputs more favorably. ↩