Kensa: An Open Source Agent Eval Harness

Kensa lets your coding agents eval any agent.

Apr 8, 20262 min read

Tell Claude Code to "evaluate this agent", get a working eval suite in minutes.

A recent State of Agent Engineering report 1 found that 89% of engineering teams have adopted observability but only 52% run evals. That's a 37-point gap.

Nearly half of all 1,300 respondents reported not running evals at all. The same teams cite quality as their top "production killer" (32%). You cannot improve something you do not measure, and yet most teams ship without it.

Orgs are painfully aware but a) don't know where to start, and b) are under too much pressure to outship their competition.

Incentives for shipping fast and testing for quality have always been at odds in software. Throw LLMs and tools into the mix, and the surface area for bugs explodes.

Kensa 2 is an attempt at closing that gap: an opinionated CLI with bundled skills that lets your coding agent write eval suites for the agents you ship.

The loop:

  1. Tell Claude Code / Codex / Cursor to evaluate the repo
  2. It reads the agent codebase, sets up telemetry, and identifies failure modes from traces
  3. It writes tests (scenarios) and judges
  4. It runs the evals via the kensa CLI
  5. You review, approve, and repeat

It's built on a simple principle: your coding agent reasons, the CLI computes, and the skills orchestrate the workflow between them.

Install via:

npx skills add satyaborg/kensa     # installs 5 eval skills
uv add kensa                       # or: pip install kensa

Open source, MIT licensed. 3

Building agents has never been easier. But let's hold them to the same standard we hold software.

Footnotes

  1. https://www.langchain.com/state-of-agent-engineering

  2. 検査 /ken·sa/ to inspect that something meets the standard before release.

  3. Check docs for more at kensa.sh/docs.