Kensa: An Open Source Agent Eval Harness

Tell Claude Code to "evaluate this agent", get a working eval suite in minutes.

A recent State of Agent Engineering report ¹ found that 89% of engineering teams have adopted observability but only 52% run evals. That's a 37-point gap.

Nearly half of all 1,300 respondents reported not running evals at all. The same teams cite quality as their top "production killer" (32%). You cannot improve something you do not measure, and yet most teams ship without it.

Orgs are painfully aware but a) don't know where to start, and b) are under too much pressure to outship their competition.

Incentives for shipping fast and testing for quality have always been at odds in software. Throw LLMs and tools into the mix, and the surface area for bugs explodes.

Kensa ² is an attempt at closing that gap: an opinionated CLI with bundled skills that lets your coding agent write eval suites for the agents you ship.

The loop:

Tell Claude Code / Codex / Cursor to evaluate the repo
It reads the agent codebase, sets up telemetry, and identifies failure modes from traces
It writes tests (scenarios) and judges
It runs the evals via the kensa CLI
You review, approve, and repeat

It's built on a simple principle: your coding agent reasons, the CLI computes, and the skills orchestrate the workflow between them.

Install via:

npx skills add satyaborg/kensa     # installs 5 eval skills
uv add kensa                       # or: pip install kensa

Open source, MIT licensed. ³

Building agents has never been easier. But let's hold them to the same standard we hold software.

Footnotes

https://www.langchain.com/state-of-agent-engineering ↩
検査 /ken·sa/ to inspect before releasing to the world. ↩
Check docs for more at kensa.sh/docs. ↩