The Eval Maturity Model for AI teams.
An 8-stage diagnostic for teams that need to know whether their AI evaluation layer is real, runnable, and ready to govern production behavior.
The Eval Maturity Model is an 8-stage framework that places an AI team by the artifact it can actually show: a spreadsheet, a batch runner, a golden set, checkpoint comparators, CI gates, feedback capture, prompt optimization, or tenant-specific launch gates.
Most teams overestimate their eval maturity.
The problem is usually not dishonesty. Teams build the pieces they can see: a runner, a Slack notification, a few fixtures, a dashboard. The maturity model forces the harder question: which artifact would still exist if the person who remembers the failure left the room?
The stage is determined by the artifact, not the aspiration.
The tell for each stage points at a file, directory, report, or workflow you can inspect.
The next move is deliberately one stage away, because eval infrastructure breaks when teams jump too far.
The road runs from memory to launch gates.
Each stage answers a different failure mode. The early stages preserve failures. The middle stages compare behavior against expectation. The later stages turn live feedback and tenant coverage into release discipline.
Stage 0 - No eval
Manual spot checks. The tell is that nobody can explain how a prompt edit would be caught if it broke an unrelated query.
Stage 1 - Issue spreadsheet
A durable list of failures exists, but it is not connected to a runner, gate, or regression process.
Stage 2 - Batch query runner
The team can run many queries at once, but cannot say which run is the truth or which output should be treated as correct.
Stage 3 - Partial golden set
Expected answers exist for a narrow slice of behavior. Coverage in the twenties feels better than zero and is still too thin to govern.
Stage 4 - Checkpoint comparators
The comparator tests structural seams: route, retrieval, payload shape, SQL, execution, and presentation. Volatile fields are deliberately ignored.
Stage 5 - CI-runnable framework
Eval results have a shareable report and a merge can be blocked when the system regresses below a threshold.
Stage 6 - User feedback capture
Downvotes and user-reported failures are added back into the golden set so the corpus is authored by reality, not only by engineers.
Stage 7/8 - Optimization and launch gates
Prompt optimization and tenant-specific launch gates only work after the corpus, comparator, and per-tenant coverage rules are stable.
Let the tell beat the gut.
Start by picking the stage you think you are at. Then read the tell for that stage and open the artifact it references: the golden set directory, comparator file, CI config, prompt file, or latest eval report.
If the tell and your instinct disagree, trust the tell.
Write one next action, one owner, and one revisit date.
Repeat the exercise in 30 days. Stages are walked, not jumped.
The expensive mistake is the skip-and-revert.
The common failure is wiring Stage 7 ideas onto a Stage 3 or Stage 4 foundation: feedback gates before comparators are stable, optimizer loops before the golden set is representative, launch gates before tenant coverage exists. The result is noisy CI, fast reverts, and a team that stops trusting the eval layer.
Do not add optimization before the corpus is trustworthy.
Do not block merges before the comparator knows which fields are volatile.
Do not declare multi-tenant readiness until each tenant has its own coverage rule.
Practical questions, answered plainly.
It is for AI product, platform, and engineering teams that already have AI behavior in production or close to production and need a concrete way to sequence eval work.
Pick your likely stage, read the tell, and open the artifact it names. If you cannot show the artifact, you are probably one or two stages lower.
Because governance depends on a measurement layer. A policy cannot govern behavior that the team cannot compare against a known baseline.