Marketing-copy agent
- Metric
- Bias tolerance
- Threshold
- 0.25
Loose. Creative copy. Low regulatory impact. Risk appetite is “don’t embarrass us.” QA already catches most issues downstream.
Frameworks tell you what to track. They don’t tell you what “good enough” looks like.
The gap between the metric and the threshold that makes the metric actionable is the practical heart of AI assurance.
A baseline in AI evaluation is the threshold that makes a metric actionable for a specific use case. It has four components: metric (what you measure), threshold (the number that triggers action), context (use case, population, risk appetite), owner (who authorizes threshold changes). Without all four, you have a number, not a baseline.
Four realistic agents. The same metric category. Four different numbers, because context (not the framework) sets the threshold.
Loose. Creative copy. Low regulatory impact. Risk appetite is “don’t embarrass us.” QA already catches most issues downstream.
Tight. Fair-lending regulatory exposure. Protected classes in the population. Consumer-protection laws. Reputational cost of a lawsuit. Same metric, different number.
Hallucination near-zero for safety-critical outputs, looser for wayfinding. Groundedness SLO is the primary control, not the generic “hallucination rate.” Context outweighs the metric name.
Depends on tenant isolation guarantees, what the agent can see, contractual commitments. A B2B PII tenant needs different settings than a B2C tenant on published docs.
NIST AI RMF, ISO 42001, EU AI Act, Singapore Agentic AI Governance, AIUC-1. Each names categories of risk. Each names what to measure. Each refuses, correctly, to name the threshold.
A framework that legislated “bias < 0.10 for all agents” would be too loose for loan underwriting and too strict for marketing copy. The same metric, applied uniformly, is either dangerous or useless depending on context.
The gap is intentional. What frameworks can’t do, the organization has to. Define what “good enough” looks like for its specific deployments, populations, and risk appetite.
Strip any one of these and you don’t have a baseline. You have a number.
What you’re measuring.
The number that triggers action.
Use case, population, risk appetite.
The named human authorized to change the threshold.
Most enterprises skip the owner. That is why most “baselines” in market today are aspirational PDFs instead of operational controls.
“Customer-support agent for tier-1 troubleshooting on product X.” Not “AI chatbot.”
Per AIUC-1 six categories, NIST AI RMF MEASURE, or your internal framework. The risk categories drive the metric set.
Groundedness, bias against protected class, tool-call authorization, data-exposure incidents. Fewer is better.
Threshold, use-case description, population, risk appetite, business-impact model. Document the reasoning.
Named human. Authorized to change the threshold based on evidence. Versioned document. Change log accessible.
Step 5 is where most baseline programs die. Without an owner, step 4 turns into a PDF; the PDF turns into a reference nobody updates; the baseline becomes an artifact, not a control.
The baseline is the threshold. Continuous measurement is what tells you whether the production system is on the right side of it today, this hour, this interaction. Without continuous measurement, a baseline is intent without enforcement.
AI systems are non-deterministic. A system that met its baseline on Monday can drift past it by Thursday. A prompt update, a model refresh, a new corpus, a change in user behavior. Quarterly attestation against a baseline means up to 90 days of undetected drift.
Continuous measurement without a baseline is just noise. “Groundedness 87.4%” with no way to decide whether that’s good or bad. Continuous + baseline together are the full control. Either alone falls down.
The four-stage demand sequence. And why governance-first vendors miss the buyer.
Every lever pulled in a real transformation, who owns it, and what 'good' looks like.
What governance looks like when the system being governed is non-deterministic.
A baseline is the threshold that makes the metric actionable. Continuous measurement is what keeps the threshold honest. Both, or neither.
A baseline is the threshold that makes a metric actionable for a specific use case. It has four components: metric, threshold, context, owner. Without all four, you have a number, not a baseline. NIST AI RMF and ISO 42001 name the metric categories; baselines name the thresholds.
A named owner (product, security, or compliance) depending on the use case. The owner authorizes threshold changes based on evidence and is referenced by the policy-as-code layer that enforces the baseline. Step 5 of the five-step method is where most baseline programs die.
Continuously evaluated, periodically reviewed. Production traces flow against the threshold every interaction. Formal review of the threshold itself runs quarterly or whenever the use case, population, or risk appetite changes. Model upgrade, prompt revision, new corpus, regulatory shift.
A threshold is a number ('bias < 0.05'). A baseline is the threshold plus context plus owner. The metric, the threshold, the use case + population + risk appetite the threshold is grounded in, and the named human authorized to change it. Baselines are operational; thresholds in isolation are aspirational.
Both frameworks specify what to measure (MAP and MEASURE in NIST; clauses 6 to 9 in ISO 42001). Neither specifies thresholds, correctly, because thresholds are use-case-specific. Baselines fill the gap between framework category and operational control.
Run the five-step method against your highest-risk AI deployment first: name the use case tightly, inventory the risks, pick metrics narrowly, set the threshold with context, assign the owner. One baseline per quarter is faster than waiting for a framework certification.