Production reading

87.4%

Groundedness

Is that good?

No threshold. No baseline. No answer.

The Baseline Problem in AI Evaluation.

Q: What is a baseline in AI evaluation?

A baseline is the threshold that makes a metric actionable for a specific use case. It has four components: metric, threshold, context, owner. Without all four, you have a number, not a baseline. NIST AI RMF and ISO 42001 name the metric categories; baselines name the thresholds.

Q: Who sets the baseline?

A named owner (product, security, or compliance) depending on the use case. The owner authorizes threshold changes based on evidence and is referenced by the policy-as-code layer that enforces the baseline. Step 5 of the five-step method is where most baseline programs die.

Q: How often should a baseline be updated?

Continuously evaluated, periodically reviewed. Production traces flow against the threshold every interaction. Formal review of the threshold itself runs quarterly or whenever the use case, population, or risk appetite changes. Model upgrade, prompt revision, new corpus, regulatory shift.

Q: Baseline vs. threshold. What is the difference?

A threshold is a number ('bias < 0.05'). A baseline is the threshold plus context plus owner. The metric, the threshold, the use case + population + risk appetite the threshold is grounded in, and the named human authorized to change it. Baselines are operational; thresholds in isolation are aspirational.

Q: How does the Baseline Problem map to NIST AI RMF and ISO 42001?

Both frameworks specify what to measure (MAP and MEASURE in NIST; clauses 6 to 9 in ISO 42001). Neither specifies thresholds, correctly, because thresholds are use-case-specific. Baselines fill the gap between framework category and operational control.

Q: What if my organization has no baseline today?

Run the five-step method against your highest-risk AI deployment first: name the use case tightly, inventory the risks, pick metrics narrowly, set the threshold with context, assign the owner. One baseline per quarter is faster than waiting for a framework certification.

Frameworks tell you what to track. They don’t tell you what “good enough” looks like.

The gap between the metric and the threshold that makes the metric actionable is the practical heart of AI assurance.

See compliance coverage →

TrustEvals field guide for finance AI teams.

A baseline in AI evaluation is the threshold that makes a metric actionable for a specific use case. It has four components: metric (what you measure), threshold (the number that triggers action), context (use case, population, risk appetite), owner (who authorizes threshold changes). Without all four, you have a number, not a baseline.

The problem

Same question, four different answers.

Four realistic agents. The same metric category. Four different numbers, because context (not the framework) sets the threshold.

Marketing-copy agent

Retail brand. Product descriptions.

Metric: Bias tolerance
Threshold: 0.25

Loose. Creative copy. Low regulatory impact. Risk appetite is “don’t embarrass us.” QA already catches most issues downstream.

Loan-underwriting agent

Regional bank. Credit decisions.

Metric: Bias tolerance
Threshold: 0.05

Tight. Fair-lending regulatory exposure. Protected classes in the population. Consumer-protection laws. Reputational cost of a lawsuit. Same metric, different number.

Healthcare-triage agent

Health system. Patient inquiries.

Metric: Groundedness SLO
Threshold: Near-zero (safety-critical)

Hallucination near-zero for safety-critical outputs, looser for wayfinding. Groundedness SLO is the primary control, not the generic “hallucination rate.” Context outweighs the metric name.

Customer-support agent

SaaS. Tier-1 tickets.

Metric: Data exposure
Threshold: Per-tenant

Depends on tenant isolation guarantees, what the agent can see, contractual commitments. A B2B PII tenant needs different settings than a B2C tenant on published docs.

Why frameworks stop short

Standards don’t define thresholds. And shouldn’t.

NIST AI RMF, ISO 42001, EU AI Act, Singapore Agentic AI Governance, AIUC-1. Each names categories of risk. Each names what to measure. Each refuses, correctly, to name the threshold.

A framework that legislated “bias < 0.10 for all agents” would be too loose for loan underwriting and too strict for marketing copy. The same metric, applied uniformly, is either dangerous or useless depending on context.

The gap is intentional. What frameworks can’t do, the organization has to. Define what “good enough” looks like for its specific deployments, populations, and risk appetite.

What baselines are

A baseline is not a number. It’s a process.

Strip any one of these and you don’t have a baseline. You have a number.

Metric.

What you’re measuring.

Threshold.

The number that triggers action.

Context.

Use case, population, risk appetite.

Owner.

The named human authorized to change the threshold.

Most enterprises skip the owner. That is why most “baselines” in market today are aspirational PDFs instead of operational controls.

How to set a baseline

The five-step method we use with customers.

01
NAME the use case tightly.
“Customer-support agent for tier-1 troubleshooting on product X.” Not “AI chatbot.”
02
INVENTORY the risks.
Per AIUC-1 six categories, NIST AI RMF MEASURE, or your internal framework. The risk categories drive the metric set.
03
PICK the metrics, narrowly.
Groundedness, bias against protected class, tool-call authorization, data-exposure incidents. Fewer is better.
04
SET the threshold, with the context.
Threshold, use-case description, population, risk appetite, business-impact model. Document the reasoning.
05
ASSIGN the owner.
Named human. Authorized to change the threshold based on evidence. Versioned document. Change log accessible.

Step 5 is where most baseline programs die. Without an owner, step 4 turns into a PDF; the PDF turns into a reference nobody updates; the baseline becomes an artifact, not a control.

Why continuous

A baseline without continuous measurement is performance art.

The baseline is the threshold. Continuous measurement is what tells you whether the production system is on the right side of it today, this hour, this interaction. Without continuous measurement, a baseline is intent without enforcement.

AI systems are non-deterministic. A system that met its baseline on Monday can drift past it by Thursday. A prompt update, a model refresh, a new corpus, a change in user behavior. Quarterly attestation against a baseline means up to 90 days of undetected drift.

Continuous measurement without a baseline is just noise. “Groundedness 87.4%” with no way to decide whether that’s good or bad. Continuous + baseline together are the full control. Either alone falls down.

Three more from the practical guide.

A baseline is the threshold that makes the metric actionable. Continuous measurement is what keeps the threshold honest. Both, or neither.

FAQ

Common questions on baselines.

What is a baseline in AI evaluation?

A baseline is the threshold that makes a metric actionable for a specific use case. It has four components: metric, threshold, context, owner. Without all four, you have a number, not a baseline. NIST AI RMF and ISO 42001 name the metric categories; baselines name the thresholds.

Who sets the baseline?

A named owner (product, security, or compliance) depending on the use case. The owner authorizes threshold changes based on evidence and is referenced by the policy-as-code layer that enforces the baseline. Step 5 of the five-step method is where most baseline programs die.

How often should a baseline be updated?

Continuously evaluated, periodically reviewed. Production traces flow against the threshold every interaction. Formal review of the threshold itself runs quarterly or whenever the use case, population, or risk appetite changes. Model upgrade, prompt revision, new corpus, regulatory shift.

Baseline vs. threshold. What is the difference?

A threshold is a number ('bias < 0.05'). A baseline is the threshold plus context plus owner. The metric, the threshold, the use case + population + risk appetite the threshold is grounded in, and the named human authorized to change it. Baselines are operational; thresholds in isolation are aspirational.

How does the Baseline Problem map to NIST AI RMF and ISO 42001?

Both frameworks specify what to measure (MAP and MEASURE in NIST; clauses 6 to 9 in ISO 42001). Neither specifies thresholds, correctly, because thresholds are use-case-specific. Baselines fill the gap between framework category and operational control.

What if my organization has no baseline today?

Run the five-step method against your highest-risk AI deployment first: name the use case tightly, inventory the risks, pick metrics narrowly, set the threshold with context, assign the owner. One baseline per quarter is faster than waiting for a framework certification.

Platform

Services

Industries

Resources

For your role

Company

The Baseline Problem in AI Evaluation.

Same question, four different answers.

Marketing-copy agent

Loan-underwriting agent

Healthcare-triage agent

Customer-support agent

Standards don’t define thresholds. And shouldn’t.

A baseline is not a number. It’s a process.

Metric.

Threshold.

Context.

Owner.

The five-step method we use with customers.

NAME the use case tightly.

INVENTORY the risks.

PICK the metrics, narrowly.

SET the threshold, with the context.

ASSIGN the owner.

A baseline without continuous measurement is performance art.

Three more from the practical guide.

Common questions on baselines.

The Baseline Problem in AI Evaluation.

Same question, four different answers.

Marketing-copy agent

Loan-underwriting agent

Healthcare-triage agent

Customer-support agent

Standards don’t define thresholds. And shouldn’t.

A baseline is not a number. It’s a process.

Metric.

Threshold.

Context.

Owner.

The five-step method we use with customers.

NAME the use case tightly.

INVENTORY the risks.

PICK the metrics, narrowly.

SET the threshold, with the context.

ASSIGN the owner.

A baseline without continuous measurement is performance art.

Three more from the practical guide.

From AI Adoption to AI Assurance

The 12 Levers of Enterprise AI

Why Continuous Beats Periodic

Common questions on baselines.