EVIDENCEPIPELINEOPERATINGAUDITAUDIT PACK

Continuous evaluation beats periodic attestation.

Point-in-time attestation was built for deterministic systems. Production AI isn’t deterministic.

A chatbot handling 200,000+ interactions a week cannot be assured through quarterly reviews. The conversation has to move from audit-as-an-event to audit-as-a-stream.

See compliance coverage →
TrustEvals field guide for finance AI teams.

Continuous AI evaluation measures every production interaction of an AI system against a defined baseline, producing timestamped, sourced, versioned evidence as a byproduct of operation. Periodic attestation samples the system on a fixed cadence (quarterly, annually) and verifies controls at each sample. Continuous catches drift between samples; periodic does not.

ISO 42001NIST RMFSR 11-7EU AI Act
The contrast

Two shapes of audit.

One was built for deterministic systems and 30 years of enterprise IT. The other is what production AI actually requires.

Periodic attestation

Audit as an event.

  • Cadence. Quarterly or annual sample.
  • Evidence. PDFs, screenshots, control tests.
  • Drift. Up to 90 days of lag before detection.
  • Cost. Pre-audit scramble plus remediation.
  • Best fit. Management-system review (ISO 42001 clause 9, SOC 2).
Continuous evaluation

Audit as a stream.

  • Cadence. Every production interaction.
  • Evidence. Timestamped, sourced, versioned trace.
  • Drift. Detected in real time.
  • Cost. Background process. No quarterly project.
  • Best fit. Production AI behavior, where every interaction matters.
Why periodic worked

The model that worked for 30 years of enterprise IT.

Enterprise audit evolved around deterministic systems. Define the control. Sample periodically. Verify it works. Document the verification. Carry on. SOC 2, ISO 27001, every regulatory examination of the last 30 years was designed around this shape.

It worked for a reason. In deterministic systems, if a control passed on Tuesday, the same control passes on Wednesday. Unless somebody changed the code. And if somebody changed the code, the change is in a git log, on a change-management ticket, reviewed by a second engineer.

Periodic attestation works when between-audit drift is rare, small, and visible.
Why AI breaks it

Three ways AI drifts between audits.

None of these moves a check-box. Each can shift behavior by the Tuesday following the change.

01

The model changes underneath you.

A foundation model vendor releases a new version. Your AI system upgrades. Nothing in your change-management system moved. Behavior is different. Quarterly attestation was designed for code changes, not weight changes.

02

The prompt changes in ways that feel minor.

A product manager improves a system prompt. The new prompt scores better on the narrow benchmark they tested. In production, under a slightly different distribution of user inputs, it fails in a new way. The change was small. The effect is not.

03

The corpus refreshes.

A RAG system’s knowledge base gets a monthly update. A new document contradicts an old one. The agent starts confidently citing outdated information. No engineer touched the model or the prompt; only the retrieval surface changed.

The Tuesday test

The question a quarterly audit cannot answer.

What was this AI system doing at 3:47pm on Tuesday?

If the answer is “we sampled it last quarter and it looked fine,” the answer fails. If the answer is the trace of every interaction at that moment, evaluated against the baseline, with the policy outcome captured. That is a real answer.

There is no “kind of continuous” that works. Either every interaction is measured, or there is a gap. The gap is where the question lives.

AUDIT
Six properties

Evidence that survives the Tuesday test.

01

Timestamped.

Every artifact carries a source time, not “current as of.”

02

Sourced.

Every claim points back to the underlying production trace.

03

Versioned.

The baseline used for evaluation is captured with the result.

04

Fresh.

A freshness rule per artifact. Stale evidence self-flags.

05

Cross-referenced.

The same trace feeds the ISO, NIST, EU, AIUC views.

06

Exportable.

A one-click framework-specific audit pack, not a PDF.

Where periodic stays

What periodic audit still does well.

Periodic audit is not obsolete. It is the right tool for the management-system question: does the organization run a sensible process for identifying, classifying, and mitigating AI risk? ISO 42001 is a management-system standard. Its periodic certification cadence is appropriate.

What is obsolete is periodic behavioral audit. An auditor who only looks at an AI system four times a year cannot claim to know how it behaves.
The commercial case

Audit as an event vs. audit as a stream.

Audit as an event

Compliance is expensive. Teams scramble. Evidence is assembled after the fact. Gaps discovered during the audit become remediation projects that consume the next quarter.

Today, most organizations spend 40 to 60% of their AI governance budget on evidence assembly.

Audit as a stream

Compliance is a background process. Evidence accumulates as a byproduct of production. Auditors look at the stream; the organization spends time on AI, not on audit prep.

Continuous evaluation drops the assembly cost close to zero. Budget moves to AI work that produces value.

Audit as an event vs. audit as a stream. Continuous evidence accumulates as a byproduct of production. Not as a quarterly project.

FAQ

Common questions on continuous evaluation.

Continuous AI evaluation measures every production interaction of an AI system against a defined baseline, producing timestamped, sourced, versioned evidence as a byproduct of operation. The output is an audit stream rather than an audit event. Auditors look at the stream; the organization spends time on AI, not on audit prep.

Three mechanisms drift AI between audits: the foundation model upgrades underneath the system, a system prompt changes in ways that feel minor, or a RAG corpus refreshes with contradictory documents. None of these move a change-management ticket. Each can shift behavior by the Tuesday following the change.

No. Periodic audit remains the right tool for management-system questions ('does your organization run a sensible AI risk process?'). ISO 42001 is correctly periodic. What is obsolete is periodic behavioral audit. Sampling AI behavior four times a year cannot describe how it behaves.

ISO 42001 requires risk monitoring; it does not specify cadence. The interpretation is converging on continuous for high-risk and high-volume AI systems, periodic management-system review for the surrounding governance program. Continuous evidence satisfies both. Periodic alone does not.

The question a quarterly audit cannot answer: 'What was this AI system doing at 3:47pm on Tuesday?' Continuous mode produces the trace of every interaction at that moment, evaluated against the baseline, with the policy outcome captured. Quarterly attestation produces 'we sampled it last quarter and it looked fine.'

Timestamped (source time, not 'current as of'), sourced (claim points to underlying trace), versioned (baseline captured with result), fresh (per-artifact freshness rule, stale evidence self-flags), cross-referenced (same trace feeds ISO/NIST/EU/AIUC views), exportable (one-click framework-specific audit pack).