New, with Accorian: a real-time AI governance framework for control drift in enterprise AI.Read the framework

Article

What broke when the AI met the workflow.

Why AI workflows fail when buyer risk tolerance, downstream authority, and measurement are misaligned, plus a two-question diagnostic for finance leaders.

TrustEvals· May 20, 2026· 7 min read

Clear the risk threshold.

At a glanceArticleAuditHead of AIAI workflow, risk tolerance, diagnostic

A two-question diagnostic for finance leaders deciding whether an AI workflow is ready to deploy, defend, and measure.

AI workflows break in finance when the downstream consumer grants more authority to the output than the system can prove. The issue is rarely generic AI risk. It is a specific mismatch between the workflow, the buyer's risk tolerance, and the evidence available to defend the output.

An operator I respect pushed back on a recent note about AI adoption inside finance firms with a fair ask: give me a tangible example. Specifically, show the moment where a workflow's use case runs into the buyer's risk tolerance and breaks.

This is the long version, with a two-question diagnostic a CIO, CDO, or Head of AI can put on the wall before approving an AI workflow that touches a regulated process.

Two workflows. One pattern.

The first workflow lived inside an FP&A copilot.

The product was a copilot for finance teams. Sit a CFO down with their finance data, let them ask questions in natural language. "What is our burn if we add twelve heads next quarter, and where does the spend come from." Today that question takes a fourteen-tab workbook and a half-day. The promise was a sentence.

The use case was free-text question and answer across the full finance data layer. Scenario modeling included. The buyer was the CFO at customer accounts.

That buyer's risk tolerance was 95% numeric accuracy. That number is not an aspiration. It is a floor. The answer to a CFO's question lands in a board deck or a banker call within hours. It has to reconcile to the GL. If it does not, the next conversation is about ripping the tool out.

The misalignment surfaced the first time a customer's vendor table had missing cells. The LLM answered the question anyway. It did not flag the gap. The CFO did not see a low-confidence badge. They saw a number. That moment, the workflow lost the room.

The fix was structural. Any lookup or aggregate ran through a deterministic SQL path. Any forecast or scenario ran through a probabilistic path, and every probabilistic output carried a calibrated confidence plus an explicit predicted, not reconciled tag. Underneath, the team built a held-out evals harness so accuracy on the hard tasks became a number, not a vibe.

The 95% bar did not change. The system caught up to it.

The second workflow lived inside a compliance chatbot.

The second engagement was a knowledge-base chatbot inside a GRC platform serving regulated firms. The chatbot answered compliance questions across SOC 2, ISO 27001, HIPAA, and TPRM. The outputs would feed into evidence packs that go to auditors.

The team's instinct was to re-architect. Bring in an orchestrator agent. Build a semantic layer. Swap in a deterministic classifier. The plan was reasonable on its own terms. It was also a six-month plan that would not produce a single piece of customer-facing evidence in the meantime.

We made the opposite call. Freeze the architecture for three weeks. Put four things around the existing system before touching any of it:

A 90-fixture evals harness that produced a pass rate per document type, ran nightly, and published the result to a channel the team actually read.
Observability with prompt, retrieval, model output, and latency captured per request, all inside the customer's own cloud account.
An in-product feedback loop that turned every production thumbs-down into a candidate fixture for the next eval run.
PII handling at ingest with type labels preserved instead of values blanked, plus a guardrail at inference.

The label-preserve choice is worth a sentence. Blanking values fails compliance evals because the classifier loses the structure it needs to reason. Labelling preserves the privacy posture without breaking the product. A token like <PERSON_NAME PII=true> tells the classifier there is a person. It does not tell the model which person.

Three weeks later, the artifact a customer's auditor could actually accept was on the page. A pass rate per document type. Per-customer trace logs. A redaction audit. The architecture work still got done in the months after, but every architectural choice now had evals to point at. Chunking strategy, model choice, deterministic versus probabilistic split: each one fell out of where the harness showed failure.

The deployment row every AI workflow needs.

Before approving an AI workflow in a regulated process, fill one row. If the team cannot name the consumer, authority threshold, and proof number, the workflow is not ready to ship.

The consumer sets the threshold. The eval proves whether the workflow clears it.

The CFO, auditor, banker, or regulator grants authority to the output.

The team needs a measurement a third party can audit before deployment.

A two-question diagnostic.

Both engagements pivot on the same realization. The misalignment was rarely "AI is risky" in the abstract. It was something more specific, and more answerable.

Before approving any AI workflow that touches a regulated process, the operator should be able to answer two questions cleanly.

The first who is the downstream consumer of this output, and what authority do they grant it? Name the person and the assumption they hold. The CFO assumes the number reconciles to the GL. The auditor assumes the control reference is real. The banker assumes the forecast is defensible. The regulator assumes the policy response reflects an actual control. Each of these is an authority threshold the workflow has to clear.

The second can you produce a number, per category, that proves the system clears the threshold? Not "the demo worked." A pass rate. A held-out accuracy figure. A per-tenant trace. Something a third party can audit.

If you cannot answer the first question, the team is not yet ready to deploy. If you cannot answer the second, the team is not yet ready to defend the deployment.

These two questions cost an afternoon and collapse most of the AI risk conversation into something operational. They also expose the projects where the honest answer is "we don't know yet," which is the most useful conversation a portfolio owner can have a quarter before a board review.

Measurement comes before architecture.

The instinct most teams reach for, when an AI workflow misaligns with the buyer's risk appetite, is architectural. Swap the model. Add an orchestrator. Re-platform. Buy a governance layer.

The architecture work eventually gets done. It rarely gets done well without measurement in place first. A semantic layer built before the evals are running is built on assumptions that the evals will later contradict. A governance overlay deployed before the production traces are observable is a checkbox layer with no signal underneath it.

Both stories above pivot on the same move. Freeze the architecture. Build the measurement. Then let the measurement results become the architecture spec. Chunking strategy comes from where the harness shows retrieval failures. Model choice comes from per-tenant pass rates. Deterministic versus probabilistic surface splits come from where the deterministic floor is non-negotiable.

That sequence is what gets you to measured, reliable AI in a finance environment without burning a year. It also gives a CIO an answer when the auditor walks in: a number that says where the system clears the threshold and where it does not.

What this looks like in practice.

The teams winning at AI in finance today are not the ones with the cleverest architecture or the strictest governance committee. They are the ones who, the first time a customer or a board member asked "how do you know this works," had a number ready.

Two stories. One pattern. The downstream consumer always tells you what the threshold is. The measurement always tells you whether you clear it. The architecture is downstream of both.

The work is not glamorous. It is unfair how much it changes.

AI Audit: the two-week operating read for AI portfolios.
AI Audit checklist: the inventory, usage, risk, eval, and evidence questions to run before deployment.
Eval Maturity Model: eight stages from manual spot checks to multi-tenant launch gates.
NL-to-SQL evals for finance: the same measurement logic applied to finance data agents.

This is the long version of a comment that landed on LinkedIn this week. For finance leaders who want the same diagnostic applied to their own AI portfolio, the AI Audit is a two-week version of the exercise. Get in touch.

Keep the operating read connected.

Questions before the workflow ships.

AI workflows break in finance when the downstream consumer grants more authority to an output than the system can prove. The failure is usually a mismatch between the workflow, the buyer's risk tolerance, and the measurement available to defend the output.

A CIO should ask who consumes the AI output, what authority that person grants it, and whether the team can produce a per-category number that proves the workflow clears that threshold.

Measurement should come before architecture because evals and traces show where the system actually fails. Without that evidence, model swaps, semantic layers, and governance overlays are built on assumptions the evidence may later contradict.

If the workflow cannot name the consumer, threshold, and proof, it is not ready for the audit committee.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Start Thesis
ARTICLE Reward shipped outcomes.AI TRANSFORMATION Why Bank AI Budgets Get Approved and Don't Ship

Keep the thread going.

Resource

One builder, across the board.

We take your AI from strategy to outcome, with governance, audit, and evals built into every build. Start with a discovery call, or a quick audit.

Start with Quick Audit Book a Discovery Call

What broke when the AI met the workflow.

The first workflow lived inside an FP&A copilot.

The second workflow lived inside a compliance chatbot.

The deployment row every AI workflow needs.

The consumer sets the threshold. The eval proves whether the workflow clears it.

A two-question diagnostic.

Measurement comes before architecture.

What this looks like in practice.

Keep the operating read connected.

Questions before the workflow ships.

If the workflow cannot name the consumer, threshold, and proof, it is not ready for the audit committee.

Keep the thread going.

The AI Audit checklist for teams.

Shadow MCP audit methodology.

What is an AI Audit?

One builder, across the board.

What broke when the AI met the workflow.

The first workflow lived inside an FP&A copilot.

The second workflow lived inside a compliance chatbot.

The deployment row every AI workflow needs.

The consumer sets the threshold. The eval proves whether the workflow clears it.

A two-question diagnostic.

Measurement comes before architecture.

What this looks like in practice.

Related TrustEvals resources.

Keep the operating read connected.

Questions before the workflow ships.

If the workflow cannot name the consumer, threshold, and proof, it is not ready for the audit committee.

Related links and sources

Keep the thread going.

The AI Audit checklist for teams.

Shadow MCP audit methodology.

What is an AI Audit?

One builder, across the board.