Across the AI portfolios we have reviewed inside finance firms over the last twelve months, the pattern is uncomfortably consistent. Large budgets approved. Sophisticated teams hired. Compliance signed off. Vendors selected. Steering committees green. And at the end of the year, an awkward conversation: what changed in the business?
The honest read is harder than execution failure. The system is doing exactly what it was designed to do. The budget approval was the outcome the system was optimised to produce. Everything downstream of approval, including capture, integration, value attribution, and workforce change, was someone else's problem at every step that mattered.
If you are a CAIO, a CIO, or the executive who owns the AI portfolio at a bank, an asset manager, or an insurer, the executive question worth asking is harder than "why did the AI projects fail?" The projects worked exactly as the institution rewarded them to work. The right question is whether you are willing to change what gets rewarded.
This piece works through what is actually broken, why every smart person inside the system keeps producing the same result, and where the real lever sits to change it.
Defensibility is the reward function.
Finance firms run on defensibility. Defensibility is the right answer to a century of operating in regulated, audited, fiduciary environments. It is what keeps the institution alive across cycles, regulators, and lawsuits. The discipline of producing legible, audit-trail-backed decisions is the moat.
The problem is what happens when defensibility becomes the reward function for a class of work, AI, where outcome velocity is the actual job.
Look at what the AI portfolio rewards today, in any large bank.
- Approvals are rewarded. The MD who walks the budget through committee gets credit at the moment of signature. Nothing in the institution's wiring asks them the next quarter whether the spend produced anything.
- Deployment is rewarded. "Tool rolled out to the front office" is a measurable, defensible, easily-reported claim. "Tool moved underwriting cycle time" is a harder claim that nobody will write down without proof, and proof is expensive to produce.
- Vendor revenue recognition is rewarded. Vendors hit their milestones on procurement. Integration is a separate workstream they are not contracted on. Their dashboards stay green either way.
- Comprehensibility in SteerCo is rewarded. Projects that fit on a slide win. Projects that need real explanation, because they actually move a P&L lever, lose. Selection bias against the right work is built into the intake process.
- Regulatory defensibility is rewarded. "We met every NIST control" wins. "We made the agent materially more accurate but cannot map the test to an existing control" loses, regardless of which one moves the business.
Across each of these, the institution is correctly rewarding what it values. The bug is that none of them are correlated with the business outcome the AI budget was supposed to produce. The published evidence on this point is unambiguous.
BCG's 2025 Build for the Future report, based on a global survey of 1,250 senior executives across nine industries and more than twenty-five sectors, finds that 60% of companies report minimal or no value from AI despite substantial investment. Only 5% qualify as "future-built" and capture meaningful financial gains, and those 5% achieve roughly five times the revenue increase and three times the cost reduction of the rest [1].
Gartner, polling more than 3,400 organisations actively investing in agentic AI, forecasts that over 40% of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls [2].
McKinsey's State of AI finds that 39% of respondents attribute any level of EBIT impact to AI use, and most of those say less than 5% of organisational EBIT is attributable to it. The strongest correlate of meaningful bottom-line impact is direct CEO oversight of AI governance. Budget size, vendor selection, and team headcount each correlate weakly by comparison [3].
These are reward-function numbers. They describe what the system is rewarding, and what the system is rewarding is something other than business outcomes.
Three consequences nobody wants to own.
Once you accept the reward function as the root cause, three downstream consequences fall out cleanly. Each one is something AI executives in finance know intuitively. None of them are anyone's job to fix at the moment.
1. Vendor velocity is a direct function of ownership.
There is an asymmetry inside every large bank's AI program that almost nobody names out loud: the vendor's velocity is decoupled from yours.
A vendor's revenue is recognised on contract milestones. Their dashboard is green when the procurement clauses are met: licences activated, training sessions delivered, integration "available." None of that requires the bank's internal owner to actually ship a workflow change to a real user.
From the vendor's perspective, this is fine: they are doing what their contract rewards. From the bank's perspective, it is a slow leak. The vendor will continue to look like a healthy, on-track partner for the entire engagement window it takes for your team to discover that no one above SVP owned the integration, the workflow redesign, or the change-management work on the bank's side.
Vendor velocity is a direct function of internal ownership. If the named owner sits below SVP, or is split across three functions, or rotates inside the contract window, the vendor will continue to ship to their own milestones and the bank will continue to pay for the privilege. The slip is silent until the year-end review.
The diagnostic question to run on Monday: for every vendor line item in your 2026 AI budget, can you name one person at the SVP-or-above level whose compensation is directly tied to the business outcome of that line item? If you cannot, the line item is a procurement event. The integration was never owned.
2. The vendor model you bought is probably already obsolete.
This one is structural, and it is the failure mode finance executives have the least vocabulary for.
Vendor selection in banks runs on multi-quarter procurement cycles. RFP, security review, legal, procurement, pilot, contract. Foundation-model progress runs every quarter, sometimes every month. Anthropic, OpenAI, Google, and the open-weights ecosystem are releasing capability tiers (tool use, extended context, code execution, native search, agentic frameworks) faster than any regulated procurement function can absorb.
The consequence: by the time the vendor's wrapper goes into production, the hard problem it was selected to solve is at least partly absorbed into the platform layer. The Retrieval-Augmented Generation specialist you bought in Q1 of last year now competes against a foundation-model feature shipping by default. The agentic orchestration vendor competes against the foundation provider's own agent toolkit. The integration partner you contracted to glue these together is doing a workflow that the next model release will compress into a single tool call.
The vendor is doing their job. The selection committee did its job. The mismatch is structural: procurement-cycle physics runs annually; platform-progress physics runs quarterly. Deloitte's 2026 enterprise AI work names the same dynamic from a different angle, recommending that banks treat AI agents as active operators inside their systems because the substrate underneath those agents is moving every quarter [4].
The diagnostic question to run on Monday: for every vendor in your 2026 AI portfolio, what is the model and capability assumption their value proposition rests on? When the foundation provider releases the next tier of that capability natively, sooner than your procurement cycle can absorb, what remains of the vendor's value?
3. The data debt has come home, and there is no one to invoice.
Production AI is uniquely good at one diagnostic: it exposes twenty years of data and workflow debt that has been hiding under manual processes.
A model that looks accurate in the pilot regresses materially the moment it touches real production. The pilot data was clean because it was hand-picked. Production data carries every workaround, every undocumented exception, every joined-from-three-systems-and-papered-over-by-a-spreadsheet field that the underlying business has been running on. The AI surfaced the debt that the manual processes had been absorbing for years.
The bank's instinct in this moment is to send the project back to a "data strategy" workstream. Every large bank already has a data strategy, a CDO, and a multi-year transformation programme. Those functions were never funded to fix the debt at the level of granularity AI requires, because the manual processes were quietly absorbing it. The funding case never had to land.
Production AI needs one explicit owner for the surfaced debt. Without that owner, the AI project becomes a referendum on the entire data estate, and the AI project will lose every time, because no model rollout has the political capital to fix twenty years of joined data.
The same point applies to workflow debt. Most AI projects fail at the moment a relationship manager, an underwriter, or a credit analyst is asked to switch their workflow. The muscle memory, the audit-trail expectations, and the trust mechanics of their current process were never inventoried, let alone redesigned. Workflow change is the dominant failure mode, and it is invisible until production.
The diagnostic question to run on Monday: for every "production-ready" AI project in your portfolio, who owns the data debt it has surfaced and the workflow change it requires? The answer the audit committee will accept is a name at the SVP-or-above level. Anything softer is the same procurement event that Question 1 already exposed.
A short note on the regulator.
Any honest piece on AI in banks has to address the regulator directly. Defensibility-over-outcomes is partly rational in finance because the regulator's posture rewards audit trails over P&L lift. An AI project that ships value but cannot be cleanly mapped to an existing control framework creates institutional risk that a project producing framework-mapped evidence with no measurable lift simply does not carry.
The regulatory drag is real and worth respecting. The institutions that treat it as a constraint to navigate will out-ship the ones that treat it as a stop signal.
The regulator is asking for defensibility of outcomes. Read the frameworks themselves: NIST AI RMF, ISO 42001, the EU AI Act, MAS's agentic AI guidelines, the OCC's emerging supervisory letters. Every one of them requires continuous evaluation and monitoring. The institutions that produce continuous, framework-mapped evidence as a byproduct of their AI operating cadence will out-ship peers who keep treating defensibility and outcomes as a trade-off. The procurement-and-deploy posture banks are running today is out of alignment with where the regulators are already moving.
Where the reward function actually changes.
A diagnosis without a prescription is consulting theatre. The prescription is structural: a reward-function change at one of three layers. Pick at least one. Ideally two.
Layer 1. Compensation tied to business outcomes. The compensation cycle for the AI portfolio owner and their direct reports should be tied to a measurable business outcome on the budget they deployed. Deployment metrics stop being the unit of evaluation. This is uncomfortable because outcomes take longer than a compensation cycle, and because measurable outcome attribution requires instrumentation most banks do not currently have in place. The discomfort is the point. Without it, the institution will continue to reward signature.
Layer 2. Board-level accountability at SVP+, with named ownership. Every AI line item over a threshold (we suggest $1M, but pick your own) gets a named SVP-or-above owner whose continuing accountability is published in the audit committee's working papers. Audit committees in finance already run this discipline for credit risk, market risk, and operational risk. The infrastructure exists; AI risk has simply not been pulled into it yet. The committee asks the named owner the same three questions every quarter: what changed in the business, what is the audit trail behind that change, and what is the remediation plan for any exceptions. The owner who cannot answer those three questions loses the line item at the next budget cycle.
Layer 3. Navigating the regulatory drag explicitly. This is the layer most institutions duck. The conversation with the regulator, and internally with the CRO and the head of compliance, has to be reframed from "show us you have controlled the AI risk" to "show us you have continuous evidence that the AI is producing the outcome it was approved to produce, with the controls running in the background of that operating cadence." The institutions willing to have this conversation directly will set the regulatory precedent. The institutions that wait will inherit whatever precedent the early movers set.
Each of these three layers is hard. Each is also more tractable than the alternative: another year of budget approved, deployment evidence reported, and no answer to "what changed in the business" when the audit committee asks at year-end.
What changes when you can measure outcomes.
The reward function cannot change until the institution can measure what it is rewarding. That is the bottleneck most banks hit when they try to operationalise the prescription above. Deployment evidence is easy: licence counts, training completion rates, model-deployment dashboards. Outcome evidence, the read on whether the AI is compounding on the balance sheet or exposing it, requires instrumentation that most banks do not currently have in place across vendor tools, embedded AI features, and internal agents.
This is the practical work of applied AI inside a finance firm. It is the operating substrate that lets the CAIO, the CIO, and the CFO answer the audit committee's three questions with evidence instead of anecdote: what changed, what is the trail, what is the next quarter's plan. Strategy slides and governance checklists describe the work. The substrate is what does it.
Trustable, reliable AI is the actual game inside a finance firm. The bar is AI whose outcome holds up between earnings calls, between audit cycles, between model releases. AI that compounds on the balance sheet instead of exposing it.
We built TrustEvals to be that substrate, anchored on an AI Audit that produces a board-readable operating read in two weeks across the four workstreams every finance firm needs to run in parallel: AI Transformation to capture the upside, AI Governance to contain the risk, AI Fluency to move the workforce, and the AI Audit itself as the continuous read that ties them together. If the diagnosis in this piece resonates and the three diagnostic questions return more "no" than "yes," that is the conversation to have next.
The AI budget will keep getting approved. The harder question is whether the next round of it actually ships.
Sources
[1] BCG, The Widening AI Value Gap: Build for the Future 2025 (September 2025). Global survey of 1,250 senior executives and AI decision-makers across nine industries and more than twenty-five sectors. Headline finding: 60% of companies report minimal revenue and cost gains despite substantial investment; 35% are scaling and beginning to generate value; 5% qualify as "future-built" and achieve roughly five times the revenue increase and three times the cost reduction of the rest. Full report PDF: https://media-publications.bcg.com/The-Widening-AI-Value-Gap-Sept-2025.pdf · Companion essay: https://www.bcg.com/publications/2025/are-you-generating-value-from-ai-the-widening-gap
[2] Gartner press release, Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 (June 25, 2025). Based on a poll of more than 3,400 organisations actively investing in agentic AI. Cited reasons: escalating costs, unclear business value, inadequate risk controls. Gartner also estimates only about 130 of the thousands of self-described agentic AI vendors offer real agentic capability versus rebranded RPA, chatbots, or AI assistants ("agent washing"). https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
[3] McKinsey, The State of AI: How Organizations Are Rewiring to Capture Value (March 2025) and The State of AI in 2025: Agents, Innovation, and Transformation (November 2025). 39% of respondents attribute any level of EBIT impact to AI use; most of those say less than 5% of organisational EBIT is attributable to AI. About 6% are "AI high performers" reporting 5%+ EBIT contribution. The strongest correlate of bottom-line impact is direct CEO oversight of AI governance. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-how-organizations-are-rewiring-to-capture-value · November 2025 update: https://www.mckinsey.com/~/media/mckinsey/business%20functions/quantumblack/our%20insights/the%20state%20of%20ai/november%202025/the-state-of-ai-2025-agents-innovation_cmyk-v1.pdf
[4] Deloitte, State of AI in the Enterprise, 2026 (published April 2026). Survey of 3,235 director-to-C-suite leaders across 24 countries and six industries including financial services, conducted August–September 2025. 25% of respondents report having moved 40%+ of AI experiments into production today; 54% expect to reach that level in the next three to six months. Companion: Deloitte, Managing the New Wave of Risks from AI Agents in Banking (2026), which names the risk-framework adequacy gap for agentic AI in banking and recommends treating AI agents as active operators inside the bank's systems. https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html · Press release: https://www.deloitte.com/us/en/about/press-room/state-of-ai-report-2026.html