NL-to-SQL Evals forFinance.

Natural-language SQL is solved at the demo layer and broken at the production layer. The gap is evals.

NL-to-SQL evals measure whether a finance user's natural language question is routed, interpreted, translated into SQL, executed, and rendered correctly. In production, the eval also has to decide whether the underlying data is trustworthy enough to answer. That is why finance teams need two eval surfaces: answer correctness on top and dataset quality underneath.

Quick answer

The production question is not whether SQL was generated. It is whether the answer can be trusted.

The buyer does not experience a text-to-SQL system as a parsing benchmark. The buyer experiences it as a board number, a budget answer, a chart, a caveat, or an exception.

In finance, a wrong NL-to-SQL answer is not merely a model hallucination. It can become the wrong COGS number, the wrong vendor-spend read, the wrong headcount trend, or the wrong risk signal. The operating bar is therefore audit-grade: the system needs an opinion, materiality thresholds, exceptions, working papers, and remediation.

Academic benchmarks such as Spider and Spider 2.0 are useful signal. Production finance systems need the next layer: tenant-specific business meaning, RBAC slices, data quality checks, user feedback, drift gates, and audit-ready evidence.

Topic cluster

The source guide now branches into an operating cluster.

Use the canonical page for the whole argument, then use these pages when the buyer or answer engine needs a specific definition, checklist, template, or failure-mode explanation.

Guide

What are NL-to-SQL evals?

A practical definition of NL-to-SQL evals for finance teams: what they test, why text-to-SQL benchmarks are not enough, and what evidence production systems need.

Template

Build the golden dataset for NL-to-SQL.

How finance teams should build a persona-first golden dataset for NL-to-SQL systems, with materiality thresholds, tenant slices, role slices, and expected answer metadata.

Guide

Answer correctness and dataset quality are different evals.

Why production NL-to-SQL systems in finance need two eval surfaces: answer correctness on the final response and dataset quality underneath the query.

Guide

NL-to-SQL fails differently in finance.

The main NL-to-SQL failure modes finance teams should evaluate: routing drift, semantic mismatch, data-quality failure, RBAC errors, SQL mistakes, and misleading output rendering.

Guide

How to audit an NL-to-SQL system.

A step-by-step audit workflow for NL-to-SQL systems in finance, including scope, materiality, golden datasets, trace replay, exceptions, drift gates, and audit memoranda.

Guide

Evaluate the semantic layer before SQL.

How to evaluate the semantic layer behind finance AI and NL-to-SQL systems, including metric meaning, tenant overlays, synonyms, sample values, and role-aware retrieval.

Checklist

The NL-to-SQL evaluation checklist for finance.

A practical NL-to-SQL evaluation checklist for finance teams, covering scope, golden datasets, semantic layer checks, answer correctness, dataset quality, drift, CI gates, and audit memoranda.

Template

The audit memorandum template for AI systems.

A practical audit memorandum template for AI systems, including opinion, scope, materiality, exceptions, remediation, working papers, and evidence sections.

The four-layer stack

Every layer needs its own eval treatment.

Most teams evaluate the SQL writer and assume the rest of the stack is fine. That is how good demos become brittle production systems.

LayerWhat it doesWhat to evaluate
Router / intentClassifies whether the question is a metric lookup, comparison, listing, trend, anomaly explanation, or scenario.Routing accuracy, category drift, and whether the right semantic slice was selected.
Semantic layerMaps schema objects to business meaning, synonyms, sample values, tenant overlays, and finance definitions.Tenant-specific meaning, data dependencies, and whether COGS, revenue, ARR, or headcount mean the right thing for this firm.
SQL synthesisPlans, writes, reviews, executes, and sometimes bypasses generation through template paths for common questions.End-to-end answer correctness, not only per-agent scores. The final answer is what the audit opinion uses.
Output renderingTurns result sets into chart, narrative, confidence badge, caveat, and exportable answer.Whether the visual or narrative changes the business meaning of the result.
Evals as audit

The right mental model is the audit memorandum.

Finance buyers already know how to read opinions, materiality, exceptions, scope limitations, and working papers. NL-to-SQL evals should use that language.

Audit opinion

Clean, qualified, adverse, or scope-limited verdict on the production NL-to-SQL system.

Materiality threshold

The accuracy bar per question tier, persona, tenant, and workflow before a finding becomes an exception.

Working papers

Golden dataset, trace logs, prompt versions, classifier outputs, optimizer history, and remediation trail.

Audit exception

A failure above threshold with root cause, owner, fix status, and re-test evidence.

Scope limitation

The system cannot issue a clean answer because the underlying data slice failed its own quality checks.

Two eval surfaces

Build answer correctness and dataset quality in parallel.

A correct SQL query over broken finance data is still a wrong answer. The system needs to know when to answer, when to caveat, and when to refuse.

Answer correctness

Given a user question, did the final rendered answer match the expected result? Score the full trace, not only the planner, reviewer, or SQL writer.

  • Question-to-answer correctness
  • Trace-level root cause
  • Chart and narrative accuracy
  • Regression against tier-1 questions

Dataset quality

Before the question runs, did the data slice pass the quality bar? Finance teams cannot trust a correct query over broken data.

  • Type consistency
  • Completeness by critical table
  • Range and sanity checks
  • Cross-source reconciliation
Golden dataset

The golden set is the working-paper substrate.

A flat list of questions is too weak. The corpus needs persona, tier, materiality, tenant, and role metadata so the result can support an audit opinion.

Field 01
User question
Field 02
Expected answer or chart shape
Field 03
Persona
Field 04
Question tier
Field 05
Materiality threshold
Field 06
Business importance
Field 07
Production frequency
Field 08
Data dependencies
Field 09
Failure mode if wrong
Field 10
Tenant and role slice

The seed set should be written manually with domain experts before it is extended synthetically. Synthetic variants are useful after the ground truth exists; they are dangerous when they become the ground truth.

Closed loop

The production loop keeps the opinion fresh.

Evals run once become stale. Evals tied to production traces become the measurement system behind the product.

1. Trace capture

Capture every production question, intermediate state, model version, token and latency data, tenant, persona, RBAC slice, final answer, and feedback signal.

2. Classifier

Match production traces to known golden-set patterns and flag new recurring question shapes for review.

3. Feedback review

Treat thumbs-down signals as working-paper candidates. A human decides whether the failure is real, user-side misunderstanding, or a classifier false negative.

4. Drift detection

Track tier-1 questions on a rolling window so the team sees when a previously working board-grade question starts degrading.

5. CI gates

Block releases when above-materiality questions regress below threshold. Downgrading a question tier should be an explicit, logged decision.

6. Prompt optimization

Run optimizer loops only after the working papers are representative enough to judge whether a prompt change helped or merely moved the error.

Optimizer loops such as GEPA become useful only when the working papers are representative enough to judge real production improvement.

Buyer artifact

The dashboard is not the deliverable. The memorandum is.

A CFO, CIO, CISO, audit committee chair, and external auditor need different reads from the same evidence trail. The memorandum gives each one a legible object.

Page 1: Opinion

Clean, qualified, adverse, or scope-limited. One paragraph, period covered, and the production surface in scope.

Page 2: Materiality

Thresholds by tier, persona, tenant, workflow, and any changes made during the period.

Page 3: Exceptions

Material findings with trace evidence, root cause, owner, remediation status, and re-test result.

Appendix: Working papers

Scores by question tier, drift triggers, classifier patterns, optimizer history, CI failures, and red-team traces.

TrustEvals

TrustEvals turns NL-to-SQL evals into audit-grade evidence.

The same substrate supports AI Audit, AI Evals, AI Governance, and the AI Engineering side track for AI-native finance SaaS companies.

TrustEvals operates the eval engine, golden-set management surface, policy-enforcement layer, and red-teaming substrate for finance AI systems. The evidence stream can be mapped to NIST AI RMF, ISO 42001, the EU AI Act, and a firm's internal standards without rebuilding the measurement layer for each framework.

For an AI-native finance SaaS company, this becomes customer trust infrastructure. For a finance enterprise deploying its own NL-to-SQL surface, it becomes the board-readable operating read: where AI is creating value, where it is creating risk, and what needs to move this quarter.

FAQ

NL-to-SQL eval questions, answered plainly.

NL-to-SQL evals are tests and operating controls that measure whether a natural-language question is routed, interpreted, translated into SQL, executed, and rendered correctly. In finance, they should also check whether the underlying dataset is trustworthy enough to answer the question at all.

They fail because the demo problem is narrower than the production problem. Production finance questions depend on tenant-specific business meaning, row-level permissions, data quality, chart interpretation, materiality, and audit evidence. A syntactically valid SQL query can still create the wrong board number.

Answer correctness asks whether the final answer matches ground truth. Dataset quality asks whether the table, metric, time window, currency, account mapping, and source-system reconciliation are sound enough to support the answer. Production finance systems need both surfaces.

A finance NL-to-SQL team should usually seed a few hundred manual examples before broad rollout, then expand with validated synthetic variants and production traffic. The important design choice is not the exact count; it is persona, tier, materiality, tenant, and role metadata.

It should include an opinion, materiality thresholds, scope, exceptions, remediation status, and working papers. The supporting evidence should show scores by question tier, persona, tenant, trace sample, optimizer history, drift trigger, and policy-enforcement result.

TrustEvals operates the evaluation substrate for finance AI systems: trace ingestion, golden-set management, answer-correctness evals, dataset-quality checks, drift gates, policy evidence, red-team probes, and audit-ready memoranda. The first wedge is usually the two-week AI Audit.

If the NL-to-SQL surface can walk into a finance committee meeting, the eval evidence has to be strong enough to walk in with it.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Book the AI Audit →