New, with Accorian: a real-time AI governance framework for control drift in enterprise AI.Read the framework
Guide

What are NL-to-SQL evals?

A practical definition of NL-to-SQL evals for finance teams: what they test, why text-to-SQL benchmarks are not enough, and what evidence production systems need.

At a glanceGuideEvalsHead of AINL-to-SQL, text-to-SQL, evals

A plain-English definition for finance teams evaluating natural-language SQL systems before they become board numbers, budget answers, or audit evidence.

NL-to-SQL evals are tests and operating controls that measure whether a natural-language question is routed, interpreted, translated into SQL, executed, and rendered correctly. In finance, they also have to decide whether the underlying data is trustworthy enough to answer the question at all.

The eval is the operating proof, not the demo score.

A benchmark can show whether a model can produce plausible SQL. A production eval has to show whether the whole finance answer can be trusted by the person who uses it.

Every layer needs a different question.

The mistake is treating NL-to-SQL as one model task. Finance systems are layered operating surfaces, and each layer needs its own eval treatment.

LAYER

QUESTION

EVIDENCE

Router

Did the system classify the user's intent correctly?

Intent label, selected path, confidence, and false-route examples.

Semantic layer

Did the system retrieve the right business meaning for this tenant?

Metric definition, synonyms, table slice, tenant overlay, and role context.

SQL synthesis

Did the generated or templated query produce the expected result?

Query, reviewer result, execution output, and expected-answer comparison.

Output rendering

Did the chart, caveat, and narrative preserve the business meaning?

Rendered answer, chart type, caveat state, and reviewer notes.

A correct query can still be an unsafe answer.

Finance buyers do not experience NL-to-SQL as a parsing task. They experience it as a number they may use in a board pack, budget review, vendor conversation, or audit committee readout.

Materiality

Board-grade questions need a higher threshold than exploratory analyst questions.

Permissions

The same question can be right for a CFO and wrong for a budget owner if role filters change the result.

Data quality

The system needs to know when the data slice cannot support a clean answer.

What are NL-to-SQL evals?, answered plainly.

No. SQL unit tests check specific query behavior. NL-to-SQL evals check the full path from user question to final answer, including routing, semantic meaning, permissions, data quality, SQL execution, and output rendering.

The headline metric should be end-to-end answer correctness by question tier. Intermediate scores are useful for debugging, but the final rendered answer is what the business user experiences.

Finance questions often become operating numbers. A wrong answer can change a board read, budget decision, risk signal, or audit finding, so the eval has to include materiality, permissions, and evidence.

Keep the evidence trail connected.

NL-to-SQL evals for finance

The canonical guide that ties answer correctness, dataset quality, golden sets, drift gates, and audit memoranda together.

Answer correctness vs dataset quality

The two eval surfaces production finance systems need.

AI Evals

TrustEvals service work for production eval layers, golden sets, and release gates.

If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Related reading

Keep the thread going.

Specialist AI builder, across the board

One builder, across the board.

We take your AI from strategy to outcome, with governance, audit, and evals built into every build. Start with a discovery call, or a quick audit.