What are NL-to-SQL evals?

A plain-English definition for finance teams evaluating natural-language SQL systems before they become board numbers, budget answers, or audit evidence.

NL-to-SQL evals are tests and operating controls that measure whether a natural-language question is routed, interpreted, translated into SQL, executed, and rendered correctly. In finance, they also have to decide whether the underlying data is trustworthy enough to answer the question at all.

Definition

The eval is the operating proof, not the demo score.

A benchmark can show whether a model can produce plausible SQL. A production eval has to show whether the whole finance answer can be trusted by the person who uses it.

The unit of evaluation is the final answer, not only the generated SQL.
The trace matters because routing, semantic lookup, SQL generation, execution, and rendering can each fail differently.
The evidence has to be durable enough for product, risk, and audit teams to inspect after the answer is gone.
What to measure

Every layer needs a different question.

The mistake is treating NL-to-SQL as one model task. Finance systems are layered operating surfaces, and each layer needs its own eval treatment.

LayerQuestionEvidence
RouterDid the system classify the user's intent correctly?Intent label, selected path, confidence, and false-route examples.
Semantic layerDid the system retrieve the right business meaning for this tenant?Metric definition, synonyms, table slice, tenant overlay, and role context.
SQL synthesisDid the generated or templated query produce the expected result?Query, reviewer result, execution output, and expected-answer comparison.
Output renderingDid the chart, caveat, and narrative preserve the business meaning?Rendered answer, chart type, caveat state, and reviewer notes.
Finance bar

A correct query can still be an unsafe answer.

Finance buyers do not experience NL-to-SQL as a parsing task. They experience it as a number they may use in a board pack, budget review, vendor conversation, or audit committee readout.

Materiality

Board-grade questions need a higher threshold than exploratory analyst questions.

Permissions

The same question can be right for a CFO and wrong for a budget owner if role filters change the result.

Data quality

The system needs to know when the data slice cannot support a clean answer.

FAQ

What are NL-to-SQL evals?, answered plainly.

No. SQL unit tests check specific query behavior. NL-to-SQL evals check the full path from user question to final answer, including routing, semantic meaning, permissions, data quality, SQL execution, and output rendering.

The headline metric should be end-to-end answer correctness by question tier. Intermediate scores are useful for debugging, but the final rendered answer is what the business user experiences.

Finance questions often become operating numbers. A wrong answer can change a board read, budget decision, risk signal, or audit finding, so the eval has to include materiality, permissions, and evidence.

If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Design the eval layer ->