What are NL-to-SQL evals?
A plain-English definition for finance teams evaluating natural-language SQL systems before they become board numbers, budget answers, or audit evidence.
NL-to-SQL evals are tests and operating controls that measure whether a natural-language question is routed, interpreted, translated into SQL, executed, and rendered correctly. In finance, they also have to decide whether the underlying data is trustworthy enough to answer the question at all.
The eval is the operating proof, not the demo score.
A benchmark can show whether a model can produce plausible SQL. A production eval has to show whether the whole finance answer can be trusted by the person who uses it.
Every layer needs a different question.
The mistake is treating NL-to-SQL as one model task. Finance systems are layered operating surfaces, and each layer needs its own eval treatment.
| Layer | Question | Evidence |
|---|---|---|
| Router | Did the system classify the user's intent correctly? | Intent label, selected path, confidence, and false-route examples. |
| Semantic layer | Did the system retrieve the right business meaning for this tenant? | Metric definition, synonyms, table slice, tenant overlay, and role context. |
| SQL synthesis | Did the generated or templated query produce the expected result? | Query, reviewer result, execution output, and expected-answer comparison. |
| Output rendering | Did the chart, caveat, and narrative preserve the business meaning? | Rendered answer, chart type, caveat state, and reviewer notes. |
A correct query can still be an unsafe answer.
Finance buyers do not experience NL-to-SQL as a parsing task. They experience it as a number they may use in a board pack, budget review, vendor conversation, or audit committee readout.
Materiality
Board-grade questions need a higher threshold than exploratory analyst questions.
Permissions
The same question can be right for a CFO and wrong for a budget owner if role filters change the result.
Data quality
The system needs to know when the data slice cannot support a clean answer.
What are NL-to-SQL evals?, answered plainly.
No. SQL unit tests check specific query behavior. NL-to-SQL evals check the full path from user question to final answer, including routing, semantic meaning, permissions, data quality, SQL execution, and output rendering.
The headline metric should be end-to-end answer correctness by question tier. Intermediate scores are useful for debugging, but the final rendered answer is what the business user experiences.
Finance questions often become operating numbers. A wrong answer can change a board read, budget decision, risk signal, or audit finding, so the eval has to include materiality, permissions, and evidence.
Keep the evidence trail connected.
NL-to-SQL evals for finance
The canonical guide that ties answer correctness, dataset quality, golden sets, drift gates, and audit memoranda together.
Answer correctness vs dataset quality
The two eval surfaces production finance systems need.
AI Evals
TrustEvals service work for production eval layers, golden sets, and release gates.
If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.
Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.