NL-to-SQL fails differently in finance.
The symptom is usually the same: the AI gave the wrong number. The root causes are different, and the eval layer has to separate them.
NL-to-SQL failure modes in finance include routing errors, tenant-specific semantic mismatch, SQL synthesis errors, dataset-quality failures, row-level permission mistakes, and output rendering choices that change the business meaning of the result.
The same wrong answer can have six causes.
A useful eval layer does not stop at pass or fail. It classifies the failure so engineering, product, data, and risk teams know what to fix.
Routing drift
The question is sent to the wrong path or semantic slice.
Semantic mismatch
The system retrieves the wrong business meaning for a metric or tenant.
SQL error
The query plan, join, filter, aggregation, or time window is wrong.
Data-quality failure
The source data is incomplete, stale, mistyped, or unreconciled.
RBAC failure
The answer ignores or misapplies tenant, row, or role permissions.
Rendering failure
The chart, caveat, or narrative changes how the business user reads the result.
The dangerous failures look plausible.
Finance NL-to-SQL rarely fails by returning nonsense. It fails by returning a number that looks reasonable enough to use.
Each failure needs a named owner.
Model failures, data failures, policy failures, and UX failures should not land in one undifferentiated queue. The audit memorandum should show owner, fix, and re-test status.
| Failure | Likely owner | Re-test evidence |
|---|---|---|
| Semantic mismatch | Product and data | Updated metric definition and passing tenant examples. |
| SQL error | Engineering | Passing trace against expected answer and reviewer result. |
| Data-quality failure | Data owner | Quality check passes or caveat/refusal is added. |
| RBAC failure | Security and platform | Role-sliced replay passes for each permission tier. |
NL-to-SQL fails differently in finance, answered plainly.
It depends on the deployment. The important point is that finance teams should separate model, semantic, data, permission, and rendering failures instead of treating every wrong answer as a model issue.
A plausible wrong answer is more likely to be used. In finance, that can turn into a wrong board number, budget read, vendor decision, or risk signal.
Log the trace, root-cause class, materiality tier, owner, remediation, and re-test result. That is the minimum evidence needed for an audit memorandum.
Keep the evidence trail connected.
NL-to-SQL evals for finance
The canonical guide that ties answer correctness, dataset quality, golden sets, drift gates, and audit memoranda together.
How to audit an NL-to-SQL system
The audit workflow that turns failures into evidence.
AI Audit
The two-week operating read that turns production AI behavior into board-readable evidence.
If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.
Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.