NL-to-SQL fails differently in finance.

The symptom is usually the same: the AI gave the wrong number. The root causes are different, and the eval layer has to separate them.

NL-to-SQL failure modes in finance include routing errors, tenant-specific semantic mismatch, SQL synthesis errors, dataset-quality failures, row-level permission mistakes, and output rendering choices that change the business meaning of the result.

Taxonomy

The same wrong answer can have six causes.

A useful eval layer does not stop at pass or fail. It classifies the failure so engineering, product, data, and risk teams know what to fix.

Routing drift

The question is sent to the wrong path or semantic slice.

Semantic mismatch

The system retrieves the wrong business meaning for a metric or tenant.

SQL error

The query plan, join, filter, aggregation, or time window is wrong.

Data-quality failure

The source data is incomplete, stale, mistyped, or unreconciled.

RBAC failure

The answer ignores or misapplies tenant, row, or role permissions.

Rendering failure

The chart, caveat, or narrative changes how the business user reads the result.

Finance examples

The dangerous failures look plausible.

Finance NL-to-SQL rarely fails by returning nonsense. It fails by returning a number that looks reasonable enough to use.

Vendor spend excludes uncategorized vendors but the chart does not say so.
Revenue uses bookings for one tenant and recognized revenue for another.
A budget owner sees a CFO-level total because row-level filters were not replayed in the eval.
A line chart implies a trend over categories that have no time order.
Remediation

Each failure needs a named owner.

Model failures, data failures, policy failures, and UX failures should not land in one undifferentiated queue. The audit memorandum should show owner, fix, and re-test status.

FailureLikely ownerRe-test evidence
Semantic mismatchProduct and dataUpdated metric definition and passing tenant examples.
SQL errorEngineeringPassing trace against expected answer and reviewer result.
Data-quality failureData ownerQuality check passes or caveat/refusal is added.
RBAC failureSecurity and platformRole-sliced replay passes for each permission tier.
FAQ

NL-to-SQL fails differently in finance, answered plainly.

It depends on the deployment. The important point is that finance teams should separate model, semantic, data, permission, and rendering failures instead of treating every wrong answer as a model issue.

A plausible wrong answer is more likely to be used. In finance, that can turn into a wrong board number, budget read, vendor decision, or risk signal.

Log the trace, root-cause class, materiality tier, owner, remediation, and re-test result. That is the minimum evidence needed for an audit memorandum.

If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Audit the failure modes ->