Answer correctness and dataset quality are different evals.
A system can generate the right SQL and still produce the wrong finance answer if the underlying dataset cannot support the question.
Answer correctness measures whether the final rendered answer matches the expected result. Dataset quality measures whether the table, metric, time window, currency, role slice, and source-system reconciliation are sound enough for the system to answer at all.
One surface scores the answer. The other scores the ground.
When teams collapse these surfaces into one accuracy number, they lose the ability to tell whether the model failed or the data failed.
| Surface | Question | Failure it catches |
|---|---|---|
| Answer correctness | Did the final response match the expected answer? | Wrong query plan, wrong SQL, wrong aggregation, wrong chart, wrong narrative. |
| Dataset quality | Was the data slice fit to answer the question? | Missing rows, stale close, currency mismatch, type inconsistency, bad category mapping. |
Refusal is sometimes the correct answer.
Finance users should not get a confident answer over a failed data slice. The system needs an explicit path for caveats, confidence downgrade, and refusal.
Answer
The answer passes both the semantic and dataset-quality checks.
Caveat
The answer is useful, but a named data-quality issue limits interpretation.
Refuse
The data slice is below threshold, so the system should not answer as if it knows.
Dataset quality creates scope limitations.
In audit language, a data-quality failure is not the same as a model exception. It may mean the system cannot issue a clean opinion on that answer surface.
Answer correctness and dataset quality are different evals, answered plainly.
Yes. A query can be structurally correct and still return a misleading result because the source data is incomplete, stale, or misclassified.
For material finance questions, yes. If the slice fails a threshold that matters for the use case, the system should caveat or refuse instead of producing a confident answer.
It should appear in the trace, user caveat, dashboard, exception log, and audit memorandum so the same evidence serves product, risk, and audit review.
Keep the evidence trail connected.
NL-to-SQL evals for finance
The canonical guide that ties answer correctness, dataset quality, golden sets, drift gates, and audit memoranda together.
NL-to-SQL failure modes in finance
The model, semantic, data, RBAC, and rendering failures to separate.
AI Audit
The two-week operating read that turns production AI behavior into board-readable evidence.
If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.
Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.