Answer correctness and dataset quality are different evals.

A system can generate the right SQL and still produce the wrong finance answer if the underlying dataset cannot support the question.

Answer correctness measures whether the final rendered answer matches the expected result. Dataset quality measures whether the table, metric, time window, currency, role slice, and source-system reconciliation are sound enough for the system to answer at all.

Two surfaces

One surface scores the answer. The other scores the ground.

When teams collapse these surfaces into one accuracy number, they lose the ability to tell whether the model failed or the data failed.

SurfaceQuestionFailure it catches
Answer correctnessDid the final response match the expected answer?Wrong query plan, wrong SQL, wrong aggregation, wrong chart, wrong narrative.
Dataset qualityWas the data slice fit to answer the question?Missing rows, stale close, currency mismatch, type inconsistency, bad category mapping.
User experience

Refusal is sometimes the correct answer.

Finance users should not get a confident answer over a failed data slice. The system needs an explicit path for caveats, confidence downgrade, and refusal.

Answer

The answer passes both the semantic and dataset-quality checks.

Caveat

The answer is useful, but a named data-quality issue limits interpretation.

Refuse

The data slice is below threshold, so the system should not answer as if it knows.

Audit impact

Dataset quality creates scope limitations.

In audit language, a data-quality failure is not the same as a model exception. It may mean the system cannot issue a clean opinion on that answer surface.

Separate model failures from data failures in the exception log.
Map dataset-quality checks to the exact metric, table, tenant, and time period.
Make caveats visible in the user interface and in the audit memorandum.
FAQ

Answer correctness and dataset quality are different evals, answered plainly.

Yes. A query can be structurally correct and still return a misleading result because the source data is incomplete, stale, or misclassified.

For material finance questions, yes. If the slice fails a threshold that matters for the use case, the system should caveat or refuse instead of producing a confident answer.

It should appear in the trace, user caveat, dashboard, exception log, and audit memorandum so the same evidence serves product, risk, and audit review.

If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Map both eval surfaces ->