Evaluate the semantic layer before SQL.
Most NL-to-SQL errors are not purely SQL errors. They start when the system retrieves the wrong business meaning for the question.
Semantic layer evaluation measures whether an AI system maps finance schema objects to the right business meaning: metric definitions, synonyms, sample values, tenant overlays, time logic, role context, and permitted data slices.
Schema accuracy is not business accuracy.
A model can select the right table and still use the wrong meaning. Finance metrics vary by tenant, workflow, close state, policy, and role.
The semantic layer needs its own scorecard.
Do not wait for the final SQL answer to discover semantic drift. Evaluate retrieval quality before the SQL writer gets the context.
| Dimension | What to check | Evidence |
|---|---|---|
| Metric meaning | Does the system retrieve the right definition? | Definition ID, version, owner, and expected usage. |
| Synonyms | Does natural language map to the right field? | Accepted terms, rejected terms, and examples. |
| Tenant overlay | Does the tenant-specific logic override the generic definition? | Overlay version and tenant-specific examples. |
| Role context | Does the retrieved meaning match the user's permission slice? | Role replay and filtered result expectation. |
Treat definitions as controlled evidence.
The semantic layer should be editable by the right domain owner, versioned like code, and replayed against the golden set after every material change.
Owner
Every material metric needs a named business owner, not only a column description.
Version
Metric definitions should carry version history so old answers can be reconstructed.
Replay
Semantic edits should trigger golden-set replay for affected personas, tenants, and question tiers.
Evaluate the semantic layer before SQL, answered plainly.
No. The schema names tables and columns. The semantic layer maps those objects to business meaning, metric definitions, synonyms, tenant logic, and role context.
Finance metrics often vary by business model or customer. A generic definition can be directionally right and still wrong for a specific tenant.
Create examples where the correct answer depends on a specific metric definition, synonym, tenant overlay, or role slice, then evaluate whether retrieval selects the expected context before SQL generation.
Keep the evidence trail connected.
Golden dataset for NL-to-SQL
The corpus that captures semantic and tenant examples.
NL-to-SQL evals for finance
The canonical guide that ties answer correctness, dataset quality, golden sets, drift gates, and audit memoranda together.
AI Evals
TrustEvals service work for production eval layers, golden sets, and release gates.
If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.
Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.