Build the golden dataset for NL-to-SQL.

The golden dataset is the working-paper substrate behind answer correctness, prompt optimization, drift gates, and the audit memorandum.

A golden dataset for NL-to-SQL is a versioned corpus of representative finance questions with expected answers, persona, question tier, materiality threshold, tenant context, role context, data dependencies, and failure mode metadata.

Schema

A flat list of questions is too weak.

The dataset has to preserve the business context that makes the question material. Without metadata, the score becomes a flat average that hides the failures that matter.

01

User question

The natural-language question in the language a finance user would actually ask.

02

Expected answer

The scalar value, table shape, chart shape, caveat, or refusal the system should produce.

03

Persona and role

CFO, FP&A analyst, controller, budget owner, internal auditor, and the permissions slice being evaluated.

04

Tier and materiality

The threshold that determines whether a failure becomes an exception or an appendix note.

05

Data dependencies

The source tables, metric definitions, time window, currency treatment, and reconciliation expectations.

06

Failure mode if wrong

The business consequence: wrong board number, incorrect budget read, bad vendor decision, or audit limitation.

Seed set

Start manual before synthetic.

Synthetic variants are useful after ground truth exists. They are dangerous when they become the ground truth. The first seed set should come from domain experts and production-like questions.

Write tier-1 examples with the people who know the finance workflow.
Use synthetic paraphrases only after a known-good answer exists.
Review tenant-specific business logic before marking an example production-ready.
Promote production questions into the golden set after human review, not automatically.
Example

The finance NL-to-SQL record needs audit fields.

The example below extends a generic golden-set pattern with the fields a finance deployment needs for evidence.

query_id: FIN_NL_SQL_001
status: production_ready
persona: cfo
role_slice: executive
question_tier: tier_1
materiality_threshold: 0.95

user_question: "What was vendor spend by category last quarter versus plan?"

expected_answer:
  type: chart_plus_narrative
  chart_shape: grouped_bar
  required_caveats:
    - "Only approved vendor categories included"

data_dependencies:
  tables:
    - ap_invoices
    - vendor_master
    - budget_plan
  checks:
    - currency_normalized
    - category_mapping_complete
    - quarter_closed

failure_mode_if_wrong: "Wrong board-level operating expense read"
FAQ

Build the golden dataset for NL-to-SQL, answered plainly.

The exact count depends on scope, but the important design choice is tiering. A small number of board-grade questions may need many examples, while long-tail exploratory questions can start with fewer examples and grow from production traffic.

Engineering should own the harness. Product, customer success, and domain experts should own the working-paper content because they are closest to the buyer's actual questions.

It can start as version-controlled YAML. As soon as the corpus becomes operational, finance teams need an editable surface so domain experts can add, tier, and correct examples without filing pull requests.

If a finance AI answer can move an operating decision, the evidence behind it needs to be readable after the answer is gone.

Bring one workflow, vendor, or AI portfolio. We will map the evidence needed for finance leaders to fund, ship, or stop it.

Design the golden set ->