The golden set YAML template for multi-tenant AI products.

A practical schema for teams that need real user queries, expected behavior, tenant context, and regression gates to live in one durable corpus.

A golden set is a versioned corpus of real or representative user queries with an explicit statement of expected behavior. For multi-tenant AI products, the schema should separate tenant-agnostic intent from tenant-specific data so the same behavior can be tested across customers without copy-paste rewrites.

Schema

Separate what the user means from what each tenant contains.

The load-bearing design decision is the intent/data split. The intent layer describes what the user is asking the system to do. The data layer describes how that expectation maps onto each tenant's calendar, catalog, result shape, and known edge cases.

query_id: QUERY_001
version: 2
status: production_ready
category: headcount_aggregation
priority: standard

user_query: |
  How many engineers do we have right now?

intent:
  primary_action: count
  entity: employees
  filter:
    department: engineering
  temporal_reference: present

expected_behavior:
  route_to: aggregation_pipeline
  checkpoint_order:
    - route
    - vector_retrieval
    - payload_generation
    - aggregate_validation

data:
  tenant_a:
    fiscal_context:
      period_label: current_quarter
      period_start: "2026-04-01"
      period_end: "2026-06-30"
    expected_result_shape:
      type: scalar_integer
      min_plausible: 20
      max_plausible: 500
    baseline_method: human_verified
    known_failure_modes:
      - name: contractor_inclusion_drift
        description: "Do not alert when documented contractor handling changes the count."
Design decisions

The boring fields are the ones that keep CI useful.

A usable golden set is not just a list of prompts. It carries workflow state, expected checkpoints, tenant coverage, and known failure modes so the runner can tell the difference between a real regression and a documented source of volatility.

Three-state status

Use pending, baseline_captured, and production_ready so teams can report deltas before they are ready to block a build.

Explicit checkpoint order

Route, retrieval, payload, execution, and synthesis should fail independently so the team knows which stage broke.

Tenant-keyed data

A missing tenant block should mean this query is not yet baselined for that tenant, not that someone forgot a duplicate file.

Known failure modes

Documented edge cases prevent baseline churn and preserve decisions that would otherwise live in one engineer's memory.

Anti-pattern

Flat golden sets become onboarding tax.

The fastest-looking schema is one file per tenant per query. It also becomes the most expensive shape when tenant two arrives. Every intent is copied, every value is rediscovered, and every terminology difference becomes hand work.

Author the intent once.

Generate or review the tenant data layer per customer.

Report coverage by tenant so readiness becomes visible before launch.

FAQ

Practical questions, answered plainly.

No. The examples use finance-shaped data because that is where TrustEvals works, but the intent/data split applies to any tenant-partitioned AI product.

Only when the value is actually stable. For refreshed business data, shape, bounds, and checkpoint-specific expectations are often more useful than exact answers.

After the baseline is captured, reviewed by a human, and accepted as strong enough to block a regression for the tenant where it applies.

Design the eval layer ->