The golden set YAML template for multi-tenant AI products.
A practical schema for teams that need real user queries, expected behavior, tenant context, and regression gates to live in one durable corpus.
A golden set is a versioned corpus of real or representative user queries with an explicit statement of expected behavior. For multi-tenant AI products, the schema should separate tenant-agnostic intent from tenant-specific data so the same behavior can be tested across customers without copy-paste rewrites.
Separate what the user means from what each tenant contains.
The load-bearing design decision is the intent/data split. The intent layer describes what the user is asking the system to do. The data layer describes how that expectation maps onto each tenant's calendar, catalog, result shape, and known edge cases.
query_id: QUERY_001
version: 2
status: production_ready
category: headcount_aggregation
priority: standard
user_query: |
How many engineers do we have right now?
intent:
primary_action: count
entity: employees
filter:
department: engineering
temporal_reference: present
expected_behavior:
route_to: aggregation_pipeline
checkpoint_order:
- route
- vector_retrieval
- payload_generation
- aggregate_validation
data:
tenant_a:
fiscal_context:
period_label: current_quarter
period_start: "2026-04-01"
period_end: "2026-06-30"
expected_result_shape:
type: scalar_integer
min_plausible: 20
max_plausible: 500
baseline_method: human_verified
known_failure_modes:
- name: contractor_inclusion_drift
description: "Do not alert when documented contractor handling changes the count."The boring fields are the ones that keep CI useful.
A usable golden set is not just a list of prompts. It carries workflow state, expected checkpoints, tenant coverage, and known failure modes so the runner can tell the difference between a real regression and a documented source of volatility.
Three-state status
Use pending, baseline_captured, and production_ready so teams can report deltas before they are ready to block a build.
Explicit checkpoint order
Route, retrieval, payload, execution, and synthesis should fail independently so the team knows which stage broke.
Tenant-keyed data
A missing tenant block should mean this query is not yet baselined for that tenant, not that someone forgot a duplicate file.
Known failure modes
Documented edge cases prevent baseline churn and preserve decisions that would otherwise live in one engineer's memory.
Flat golden sets become onboarding tax.
The fastest-looking schema is one file per tenant per query. It also becomes the most expensive shape when tenant two arrives. Every intent is copied, every value is rediscovered, and every terminology difference becomes hand work.
Author the intent once.
Generate or review the tenant data layer per customer.
Report coverage by tenant so readiness becomes visible before launch.
Practical questions, answered plainly.
No. The examples use finance-shaped data because that is where TrustEvals works, but the intent/data split applies to any tenant-partitioned AI product.
Only when the value is actually stable. For refreshed business data, shape, bounds, and checkpoint-specific expectations are often more useful than exact answers.
After the baseline is captured, reviewed by a human, and accepted as strong enough to block a regression for the tenant where it applies.