A baseline makes proof possible.
"The AI is working" is a sentence anyone can say, including the model itself. "The AI is at or above the labeled threshold on the 312-case benchmark, holding for 9 of the last 12 weeks" is a sentence the board can act on. The golden dataset is what turns the first sentence into the second.
