A "golden set" is a small collection of input → expected-output pairs that you can run your prompt against any time you change it. It is the single most useful artifact you will build in this course. Without it, you are making decisions on vibes.
Start with five
Five pairs is plenty for week one. Pick examples that span the edges of your task: one obvious case, one ambiguous case, one out-of-scope case, one adversarial case, one weird case. Write down what the right answer is. That's your set.
- Obvious — a clear-cut example anyone on the team would handle the same way.
- Ambiguous — a case that could reasonably go two ways. Pick the answer your team prefers and write it down.
- Out-of-scope — input the agent shouldn't act on. Expected: refusal or escalation.
- Adversarial — someone trying to trick the system. Prompt injection, jailbreak attempts, abuse.
- Weird — a real case that's structurally normal but feels off. Empty fields, unusual formatting, edge of the allowed range.
What you do with the set
Every time you touch the prompt, run the set. Compare the new output to the expected output. If a previously-passing case now fails, that's a regression — you broke something with your change. If a previously-failing case now passes, that's progress. The set is your scoreboard.
Knowledge check
0/1 answered1. You change your system prompt and three of your five golden cases now fail. Best next move?
Discussion
0 commentsBe the first to start the conversation.