Building the scoreboard that tells you whether your system is actually getting better.
Coverage, edge cases, and how to grow the set without melting your review time.
What to log, what to skip, and what to do before you adopt a vendor.
An eval suite with regression alerts wired to Slack or PagerDuty.