Evaluation
Prompt Regression Suite
Treat prompts like code: keep a small regression set before tuning the clever wording.
Use when
A prompt, system instruction, tool policy, or model setting change could silently break workflows that used to work.
Intermediatedifficulty
Evaluationcategory
Mia daily expansionsource
Cadence
Before changing a production prompt, agent instruction, or model setting
Verification
Must-pass cases stay green, known failure cases do not regress, and any changed behavior is documented with an accept or reject decision.
Structured loop spec
| Field | Value |
|---|---|
| Name | Prompt Regression Suite |
| Category | Evaluation |
| Trigger | Before changing a production prompt, agent instruction, or model setting |
| Objective | Treat prompts like code: keep a small regression set before tuning the clever wording. |
| Allowed inputs | Relevant files, source notes, logs, tests, screenshots, metrics, or task state for this loop |
| Allowed actions | Select 8-20 representative cases: normal requests, edge cases, refusal boundaries, formatting requirements, and past failures.; Record the expected behavior, allowed variance, required citations or tool evidence, and disallowed outputs for each case.; Run the baseline prompt and the proposed change against the same cases with model, tools, and temperature held constant where possible.; Compare outputs using the written rubric, not vibes, and mark each case pass, fail, degraded, improved, or needs human review.; Ship only if must-pass cases remain green and degraded cases are accepted, fixed, or rolled back. |
| Verification | Must-pass cases stay green, known failure cases do not regress, and any changed behavior is documented with an accept or reject decision. |
| Stop condition | Stop when the verifier passes, the budget is exhausted, no progress is made, a blocker appears, or approval is required. |
| Budget | Set a time, turn, token, retry, file, or dollar cap before running the loop. |
| Approval boundary | Human approval required before publishing, sending, deleting, spending, changing accounts, touching production, or making reputational/legal/financial commitments. |
| Safe output | Draft, report, checklist, table, or approval-gated recommendation |
| Works with | Claude, ChatGPT, Gemini, any tool-using AI assistant |
Steps
- Select 8-20 representative cases: normal requests, edge cases, refusal boundaries, formatting requirements, and past failures.
- Record the expected behavior, allowed variance, required citations or tool evidence, and disallowed outputs for each case.
- Run the baseline prompt and the proposed change against the same cases with model, tools, and temperature held constant where possible.
- Compare outputs using the written rubric, not vibes, and mark each case pass, fail, degraded, improved, or needs human review.
- Ship only if must-pass cases remain green and degraded cases are accepted, fixed, or rolled back.
Prompt
Run the Prompt Regression Suite loop before changing a production prompt, system instruction, model setting, or agent policy. Build or reuse 8-20 representative cases with expected behavior, allowed variance, required evidence, and disallowed outputs. Run baseline and candidate under comparable settings, score against the rubric, and report pass/fail/degraded/improved cases. Do not ship the prompt change unless must-pass cases stay green and any degradation has an explicit accept, fix, or rollback decision.