Evaluation

Prompt Regression Suite

Treat prompts like code: keep a small regression set before tuning the clever wording.

Use when

A prompt, system instruction, tool policy, or model setting change could silently break workflows that used to work.

Intermediatedifficulty
Evaluationcategory
Mia daily expansionsource

Cadence

Before changing a production prompt, agent instruction, or model setting

Verification

Must-pass cases stay green, known failure cases do not regress, and any changed behavior is documented with an accept or reject decision.

Structured loop spec

FieldValue
NamePrompt Regression Suite
CategoryEvaluation
TriggerBefore changing a production prompt, agent instruction, or model setting
ObjectiveTreat prompts like code: keep a small regression set before tuning the clever wording.
Allowed inputsRelevant files, source notes, logs, tests, screenshots, metrics, or task state for this loop
Allowed actionsSelect 8-20 representative cases: normal requests, edge cases, refusal boundaries, formatting requirements, and past failures.; Record the expected behavior, allowed variance, required citations or tool evidence, and disallowed outputs for each case.; Run the baseline prompt and the proposed change against the same cases with model, tools, and temperature held constant where possible.; Compare outputs using the written rubric, not vibes, and mark each case pass, fail, degraded, improved, or needs human review.; Ship only if must-pass cases remain green and degraded cases are accepted, fixed, or rolled back.
VerificationMust-pass cases stay green, known failure cases do not regress, and any changed behavior is documented with an accept or reject decision.
Stop conditionStop when the verifier passes, the budget is exhausted, no progress is made, a blocker appears, or approval is required.
BudgetSet a time, turn, token, retry, file, or dollar cap before running the loop.
Approval boundaryHuman approval required before publishing, sending, deleting, spending, changing accounts, touching production, or making reputational/legal/financial commitments.
Safe outputDraft, report, checklist, table, or approval-gated recommendation
Works withClaude, ChatGPT, Gemini, any tool-using AI assistant

Steps

  1. Select 8-20 representative cases: normal requests, edge cases, refusal boundaries, formatting requirements, and past failures.
  2. Record the expected behavior, allowed variance, required citations or tool evidence, and disallowed outputs for each case.
  3. Run the baseline prompt and the proposed change against the same cases with model, tools, and temperature held constant where possible.
  4. Compare outputs using the written rubric, not vibes, and mark each case pass, fail, degraded, improved, or needs human review.
  5. Ship only if must-pass cases remain green and degraded cases are accepted, fixed, or rolled back.

Prompt

Run the Prompt Regression Suite loop before changing a production prompt, system instruction, model setting, or agent policy. Build or reuse 8-20 representative cases with expected behavior, allowed variance, required evidence, and disallowed outputs. Run baseline and candidate under comparable settings, score against the rubric, and report pass/fail/degraded/improved cases. Do not ship the prompt change unless must-pass cases stay green and any degradation has an explicit accept, fix, or rollback decision.

Tags

promptsevalsLLMregression testing

Related

Back to all examples