Evaluation

Prompt Regression Suite

Treat prompts like code: keep a small regression set before tuning the clever wording.

Use when

A prompt, system instruction, tool policy, or model setting change could silently break workflows that used to work.

Intermediatedifficulty

Evaluationcategory

Mia daily expansionsource

Cadence

Before changing a production prompt, agent instruction, or model setting

Verification

Must-pass cases stay green, known failure cases do not regress, and any changed behavior is documented with an accept or reject decision.

Structured loop spec

Field	Value
Name	Prompt Regression Suite
Category	Evaluation
Trigger	Before changing a production prompt, agent instruction, or model setting
Objective	Treat prompts like code: keep a small regression set before tuning the clever wording.
Allowed inputs	Relevant files, source notes, logs, tests, screenshots, metrics, or task state for this loop
Allowed actions	Select 8-20 representative cases: normal requests, edge cases, refusal boundaries, formatting requirements, and past failures.; Record the expected behavior, allowed variance, required citations or tool evidence, and disallowed outputs for each case.; Run the baseline prompt and the proposed change against the same cases with model, tools, and temperature held constant where possible.; Compare outputs using the written rubric, not vibes, and mark each case pass, fail, degraded, improved, or needs human review.; Ship only if must-pass cases remain green and degraded cases are accepted, fixed, or rolled back.
Verification	Must-pass cases stay green, known failure cases do not regress, and any changed behavior is documented with an accept or reject decision.
Stop condition	Stop when the verifier passes, the budget is exhausted, no progress is made, a blocker appears, or approval is required.
Budget	Set a time, turn, token, retry, file, or dollar cap before running the loop.
Approval boundary	Human approval required before publishing, sending, deleting, spending, changing accounts, touching production, or making reputational/legal/financial commitments.
Safe output	Draft, report, checklist, table, or approval-gated recommendation
Works with	Claude, ChatGPT, Gemini, any tool-using AI assistant

Steps

Select 8-20 representative cases: normal requests, edge cases, refusal boundaries, formatting requirements, and past failures.
Record the expected behavior, allowed variance, required citations or tool evidence, and disallowed outputs for each case.
Run the baseline prompt and the proposed change against the same cases with model, tools, and temperature held constant where possible.
Compare outputs using the written rubric, not vibes, and mark each case pass, fail, degraded, improved, or needs human review.
Ship only if must-pass cases remain green and degraded cases are accepted, fixed, or rolled back.

Prompt

Run the Prompt Regression Suite loop before changing a production prompt, system instruction, model setting, or agent policy. Build or reuse 8-20 representative cases with expected behavior, allowed variance, required evidence, and disallowed outputs. Run baseline and candidate under comparable settings, score against the rubric, and report pass/fail/degraded/improved cases. Do not ship the prompt change unless must-pass cases stay green and any degradation has an explicit accept, fix, or rollback decision.