Evaluation Driven Development

Here's a failure mode I've seen repeatedly. An engineer updates a system prompt — tightens the tone, adds a new instruction, tweaks the output format — then manually tests it against five example inputs, everything looks fine, and the change ships. Two weeks later, support tickets start arriving. The model is now ignoring edge cases that were handled correctly before. Nobody caught it because nobody was looking.

This is what Andrej Karpathy calls the "vibe check" problem. You eyeball a handful of outputs, feel good about it, and call it engineering. It's not. It's hope.

Your CI pipeline, two weeks after a "looks fine to me" prompt change shipped to production.

The solution isn't complicated, but it takes commitment to set up: you need automated evaluations that run every time your prompts change, the same way unit tests run every time your code changes. This is Evaluation-Driven Development — EDD — and it's worth understanding in some depth.

Why prompt changes are uniquely dangerous

In traditional software, if you change a function, you get a test failure that names the broken behavior. The failure is loud, immediate, and specific.

Prompt changes don't work like that. LLMs are non-deterministic. The same prompt can produce slightly different outputs on different runs. And failures are silent — the model still returns a coherent-looking response, it just starts hallucinating, or drifting on tone, or breaking a JSON schema you depend on. No exception thrown. No red CI badge. Just subtly wrong outputs flowing into your users' experience.

The research backs this up in a concrete way. Parsing-related incidents alone account for 38% of all task failures in production agentic systems. That's not model capability failing — that's format compliance failing. A regex assertion in CI would catch every single one of those before they ship.

EDD is not TDD with an LLM skin

The tempting framing is: write your evals first, then write prompts to make them pass. That's TDD translated to AI. And practitioners like Hamel Husain caution against it, for a good reason.

With deterministic code, you can anticipate what will break. With LLMs, you can't. The failure surface is effectively infinite. If you write evals for failures you imagine, you'll build a test suite measuring things that never actually go wrong in production, while missing the things that do.

The better approach is: ship your first prompt, collect real outputs, do error analysis on actual failures, then write evaluators targeting the specific failure modes you find. Your eval suite starts small and grows incrementally with your system's real-world failure history. It stays predictive because it's grounded in what users actually send.

The difference sounds subtle but it's significant in practice. EDD is still eval-first — you're just drawing from observed reality rather than speculation.

Three layers of assertions

Not all tests have the same cost or precision. Production eval pipelines use three tiers, applied in order:

Deterministic assertions run first, cost nothing extra, and catch the most obvious failures. Does the response contain a required phrase? Does it match a regex for price format? Does the JSON parse without error? Does it stay under a token budget?

assert:
  - type: contains
    value: "30-day refund"
  - type: regex
    value: "\\$\\d+\\.\\d{2}"
  - type: not-contains
    value: "I cannot help with that"
  - type: is-json

These run in milliseconds and catch format compliance failures — the 38% category mentioned above.

Semantic similarity assertions handle cases where the phrasing can vary but the meaning shouldn't. You embed both the expected output and the actual output, compute cosine similarity, and fail if it drops below a threshold (typically 0.85). This is how you test paraphrase equivalence without being brittle about exact wording.

LLM-graded assertions handle the subjective stuff: tone, helpfulness, coherence, safety. You send the output to a judge model with a scoring rubric and get back a numeric score. This is the "LLM-as-a-Judge" pattern, and it scales where human review can't.

LLM-as-a-Judge: how to do it right

The mechanics are straightforward. You write a prompt that defines quality criteria, feed it the input and output, and ask the judge to score it. The tricky part is controlling for well-documented biases.

Position bias means that in A/B comparisons, whichever answer appears first gets a ~10-15 point bump in winrate on close calls. Fix: run the comparison twice with responses swapped, then average.

Verbosity bias is worse. Across GPT-4, Claude, and PaLM-2, studies have shown longer outputs inflate preference scores by 15-30 points, even when the shorter response is more correct. Length reads as effort. Your rubric needs to explicitly call this out:

Score 5: Highly informative and specific. Addresses the user's need directly.
         Conciseness is a virtue — do not reward longer responses unless the
         length is warranted by the complexity of the question.

Self-enhancement bias means a model used as judge will rate outputs from its own family 10-25% higher. Don't use Claude to judge Claude outputs. Use a cross-family judge.

The judge model, rating a 2,000-word ramble a perfect 5 because at least it showed effort.

A solid rubric prompt looks like this:

You are an expert evaluator. Score the following AI response on HELPFULNESS
from 1 to 5.

Score 5: Specific, well-structured, directly addresses the user's need.
Score 4: Mostly helpful, minor gaps.
Score 3: Partially answers the question, lacks important detail.
Score 2: Touches the topic but misses the core issue.
Score 1: Incorrect, off-topic, or misleading.

User input: {input}
AI response: {output}

Explain your reasoning, then output your score as a single integer.

Always ask for chain-of-thought before the score. It significantly improves reliability — the judge has to commit to a rationale before it can hedge with a number.

Golden datasets: the thing you actually need to maintain

Evals without data are nothing. The "golden dataset" is your curated set of input/expected-output pairs — the ground truth for what correct behavior looks like in your system.

A few things that matter:

Size: 100-500 test cases is the sweet spot for most teams. Too few and the signal is noisy. Too many and maintenance becomes a full-time job, which means it won't happen.

Coverage: Include your most common user inputs, your known edge cases, and your previous failures. Adversarial examples belong here too. The dataset should grow every time a production failure surfaces — that's how it stays predictive.

Versioning: Treat the dataset like source code. Every schema change, every label redefinition, every addition gets a commit. When you change what a "Score 5" means in your rubric, you bump the dataset version. This matters when you're auditing why a change regressed two months later.

Ownership: The dataset needs a clear owner. The most common golden dataset failure mode is "multiple people editing the eval criteria over six months with no coordination, resulting in contradictory rubrics and test cases that measure different things."

Wiring it into CI

Promptfoo is the most practical open-source option here. It's MIT-licensed, used internally by OpenAI and Anthropic, and it has a native GitHub Action.

Your config lives in a YAML file:

prompts:
  - "prompts/system_prompt_v2.txt"

providers:
  - openai:gpt-4o

tests:
  - vars:
      question: "What is your refund policy?"
    assert:
      - type: contains
        value: "30 days"
      - type: llm-rubric
        value: "Response is accurate, polite, and under 150 words"
        weight: 3
      - type: regex
        value: "\\d+ day"
        weight: 1

  - vars:
      question: "How do I cancel?"
    assert:
      - type: javascript
        value: "output.includes('account settings') || output.includes('billing page')"

The GitHub Action posts results as a PR comment — a diff showing which test cases improved, which regressed, and by how much:

- name: Run LLM evals
  uses: promptfoo/promptfoo-action@v1
  with:
    github-token: ${{ secrets.GITHUB_TOKEN }}
    config: evals/promptfooconfig.yaml
    repeat: 3
    repeat-min-pass: 2

Setting repeat: 3 with repeat-min-pass: 2 handles non-determinism gracefully — a test case passes only if it passes at least 2 of 3 runs. No false failures from temperature variance.

If you want a hosted platform with human annotation, regression tracking, and dashboard views, Braintrust is worth looking at. Their eval-action works similarly and posts detailed PR comments. For Python-native teams running RAG systems, DeepEval integrates directly with pytest and provides pre-built metrics for hallucination, answer relevancy, and faithfulness that you can drop into a standard pytest run.

Model swaps without the surprises

The hardest eval scenario isn't a prompt change. It's a model version swap. When you move from gpt-4 to gpt-4o, or from Claude 3.5 Sonnet to Claude 3.7, the behavioral surface changes in ways that are hard to anticipate. Format verbosity shifts. Refusal patterns change. JSON adherence varies. Instruction-following nuances differ.

With a solid golden dataset, a model migration becomes a regression run, not a leap of faith. You run both the old and new model against all your test cases in parallel — promptfoo supports this natively — and you get a side-by-side diff. You've defined acceptable thresholds in advance: no more than a 5% drop in helpfulness score, zero increase in refusal rate on valid queries, no format compliance failures.

Without evals, a model migration is a guess. With them, it's a measurable decision.

The shift that makes this stick

None of the tooling matters if your team treats evals as an afterthought — something you write after the prompt is working, if you get around to it.

The teams that get this right treat their eval suite the same way they treat their test suite. Adding a test case for every production failure is not optional. The CI gate is not bypassed. Prompt changes without an eval delta are flagged in code review.

The mindset shift is small, but the operational change is real. You stop asking "does this look right?" and start asking "what score does this get, and is that above our threshold?" That's the difference between vibe-checking and engineering.

Your on-call rotation once your evals actually catch regressions before they hit prod.

Your prompts will change. Your models will change. The only way to know whether those changes made things better or worse is to measure it. Evals are how you measure it.