Evaluating LLM Apps: How to Test Something Non-Deterministic

The first time I shipped an LLM feature, I "tested" it the way I test everything: a handful of assertEquals calls against fixed inputs. It passed. Two weeks later a prompt tweak that improved one customer's output silently broke summarization for a different segment, and the only reason I found out was a support ticket. The test suite was green the entire time.

That's the core problem. Traditional tests assume f(x) returns the same y every time. An LLM returns a sample from a distribution. Even at temperature: 0 you are not guaranteed determinism — floating-point non-associativity across GPU batches, MoE routing, and model-version rollovers all introduce drift. So expect(output).toBe("...") is the wrong tool. You're not asserting equality, you're asserting that a distribution of outputs stays good enough. That requires a different machine: evals.

Assertions fail, so measure quality instead

The mental shift is from binary to scalar. A unit test answers yes/no. An eval answers "how good, on a scale, across N representative cases." You stop asking "is this output correct" and start asking "did my aggregate quality score drop below threshold."

That reframing buys you three things: you can compare two prompts numerically, you can gate CI on a score instead of a brittle string, and you can track quality as a time series instead of discovering regressions by anecdote.

The unit of work is the eval case: an input, optional context, optional reference output, and one or more scorers that grade the model's response. Run every case, aggregate the scores, compare against a baseline.

Build the golden dataset from real cases, not your imagination

The single biggest mistake I see: engineers hand-write 15 "test cases" that look like demos. Those cases encode what you think users do. Your eval set should encode what they actually do.

My rule: at least 60% of the golden set comes from real production traffic — sampled, anonymized, and labeled. Capture inputs and outputs with a tracing layer (LangSmith, Langfuse, or your own logging), then build a labeling pipeline. Every time a user thumbs-down a response, or a support ticket traces back to a bad generation, that case goes into the eval set. This is the flywheel: production failures become permanent regression tests.

A practical target distribution for a 200-case golden set:

Bucket	Share	Source
Happy-path typical requests	40%	Sampled production traffic
Known failure modes	25%	Tickets, thumbs-down, incident postmortems
Edge cases (long input, empty, multilingual)	15%	Hand-curated + sampled
Adversarial / injection	10%	Red-teaming, OWASP LLM Top 10 patterns
Safety / refusal boundaries	10%	Policy team + curated

Store cases as flat files in the repo so they version with your code and diff in PRs:

{
  "id": "summ-042",
  "input": "Summarize this support thread in 2 sentences:\n\n[thread text...]",
  "context": null,
  "reference": "Customer reported a failed payment; agent issued a refund and confirmed resolution.",
  "tags": ["summarization", "production-sampled"],
  "metadata": { "captured_at": "2026-04-18", "source": "thumbs_down" }
}

The scorer spectrum: cheap and dumb to expensive and smart

There is no single scorer. You compose a ladder, from cheapest/most-deterministic to most expensive/most-flexible, and you use the cheapest one that can actually judge the property you care about.

1. Exact and heuristic checks. Free, deterministic, flaky-proof. Use them whenever output structure is constrained: valid JSON, contains a required field, matches a regex, length bounds, no banned phrases. If your feature returns structured output, most of your scorers should live here. Don't reach for an LLM to check whether a string parses as JSON.

2. Embedding similarity. When you have a reference answer but exact match is too strict (paraphrase is fine), score cosine similarity between embeddings of the candidate and reference. Useful, but noisy — semantically opposite sentences can score 0.85 cosine because they share vocabulary. Treat it as a weak signal, never a gate on its own.

3. Task-specific scorers. Often the highest ROI and the least used. If you generate SQL, run the SQL against a fixture DB and compare result sets. If you extract a date, parse it and compare timestamps. If you classify, compute F1 against labels. These are real tests with real assertions — they just live inside the eval harness.

4. LLM-as-judge. A separate model call that grades the output against a rubric. The most flexible and the most dangerous. Reserve it for genuinely subjective properties — tone, helpfulness, faithfulness — that the cheaper scorers can't express.

The harness

Here's a minimal but real harness in TypeScript. It runs cases through the model and applies an array of scorers, each returning a 0..1 score. Note that scorers are just async functions — heuristic, embedding-based, and LLM-based all implement the same interface.

import { z } from "zod";
 
export interface EvalCase {
  id: string;
  input: string;
  context?: string | null;
  reference?: string | null;
  tags: string[];
}
 
export interface ScoreResult {
  scorer: string;
  score: number;       // 0..1
  passed: boolean;
  rationale?: string;
}
 
export type Scorer = (
  caseData: EvalCase,
  output: string,
) => Promise<ScoreResult>;
 
// Cheap deterministic scorer: must be valid JSON with required keys.
export const validJsonScorer =
  (schema: z.ZodTypeAny): Scorer =>
  async (_caseData, output) => {
    try {
      schema.parse(JSON.parse(output));
      return { scorer: "valid_json", score: 1, passed: true };
    } catch (e) {
      return {
        scorer: "valid_json",
        score: 0,
        passed: false,
        rationale: (e as Error).message,
      };
    }
  };
 
export async function runEval(
  cases: EvalCase[],
  generate: (c: EvalCase) => Promise<string>,
  scorers: Scorer[],
): Promise<{ caseId: string; scores: ScoreResult[] }[]> {
  // Bound concurrency so you don't hammer rate limits.
  const results: { caseId: string; scores: ScoreResult[] }[] = [];
  const queue = [...cases];
  const workers = Array.from({ length: 8 }, async () => {
    let c: EvalCase | undefined;
    while ((c = queue.shift())) {
      const output = await generate(c);
      const scores = await Promise.all(scorers.map((s) => s(c!, output)));
      results.push({ caseId: c.id, scores });
    }
  });
  await Promise.all(workers);
  return results;
}

LLM-as-judge, done with eyes open

LLM-as-judge works, but it has documented failure modes you must design around. From the research and from getting burned in production:

Position bias. Judges prefer whichever answer is shown first (or sometimes last) in pairwise comparisons. Mitigate by running both orderings and averaging, or by scoring absolute rather than pairwise.
Verbosity bias. Judges reward longer, more confident answers even when they're wrong.
Self-preference. A model tends to rate its own family's outputs higher. If GPT generates, consider having Claude judge, or vice versa.
Rubric sensitivity. Vague rubrics produce vague, high-variance scores. "Rate helpfulness 1–5" is nearly useless. Give the judge a concrete, anchored scale.

The most important discipline: your judge needs its own validation. Before you trust a judge to gate CI, have a human label 50–100 cases, then measure the judge's agreement with the human (Cohen's kappa, or simple accuracy on pass/fail). If the judge agrees with humans only 70% of the time, it cannot reliably catch a 5% regression — the noise floor is higher than the signal you're hunting. I aim for ≥85% agreement before a judge gates anything.

Here's a faithfulness judge for RAG — does the answer stay grounded in the retrieved context, or did it hallucinate? Note the explicit rubric, forced structured output, and temperature: 0.

import OpenAI from "openai";
import { z } from "zod";
 
const client = new OpenAI();
 
const Verdict = z.object({
  grounded: z.boolean(),
  score: z.number().min(0).max(1),
  unsupported_claims: z.array(z.string()),
  rationale: z.string(),
});
 
const RUBRIC = `You are grading whether an ANSWER is faithful to the CONTEXT.
Faithful means every factual claim in the ANSWER is directly supported by the CONTEXT.
 
Score with this anchored scale:
- 1.0: Every claim is supported. No outside facts introduced.
- 0.5: Mostly supported, but contains one minor unsupported detail.
- 0.0: Contains a claim that contradicts or is absent from the CONTEXT.
 
List every unsupported claim verbatim. Do not reward fluency or length.`;
 
export async function faithfulnessJudge(
  context: string,
  answer: string,
): Promise<z.infer<typeof Verdict>> {
  const res = await client.chat.completions.create({
    model: "gpt-4.1",
    temperature: 0,
    messages: [
      { role: "system", content: RUBRIC },
      {
        role: "user",
        content: `CONTEXT:\n${context}\n\nANSWER:\n${answer}`,
      },
    ],
    response_format: { type: "json_object" },
  });
  return Verdict.parse(JSON.parse(res.choices[0].message.content!));
}

For RAG specifically, faithfulness is only half the picture. You also want answer relevance (did it address the question) and context relevance (did retrieval surface the right chunks). Frameworks like Ragas formalize these as a triad; I find faithfulness and context relevance catch the most real bugs, since a retrieval regression is invisible to a generation-only eval.

Don't build the plumbing yourself — use promptfoo

You can write all of this from scratch, and the harness above shows it isn't much code. But for the common case, promptfoo gives you a declarative config, parallel execution, caching, a results UI, and built-in scorers (including LLM-rubric and similarity) for free. A config looks like this:

# promptfooconfig.yaml
providers:
  - openai:gpt-4.1
  - anthropic:claude-sonnet-4-5
prompts:
  - file://prompts/summarize_v3.txt
tests:
  - vars:
      input: file://cases/summ-042.txt
    assert:
      - type: is-json
      - type: javascript
        value: output.length < 500
      - type: llm-rubric
        provider: anthropic:claude-sonnet-4-5
        value: "Summary is 2 sentences, factually grounded, neutral tone."
      - type: similar
        value: file://cases/summ-042-ref.txt
        threshold: 0.75

Run it with npx promptfoo@latest eval and open npx promptfoo@latest view for the diff UI. Crucially, it runs the same cases across two providers at once, which is exactly the A/B comparison you need when deciding whether to migrate models.

Offline evals vs. online monitoring

These are two different jobs and you need both.

Offline evals run against your golden set, in CI, on every prompt or model change. They answer "did this change make things worse on cases I've already seen." Fast feedback, controlled, but blind to inputs you haven't captured yet.

Online monitoring runs against live traffic. You can't compute reference-based scores in production (no ground truth), so you lean on reference-free signals: faithfulness judges on a sampled %, output-format validity rate, latency, refusal rate, and user feedback (thumbs, edits, re-asks). When an online signal degrades, you sample those failing inputs back into the golden set — closing the flywheel.

Gate CI on score regression, not absolute thresholds

The CI gate is where evals earn their keep. The trap is hard absolute thresholds — they're either so low they catch nothing or so high they flap. Gate on regression against a stored baseline instead, with a tolerance band that accounts for judge variance.

#!/usr/bin/env bash
set -euo pipefail
 
# Run evals, write aggregate scores to results.json
npx tsx scripts/run-evals.ts --out results.json
 
# Compare against committed baseline; fail if mean score drops > 3 points.
npx tsx scripts/check-regression.ts \
  --baseline evals/baseline.json \
  --current results.json \
  --tolerance 0.03
 
echo "Eval gate passed."

// check-regression.ts (core comparison)
import { readFileSync } from "node:fs";
 
const args = new Map<string, string>();
for (let i = 2; i < process.argv.length; i += 2) {
  args.set(process.argv[i].replace(/^--/, ""), process.argv[i + 1]);
}
 
const baseline = JSON.parse(readFileSync(args.get("baseline")!, "utf8"));
const current = JSON.parse(readFileSync(args.get("current")!, "utf8"));
const tol = Number(args.get("tolerance") ?? "0.03");
 
const drop = baseline.meanScore - current.meanScore;
if (drop > tol) {
  console.error(
    `Regression: mean ${current.meanScore.toFixed(3)} vs baseline ` +
      `${baseline.meanScore.toFixed(3)} (drop ${drop.toFixed(3)} > ${tol})`,
  );
  process.exit(1);
}
 
// Also fail on any per-tag collapse, even if the mean holds.
for (const [tag, score] of Object.entries<number>(current.byTag)) {
  if ((baseline.byTag[tag] ?? 1) - score > 0.1) {
    console.error(`Tag "${tag}" regressed sharply: ${score.toFixed(3)}`);
    process.exit(1);
  }
}
console.log("No regression.");

That per-tag check matters: a prompt change can lift the mean while quietly destroying one segment — exactly the failure that bit me on day one. You only see it if you slice.

Be honest about cost and flakiness

Two things nobody tells you up front. First, evals cost real money and time. A 200-case set with two LLM-judge scorers per case is 400 judge calls per eval run, on top of generation. At a few cents each and run on every PR, that adds up — budget for it, cache aggressively (promptfoo caches by default), and only run the full suite on prompt/model changes, not every commit. Second, LLM-judge scores are themselves noisy. Run judges at temperature: 0, and for borderline gates, run each judge case 3× and take the median to smooth variance. If your eval suite is itself flaky, nobody will trust the gate, and an untrusted gate gets disabled.

Checklist

Golden set is ≥60% real production traffic, versioned in the repo, sliced by tags.
Every production failure (ticket, thumbs-down) gets a permanent eval case.
Use the cheapest scorer that can judge the property — heuristics before embeddings before LLM-judge.
Every LLM-judge has an anchored rubric, temperature: 0, structured output, and ≥85% measured agreement with human labels.
CI gates on regression vs. baseline plus per-tag collapse, not absolute thresholds.
Online monitoring uses reference-free signals and feeds failures back offline.
You've measured what the full suite costs per run and decided when it runs.

You cannot make an LLM deterministic. You can make its quality measurable, and a measurable thing is a thing you can defend in CI. That's the whole game.