BLEU and ROUGE were built for machine translation in the 1990s. They measure n-gram overlap between a generated output and a reference string. They work fine when there's exactly one correct phrasing. They fail on any open-ended generation task where multiple phrasings are valid, where quality is a matter of depth or tone, or where the "reference" is a summary rather than a translation.

LLM-as-a-judge replaced them. The idea: use a strong model to read an output and score it against a rubric, the same way a human evaluator would. This guide is a practical walkthrough — how to set it up, how to fight the biases, and how to run it in production without it becoming a cost or latency problem.


The three modes of LLM-as-a-judge

Mode 1: Single-output scoring without reference

The judge reads only the output and a rubric. No reference answer. Use this when:

  • You're evaluating properties that don't require comparison — tone, format compliance, safety, coherence.
  • You don't have a ground truth answer and don't want to generate one.
  • You're evaluating open-ended creative or conversational outputs where no single reference is correct.
// Single-output without reference
{
  "output": "Here are three Python libraries for data visualization: matplotlib for static charts, plotly for interactive ones, and seaborn for statistical plots.",
  "rubric": "Rate the completeness and accuracy of this answer on a scale of 1-5. Consider whether it names specific tools, gives a brief description of each, and covers the major use cases."
}

Mode 2: Single-output scoring with reference

The judge reads the output, a reference answer, and a rubric. Use this when:

  • You have a golden dataset with expected answers.
  • You need to detect when key information from the reference is missing from the output.
  • You're evaluating factual correctness against a known ground truth.

The reference doesn't have to be the only correct answer — a good rubric will instruct the judge to accept equivalent correct answers that don't match the reference verbatim.

Mode 3: Pairwise comparison

The judge reads two outputs (A and B) and decides which is better for a given criteria. Use this when:

  • You're comparing two versions of a prompt, model, or pipeline.
  • You want an A/B evaluation without needing an absolute score.
  • The quality dimension is hard to score absolutely but easy to compare relatively (e.g., "which answer is more concise while being accurate?").

Pairwise is powerful but requires care with position bias (see below). Always evaluate both orderings (A-then-B and B-then-A) and average the results.


Anatomy of a good judge prompt

The quality of your judge is the quality of your rubric. A vague rubric produces inconsistent scores. Here's a complete annotated judge prompt for evaluating RAG answer quality:

You are evaluating the quality of an AI assistant's answer to a user question.

## Question
{question}

## Retrieved Context
{context}

## Answer to Evaluate
{answer}

## Rubric

Score the answer from 1 to 5 on FAITHFULNESS: whether every claim in the answer
can be traced back to the provided context. Use these criteria:

1 - Answer contains multiple claims not supported by the context
2 - Answer contains at least one significant unsupported claim
3 - Answer is mostly grounded but contains minor speculation or inference
4 - Answer is fully grounded with only explicit information from context
5 - Answer is fully grounded AND correctly acknowledges when context is insufficient

## Instructions

Think step by step:
1. Identify each distinct factual claim in the answer.
2. For each claim, find the sentence in the context that supports it.
3. Flag any claims with no supporting evidence in the context.
4. Assign a score based on the rubric.

Return your response as JSON:
{
  "reasoning": "step-by-step analysis",
  "score": <integer 1-5>,
  "unsupported_claims": ["list of any unsupported claims"]
}

The key elements, annotated:

  • Clear rubric with concrete criteria: Each score level is defined with a specific behavioral description, not just "bad/ok/good".
  • Chain-of-thought before score: Asking the model to reason step-by-step before assigning a score reliably improves consistency. The score appears after the reasoning, not before.
  • Structured output: JSON output makes it easy to parse the score programmatically and log the explanation for debugging.
  • Diagnosis fields: unsupported_claims gives you actionable information, not just a number.
Tip
Add 2–3 few-shot examples directly in the prompt for tricky cases — a score-3 example with explanation is worth more than two paragraphs of rubric text. Few-shot examples anchor the judge's internal scale and reduce variance at the margins.

The biases you have to fight

LLM judges inherit the same biases as the models they're built on. Left unchecked, these biases will make your eval scores misleading.

Position bias (pairwise evaluations)

Problem: In pairwise comparison, the first output presented tends to score higher, regardless of actual quality. The magnitude varies by model but can be as large as 5–10 percentage points.

Fix: Run each pairwise comparison twice — once with A first, once with B first. Average the results. If both orderings agree, you have a reliable signal. If they disagree, treat the pair as a tie or escalate to human review.

Verbosity bias

Problem: Longer, more elaborate answers score higher than shorter, equally correct answers. This systematically rewards padding and discourages conciseness.

Fix: Include explicit length guidance in your rubric: "A concise, accurate answer should score the same as a longer accurate answer. Do not reward length for its own sake." For high-stakes evals, also add: "Penalize answers that pad with hedging or unnecessary caveats."

Self-preference bias

Problem: A model family systematically scores outputs from the same family higher. GPT-4 evaluating GPT-4o outputs vs Claude outputs shows measurable preference for GPT outputs. The reverse is also true.

Fix: Use a judge from a different provider family than the model being evaluated. If your RAG generator is Claude-based, use GPT-4 as your judge. If it's GPT-based, use Claude. This is easy to implement and eliminates most self-preference effects.

Leniency drift

Problem: Judge scores tend to drift upward over time as models become more lenient with updates. If you're comparing eval runs from three months ago to today, some of the improvement may be the judge, not your pipeline.

Fix: Maintain a calibration set of 50–100 human-labeled examples. When you suspect drift (quarterly, or after a judge model update), re-run the calibration set and compare the judge's scores to the human labels. If the correlation has dropped, recalibrate by adjusting your rubric or switching judge versions.


Choosing a judge model

The right judge model depends on your use case, volume, and compliance constraints:

Frontier (GPT-4o, Claude Sonnet 4)
Best for high-stakes evaluations — final pre-launch checks, compliance audits, calibrating your eval suite. Most consistent, highest cost. Use for CI gates where quality matters most.
Mid-tier (GPT-4o-mini, Claude Haiku)
Good for high-volume production sampling. ~85–90% as accurate as frontier models at 10–20x lower cost. Run these on your 1–5% production sample.
Local (via Bedrock, Ollama, vLLM)
For compliance-bound environments where no data can leave your network. Llama 3.1 70B is the current sweet spot for local judging — comparable to GPT-4o-mini on well-defined rubrics. eval.ninja supports BYOK to any provider including local endpoints.

One practical rule: never use the same model as both generator and judge. The self-preference bias is real and measurable, and it produces artificially inflated scores for your pipeline while masking genuine quality issues.


Calibrating against human labels

You can't trust an uncalibrated judge. Before relying on any LLM judge for decisions (CI gates, production alerts, model selection), validate it against human labels.

The minimum viable calibration process

  1. Collect 50–100 examples that cover your evaluation task. Include examples across the full score range — don't just collect easy positives and obvious negatives.
  2. Have humans label them. Two annotators per example is ideal; one is acceptable if you can't afford two. If possible, have your domain experts label, not crowdworkers.
  3. Run your judge on the same examples.
  4. Compute agreement: Cohen's kappa for categorical scores (pass/fail, 1–5 scale), Spearman correlation for continuous scores.
  5. Interpret the result:
    • Kappa > 0.6 (or Spearman r > 0.7): the judge is reliable. Ship it.
    • Kappa 0.4–0.6: marginal. Tighten your rubric and add more few-shot examples before using in CI gates.
    • Kappa < 0.4: the rubric is too vague or the task is too ambiguous for automated judging. Redesign the rubric.

When to recalibrate

  • When you update the judge model version (even minor updates can shift scoring behavior)
  • When you modify the rubric
  • On a quarterly schedule to catch leniency drift
  • When you onboard a new evaluation task

Production patterns

Sampling: judge asynchronously, not on the critical path

Never run a judge synchronously on a user request. The judge adds latency equal to a full LLM call — unacceptable on any user-facing path. Instead, log requests to a queue and run the judge asynchronously. eval.ninja's latency is exactly the judge model's latency, with no application-layer overhead added.

import random
import asyncio
import httpx

EVAL_NINJA_KEY = "..."  # from env

async def post_eval(question: str, answer: str, contexts: list[str]) -> dict:
    async with httpx.AsyncClient() as client:
        r = await client.post(
            "https://api.eval.ninja/v1/evaluate",
            headers={
                "Authorization": f"Bearer {EVAL_NINJA_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "user_input": question,
                "response": answer,
                "retrieved_contexts": contexts,
                "metrics": ["faithfulness"],
            },
        )
        r.raise_for_status()
        return r.json()

async def handle_rag_request(question: str, context: list[str]) -> str:
    answer = await generate_answer(question, context)  # your RAG pipeline

    # Sample 3% of traffic for judging — never blocks the response
    if random.random() < 0.03:
        asyncio.create_task(post_eval(question, answer, context))

    return answer

Storing judge outputs for debugging

The score alone is useless for debugging. Always store the full judge output — score, chain-of-thought reasoning, and any structured diagnostic fields. When a score drops, the reasoning tells you why. This is especially valuable when investigating sudden drops after a model update or data distribution change.


End-to-end code walkthrough

Here's a complete example: define a rubric, run the evaluation, parse the result, and fail CI if below threshold.

curl -X POST https://api.eval.ninja/v1/evaluate \
  -H "Authorization: Bearer $EVAL_NINJA_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_input": "What is the cancellation policy?",
    "response": "You can cancel anytime. Refunds are prorated for annual plans.",
    "retrieved_contexts": ["Annual plans can be cancelled at any time. Refunds are issued on a prorated basis."],
    "metrics": ["faithfulness"]
  }'
import os, requests, sys

EVAL_NINJA_KEY = os.environ["EVAL_NINJA_KEY"]
sample = {
    "question": "What is the cancellation policy?",
    "answer": "You can cancel anytime. Refunds are prorated for annual plans.",
    "contexts": ["Annual plans can be cancelled at any time. Refunds are issued on a prorated basis."],
}

result = requests.post(
    "https://api.eval.ninja/v1/evaluate",
    headers={"Authorization": f"Bearer {EVAL_NINJA_KEY}"},
    json={
        "user_input": sample["question"],
        "response": sample["answer"],
        "retrieved_contexts": sample["contexts"],
        "metrics": ["faithfulness"],
    },
).json()

faith = next(m for m in result["metrics"] if m["name"] == "faithfulness")
score = faith["score"]
if score < 0.85:
    print(f"FAIL: faithfulness score {score:.2f}")
    print(faith["interpretation"])
    sys.exit(1)
const res = await fetch('https://api.eval.ninja/v1/evaluate', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.EVAL_NINJA_KEY}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    user_input: sample.question,
    response: sample.answer,
    retrieved_contexts: sample.contexts,
    metrics: ['faithfulness'],
  })
});
const body = await res.json();
const faith = body.metrics.find((m) => m.name === 'faithfulness');
if (faith.score < 0.85) process.exit(1);

Frequently asked questions

LLM-as-a-judge is an evaluation technique where a capable language model scores another model's output against a rubric or reference answer. It replaces reference-based metrics like BLEU and ROUGE that break on open-ended generation tasks, providing human-like quality assessments at scale without requiring human annotators for every evaluation.
Yes, when properly calibrated. Frontier models achieve 80–90% agreement with human annotators on well-defined rubrics — comparable to inter-annotator agreement between humans. Reliability degrades with vague rubrics, uncalibrated judges, or domain-specific tasks requiring expertise the judge lacks. Always calibrate against a human-labeled sample before using in production gates.
Use a frontier model from a different provider family than your generator. If you're generating with Claude, judge with GPT-4o. If you're generating with GPT, judge with Claude. For high-volume production sampling, drop to a mid-tier model (GPT-4o-mini, Claude Haiku) to control cost. For compliance environments, eval.ninja supports BYOK to local models via Ollama or AWS Bedrock so no data leaves your network.
Position bias is the tendency of an LLM judge to favor whichever output appears first in a pairwise comparison prompt. It can be 5–10 percentage points in magnitude. The fix is simple: evaluate both orderings (A-then-B and B-then-A) and average. If both orderings agree on the winner, the signal is reliable. If they disagree, treat it as a tie.
ROUGE and BLEU measure surface-level token overlap with a reference string. They work well for tasks where one correct phrasing exists (e.g., machine translation with a reference). They fail on open-ended tasks: two semantically identical answers with different wording score very differently. LLM-as-a-judge evaluates semantic quality and relevance, not surface overlap — a much better fit for evaluating chatbots, RAG systems, and any open-ended generation task.