BLEU and ROUGE were built for machine translation in the 1990s. They measure n-gram overlap between a generated output and a reference string. They work fine when there's exactly one correct phrasing. They fail on any open-ended generation task where multiple phrasings are valid, where quality is a matter of depth or tone, or where the "reference" is a summary rather than a translation.
LLM-as-a-judge replaced them. The idea: use a strong model to read an output and score it against a rubric, the same way a human evaluator would. This guide is a practical walkthrough — how to set it up, how to fight the biases, and how to run it in production without it becoming a cost or latency problem.
The three modes of LLM-as-a-judge
Mode 1: Single-output scoring without reference
The judge reads only the output and a rubric. No reference answer. Use this when:
- You're evaluating properties that don't require comparison — tone, format compliance, safety, coherence.
- You don't have a ground truth answer and don't want to generate one.
- You're evaluating open-ended creative or conversational outputs where no single reference is correct.
// Single-output without reference
{
"output": "Here are three Python libraries for data visualization: matplotlib for static charts, plotly for interactive ones, and seaborn for statistical plots.",
"rubric": "Rate the completeness and accuracy of this answer on a scale of 1-5. Consider whether it names specific tools, gives a brief description of each, and covers the major use cases."
} Mode 2: Single-output scoring with reference
The judge reads the output, a reference answer, and a rubric. Use this when:
- You have a golden dataset with expected answers.
- You need to detect when key information from the reference is missing from the output.
- You're evaluating factual correctness against a known ground truth.
The reference doesn't have to be the only correct answer — a good rubric will instruct the judge to accept equivalent correct answers that don't match the reference verbatim.
Mode 3: Pairwise comparison
The judge reads two outputs (A and B) and decides which is better for a given criteria. Use this when:
- You're comparing two versions of a prompt, model, or pipeline.
- You want an A/B evaluation without needing an absolute score.
- The quality dimension is hard to score absolutely but easy to compare relatively (e.g., "which answer is more concise while being accurate?").
Pairwise is powerful but requires care with position bias (see below). Always evaluate both orderings (A-then-B and B-then-A) and average the results.
Anatomy of a good judge prompt
The quality of your judge is the quality of your rubric. A vague rubric produces inconsistent scores. Here's a complete annotated judge prompt for evaluating RAG answer quality:
You are evaluating the quality of an AI assistant's answer to a user question.
## Question
{question}
## Retrieved Context
{context}
## Answer to Evaluate
{answer}
## Rubric
Score the answer from 1 to 5 on FAITHFULNESS: whether every claim in the answer
can be traced back to the provided context. Use these criteria:
1 - Answer contains multiple claims not supported by the context
2 - Answer contains at least one significant unsupported claim
3 - Answer is mostly grounded but contains minor speculation or inference
4 - Answer is fully grounded with only explicit information from context
5 - Answer is fully grounded AND correctly acknowledges when context is insufficient
## Instructions
Think step by step:
1. Identify each distinct factual claim in the answer.
2. For each claim, find the sentence in the context that supports it.
3. Flag any claims with no supporting evidence in the context.
4. Assign a score based on the rubric.
Return your response as JSON:
{
"reasoning": "step-by-step analysis",
"score": <integer 1-5>,
"unsupported_claims": ["list of any unsupported claims"]
} The key elements, annotated:
- Clear rubric with concrete criteria: Each score level is defined with a specific behavioral description, not just "bad/ok/good".
- Chain-of-thought before score: Asking the model to reason step-by-step before assigning a score reliably improves consistency. The score appears after the reasoning, not before.
- Structured output: JSON output makes it easy to parse the score programmatically and log the explanation for debugging.
- Diagnosis fields:
unsupported_claimsgives you actionable information, not just a number.
The biases you have to fight
LLM judges inherit the same biases as the models they're built on. Left unchecked, these biases will make your eval scores misleading.
Position bias (pairwise evaluations)
Problem: In pairwise comparison, the first output presented tends to score higher, regardless of actual quality. The magnitude varies by model but can be as large as 5–10 percentage points.
Fix: Run each pairwise comparison twice — once with A first, once with B first. Average the results. If both orderings agree, you have a reliable signal. If they disagree, treat the pair as a tie or escalate to human review.
Verbosity bias
Problem: Longer, more elaborate answers score higher than shorter, equally correct answers. This systematically rewards padding and discourages conciseness.
Fix: Include explicit length guidance in your rubric: "A concise, accurate answer should score the same as a longer accurate answer. Do not reward length for its own sake." For high-stakes evals, also add: "Penalize answers that pad with hedging or unnecessary caveats."
Self-preference bias
Problem: A model family systematically scores outputs from the same family higher. GPT-4 evaluating GPT-4o outputs vs Claude outputs shows measurable preference for GPT outputs. The reverse is also true.
Fix: Use a judge from a different provider family than the model being evaluated. If your RAG generator is Claude-based, use GPT-4 as your judge. If it's GPT-based, use Claude. This is easy to implement and eliminates most self-preference effects.
Leniency drift
Problem: Judge scores tend to drift upward over time as models become more lenient with updates. If you're comparing eval runs from three months ago to today, some of the improvement may be the judge, not your pipeline.
Fix: Maintain a calibration set of 50–100 human-labeled examples. When you suspect drift (quarterly, or after a judge model update), re-run the calibration set and compare the judge's scores to the human labels. If the correlation has dropped, recalibrate by adjusting your rubric or switching judge versions.
Choosing a judge model
The right judge model depends on your use case, volume, and compliance constraints:
One practical rule: never use the same model as both generator and judge. The self-preference bias is real and measurable, and it produces artificially inflated scores for your pipeline while masking genuine quality issues.
Calibrating against human labels
You can't trust an uncalibrated judge. Before relying on any LLM judge for decisions (CI gates, production alerts, model selection), validate it against human labels.
The minimum viable calibration process
- Collect 50–100 examples that cover your evaluation task. Include examples across the full score range — don't just collect easy positives and obvious negatives.
- Have humans label them. Two annotators per example is ideal; one is acceptable if you can't afford two. If possible, have your domain experts label, not crowdworkers.
- Run your judge on the same examples.
- Compute agreement: Cohen's kappa for categorical scores (pass/fail, 1–5 scale), Spearman correlation for continuous scores.
- Interpret the result:
- Kappa > 0.6 (or Spearman r > 0.7): the judge is reliable. Ship it.
- Kappa 0.4–0.6: marginal. Tighten your rubric and add more few-shot examples before using in CI gates.
- Kappa < 0.4: the rubric is too vague or the task is too ambiguous for automated judging. Redesign the rubric.
When to recalibrate
- When you update the judge model version (even minor updates can shift scoring behavior)
- When you modify the rubric
- On a quarterly schedule to catch leniency drift
- When you onboard a new evaluation task
Production patterns
Sampling: judge asynchronously, not on the critical path
Never run a judge synchronously on a user request. The judge adds latency equal to a full LLM call — unacceptable on any user-facing path. Instead, log requests to a queue and run the judge asynchronously. eval.ninja's latency is exactly the judge model's latency, with no application-layer overhead added.
import random
import asyncio
import httpx
EVAL_NINJA_KEY = "..." # from env
async def post_eval(question: str, answer: str, contexts: list[str]) -> dict:
async with httpx.AsyncClient() as client:
r = await client.post(
"https://api.eval.ninja/v1/evaluate",
headers={
"Authorization": f"Bearer {EVAL_NINJA_KEY}",
"Content-Type": "application/json",
},
json={
"user_input": question,
"response": answer,
"retrieved_contexts": contexts,
"metrics": ["faithfulness"],
},
)
r.raise_for_status()
return r.json()
async def handle_rag_request(question: str, context: list[str]) -> str:
answer = await generate_answer(question, context) # your RAG pipeline
# Sample 3% of traffic for judging — never blocks the response
if random.random() < 0.03:
asyncio.create_task(post_eval(question, answer, context))
return answer Storing judge outputs for debugging
The score alone is useless for debugging. Always store the full judge output — score, chain-of-thought reasoning, and any structured diagnostic fields. When a score drops, the reasoning tells you why. This is especially valuable when investigating sudden drops after a model update or data distribution change.
End-to-end code walkthrough
Here's a complete example: define a rubric, run the evaluation, parse the result, and fail CI if below threshold.
curl -X POST https://api.eval.ninja/v1/evaluate \
-H "Authorization: Bearer $EVAL_NINJA_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_input": "What is the cancellation policy?",
"response": "You can cancel anytime. Refunds are prorated for annual plans.",
"retrieved_contexts": ["Annual plans can be cancelled at any time. Refunds are issued on a prorated basis."],
"metrics": ["faithfulness"]
}' import os, requests, sys
EVAL_NINJA_KEY = os.environ["EVAL_NINJA_KEY"]
sample = {
"question": "What is the cancellation policy?",
"answer": "You can cancel anytime. Refunds are prorated for annual plans.",
"contexts": ["Annual plans can be cancelled at any time. Refunds are issued on a prorated basis."],
}
result = requests.post(
"https://api.eval.ninja/v1/evaluate",
headers={"Authorization": f"Bearer {EVAL_NINJA_KEY}"},
json={
"user_input": sample["question"],
"response": sample["answer"],
"retrieved_contexts": sample["contexts"],
"metrics": ["faithfulness"],
},
).json()
faith = next(m for m in result["metrics"] if m["name"] == "faithfulness")
score = faith["score"]
if score < 0.85:
print(f"FAIL: faithfulness score {score:.2f}")
print(faith["interpretation"])
sys.exit(1) const res = await fetch('https://api.eval.ninja/v1/evaluate', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.EVAL_NINJA_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
user_input: sample.question,
response: sample.answer,
retrieved_contexts: sample.contexts,
metrics: ['faithfulness'],
})
});
const body = await res.json();
const faith = body.metrics.find((m) => m.name === 'faithfulness');
if (faith.score < 0.85) process.exit(1);