Evaluating LLMs is the unsolved problem at the center of every production AI system. You can't ship a prompt change, a retrieval improvement, or a model upgrade without knowing whether it actually got better — or just better on the examples you happened to check. The engineers who solve this build systematic eval pipelines. The ones who don't spend their time debugging regressions in production.
Why Perplexity Is Not a Production Metric
Perplexity measures how well a language model predicts held-out text — lower is better. It's the standard pretraining metric. It's almost useless for evaluating whether your chatbot gives correct, faithful, useful answers.
Perplexity failure modes:
1. A model can have low perplexity and high hallucination rate
(fluent generation that confidently states wrong facts)
2. Perplexity measures distribution fit, not task performance
(a model might score well on next-token prediction but
fail at instruction following)
3. Perplexity is undefined for API-only models (GPT-4, Claude)
— you can't compute it without log-probabilities
Traditional NLP metrics fare no better for LLM outputs:
BLEU: n-gram overlap — punishes valid paraphrases
ROUGE: recall-oriented n-gram — designed for summarization,
ignores factual correctness
BERTScore: semantic similarity via BERT embeddings —
"The patient died" scores highly against
"The patient survived" (similar sentence structure)
What you actually need: metrics that measure correctness, faithfulness, and usefulness — not surface-level text similarity.
LLM-as-Judge: Using Models to Evaluate Models
The most scalable approach for evaluating free-form LLM outputs is using a capable model (GPT-4, Claude) as an evaluator. The judge reads the question, context, and answer, then scores along a specific rubric.
import os
import anthropic
from dataclasses import dataclass
from typing import Optional
client = anthropic.Anthropic()
JUDGE_MODEL = os.environ.get("JUDGE_MODEL", "your-judge-model")
FAITHFULNESS_PROMPT = """You are evaluating whether an AI assistant's answer is faithful to the provided context.
Context: {context}
Question: {question}
Answer: {answer}
Evaluate faithfulness on a scale of 1-5:
1 = Answer contains claims directly contradicting the context
2 = Answer has significant unsupported claims
3 = Answer mixes supported and unsupported claims
4 = Answer is mostly faithful with minor extrapolations
5 = Answer is entirely supported by the context
Return ONLY a JSON object:
{{"score": <1-5>, "reasoning": "<one sentence>", "unsupported_claims": ["<claim1>", ...]}}"""
RELEVANCY_PROMPT = """You are evaluating whether an AI assistant's answer directly addresses the question asked.
Question: {question}
Answer: {answer}
Evaluate answer relevancy on a scale of 1-5:
1 = Answer is completely off-topic
2 = Answer partially addresses the question
3 = Answer addresses the question but includes significant irrelevant content
4 = Answer mostly addresses the question with minor tangents
5 = Answer directly and completely addresses the question
Return ONLY a JSON object:
{{"score": <1-5>, "reasoning": "<one sentence>"}}"""
@dataclass
class EvalResult:
faithfulness: float
relevancy: float
reasoning: str
unsupported_claims: list[str]
def evaluate_rag_response(
question: str,
context: str,
answer: str
) -> EvalResult:
import json
faithfulness_response = client.messages.create(
model=JUDGE_MODEL,
max_tokens=500,
messages=[{
"role": "user",
"content": FAITHFULNESS_PROMPT.format(
context=context, question=question, answer=answer
)
}]
)
relevancy_response = client.messages.create(
model=JUDGE_MODEL,
max_tokens=200,
messages=[{
"role": "user",
"content": RELEVANCY_PROMPT.format(question=question, answer=answer)
}]
)
faith_data = json.loads(faithfulness_response.content[0].text)
rel_data = json.loads(relevancy_response.content[0].text)
return EvalResult(
faithfulness=faith_data["score"] / 5.0, # Normalize to 0-1
relevancy=rel_data["score"] / 5.0,
reasoning=faith_data["reasoning"],
unsupported_claims=faith_data.get("unsupported_claims", [])
)
Bias mitigation — the critical part most teams skip:
# LLM judges have well-documented biases:
# 1. Verbosity bias: longer answers score higher regardless of quality
# 2. Self-enhancement bias: models favor their own style
# 3. Position bias: in pairwise eval, the first option scores higher
# 4. Sycophancy: judges agree with confident-sounding answers
# Mitigation strategies:
# 1. Calibration: include gold standard examples in few-shot prompt
CALIBRATED_PROMPT = """
Here are examples of scoring:
- "Paris is the capital of Germany" against context saying it's France → faithfulness: 1
- "The study found mixed results" (context says clear positive results) → faithfulness: 2
...
Now evaluate:
"""
# 2. Swap order for pairwise comparison and check consistency
def pairwise_eval_with_swap(response_a, response_b, question):
score_ab = judge(question, response_a, response_b) # A vs B
score_ba = judge(question, response_b, response_a) # B vs A (swapped)
# If results disagree: mark as inconclusive (position bias detected)
if score_ab["winner"] == "A" and score_ba["winner"] == "A":
return "A" # Consistent: A wins
elif score_ab["winner"] == "B" and score_ba["winner"] == "B":
return "B" # Consistent: B wins
else:
return "tie" # Inconsistent: don't trust
# 3. Use a judge model that is different from the generating model when possible.
# Same-model evaluation can inflate scores via self-enhancement bias.
RAGAS: Evaluation Framework for RAG Pipelines
RAGAS provides four metrics specifically designed for RAG evaluation:
RAGAS Metrics:
Faithfulness:
Does the answer contain ONLY information from the retrieved context?
Fails when the model adds information from its training data not in context.
Answer Relevancy:
Does the answer address the question asked?
Low when the answer is technically correct but ignores the specific question.
Context Recall:
Does the retrieved context contain ALL information needed to answer?
Requires a ground truth answer. Low = retrieval is missing relevant docs.
Context Precision:
Are the retrieved documents actually relevant?
Low = too much irrelevant context retrieved (noise in the context window).
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)
from datasets import Dataset
def run_ragas_evaluation(eval_samples: list[dict]) -> dict:
"""
eval_samples: list of dicts with keys:
- question: str
- answer: str (generated by your RAG system)
- contexts: list[str] (retrieved chunks)
- ground_truth: str (reference answer, for context_recall)
"""
dataset = Dataset.from_list(eval_samples)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
)
return {
"faithfulness": results["faithfulness"],
"answer_relevancy": results["answer_relevancy"],
"context_recall": results["context_recall"],
"context_precision": results["context_precision"],
}
# Example catching a real hallucination:
sample = {
"question": "What is the default Kafka consumer timeout?",
"contexts": [
"The default session.timeout.ms for Kafka consumers is 45000 (45 seconds). "
"This can be configured in consumer.properties."
],
"answer": "The default Kafka consumer timeout is 30 seconds.",
# ← Model hallucinated 30s instead of 45s from context
"ground_truth": "The default Kafka consumer timeout (session.timeout.ms) is 45 seconds."
}
scores = run_ragas_evaluation([sample])
# faithfulness → 0.2 (low: answer contradicts context)
# context_recall → 1.0 (context had the answer)
# → Diagnosis: retrieval is fine, generation is hallucinating
This diagnostic pattern is powerful: low faithfulness + high context recall means your generator is hallucinating. Low context recall means your retriever is missing relevant documents. Different root causes, different fixes.
Building Automated Eval Pipelines
A production eval pipeline runs on every prompt change, model upgrade, or retrieval configuration change:
import json
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, asdict
@dataclass
class EvalCase:
id: str
question: str
ground_truth: str
tags: list[str] # e.g., ["factual", "multi-hop", "edge-case"]
@dataclass
class EvalRunResult:
case_id: str
answer: str
faithfulness: float
relevancy: float
latency_ms: float
passed: bool
class EvalHarness:
def __init__(self, rag_pipeline, eval_cases: list[EvalCase]):
self.rag = rag_pipeline
self.cases = eval_cases
self.results_dir = Path("eval_results")
self.results_dir.mkdir(exist_ok=True)
def run_suite(self, run_name: str) -> list[EvalRunResult]:
results = []
for case in self.cases:
import time
start = time.monotonic()
answer, contexts = self.rag.query(case.question)
latency_ms = (time.monotonic() - start) * 1000
scores = evaluate_rag_response(
question=case.question,
context="\n".join(contexts),
answer=answer
)
result = EvalRunResult(
case_id=case.id,
answer=answer,
faithfulness=scores.faithfulness,
relevancy=scores.relevancy,
latency_ms=latency_ms,
passed=scores.faithfulness >= 0.7 and scores.relevancy >= 0.7
)
results.append(result)
# Persist results for trend analysis
output_path = self.results_dir / f"{run_name}_{datetime.now().isoformat()}.json"
output_path.write_text(json.dumps([asdict(r) for r in results], indent=2))
return results
def regression_check(self, current_results, baseline_results, threshold=0.05):
"""Fail CI if metrics drop more than threshold vs baseline."""
current_faith = sum(r.faithfulness for r in current_results) / len(current_results)
baseline_faith = sum(r.faithfulness for r in baseline_results) / len(baseline_results)
if baseline_faith - current_faith > threshold:
raise AssertionError(
f"Faithfulness regression: {baseline_faith:.3f} → {current_faith:.3f} "
f"(delta {baseline_faith - current_faith:.3f} > threshold {threshold})"
)
Integrate into CI:
# .github/workflows/eval.yml
- name: Run LLM eval suite
run: python eval/run_suite.py --baseline eval_results/baseline.json
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# Fails PR if faithfulness drops > 5% vs baseline
This is the point where evals become engineering infrastructure instead of a research notebook. Every run should have:
- the prompt version
- the model name and provider
- retrieval configuration
- embedding model version
- chunking strategy
- dataset version
- judge model version
- metric thresholds
- timestamp and commit SHA
Without this metadata, a score is hard to explain later. A faithfulness drop from 0.86 to 0.79 may be caused by a prompt change, a retrieval change, a model upgrade, or a dataset update. If the eval run does not record those inputs, you only know that something changed. You do not know what.
Production Release Gates
Treat LLM evals like performance tests: they should not be the only signal, but they should block obviously bad changes. A practical gate looks like this:
release_gate:
minimum_cases: 250
hard_fail:
faithfulness_p10_below: 0.70
pass_rate_below: 0.85
critical_regression_cases_failed: 1
regression_fail:
faithfulness_mean_drop_pp: 5
answer_relevancy_mean_drop_pp: 5
p95_latency_increase_percent: 25
manual_review:
judge_disagreement_rate_above: 0.15
cost_per_1000_requests_increase_percent: 20
The point is not to turn every prompt edit into a statistics dissertation. The point is to make unsafe changes visible. If faithfulness drops by 8 percentage points but latency improves, the pipeline should force a conversation instead of letting the change drift into production.
For high-risk domains such as medical, finance, legal, or security workflows, add human review to the release gate. LLM-as-judge is useful, but it is still a model. It can miss subtle domain-specific errors.
The Statistics of LLM Evaluation
The most common eval mistake: drawing conclusions from 20-50 examples. LLM outputs have high variance. You need enough samples to detect the difference you care about.
from scipy import stats
import numpy as np
def required_sample_size(
baseline_mean: float,
minimum_detectable_effect: float,
std_estimate: float = 0.2, # typical for LLM scores 0-1
alpha: float = 0.05,
power: float = 0.80
) -> int:
"""
How many eval cases needed to detect a given improvement?
Example: baseline faithfulness = 0.70
want to detect: improvement to 0.77 (0.07 effect)
std = 0.20 (typical for 0-1 scaled scores)
"""
effect_size = minimum_detectable_effect / std_estimate # Cohen's d
# For two-sample t-test:
n = (stats.norm.ppf(1 - alpha/2) + stats.norm.ppf(power))**2
n = n * 2 / (effect_size ** 2)
return int(np.ceil(n))
print(required_sample_size(0.70, 0.07))
# → ~250 samples per variant to detect a 7pp improvement with 80% power
# Compare two variants:
def compare_variants(scores_a, scores_b):
statistic, p_value = stats.mannwhitneyu(scores_a, scores_b, alternative='two-sided')
effect_size = (np.mean(scores_b) - np.mean(scores_a)) / np.std(scores_a + scores_b)
return {
"p_value": p_value,
"significant": p_value < 0.05,
"effect_size_d": effect_size,
"mean_a": np.mean(scores_a),
"mean_b": np.mean(scores_b),
"practical_improvement": np.mean(scores_b) - np.mean(scores_a)
}
Stratified eval sets: Your eval set should match the distribution of production queries — not just the easy cases or the ones that broke recently:
Eval set composition (recommended):
- 40% common queries (high-frequency, well-represented in training)
- 30% edge cases (ambiguous questions, out-of-domain)
- 20% regression cases (queries that previously failed)
- 10% adversarial cases (prompt injection attempts, contradictory context)
Hallucination Evaluation Specifically
Faithfulness scores catch hallucinations in RAG. For general LLM outputs without a retrieval context, use a consistency-check approach:
def check_hallucination_consistency(question: str, n_samples: int = 5) -> dict:
"""
Sample the same question multiple times.
High variance in answers = likely hallucination.
Consistent answers = more reliable (but not guaranteed correct).
"""
answers = []
for _ in range(n_samples):
response = client.messages.create(
model=JUDGE_MODEL,
max_tokens=500,
temperature=0.7, # Some randomness to surface inconsistency
messages=[{"role": "user", "content": question}]
)
answers.append(response.content[0].text)
# Use LLM to check consistency across samples
consistency_check = client.messages.create(
model=JUDGE_MODEL,
max_tokens=200,
messages=[{
"role": "user",
"content": f"Are these {n_samples} answers consistent with each other? "
f"Answer YES or NO, then explain any contradictions.\n\n"
+ "\n\n---\n\n".join(f"Answer {i+1}: {a}" for i, a in enumerate(answers))
}]
)
return {
"answers": answers,
"consistency_check": consistency_check.content[0].text
}
Production Eval Dashboard Metrics
Track these in Grafana/Datadog alongside your service metrics:
Eval metrics to monitor in production:
Daily eval run results:
llm.eval.faithfulness.p50 / p25 / p10 (distribution, not just mean)
llm.eval.relevancy.p50
llm.eval.pass_rate (% of cases above threshold)
Query-level sampling (1% of production traffic):
llm.production.faithfulness (estimated via lightweight judge)
llm.production.latency_p99
Regression detection:
Alert when: 7-day rolling avg faithfulness drops > 3pp vs 30-day avg
Alert when: pass_rate drops below 0.85
A 3-point drop in faithfulness over a week likely means either a model API change (the provider silently updated the model), prompt drift (upstream context format changed), or data drift (queries shifted to a domain your retrieval handles poorly). The eval pipeline tells you it happened. The metrics breakdown tells you why.
The teams that invest in eval infrastructure ship faster — not slower. When every change is evaluated automatically, you can move quickly without fear of silent regressions. The teams that skip eval infrastructure move fast until they break something important in production, then slow to a crawl debugging without signal.
