AI/ML

LLM Evaluation at Scale: LLM-as-Judge, RAGAS, and Building Automated Eval Pipelines

How to evaluate LLM systems in production: LLM-as-judge patterns with bias mitigation, RAGAS metrics for RAG pipelines (faithfulness, context recall, answer relevancy), BERTScore vs ROUGE trade-offs, building regression test suites for prompts, and the statistical rigor needed to trust eval results.

Sachin Sarawgi·April 8, 2026·11 min read
#llm evaluation#ragas#llm-as-judge#rag#bertScore#hallucination#ai#testing

Evaluating LLMs is the unsolved problem at the center of every production AI system. You can't ship a prompt change, a retrieval improvement, or a model upgrade without knowing whether it actually got better — or just better on the examples you happened to check. The engineers who solve this build systematic eval pipelines. The ones who don't spend their time debugging regressions in production.

Why Perplexity Is Not a Production Metric

Perplexity measures how well a language model predicts held-out text — lower is better. It's the standard pretraining metric. It's almost useless for evaluating whether your chatbot gives correct, faithful, useful answers.

Perplexity failure modes:
1. A model can have low perplexity and high hallucination rate
   (fluent generation that confidently states wrong facts)

2. Perplexity measures distribution fit, not task performance
   (a model might score well on next-token prediction but
    fail at instruction following)

3. Perplexity is undefined for API-only models (GPT-4, Claude)
   — you can't compute it without log-probabilities

Traditional NLP metrics fare no better for LLM outputs:
BLEU:  n-gram overlap — punishes valid paraphrases
ROUGE: recall-oriented n-gram — designed for summarization,
       ignores factual correctness
BERTScore: semantic similarity via BERT embeddings —
           "The patient died" scores highly against
           "The patient survived" (similar sentence structure)

What you actually need: metrics that measure correctness, faithfulness, and usefulness — not surface-level text similarity.

LLM-as-Judge: Using Models to Evaluate Models

The most scalable approach for evaluating free-form LLM outputs is using a capable model (GPT-4, Claude) as an evaluator. The judge reads the question, context, and answer, then scores along a specific rubric.

import os
import anthropic
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()
JUDGE_MODEL = os.environ.get("JUDGE_MODEL", "your-judge-model")

FAITHFULNESS_PROMPT = """You are evaluating whether an AI assistant's answer is faithful to the provided context.

Context: {context}

Question: {question}

Answer: {answer}

Evaluate faithfulness on a scale of 1-5:
1 = Answer contains claims directly contradicting the context
2 = Answer has significant unsupported claims
3 = Answer mixes supported and unsupported claims
4 = Answer is mostly faithful with minor extrapolations
5 = Answer is entirely supported by the context

Return ONLY a JSON object:
{{"score": <1-5>, "reasoning": "<one sentence>", "unsupported_claims": ["<claim1>", ...]}}"""

RELEVANCY_PROMPT = """You are evaluating whether an AI assistant's answer directly addresses the question asked.

Question: {question}

Answer: {answer}

Evaluate answer relevancy on a scale of 1-5:
1 = Answer is completely off-topic
2 = Answer partially addresses the question
3 = Answer addresses the question but includes significant irrelevant content
4 = Answer mostly addresses the question with minor tangents
5 = Answer directly and completely addresses the question

Return ONLY a JSON object:
{{"score": <1-5>, "reasoning": "<one sentence>"}}"""

@dataclass
class EvalResult:
    faithfulness: float
    relevancy: float
    reasoning: str
    unsupported_claims: list[str]

def evaluate_rag_response(
    question: str,
    context: str,
    answer: str
) -> EvalResult:
    import json

    faithfulness_response = client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": FAITHFULNESS_PROMPT.format(
                context=context, question=question, answer=answer
            )
        }]
    )

    relevancy_response = client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": RELEVANCY_PROMPT.format(question=question, answer=answer)
        }]
    )

    faith_data = json.loads(faithfulness_response.content[0].text)
    rel_data = json.loads(relevancy_response.content[0].text)

    return EvalResult(
        faithfulness=faith_data["score"] / 5.0,  # Normalize to 0-1
        relevancy=rel_data["score"] / 5.0,
        reasoning=faith_data["reasoning"],
        unsupported_claims=faith_data.get("unsupported_claims", [])
    )

Bias mitigation — the critical part most teams skip:

# LLM judges have well-documented biases:
# 1. Verbosity bias: longer answers score higher regardless of quality
# 2. Self-enhancement bias: models favor their own style
# 3. Position bias: in pairwise eval, the first option scores higher
# 4. Sycophancy: judges agree with confident-sounding answers

# Mitigation strategies:

# 1. Calibration: include gold standard examples in few-shot prompt
CALIBRATED_PROMPT = """
Here are examples of scoring:
- "Paris is the capital of Germany" against context saying it's France → faithfulness: 1
- "The study found mixed results" (context says clear positive results) → faithfulness: 2
...
Now evaluate:
"""

# 2. Swap order for pairwise comparison and check consistency
def pairwise_eval_with_swap(response_a, response_b, question):
    score_ab = judge(question, response_a, response_b)   # A vs B
    score_ba = judge(question, response_b, response_a)   # B vs A (swapped)
    # If results disagree: mark as inconclusive (position bias detected)
    if score_ab["winner"] == "A" and score_ba["winner"] == "A":
        return "A"  # Consistent: A wins
    elif score_ab["winner"] == "B" and score_ba["winner"] == "B":
        return "B"  # Consistent: B wins
    else:
        return "tie"  # Inconsistent: don't trust

# 3. Use a judge model that is different from the generating model when possible.
# Same-model evaluation can inflate scores via self-enhancement bias.

RAGAS: Evaluation Framework for RAG Pipelines

RAGAS provides four metrics specifically designed for RAG evaluation:

RAGAS Metrics:

Faithfulness:
  Does the answer contain ONLY information from the retrieved context?
  Fails when the model adds information from its training data not in context.

Answer Relevancy:
  Does the answer address the question asked?
  Low when the answer is technically correct but ignores the specific question.

Context Recall:
  Does the retrieved context contain ALL information needed to answer?
  Requires a ground truth answer. Low = retrieval is missing relevant docs.

Context Precision:
  Are the retrieved documents actually relevant?
  Low = too much irrelevant context retrieved (noise in the context window).
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset

def run_ragas_evaluation(eval_samples: list[dict]) -> dict:
    """
    eval_samples: list of dicts with keys:
      - question: str
      - answer: str (generated by your RAG system)
      - contexts: list[str] (retrieved chunks)
      - ground_truth: str (reference answer, for context_recall)
    """
    dataset = Dataset.from_list(eval_samples)

    results = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_recall, context_precision],
    )

    return {
        "faithfulness": results["faithfulness"],
        "answer_relevancy": results["answer_relevancy"],
        "context_recall": results["context_recall"],
        "context_precision": results["context_precision"],
    }

# Example catching a real hallucination:
sample = {
    "question": "What is the default Kafka consumer timeout?",
    "contexts": [
        "The default session.timeout.ms for Kafka consumers is 45000 (45 seconds). "
        "This can be configured in consumer.properties."
    ],
    "answer": "The default Kafka consumer timeout is 30 seconds.",
    # ← Model hallucinated 30s instead of 45s from context
    "ground_truth": "The default Kafka consumer timeout (session.timeout.ms) is 45 seconds."
}

scores = run_ragas_evaluation([sample])
# faithfulness → 0.2 (low: answer contradicts context)
# context_recall → 1.0 (context had the answer)
# → Diagnosis: retrieval is fine, generation is hallucinating

This diagnostic pattern is powerful: low faithfulness + high context recall means your generator is hallucinating. Low context recall means your retriever is missing relevant documents. Different root causes, different fixes.

Building Automated Eval Pipelines

A production eval pipeline runs on every prompt change, model upgrade, or retrieval configuration change:

import json
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass, asdict

@dataclass
class EvalCase:
    id: str
    question: str
    ground_truth: str
    tags: list[str]  # e.g., ["factual", "multi-hop", "edge-case"]

@dataclass
class EvalRunResult:
    case_id: str
    answer: str
    faithfulness: float
    relevancy: float
    latency_ms: float
    passed: bool

class EvalHarness:
    def __init__(self, rag_pipeline, eval_cases: list[EvalCase]):
        self.rag = rag_pipeline
        self.cases = eval_cases
        self.results_dir = Path("eval_results")
        self.results_dir.mkdir(exist_ok=True)

    def run_suite(self, run_name: str) -> list[EvalRunResult]:
        results = []
        for case in self.cases:
            import time
            start = time.monotonic()
            answer, contexts = self.rag.query(case.question)
            latency_ms = (time.monotonic() - start) * 1000

            scores = evaluate_rag_response(
                question=case.question,
                context="\n".join(contexts),
                answer=answer
            )

            result = EvalRunResult(
                case_id=case.id,
                answer=answer,
                faithfulness=scores.faithfulness,
                relevancy=scores.relevancy,
                latency_ms=latency_ms,
                passed=scores.faithfulness >= 0.7 and scores.relevancy >= 0.7
            )
            results.append(result)

        # Persist results for trend analysis
        output_path = self.results_dir / f"{run_name}_{datetime.now().isoformat()}.json"
        output_path.write_text(json.dumps([asdict(r) for r in results], indent=2))
        return results

    def regression_check(self, current_results, baseline_results, threshold=0.05):
        """Fail CI if metrics drop more than threshold vs baseline."""
        current_faith = sum(r.faithfulness for r in current_results) / len(current_results)
        baseline_faith = sum(r.faithfulness for r in baseline_results) / len(baseline_results)

        if baseline_faith - current_faith > threshold:
            raise AssertionError(
                f"Faithfulness regression: {baseline_faith:.3f} → {current_faith:.3f} "
                f"(delta {baseline_faith - current_faith:.3f} > threshold {threshold})"
            )

Integrate into CI:

# .github/workflows/eval.yml
- name: Run LLM eval suite
  run: python eval/run_suite.py --baseline eval_results/baseline.json
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# Fails PR if faithfulness drops > 5% vs baseline

This is the point where evals become engineering infrastructure instead of a research notebook. Every run should have:

  • the prompt version
  • the model name and provider
  • retrieval configuration
  • embedding model version
  • chunking strategy
  • dataset version
  • judge model version
  • metric thresholds
  • timestamp and commit SHA

Without this metadata, a score is hard to explain later. A faithfulness drop from 0.86 to 0.79 may be caused by a prompt change, a retrieval change, a model upgrade, or a dataset update. If the eval run does not record those inputs, you only know that something changed. You do not know what.

Production Release Gates

Treat LLM evals like performance tests: they should not be the only signal, but they should block obviously bad changes. A practical gate looks like this:

release_gate:
  minimum_cases: 250
  hard_fail:
    faithfulness_p10_below: 0.70
    pass_rate_below: 0.85
    critical_regression_cases_failed: 1
  regression_fail:
    faithfulness_mean_drop_pp: 5
    answer_relevancy_mean_drop_pp: 5
    p95_latency_increase_percent: 25
  manual_review:
    judge_disagreement_rate_above: 0.15
    cost_per_1000_requests_increase_percent: 20

The point is not to turn every prompt edit into a statistics dissertation. The point is to make unsafe changes visible. If faithfulness drops by 8 percentage points but latency improves, the pipeline should force a conversation instead of letting the change drift into production.

For high-risk domains such as medical, finance, legal, or security workflows, add human review to the release gate. LLM-as-judge is useful, but it is still a model. It can miss subtle domain-specific errors.

The Statistics of LLM Evaluation

The most common eval mistake: drawing conclusions from 20-50 examples. LLM outputs have high variance. You need enough samples to detect the difference you care about.

from scipy import stats
import numpy as np

def required_sample_size(
    baseline_mean: float,
    minimum_detectable_effect: float,
    std_estimate: float = 0.2,  # typical for LLM scores 0-1
    alpha: float = 0.05,
    power: float = 0.80
) -> int:
    """
    How many eval cases needed to detect a given improvement?

    Example: baseline faithfulness = 0.70
             want to detect: improvement to 0.77 (0.07 effect)
             std = 0.20 (typical for 0-1 scaled scores)
    """
    effect_size = minimum_detectable_effect / std_estimate  # Cohen's d
    # For two-sample t-test:
    n = (stats.norm.ppf(1 - alpha/2) + stats.norm.ppf(power))**2
    n = n * 2 / (effect_size ** 2)
    return int(np.ceil(n))

print(required_sample_size(0.70, 0.07))
# → ~250 samples per variant to detect a 7pp improvement with 80% power

# Compare two variants:
def compare_variants(scores_a, scores_b):
    statistic, p_value = stats.mannwhitneyu(scores_a, scores_b, alternative='two-sided')
    effect_size = (np.mean(scores_b) - np.mean(scores_a)) / np.std(scores_a + scores_b)
    return {
        "p_value": p_value,
        "significant": p_value < 0.05,
        "effect_size_d": effect_size,
        "mean_a": np.mean(scores_a),
        "mean_b": np.mean(scores_b),
        "practical_improvement": np.mean(scores_b) - np.mean(scores_a)
    }

Stratified eval sets: Your eval set should match the distribution of production queries — not just the easy cases or the ones that broke recently:

Eval set composition (recommended):
- 40% common queries (high-frequency, well-represented in training)
- 30% edge cases (ambiguous questions, out-of-domain)
- 20% regression cases (queries that previously failed)
- 10% adversarial cases (prompt injection attempts, contradictory context)

Hallucination Evaluation Specifically

Faithfulness scores catch hallucinations in RAG. For general LLM outputs without a retrieval context, use a consistency-check approach:

def check_hallucination_consistency(question: str, n_samples: int = 5) -> dict:
    """
    Sample the same question multiple times.
    High variance in answers = likely hallucination.
    Consistent answers = more reliable (but not guaranteed correct).
    """
    answers = []
    for _ in range(n_samples):
        response = client.messages.create(
            model=JUDGE_MODEL,
            max_tokens=500,
            temperature=0.7,  # Some randomness to surface inconsistency
            messages=[{"role": "user", "content": question}]
        )
        answers.append(response.content[0].text)

    # Use LLM to check consistency across samples
    consistency_check = client.messages.create(
        model=JUDGE_MODEL,
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Are these {n_samples} answers consistent with each other? "
                       f"Answer YES or NO, then explain any contradictions.\n\n"
                       + "\n\n---\n\n".join(f"Answer {i+1}: {a}" for i, a in enumerate(answers))
        }]
    )

    return {
        "answers": answers,
        "consistency_check": consistency_check.content[0].text
    }

Production Eval Dashboard Metrics

Track these in Grafana/Datadog alongside your service metrics:

Eval metrics to monitor in production:

Daily eval run results:
  llm.eval.faithfulness.p50 / p25 / p10  (distribution, not just mean)
  llm.eval.relevancy.p50
  llm.eval.pass_rate  (% of cases above threshold)

Query-level sampling (1% of production traffic):
  llm.production.faithfulness  (estimated via lightweight judge)
  llm.production.latency_p99

Regression detection:
  Alert when: 7-day rolling avg faithfulness drops > 3pp vs 30-day avg
  Alert when: pass_rate drops below 0.85

A 3-point drop in faithfulness over a week likely means either a model API change (the provider silently updated the model), prompt drift (upstream context format changed), or data drift (queries shifted to a domain your retrieval handles poorly). The eval pipeline tells you it happened. The metrics breakdown tells you why.

The teams that invest in eval infrastructure ship faster — not slower. When every change is evaluated automatically, you can move quickly without fear of silent regressions. The teams that skip eval infrastructure move fast until they break something important in production, then slow to a crawl debugging without signal.

Read Next

📚

Recommended Resources

Building LLM Apps with LangChain — UdemyHot

Build RAG systems, agents, and LLM-powered apps with Python and LangChain.

View Course
Hands-On Large Language ModelsNew

Practical guide to training, fine-tuning, and deploying LLMs.

View on Amazon
AI Engineering by Chip Huyen

Building intelligent systems with foundation models — from retrieval to agents.

View on Amazon

Practical engineering notes

Get the next backend guide in your inbox

One useful note when a new deep dive is published: system design tradeoffs, Java production lessons, Kafka debugging, database patterns, and AI infrastructure.

No spam. Just practical notes you can use at work.

Sachin Sarawgi

Written by

Sachin Sarawgi

Engineering Manager and backend engineer with 10+ years building distributed systems across fintech, enterprise SaaS, and startups. CodeSprintPro is where I write practical guides on system design, Java, Kafka, databases, AI infrastructure, and production reliability.

Found this useful? Share it: