What is an offline eval for an LLM?

An offline eval is a test suite that runs against a fixed dataset before deployment. You have a set of inputs (queries or prompts), expected outputs or evaluation criteria, and code that runs each input through the system and scores the output. Offline evals give you immediate feedback on whether a change improved or degraded system quality. They are the primary tool for catching regressions before they reach users. The opposite is online evaluation, which collects metrics from a live system serving real users — useful for measuring real-world impact but too slow and noisy to use as a development feedback loop.

How do I build a golden test set?

Three approaches, ordered by data quality: (1) Sample real production queries (once you have traffic), have domain experts label whether the system's answer was correct, and use those labeled examples. This gives you the most realistic distribution but requires production traffic first. (2) Have domain experts write questions and answers from your source documents directly — the most accurate but most expensive approach. (3) Generate synthetic queries using an LLM: for each document chunk, prompt a model to generate 3–5 questions that the chunk answers, then verify a sample manually. Synthetic generation is fast and scalable but introduces model bias. For most teams, combining approaches 2 and 3 is practical: generate synthetically, verify with domain experts, and add production failures as they are discovered.

Can I use GPT-4 to evaluate my LLM outputs?

Yes, with caveats. LLM-as-judge is a legitimate and widely used evaluation approach, but it has specific failure modes you need to account for. LLM judges are biased toward verbose, confident-sounding answers regardless of correctness. They are biased toward answers that sound like their own training distribution. And if you are using the same model family as both the system under test and the judge, you are measuring self-consistency rather than correctness. Mitigations: use a different model family as judge (evaluate a GPT-4 system with Claude, or vice versa), use structured scoring rubrics rather than open-ended judgment, and calibrate your judge against human labels on a sample to understand where it diverges from human judgment.

How often should I run offline evals?

Run offline evals on every pull request that touches the prompt, the retrieval logic, the chunking strategy, or the embedding model — this is the minimum. For active development teams, this means evals run dozens of times per day in CI. Run a full eval suite (including expensive LLM-as-judge metrics) nightly, and review the results before the next day's standup. Run a comprehensive eval including adversarial and edge-case examples before any major release. The cost is real — LLM-as-judge evaluations are not free — but cache model calls during development and use cheaper models (GPT-4o-mini, Haiku) for fast feedback, reserving expensive models for the nightly comprehensive run.

Fundamentals · 2026-05-06

LLM Evaluation: Why Offline Evals Must Come Before Online Metrics

Online metrics (latency, engagement, CSAT) tell you what happened. Offline evals tell you what will happen. Building your eval harness before you ship is not optional — it's the difference between iteration and guessing.

§1

The Two Eval Regimes

LLM evaluation operates in two distinct regimes: offline and online. Offline evals run against a fixed dataset before deployment — you know the expected outputs, you measure against them, and you get a score before any user sees the system. Online metrics are collected from a live system — latency, cost, user engagement, thumbs up/down ratings, CSAT surveys. Both are necessary. Neither alone is sufficient.

The mistake most teams make is building only online metrics and then trying to use them to guide development. This is backwards. Online metrics have a 24–72 hour feedback loop minimum (you need to ship, wait for traffic, aggregate signals). They are noisy (user behavior is influenced by factors unrelated to quality). And they are often downstream of quality by weeks — a user might stop using the product three sessions after quality degraded, not immediately.

Offline evals give you a sub-second feedback loop. Run your test suite, get a score, know immediately if your prompt change broke something. This is the difference between engineering and guessing.

§2

Why Online Metrics Alone Fail

Online metrics measure user behavior, not system quality. A user who does not understand the system's limitations may give high ratings to confident wrong answers. A user with a sophisticated use case may give low ratings to technically correct answers that do not match their mental model. CSAT scores a user's experience of an interaction, which correlates with quality but is not the same thing.

Engagement metrics are even more dangerous. Time-on-site and session length can increase because users are confused, not because they are delighted. Click-through rates measure whether users clicked, not whether the click helped them. Optimizing an LLM system against engagement metrics without quality guardrails is how you build a system that is confidently wrong.

The final failure mode of online-only evaluation is that it gives you no ability to evaluate before shipping. Every change is a live experiment. For safety-critical applications (medical, legal, financial), this is not acceptable. Even for non-safety-critical applications, it means every regression ships to users before you can catch it. Offline evals catch regressions in CI, before merge.

§3

Anatomy of an Offline Eval Harness

An offline eval harness consists of four components: a test dataset (queries, expected outputs, and relevant context), a runner (code that takes a dataset, runs each query through the system, and collects outputs), a scorer (code or a model that compares outputs to expected outputs and produces a score), and a reporter (output that summarizes scores, flags regressions, and surfaces examples for review).

The runner must be deterministic where possible. Set temperature to 0 for evaluation runs. Use fixed seeds. Cache model responses during development so you are not paying for API calls on every run. The scorer can be rule-based (exact match, regex, JSON schema validation), embedding-based (cosine similarity between expected and actual), or LLM-based (GPT-4 or Claude as judge). Each has different cost/accuracy tradeoffs.

The reporter is where most teams cut corners. A score of "faithfulness: 0.83" is not actionable without knowing which examples failed and why. Your reporter should surface the failing examples with their inputs, expected outputs, actual outputs, and the scorer's reasoning. This is what enables rapid diagnosis and targeted improvement.

§4

Metrics That Matter: Precision, Recall, Faithfulness, Coherence

For RAG systems, the primary metrics are retrieval recall@k (are the right documents being retrieved?), faithfulness (is the generated answer grounded in the retrieved context?), answer relevance (does the answer address the question?), and answer correctness (is the answer factually right?). For chat systems, coherence (is the response internally consistent and grammatically well-formed?) and instruction following (did the model do what it was asked?) are primary.

Faithfulness is the most important metric for production RAG and the hardest to measure. A faithful answer uses only information from the retrieved context — it does not bring in training knowledge that may be outdated or incorrect. Measuring faithfulness requires comparing each claim in the output to the retrieved context. RAGAS implements this with LLM-as-judge. The judge reads the context and the output and asks: "Can every factual claim in the output be supported by the context?"

Do not track more than five metrics in your primary dashboard. Tracking fifteen metrics leads to optimization theater — teams report the metrics that improved and quietly ignore the ones that got worse. Pick the three to five metrics that most directly represent user value and make those the official measure of system quality.

§5

The Golden Dataset Problem

A golden dataset is a curated set of queries with verified correct answers (or relevant documents for retrieval evals). It is the foundation of offline evaluation. The problem is that building one is expensive: you need domain experts to label correct answers, you need coverage across the query distribution, and you need it to stay current as your documents and user needs evolve.

Three practical approaches: (1) Mine production logs for real queries, have domain experts label a sample, and use those as golden examples. (2) Generate synthetic queries from your documents using an LLM — for each document chunk, generate 3–5 plausible questions that the chunk answers. (3) Use adversarial examples — craft queries that are designed to expose known failure modes (out-of-scope questions, ambiguous queries, questions with no answer in the corpus).

The golden dataset must cover failure modes, not just the happy path. A test set that only includes queries your system already handles well tells you nothing about where it breaks. Allocate 30–40% of your golden examples to edge cases: questions where the answer is not in the corpus, questions that require synthesizing multiple documents, and questions that are subtly ambiguous.

§6

Eval-Driven Development

Eval-driven development (EDD) is the practice of writing eval cases before implementing changes, then using the eval results to guide implementation. It is the LLM analogue of test-driven development. The workflow: identify a failure mode or desired new behavior, write an eval case that captures it, run the eval against the current system (it fails), make a change, run the eval again, iterate until the eval passes.

This workflow has a forcing function: you cannot declare a feature done until the eval passes. It prevents the common pattern of shipping a change that handles the case you thought about while breaking the cases you did not. It also builds an ever-growing test suite that documents every failure mode the team has encountered and fixed.

The practical challenge is eval case quality. LLM behavior is non-deterministic — the same input can produce different outputs on different runs. Write evals that check for properties of the output (does the output contain a specific claim? does the JSON have the required keys? does the answer contradict the retrieved context?) rather than exact string matches. Property-based evals are more stable and more meaningful.

§7

Common Mistakes in LLM Evaluation

Mistake one: using the same model to generate test cases and evaluate outputs. If you use GPT-4 to generate your golden answers and GPT-4 as your judge, you are measuring whether the system produces GPT-4-like outputs, not whether it produces correct outputs. Use domain experts or a different model family for golden answer generation.

Mistake two: evaluating only on in-distribution queries. Your eval set should include queries that are slightly out of scope, queries with typos and informal language, multi-turn queries that depend on conversation history, and adversarial queries designed to elicit hallucination or inappropriate content. Systems that score well on clean in-distribution evals often fail badly on realistic production traffic.

Mistake three: ignoring latency and cost in evals. A system that achieves 0.95 faithfulness by retrieving 50 chunks and using a 128k context is not production-ready. Your eval harness should measure cost-per-query and latency alongside quality metrics. Track the Pareto frontier: for a given quality level, what is the minimum cost? For a given cost budget, what is the maximum quality?

FAQ

Frequently asked questions

What is an offline eval for an LLM?: An offline eval is a test suite that runs against a fixed dataset before deployment. You have a set of inputs (queries or prompts), expected outputs or evaluation criteria, and code that runs each input through the system and scores the output. Offline evals give you immediate feedback on whether a change improved or degraded system quality. They are the primary tool for catching regressions before they reach users. The opposite is online evaluation, which collects metrics from a live system serving real users — useful for measuring real-world impact but too slow and noisy to use as a development feedback loop.
What metrics should I track for RAG?: The core metrics for a RAG system are: Recall@k (what fraction of relevant documents appear in the top k retrieved results — this is the primary retrieval metric), faithfulness (what fraction of claims in the generated answer are supported by the retrieved context), answer relevance (does the answer address the question?), and answer correctness (is the answer factually right, judged against ground truth?). For practical implementation, the RAGAS framework provides implementations of faithfulness and answer relevance using LLM-as-judge. Supplement with latency percentiles (p50, p95, p99) and cost-per-query to ensure quality improvements are not coming at unacceptable operational cost.
How do I build a golden test set?: Three approaches, ordered by data quality: (1) Sample real production queries (once you have traffic), have domain experts label whether the system's answer was correct, and use those labeled examples. This gives you the most realistic distribution but requires production traffic first. (2) Have domain experts write questions and answers from your source documents directly — the most accurate but most expensive approach. (3) Generate synthetic queries using an LLM: for each document chunk, prompt a model to generate 3–5 questions that the chunk answers, then verify a sample manually. Synthetic generation is fast and scalable but introduces model bias. For most teams, combining approaches 2 and 3 is practical: generate synthetically, verify with domain experts, and add production failures as they are discovered.
Can I use GPT-4 to evaluate my LLM outputs?: Yes, with caveats. LLM-as-judge is a legitimate and widely used evaluation approach, but it has specific failure modes you need to account for. LLM judges are biased toward verbose, confident-sounding answers regardless of correctness. They are biased toward answers that sound like their own training distribution. And if you are using the same model family as both the system under test and the judge, you are measuring self-consistency rather than correctness. Mitigations: use a different model family as judge (evaluate a GPT-4 system with Claude, or vice versa), use structured scoring rubrics rather than open-ended judgment, and calibrate your judge against human labels on a sample to understand where it diverges from human judgment.
How often should I run offline evals?: Run offline evals on every pull request that touches the prompt, the retrieval logic, the chunking strategy, or the embedding model — this is the minimum. For active development teams, this means evals run dozens of times per day in CI. Run a full eval suite (including expensive LLM-as-judge metrics) nightly, and review the results before the next day's standup. Run a comprehensive eval including adversarial and edge-case examples before any major release. The cost is real — LLM-as-judge evaluations are not free — but cache model calls during development and use cheaper models (GPT-4o-mini, Haiku) for fast feedback, reserving expensive models for the nightly comprehensive run.

evalsfundamentalsquality