Evaluation Methods and Datasets
Learning Objectives
- You understand why LLM applications need explicit evaluation.
- You know the difference between code-based checks, human review, and model-based evaluation.
- You can design a small evaluation dataset for a CLI application.
Need for evaluation
When people try an LLM application informally, they often remember the best examples and the most surprising failures. That gives a rough impression, but it is not enough for engineering decisions.
An application needs evaluation because we want answers to questions such as:
- Does the system answer the right kinds of questions well?
- Does it keep the requested format?
- Does it cite the correct sources?
- Does a change improve one task while making another worse?
Evaluation turns those questions into something more systematic. In practice, when working on a system, evaluation is a part of the engineering loop, as shown in Figure 1 below.
Start with a small dataset
An evaluation dataset does not need to be large at first. A small, carefully chosen set of examples is already useful if it covers the main cases the system is supposed to handle.
A simple record might look like this:
{
"question": "Where should API keys be stored?",
"expectedCitations": ["security-guidelines.md#1"],
"mustContain": ["environment variable"]
}
This example encodes both a source expectation and a content expectation.
It is often useful to keep the first dataset deliberately small and manually understandable. For early evaluation, even ten carefully chosen cases can already be sufficient if they cover the main workflows, the most important failure modes, and a few edge cases. The evaluation set can always be augmented later on.
Several evaluation methods
There are multiple ways to evaluate an LLM system. Common methods include:
- similarity-based methods, where the application outputs are compared against reference values,
- code-based checks, where the application checks exact fields, output format, or citations,
- human review, where a person judges usefulness, correctness, or clarity,
- LLM-as-judge, where another model is asked to score or compare outputs using stated criteria.
Each method has strengths and weaknesses.
Similarity-based methods are fast, but they might penalize for correct answers that are worded in a different way. Code-based checks are fast and reproducible, but they only capture what you can specify clearly. Human review captures nuance, but it is slower and more expensive. LLM-as-judge can help scale evaluation, but it should not be treated as an unquestionable authority.
Consider a small command-line RAG tool that answers documentation questions. A code-based check can verify that the output is valid JSON and that the cited chunk identifiers exist. A human reviewer can judge whether the answer is genuinely helpful. An LLM judge can help compare two prompt versions at a larger scale. None of those methods is enough by itself, but together they provide a more informative picture of quality.
Good evaluation criteria are explicit
If the system produces free-form answers, the team still needs explicit quality criteria. Typical criteria include:
- correctness,
- groundedness in the supplied context,
- completeness for the intended task,
- formatting and structure,
- and safe behavior when information is missing.
For a RAG application, groundedness is especially important. A polished answer with the wrong citation is often worse than a cautious answer that says the documents do not contain enough information.
It also helps to separate “did the program follow the requested format?” from “was the answer actually good?”. Those are different questions. A program may return valid JSON every time and still choose weak evidence. Conversely, an answer may contain the right idea but fail a strict format check because one expected field is missing. The evaluation design should make those distinctions visible instead of collapsing everything into one vague pass-or-fail impression.
Turning results into feedback
Once a dataset exists, the next step is to use the failures constructively. Suppose that one evaluation case expects the citation security-guidelines.md#1, but the system keeps retrieving security-guidelines.md#4 instead. That result might point to a retrieval problem, a chunking problem, or a prompt that does not clearly require the most specific evidence. A good evaluation loop helps the team ask the right next question.
An evaluation should not end with a single score. Summary numbers are useful, but information about the failing cases usually teach more than the average score. The most valuable output of an evaluation run is often a short list of concrete failures that the team can investigate next.
Large language models can also help draft an evaluation rubric, provided that the request is explicit enough. For example:
Draft evaluation criteria for the following CLI question-answering system.
Return a short list of concrete criteria that a team could actually use.
Focus on quality dimensions such as correctness, groundedness, output structure, and behavior when information is missing.
Do not give generic praise.
This is a good example of prompt engineering serving a software engineering process rather than serving an end-user feature directly.