Knowledge, Retrieval, and Evaluation

Retrieval-Augmented Generation

Learning Objectives

You understand the main stages of a retrieval-augmented generation pipeline.
You know why retrieval can improve answers in domain-specific applications.
You can identify common failure points in a simple RAG system.

What RAG means

Retrieval-augmented generation, or RAG, is a pattern where the application retrieves relevant information from a document collection and includes it in the prompt before asking the model to answer.

Instead of relying only on the model’s built-in knowledge, the application provides external context at request time.

Figure 1 shows a typical RAG system with two parts: (1) a setup for indexing the documents to a database and (2) a question-answering pipeline that retrieves relevant chunks and includes them in the prompt.

Fig 1. — A RAG system has both an indexing side and a question-answering side.

The indexing side

Before the application can answer questions, it usually needs to prepare its documents:

split documents into chunks,
create embeddings for those chunks,
and store the chunks together with their embeddings and metadata.

This is often called indexing. In a small course example, the index might be a JSON file. In larger systems, the index may live in a dedicated vector database or search system.

Database systems such as PostgreSQL also support vector search through extensions like pgvector, so a separate vector database is not always necessary.

The answering side

When the user asks a question, the application:

creates an embedding for the question,
compares it with stored chunk embeddings,
retrieves the most similar chunks,
inserts those chunks into the prompt,
and asks the model to answer using that context.

The application may also instruct the model to cite chunk identifiers or document names. That makes the answer easier to inspect and evaluate.

A small worked example

Suppose that a course-support assistant receives the question:

Can students submit one day late without penalty?

Assume that the document collection contains three chunks:

a policy chunk stating that late submissions lose 5% per day,
a FAQ chunk stating that medical exemptions must be handled manually,
a calendar chunk listing the deadline date.

If retrieval returns the first and third chunks, the model has enough local evidence to answer the question and cite the relevant policy text. If retrieval returns only the calendar chunk, the model may still answer confidently, but it no longer has the penalty rule that actually matters. In other words, retrieval quality changes what the model is able to know at answer time.

This is one reason to inspect retrieved chunks directly during development. A poor final answer does not always mean the prompt is weak. Sometimes the prompt is fine and the wrong chunks were retrieved.

RAG is not a silver bullet, however. In some cases, a simple search over documents may be enough. See, e.g. Comparing the Utility, Preference, and Performance of Course Material Search Functionality and Retrieval-Augmented Generation Large Language Model (RAG-LLM) AI Chatbots in Information-Seeking Tasks.

RAG does not guarantee correctness

RAG is a useful pattern, but it does not guarantee correctness. A system that uses RAG can still fail in multiple ways, including due to the document collection being incomplete, chunking splitting the information badly, retrieval selecting the wrong text, having weak instructions in the prompt, and the model ignoring the provided context.

The application needs both a way to gather context and a way to check whether that context leads to acceptable answers.

One particularly common failure case is partial relevance. A retrieved chunk may mention the right topic without containing the precise rule or fact that the user needs. The result can sound plausible while still omitting the decisive detail. That is why many RAG systems benefit from returning more than one chunk and from requiring explicit citations in the answer.

Loading Exercise...

Prompt design still matters

Even with good retrieval, the prompt design remains important. A small example:

const buildMessages = ({ question, chunks }) => {
  const context = chunks
    .map((chunk) => `(${chunk.id}) ${chunk.text}`)
    .join("\n\n");

  return [
    {
      role: "system",
      content:
        "Answer only using the provided context. If the answer is missing, say that the documents do not contain enough information. Cite the chunk ids you used.",
    },
    {
      role: "user",
      content: `Question: ${question}\n\nContext:\n${context}`,
    },
  ];
};

The retrieval step provides the raw material. The prompt still decides how that material is presented and how the model is asked to behave.

In practice, a good RAG system is therefore built from several smaller choices rather than one big decision. The developer chooses how to chunk the documents, how to create and store embeddings, how many chunks to retrieve, how to present them in the prompt, and how to verify that the answer stayed grounded in that context. Retrieval improves the system only when those choices work together.

Loading Exercise...

If you are designing the prompt itself before writing code, a compact template is often enough:

Answer the question using only the supplied context.
If the context is insufficient, say so clearly.
Cite the chunk ids you used.
Do not rely on outside knowledge.

This template is useful because it makes the grounding rule explicit. It also gives the evaluation step something concrete to check later.

Loading Exercise...

← Embeddings and Similarity

Context Strategies and Trade-Offs →