Knowledge, Retrieval, and Evaluation

Overview

In the previous part, we connected an LLM application to tools. In this part, we focus on two questions that appear as soon as an application needs to answer domain-specific questions:

where does the application get reliable context,
and how do we know whether the answers are good enough for the intended use?

Large language models already contain a great deal of general knowledge, but many useful applications depend on material that is local, changing, or specific to one organization. A CLI assistant for course policies, internal engineering guidelines, or local documentation cannot rely only on what the model happened to learn during training.

Retrieval helps with that problem. The application can search its own documents, select relevant pieces of text, and include them in the prompt. But retrieval alone is not enough. An application may retrieve the wrong passage, assemble the context badly, or still generate a weak answer. That is why this part also introduces evaluation.

This part also introduces a second kind of model call. In the earlier application parts, we only sent chat-style requests. Here, the application may also create embeddings for documents and questions so that it can retrieve relevant chunks before generating an answer.

Figure 1 summarizes the basic pattern that we will use in this part.

Fig 1. — Retrieval and evaluation turn a plain prompt-response application into a system that can ground answers in local data and check how well it performs.

Loading Exercise...

Retrieval begins before answer generation

It is easy to think of RAG as “ask a question and get an answer with citations”. In practice, much of the engineering work happens before the answer is generated at all.

Someone has to decide what documents are included, how they are chunked, how identifiers are assigned, how embeddings are created, and how many chunks are retrieved for one question. Those decisions shape the quality of the final answer just as much as the wording of the final prompt does.

Loading Exercise...

The structure of this part is as follows:

Embeddings and Similarity explains how text can be represented for similarity search.
Retrieval-Augmented Generation introduces the main RAG pipeline.
Context Strategies and Trade-Offs compares retrieval with other ways of supplying context.
Evaluation Methods and Datasets shows how to define quality and measure it explicitly.
Tutorial: Building and Evaluating a RAG Application combines the ideas into a command-line question-answering application over local documents.
Framework Variant: A Small RAG Pipeline with LangChainJS revisits the same design through LangChainJS so that you can compare explicit retrieval code with a framework-assisted implementation.

Finally, Recap and Feedback summarizes the part and prepares you for the final part on security, limits, and responsible use.

Embeddings and Similarity →