Case Study: Assessing an LLM System
Learning Objectives
- You can assess an LLM system by looking at architecture, risks, and safeguards together.
- You can identify strengths and weaknesses in a realistic course-level design.
- You can propose concrete improvements based on the concepts of the course.
This chapter brings the course together through one case study. The goal is not to find a single perfect answer, but to practice the kind of structured judgment that software engineering with LLMs requires.
The system
Suppose that a university department wants to build an internal command-line assistant for teaching staff. The assistant helps staff members answer student questions about course policies and create follow-up tickets for issues that need manual handling.
The system has these capabilities:
- it retrieves information from a collection of course policy documents and FAQs,
- it can look up a small schedule file with known deadlines,
- it can create a support ticket through a tool call,
- and it logs prompts, retrieved chunks, answers, and tool calls for later review.
Figure 1 shows a simplified architecture.
What looks strong in this design
Several design choices are promising:
- the assistant is scoped to internal staff use rather than unrestricted public use,
- it retrieves course-specific material instead of relying only on model memory,
- it uses explicit tools for schedule lookup and ticket creation,
- and it keeps logs that can support debugging and review.
These choices suggest that the team is trying to build a system, not only a prompt.
What could go wrong
The same design also creates risks.
Risk 1: Untrusted retrieved text
If the policy documents or FAQs contain stale, ambiguous, or manipulated text, retrieval may supply weak context. A model that sounds confident can still misstate the rules.
Risk 2: Over-automation of ticket creation
If the ticket creation tool is called too freely, the assistant may create unnecessary or misleading tickets. This risk grows if the model can interpret vague user input as a request for action.
Risk 3: Privacy in logs
If staff questions contain student names, identifiers, or sensitive circumstances, the logs may store more personal information than necessary.
Risk 4: False confidence
Even when the assistant cites retrieved chunks, the answer may still be incomplete or based on the wrong chunk. Users may trust a polished answer more than they should.
Questions to ask when assessing the system
When you review a design like this, useful questions include:
- Which inputs are trusted, and which are untrusted?
- What actions can happen without human approval?
- What data is stored in logs, and who can access it?
- How would the team know whether the retrieval and answers are actually good enough?
- What happens if the system cannot find enough evidence in the documents?
These questions link directly back to the earlier parts of the course: specification, validation, retrieval, evaluation, tools, and oversight all matter here.
A short interaction to assess
It is often easier to judge a system when we look at one concrete interaction rather than only at the architecture diagram. Suppose that a staff member asks:
A student says they missed the deadline because of illness. Can they still submit, and should I create a support ticket?
Assume that the assistant retrieves one policy chunk about late penalties, one FAQ chunk saying that documented medical cases must be reviewed manually, and a schedule entry containing the original deadline. It then replies:
Students normally lose 5% per late day. Because illness was mentioned, I have created a support ticket for manual review.
This answer has one good property and one worrying property. The good property is that the assistant used local documentation instead of answering entirely from model memory. The worrying property is that it took an action immediately. A human reviewer should ask whether ticket creation really belongs on the same autonomy level as answering a question about policy. The system may need a confirmation step even if the policy answer itself is acceptable.
The same interaction also raises a logging question. If the original staff query included the student’s name or a medical detail, does the log store that information permanently? If so, the team should justify why that data is needed and whether it could be minimized or redacted.
If the team wanted a safer starting point for the assistant behavior, it could treat the prompt as part of the safeguard design. For example, a safer system prompt might say:
You are an internal course-support assistant for teaching staff.
Answer using only the supplied policy and schedule context.
If the context is insufficient, say so clearly.
If the context suggests that a case requires manual handling, recommend manual review or ticket creation, but do not claim that a ticket has already been created unless the application explicitly confirms that action elsewhere.
This does not replace tool permissions or approval boundaries, but it aligns the prompt with them. The model is being asked to recommend a next step, not to blur recommendation and execution.
Possible improvements
A stronger version of the system could:
- require human confirmation before creating a ticket,
- redact or minimize sensitive information in logs,
- instruct the model to answer only from retrieved context and to say when evidence is missing,
- store citations with the answer so reviewers can inspect the source,
- and maintain a small evaluation set of common support questions.
None of these changes makes the system perfect. Together, they make it easier to understand, control, and improve.
Another useful improvement would be to separate recommendation from execution more clearly. The assistant could suggest that a ticket may be needed, show the evidence it used, and then require explicit human confirmation before the tool call is allowed to run. This preserves the productivity benefit while reducing the risk of silent over-automation.
What this case study teaches
A realistic LLM system should be judged as a whole:
- the model,
- the prompts,
- the retrieved data,
- the tools,
- the validation logic,
- and the human workflow.
A narrow question such as “Is the model good?” misses most of what matters in actual software engineering.