Summaries and Derived Data

Practicing Summary Queries

Learning Objectives

You can write summary queries for unfamiliar domains starting from the question.
You can choose between grouping, joining, and counting strategies based on what the question asks.
You can recognize when a summary query needs care to avoid silent wrong answers.

In the previous chapters, we walked through summary techniques on the course-platform schema. That is a good starting point, but summary querying is a judgment skill, and judgment gets better with repetition on varied domains.

This chapter gives you two domains to practice on, both familiar from the Database Design and Modeling and the Schemas and Constraints parts: the recipe app and the library.

How to Use This Chapter

For each case below:

revisit the DDL for the domain, available in the Practicing Schema Implementation chapter of part 6,
read the questions and sketch a summary query for each; it does not have to be valid SQL, but it should be clear enough to be translatable into SQL,
then read the commented worked solutions and compare with your own.

The worked solutions show specific choices. Yours may differ, and many alternative queries are correct. What matters is whether your query answers the question, whether it handles the edge cases, and whether you would trust the result enough to put it on a page.

Do not peek

The worked solutions are genuinely more useful after you have sketched out your own queries. The idea is not that you should be able to solve these immediately, but that you invest some effort and time in planning out a way to solve them, and then study the worked solutions to see how they compare with your own.

The worked solutions also use techniques that we haven’t covered yet, so they are not meant to be a model for what you should be able to do right now. They are meant to show how the techniques from this part fit together in real queries, and to give you a sense of the variety of correct approaches that exist for each question.

Case 1: The Recipe App

Recall the recipe-app schema:

users(id, name, email),
recipes(id, user_id, title, description, created_at),
ingredients(id, name),
recipe_ingredients(recipe_id, ingredient_id, quantity, unit),
favorites(user_id, recipe_id).

Here are five summary questions:

How many recipes has each user authored?
Which ingredients are used in the most recipes?
What is the average number of ingredients per recipe?
How many recipes have not been favorited by anyone?
Which users have authored the most recipes per week over the last four weeks?

Try each yourself before reading on. Pay particular attention to questions 4 and 5, which require specific patterns that the previous chapters introduced.

A Commented Worked Set

Question 1: How many recipes has each user authored?

SELECT
  u.id,
  u.name,
  COUNT(r.id) AS recipe_count
FROM users AS u
LEFT JOIN recipes AS r ON r.user_id = u.id
GROUP BY u.id, u.name
ORDER BY recipe_count DESC;

The LEFT JOIN keeps users with no recipes visible (they get a count of 0). COUNT(r.id) instead of COUNT(*) ensures that users without recipes get 0 rather than 1, as discussed earlier in this part. Each result row represents one user with their authored-recipe count.

Question 2: Which ingredients are used in the most recipes?

SELECT
  i.id,
  i.name,
  COUNT(DISTINCT ri.recipe_id) AS recipe_count
FROM ingredients AS i
INNER JOIN recipe_ingredients AS ri ON ri.ingredient_id = i.id
GROUP BY i.id, i.name
ORDER BY recipe_count DESC
LIMIT 10;

This uses INNER JOIN because the question asks which ingredients are used most, not which ingredients exist. An ingredient never used would not be a meaningful answer here. COUNT(DISTINCT ri.recipe_id) is technically not necessary because (recipe_id, ingredient_id) is the primary key of recipe_ingredients, so each (recipe, ingredient) pair appears at most once. But adding DISTINCT makes the intent explicit and protects against future schema changes that might allow the same ingredient to appear twice in a recipe.

Question 3: What is the average number of ingredients per recipe?

SELECT AVG(ingredient_count) AS average_ingredients_per_recipe
FROM (
  SELECT recipe_id, COUNT(*) AS ingredient_count
  FROM recipe_ingredients
  GROUP BY recipe_id
) AS per_recipe;

This is a two-step summary: first count ingredients per recipe, then average those counts. A common mistake is to compute COUNT(*) / COUNT(DISTINCT recipe_id) directly on recipe_ingredients, which gives the same answer for non-empty cases but feels less clear and breaks for recipes with no ingredients (which would not appear in recipe_ingredients at all, and so would be excluded from the average even though they should arguably count as 0).

If the question intended to include empty recipes (recipes with no ingredients) as 0, the query should start from recipes and LEFT JOIN:

SELECT AVG(ingredient_count) AS average_ingredients_per_recipe
FROM (
  SELECT r.id, COUNT(ri.ingredient_id) AS ingredient_count
  FROM recipes AS r
  LEFT JOIN recipe_ingredients AS ri ON ri.recipe_id = r.id
  GROUP BY r.id
) AS per_recipe;

The choice between these two queries depends on what the question really means. This is exactly the kind of clarification that summary work often surfaces.

Loading Exercise...

Question 4: How many recipes have not been favorited by anyone?

SELECT COUNT(*) AS recipes_with_no_favorites
FROM recipes AS r
WHERE NOT EXISTS (
  SELECT 1 FROM favorites AS f WHERE f.recipe_id = r.id
);

The NOT EXISTS pattern from the chapter Subqueries and Set Operations of part 4 is a clean way to express “no related row exists.” An alternative is a LEFT JOIN with a WHERE ... IS NULL filter:

SELECT COUNT(*) AS recipes_with_no_favorites
FROM recipes AS r
LEFT JOIN favorites AS f ON f.recipe_id = r.id
WHERE f.recipe_id IS NULL;

Both queries return the same number. NOT EXISTS is usually clearer when the question is about absence; the LEFT JOIN version is sometimes preferred when other related data is also needed.

Loading Exercise...

Question 5: Which users have authored the most recipes per week over the last four weeks?

SELECT
  u.id,
  u.name,
  date_trunc('week', r.created_at) AS week,
  COUNT(*) AS recipe_count
FROM users AS u
INNER JOIN recipes AS r ON r.user_id = u.id
WHERE r.created_at >= CURRENT_TIMESTAMP - INTERVAL '4 weeks'
GROUP BY u.id, u.name, date_trunc('week', r.created_at)
ORDER BY week DESC, recipe_count DESC;

Three things to notice here:

date_trunc('week', r.created_at) buckets timestamps into weeks. This is a richer form of the time-bucketing pattern that we peeked at earlier.
The WHERE clause filters to the last four weeks, which is a row condition (it applies to individual recipes). Putting this in HAVING would be wrong.
The grouping is by both user and week, so one result row represents one user-and-week combination with the count of recipes that user authored in that week.

This query does not list users who authored zero recipes in some weeks, which may or may not match the intent. If the question wants every (user, week) cell including zeros, the query becomes substantially more complex (requiring a generated week series). For most realistic dashboards, the simpler form here is sufficient.

Case 2: The Library

Recall the library schema:

books(id, title, isbn),
copies(id, book_id, barcode),
members(id, name, email, joined_at),
loans(id, copy_id, member_id, borrowed_at, returned_at).

Here are five summary questions:

How many active loans does each member have right now?
Which books have the most copies?
What is the average loan duration in days for completed loans?
Which members have never borrowed a book?
How many books were borrowed each month over the past year?

Try each yourself before reading on.

Too Literal Translations

A frequent mistake is to write a summary query that follows the wording of the question without checking what the wording actually implies.

For example, “how many books were borrowed each month” sounds straightforward. But “borrowed” could mean “started a new loan in this month” or “had an active loan at any point in this month.” The first is a quick query; the second is much harder. A literal translation can produce a quick answer to the wrong question.

The safer habit is to read the question, propose a query, and then state in plain language what the query actually counts. If the statement matches the question, the query is right. If the statement reveals a mismatch, the question needs clarification.

A Commented Worked Set

Question 1: How many active loans does each member have right now?

SELECT
  m.id,
  m.name,
  COUNT(l.id) AS active_loan_count
FROM members AS m
LEFT JOIN loans AS l ON l.member_id = m.id AND l.returned_at IS NULL
GROUP BY m.id, m.name
ORDER BY active_loan_count DESC;

Two important details. First, l.returned_at IS NULL is what makes a loan “active” — the schema stores NULL for unreturned loans. Second, the IS NULL condition is in the ON clause of the LEFT JOIN, not in WHERE. This placement is important enough to walk through carefully with a concrete three-member example.

Suppose the loans table contains:

id	member_id	borrowed_at	returned_at
101	1 (Amina)	2026-03-01	2026-03-15
102	1 (Amina)	2026-04-10	`NULL`
103	2 (Alvee)	2026-02-01	2026-02-20

And the members table contains Amina (id 1), Alvee (id 2), and Rasmus (id 3, no loans ever).

The correct answer: Amina has 1 active loan, Alvee has 0, Rasmus has 0.

With the condition in the ON clause (as written above), the LEFT JOIN keeps one row for Amina that matches her active loan (102), produces one placeholder row for Alvee with NULL for the loan side (because his only loan has a non-NULL returned_at and so fails the ON condition), and produces one placeholder row for Rasmus with NULL for the loan side (because he has no loans at all). COUNT(l.id) returns 1, 0, 0 — correct.

With the condition moved to WHERE:

-- INCORRECT
SELECT
  m.id, m.name, COUNT(l.id) AS active_loan_count
FROM members AS m
LEFT JOIN loans AS l ON l.member_id = m.id
WHERE l.returned_at IS NULL
GROUP BY m.id, m.name;

The LEFT JOIN first produces the fully-expanded joined rows: Amina gets two rows (one for each of her loans), Alvee gets one row, Rasmus gets one placeholder row with NULL for the loan side. Then WHERE l.returned_at IS NULL runs. This keeps Amina’s active loan (102) and Rasmus’s placeholder row (because placeholder rows have NULL everywhere, including returned_at). But it drops Alvee’s row (his one loan has returned_at = 2026-02-20, not NULL) and drops Amina’s returned-loan row (101).

The resulting rows: Amina 1 (correct), Alvee missing entirely (wrong — should be 0), Rasmus 0 (correct). The query is still wrong because it no longer returns one row for every member. The member with only returned loans disappeared.

Moving IS NULL into the ON clause filters the join before it produces placeholder rows, so the LEFT JOIN behavior (keep every left-side row) is preserved. Moving it into WHERE breaks that behavior by dropping left-side rows whose related rows all fail the active-loan condition.

The general pattern: when a condition should filter the right-hand side of a LEFT JOIN without excluding left-hand rows, put the condition in ON.

Loading Exercise...

Question 2: Which books have the most copies?

SELECT
  b.id,
  b.title,
  COUNT(c.id) AS copy_count
FROM books AS b
LEFT JOIN copies AS c ON c.book_id = b.id
GROUP BY b.id, b.title
ORDER BY copy_count DESC
LIMIT 10;

Standard pattern. LEFT JOIN to keep books with no copies visible (they would have copy_count of 0). Each result row represents one book with its copy count.

Question 3: What is the average loan duration in days for completed loans?

SELECT AVG(EXTRACT(EPOCH FROM (returned_at - borrowed_at)) / 86400) AS average_days
FROM loans
WHERE returned_at IS NOT NULL;

The WHERE returned_at IS NOT NULL keeps only completed loans, where duration is defined. Active loans (with NULL for returned_at) are excluded because their duration cannot be computed yet. The arithmetic computes the difference as an interval, extracts seconds, and divides by 86400 to get days.

PostgreSQL has several ways to write this. A simpler version:

SELECT AVG(returned_at - borrowed_at) AS average_duration
FROM loans
WHERE returned_at IS NOT NULL;

This returns the average as an interval, which the application can then format. Both are valid; the choice depends on what the application expects.

Question 4: Which members have never borrowed a book?

SELECT id, name, email
FROM members AS m
WHERE NOT EXISTS (
  SELECT 1 FROM loans AS l WHERE l.member_id = m.id
);

This is a row-list query rather than a summary query, but it appears here because the question is the kind that summary thinking surfaces. The NOT EXISTS pattern is the same one used in the recipe app for “recipes with no favorites.”

This query does not need grouping — it returns one row per member, not one summary per group. Recognizing when a question is row-shaped rather than summary-shaped is itself a useful skill.

Question 5: How many books were borrowed each month over the past year?

SELECT
  date_trunc('month', borrowed_at) AS month,
  COUNT(*) AS loan_count
FROM loans
WHERE borrowed_at >= CURRENT_TIMESTAMP - INTERVAL '1 year'
GROUP BY date_trunc('month', borrowed_at)
ORDER BY month;

Time bucketing again, this time by month. The WHERE clause restricts the data to the last year. Each result row represents one month with the count of loans started that month.

This query has an important nuance: it counts loans by their start date, not by activity during the month. A loan that started in March and ran through May would be counted in March only. If the question really wanted “loans active during each month,” the query would be more complex, requiring overlapping interval logic.

The wording of summary questions almost always hides such details. Surfacing them — by asking the question carefully and stating clearly what is being counted — is part of the skill.

What the Two Cases Show Together

The recipe app and the library use the same SQL techniques but exercise different patterns.

The recipe app surfaces the NOT EXISTS vs LEFT JOIN ... IS NULL choice for “no related rows.” It also brings out the question of whether empty groups (recipes with no ingredients) should be included or excluded from averages.

The library surfaces the NULL semantics. Active loans are defined by returned_at IS NULL, and that condition has to be placed correctly in LEFT JOIN ... ON rather than in WHERE. The duration question shows interval arithmetic. The monthly counts question hides a “by start date or by activity” ambiguity that good summary work surfaces rather than guesses about.

Neither case has unique patterns. Both could appear in any domain. The point of practicing on more than one is to see the same techniques under different framings.

Same techniques, different problems

The five techniques used in both cases — grouping, LEFT JOIN, NOT EXISTS, time bucketing, and careful NULL handling — are the toolkit for almost any summary question. Knowing when each one fits is what summary practice teaches.

Check Your Understanding

Why is NOT EXISTS often the cleanest way to find rows with no related row?
Why does the IS NULL condition for active loans belong in the LEFT JOIN ... ON clause rather than in WHERE?
Why does a literal translation of a summary question sometimes produce a wrong answer?

← Summaries Across Related Tables

Readable Summary Queries →