Summaries and Derived Data

Summaries Across Related Tables


Learning Objectives

  • You can write grouped queries that summarize data across joined tables.
  • You can use LEFT JOIN to keep groups visible that would otherwise disappear.
  • You can recognize when a join silently inflates counts and predict when this will happen.
  • You can verify a summary query by cross-checking it against the underlying data.

Summary queries often summarize data across relationships. The same relationship paths that we practised in the part on Data in Multiple Tables support both grouped summaries and row queries.

Multi-table summaries also introduce a new hazard: a join can silently inflate counts, making the query look right while quietly returning the wrong numbers. The second half of this chapter introduces the verification habit that catches these mistakes.

Start from the Relationship Path

Suppose the question is: how many submissions has each exercise received?

SELECT
  e.id,
  e.title,
  COUNT(s.id) AS submission_count
FROM exercises AS e
LEFT JOIN submissions AS s ON s.exercise_id = e.id
GROUP BY e.id, e.title
ORDER BY e.id;

The query follows a familiar shape:

  1. start from exercises,
  2. attach matching submissions,
  3. group by exercise,
  4. count the related submission rows.

The result has one row per exercise, with the count of its submissions.

Notice that GROUP BY includes both e.id and e.title. The id alone would be sufficient for grouping, but e.title appears in SELECT and is therefore required in GROUP BY. This is a small but consistent constraint: every non-aggregate column in SELECT must be in GROUP BY.

Summaries and LEFT JOIN

If the question says that all exercises should stay visible even when they have no submissions, then LEFT JOIN is the right fit.

The choice of join type is still driven by the question, exactly as in Data in Multiple Tables. With the query above:

  • an exercise with several submissions gets a count equal to its submission count,
  • an exercise with no submissions still appears in the result,
  • and the count for that exercise is 0, because COUNT(s.id) counts only non-NULL values, and the LEFT JOIN produced NULL for s.id when no submission matched.

A common mistake is to use COUNT(*) instead of COUNT(s.id) here:

SELECT
  e.id,
  e.title,
  COUNT(*) AS submission_count
FROM exercises AS e
LEFT JOIN submissions AS s ON s.exercise_id = e.id
GROUP BY e.id, e.title;

For an exercise with no submissions, the LEFT JOIN still produces one result row (the exercise row with NULL for the submission columns). COUNT(*) counts that row as 1, even though there are no submissions. The exercise then incorrectly appears with a count of 1 instead of 0.

COUNT(s.id) fixes this because it counts only non-NULL submission identifiers. An exercise with no matching submissions has s.id as NULL, so it contributes nothing to the count.

This is a small detail with a real consequence. In summary queries that use LEFT JOIN, prefer COUNT(column) over COUNT(*).

Loading Exercise...

A Longer Summary Path

Summary queries can span more than two tables. Suppose the question is: how many submissions has each course received?

SELECT
  c.code,
  COUNT(s.id) AS submission_count
FROM courses AS c
INNER JOIN course_instances AS ci ON ci.course_id = c.id
INNER JOIN exercises AS e ON e.course_instance_id = ci.id
LEFT JOIN submissions AS s ON s.exercise_id = e.id
GROUP BY c.code
ORDER BY c.code;

The path is the same kind that we’ve looked at earlier in the course: courses -> course_instances -> exercises -> submissions. The new ingredient is grouping at the course level, even though the rows being summarized live four tables away.

For each result row:

one course code, with the count of all submissions across all instances and exercises of that course.

The LEFT JOIN for submissions is appropriate because the question keeps courses visible even if they have no submissions. The earlier joins are INNER JOIN because every course instance must belong to a real course and every exercise must belong to a real course instance.

Joins Can Silently Inflate Counts

This is the trickiest pattern in summary querying, and it deserves slow attention.

Consider this question: how many cards and how many tags does each deck have?

A first attempt might combine both into one query:

SELECT
  d.id,
  d.name,
  COUNT(c.id) AS card_count,
  COUNT(ct.tag_id) AS tag_count
FROM decks AS d
LEFT JOIN cards AS c ON c.deck_id = d.id
LEFT JOIN card_tags AS ct ON ct.card_id = c.id
GROUP BY d.id, d.name;

This looks reasonable, but the card_count is wrong. To see why, consider what the joined rows look like before grouping.

If a deck has 3 cards, and one of those cards has 4 tags, the LEFT JOIN to card_tags produces 4 rows for that one card (one row per tag). The other two cards each produce one row each (with NULL for the tag side, if they have no tags) or one row per tag if they have tags too.

So the rows entering the grouping for that deck are not 3 (one per card). They might be 6 or 10 or more, depending on how many tags the cards have. Then COUNT(c.id) counts the card on every joined row, so a card with 4 tags is counted 4 times.

The card_count for a deck is no longer the number of cards. It is the number of (card, tag) combinations, which is usually larger.

This is the silent-wrong-answer case. The query runs without errors. The numbers look plausible. But they do not mean what the column names suggest.

The chapter on Complex Summary Queries with CTEs returns to this query and shows the cleanest way to fix it. For now, the important point is to recognize the hazard, which leads directly to the verification habit.

Loading Exercise...

Why Summary Queries Need Verification

Three properties of summary queries make verification especially important.

The result is small. A query that should have produced 100 rows but only produced 99 is hard to notice. A summary query that produces one wrong number is even harder.

The query runs without errors. A syntactically correct summary query that returns a wrong answer cannot be detected by the database. The numbers come back, the page renders, and nothing complains.

Mistakes look reasonable. A count that is off by a factor of 4 because of join inflation is still a number, still a count, and still presented in the same column. Without cross-checking, the wrong number is indistinguishable from the right one.

For these reasons, “the query ran without errors” is not the same as “the query is correct.” Verification is what closes that gap.

Loading Exercise...

Five Verification Techniques

A small set of techniques covers most cases.

Run the ungrouped version. Take the same query but remove the aggregates and grouping. Inspect the raw rows. If the rows look right, the grouping should produce the right summary. If the rows look wrong (too many, too few, unexpected duplicates), the summary will reflect the same problem.

Check one known group by hand. Pick one specific group whose answer you can compute manually from the seed data. If exercise_id = 7 should have 3 submissions, check that the summary reports 3 for that exercise.

Check the totals. Compare aggregate over the grouped result to a separate aggregate over the whole table. For example, if you grouped submissions by exercise and then summed all the per-exercise counts, the total should equal the count of all submissions.

Try an extreme case. A group with no data, a group with one row, the largest group. Each tests a different edge of the query.

Cross-check with a different query. Compute the same number with a different SQL approach. If two different queries return the same number, both are likely correct. If they disagree, something is wrong with at least one.

Each technique catches its own kind of mistake. Going through even a few of them on every nontrivial summary query is a good habit in summary work.

Loading Exercise...

Worked Example: Verifying the Inflated Query

Take the inflated query from earlier in this chapter:

SELECT
  d.id,
  d.name,
  COUNT(c.id) AS card_count,
  COUNT(ct.tag_id) AS tag_count
FROM decks AS d
LEFT JOIN cards AS c ON c.deck_id = d.id
LEFT JOIN card_tags AS ct ON ct.card_id = c.id
GROUP BY d.id, d.name;

Suppose this returns:

idnamecard_counttag_count
1SQL Basics88
2Web Development42

The numbers look reasonable. But, let’s verify, and count cards directly.

SELECT deck_id, COUNT(*) AS true_card_count
FROM cards
GROUP BY deck_id;

This might return:

deck_idtrue_card_count
14
23

The disagreement is now visible. The grouped query reports 8 cards for deck 1, but a direct count gives 4. The difference is the inflation: the join to card_tags doubled the rows for deck 1 because each of its cards had on average two tags.

Without the cross-check, the inflated card_count would have been accepted and the page would have shown wrong numbers. With the cross-check, the bug is immediately apparent.

Another possibility would be to run the ungrouped version of the query. The ungrouped version would show 8 rows for deck 1, which would be a red flag that something is wrong.

Loading Exercise...

Keep Asking Questions

The habit of asking questions about what each row in a query represents is especially important in summary queries that join multiple tables. The joins can create intermediate rows that do not correspond to any stored record, and the grouping can then produce result rows that do not correspond to any stored record either.

In the first query of this chapter, one result row means one exercise. In the longer course-level query, one result row means one course. In the inflated card-and-tag query, one result row means one deck — but the column values do not mean what the column names suggest, which is exactly the problem.

Check Your Understanding

  1. Why is COUNT(joined_table.column) usually safer than COUNT(*) in a query with LEFT JOIN?
  2. What does it mean for a join to “inflate counts,” and when does this happen?
  3. Why is “the query ran without errors” not enough evidence that a summary is correct?

SQL Programming Exercise

This sequence begins with straightforward per-exercise counts and adds one more related table at a time. The last task asks you to keep two summaries in the same query without letting the joins silently inflate one of them, and to verify your result before trusting it.

Loading Exercise...