Summaries and Derived Data

Complex Summary Queries with CTEs


Learning Objectives

  • You can use common table expressions to structure larger summary queries.
  • You can use CTEs to fix the count-inflation problem.
  • You can read a summary query as a sequence of named intermediate steps.

As summary queries grow, readability starts to matter a lot. A query with three nested levels and four joins is hard to read, hard to verify, and hard to revise. Common table expressions, or CTEs, give a query named intermediate steps that can be read and reasoned about one at a time.

A First Example

Consider the question: which exercises have at least five graded submissions, and what is their average score?

Without CTEs, the query is workable but dense:

SELECT
  exercise_id,
  COUNT(*) AS graded_count,
  ROUND(AVG(score), 2) AS average_score
FROM submissions
WHERE status = 'graded'
GROUP BY exercise_id
HAVING COUNT(*) >= 5;

With a CTE, the query reads as a short sequence of ideas:

WITH graded_submissions AS (
  SELECT exercise_id, score
  FROM submissions
  WHERE status = 'graded'
)
SELECT
  exercise_id,
  COUNT(*) AS graded_count,
  ROUND(AVG(score), 2) AS average_score
FROM graded_submissions
GROUP BY exercise_id
HAVING COUNT(*) >= 5;

The CTE graded_submissions defines the input to the summary. The main query then groups and aggregates from that input.

For a query this short, the CTE adds line count without adding much clarity. The real benefit appears when the query grows.

Loading Exercise...

When Common Table Expressions Help

CTEs help most when:

  • the query has more than one logical step,
  • intermediate results have a meaningful name,
  • the same intermediate result is used in more than one place,
  • the query needs to fix the count-inflation problem.

A CTE turns a query into something that reads like:

first compute X, call it A. then compute Y from A, call it B. finally produce the result from B.

That structure is much easier to follow than a single nested expression with several joins and conditions piled on top of each other.

Using CTEs to Fix Count Inflation

The most concrete reason to reach for CTEs is the count-inflation problem from Chapter 5. Recall the failed attempt to count cards and tags in one query:

SELECT
  d.id,
  d.name,
  COUNT(c.id) AS card_count,
  COUNT(ct.tag_id) AS tag_count
FROM decks AS d
LEFT JOIN cards AS c ON c.deck_id = d.id
LEFT JOIN card_tags AS ct ON ct.card_id = c.id
GROUP BY d.id, d.name;

This silently inflates card_count because each card with multiple tags appears multiple times in the joined rows.

CTEs let each summary be computed in its own clean grouped query, then combined at the deck level:

WITH card_counts AS (
  SELECT deck_id, COUNT(*) AS card_count
  FROM cards
  GROUP BY deck_id
),
tag_counts AS (
  SELECT c.deck_id, COUNT(*) AS tag_count
  FROM cards AS c
  INNER JOIN card_tags AS ct ON ct.card_id = c.id
  GROUP BY c.deck_id
)
SELECT
  d.id,
  d.name,
  COALESCE(cc.card_count, 0) AS card_count,
  COALESCE(tc.tag_count, 0) AS tag_count
FROM decks AS d
LEFT JOIN card_counts AS cc ON cc.deck_id = d.id
LEFT JOIN tag_counts AS tc ON tc.deck_id = d.id;

The query now reads in three clear steps:

  1. count cards per deck (clean, no inflation),
  2. count tags per deck (clean, no inflation, joined through cards),
  3. combine the two counts at the deck level, with COALESCE to handle decks that have no cards or no tags.

Each intermediate summary can be inspected on its own. If card_counts returns wrong numbers, the bug is in that CTE. If the final result has unexpected NULLs, the issue is in the join in the main query. The decomposition makes debugging much easier than it would be in a single dense query.

This is the most reliable case for CTEs in summary work. Whenever a query needs to compute multiple summaries that involve different relationship paths, CTEs are usually the cleanest fix.

Loading Exercise...

Naming Intermediate Steps

A CTE name should describe what the intermediate result represents. Good names make the query read like prose.

Compare:

WITH t1 AS (
  SELECT exercise_id, COUNT(*) AS c FROM submissions GROUP BY exercise_id
)
SELECT * FROM t1 WHERE c >= 5;
WITH submission_counts_per_exercise AS (
  SELECT exercise_id, COUNT(*) AS submission_count
  FROM submissions
  GROUP BY exercise_id
)
SELECT exercise_id, submission_count
FROM submission_counts_per_exercise
WHERE submission_count >= 5;

The second version is longer but immediately readable. The CTE name (submission_counts_per_exercise) and the column name (submission_count) both explain what the data represents.

For introductory work, the rule is simple: if you cannot say what a CTE represents in one short phrase, the name is probably too vague.

A Richer Example

Suppose the question is: which decks have above-average card counts, and how does each compare?

WITH deck_card_counts AS (
  SELECT
    d.id,
    d.name,
    COUNT(c.id) AS card_count
  FROM decks AS d
  LEFT JOIN cards AS c ON c.deck_id = d.id
  GROUP BY d.id, d.name
),
overall_average AS (
  SELECT AVG(card_count) AS average_card_count
  FROM deck_card_counts
)
SELECT
  dcc.name,
  dcc.card_count,
  ROUND(oa.average_card_count, 2) AS overall_average,
  ROUND(dcc.card_count - oa.average_card_count, 2) AS difference_from_average
FROM deck_card_counts AS dcc
CROSS JOIN overall_average AS oa
WHERE dcc.card_count > oa.average_card_count
ORDER BY difference_from_average DESC;

Reading the query:

  1. compute card counts per deck,
  2. compute the overall average across those counts,
  3. show the decks above average, with their counts and the difference from the average.

CROSS JOIN here pairs every deck-count row with the single overall-average row, which is what lets the comparison happen. The subtraction uses the full-precision average and rounds only the displayed value, so a deck that is 0.4 above average is not reported as being 0 above average.

This kind of query is awkward without CTEs. With CTEs, each step is named and the intent is visible.

CTEs and Verification

CTEs are also a verification aid. Each named step can be inspected on its own by replacing the main query with SELECT * FROM cte_name:

WITH graded_submissions AS (
  SELECT exercise_id, score FROM submissions WHERE status = 'graded'
)
SELECT * FROM graded_submissions;  -- inspect the intermediate

This is the same idea as the “run the ungrouped version” verification technique outlined in the chapter Summaries Across Related Tables, but applied to each named step independently.

When building complex summary queries, the ability to inspect each step in isolation is highly useful and perhaps one of the strongest reasons to use CTEs.

Loading Exercise...

CTEs vs. Subqueries

The same query can often be written as a CTE or as a subquery. For example, the above-average-decks query could in principle be inlined:

SELECT
  d.id,
  d.name,
  COUNT(c.id) AS card_count
FROM decks AS d
LEFT JOIN cards AS c ON c.deck_id = d.id
GROUP BY d.id, d.name
HAVING COUNT(c.id) > (
  SELECT AVG(card_count) FROM (
    SELECT COUNT(c.id) AS card_count
    FROM decks AS d
    LEFT JOIN cards AS c ON c.deck_id = d.id
    GROUP BY d.id
  ) AS x
);

This is much harder to read. Notice that the decks LEFT JOIN cards GROUP BY d.id computation has to appear twice — once to produce the deck-level counts in the outer query, and again inside the inner subquery to compute the overall average. The same logic appears twice, the nesting is deep, and there is no name for any of the intermediate ideas. The CTE version spreads the same logic across named steps that anyone can follow.

The general guidance is: if a subquery is short and used in only one place, inline it. If a subquery is long, used more than once, or hides an important intermediate idea, lift it into a CTE.

CTEs and Views

A CTE is local to one query. A view is a stored named query that any other query can use.

CREATE VIEW deck_card_counts AS
SELECT d.id, d.name, COUNT(c.id) AS card_count
FROM decks AS d
LEFT JOIN cards AS c ON c.deck_id = d.id
GROUP BY d.id, d.name;

After defining this view, any query can use deck_card_counts as if it were a table. That is useful when the same intermediate result is needed across many queries.

For the purposes of this course, CTEs are usually preferred because they keep each query self-contained. Views become valuable when an organization or a project has stable summary definitions that should be shared across many places.

Loading Exercise...

Check Your Understanding

  1. What is the most concrete reason to use a CTE in a multi-summary query?
  2. Why is “use a clear name for each CTE” worth the extra typing?
  3. When is a view a better choice than a CTE?

SQL Programming Exercise

This sequence starts with one small counted step and then grows toward the two main reasons this chapter introduced CTEs at all: separating summaries that should not inflate each other, and naming intermediate results so that a comparison query stays readable. The final task asks you to build a query from two named steps instead of trying to hide the same logic inside one dense statement.

Loading Exercise...