Summaries and Derived Data

Aggregate Functions


Learning Objectives

  • You can use the most common SQL aggregate functions to answer summary questions.
  • You can explain how each aggregate treats NULL values and what it returns for an empty input.
  • You can choose between COUNT(*), COUNT(column), and COUNT(DISTINCT column) based on what is actually being counted.

Aggregate functions summarize many rows into a smaller result. They are useful whenever the question is about totals, averages, extremes, or counts rather than individual records.

This chapter covers the small set of aggregates that handle the large majority of summary questions, and pays careful attention to two situations that often produce surprising results: NULL values and empty inputs.

The Common Aggregates

A small set of aggregate functions covers most summary work:

  • COUNT(*) — number of rows,
  • COUNT(expression) — number of non-NULL values,
  • SUM(expression) — total of the values,
  • AVG(expression) — average of the values,
  • MIN(expression) — smallest value,
  • MAX(expression) — largest value.

For example:

SELECT COUNT(*) FROM submissions;

returns the number of rows in the submissions table.

SELECT AVG(score) FROM submissions;

returns one number: the average of the non-NULL scores across all submissions.

These look simple, and most of the time they are. The interesting questions arise when NULL values are present, when no rows match the filter, or when the same row contains values that should be counted differently.

Counting Rows vs. Counting Values

The most common confusion in aggregates is between COUNT(*) and COUNT(column).

COUNT(*) counts rows. Every row in the result of the underlying query contributes one to the count, regardless of any column being NULL.

COUNT(column) counts non-NULL values in that column. Rows where the column is NULL do not contribute.

Suppose the submissions table contains:

idexercise_iduser_idscore
17208
2721NULL
382010

Then:

  • COUNT(*) returns 3 (three rows exist),
  • COUNT(score) returns 2 (two rows have a non-NULL score),
  • COUNT(exercise_id) returns 3 (all rows have a non-NULL exercise),
  • COUNT(DISTINCT exercise_id) returns 2 (the two distinct exercise values are 7 and 8).

A counting query is only as meaningful as what it actually counts. “How many submissions exist?” is COUNT(*). “How many submissions have been graded?” is COUNT(score), because grading sets the score and ungraded submissions have NULL. “How many distinct exercises have submissions?” is COUNT(DISTINCT exercise_id).

Choosing the wrong form silently produces a different number, and silent wrong answers are the main hazard of summary queries.

Loading Exercise...

How NULL Affects the Other Aggregates

SUM, AVG, MIN, and MAX all ignore NULL values. They compute their result from the non-NULL values only.

In the example above:

  • SUM(score) returns 18 (8 + 10, with the NULL ignored),
  • AVG(score) returns 9 (the average of the two non-NULL scores),
  • MIN(score) returns 8,
  • MAX(score) returns 10.

This is usually what the question wants. An average of submission scores normally means “average of the scores that exist,” not “treat missing scores as zero.” But this default does have implications:

  • AVG(score) is not the same as SUM(score) / COUNT(*). It is SUM(score) / COUNT(score). The denominator changes.
  • Rows that are entirely missing the value still contribute to COUNT(*) but not to AVG(value).
The default ignores NULL on purpose

SUM, AVG, MIN, and MAX ignore NULL because that is almost always what summary questions mean. If the intent is to include missing values somehow, the query has to express that explicitly.


Loading Exercise...

What Aggregates Return on Empty Input

Another situation where aggregates surprise people is when the underlying query matches no rows.

If a WHERE clause excludes everything, or if the table is empty:

  • COUNT(*) returns 0,
  • COUNT(column) returns 0,
  • SUM(expression) returns NULL,
  • AVG(expression) returns NULL,
  • MIN(expression) returns NULL,
  • MAX(expression) returns NULL.

Notice the asymmetry. COUNT always returns a number, even when there is nothing to count. The other aggregates return NULL when there is nothing to aggregate, because no answer is defined.

This matters for application code. A page that shows “average score: NULL” is rarely what the designer intended. The fix is usually COALESCE:

SELECT COALESCE(AVG(score), 0) AS average_score FROM submissions WHERE exercise_id = 999;

If exercise 999 has no submissions, the bare AVG(score) would return NULL. The COALESCE replaces it with 0 for display.

As a reminder from Simple Expressions in Queries, COALESCE can be used to provide a fallback value when the input is NULL: COALESCE(value, fallback).

Note that a fallback of 0 for “no data” can be misleading. A real average of zero is not the same thing as “no submissions yet.” When the distinction matters, the query should keep NULL and let the application render it as something meaningful, like a string “not yet graded.”

Loading Exercise...

Aggregates in Different Domains

The same aggregate functions show up in different domains with the same shape but different meanings.

For the recipe app:

  • COUNT(*) FROM recipes — total number of recipes,
  • COUNT(DISTINCT user_id) FROM recipes — number of users who have written at least one recipe,
  • MAX(created_at) FROM recipes — the most recent recipe’s timestamp,
  • AVG(quantity) FROM recipe_ingredients — average ingredient quantity (this is probably meaningless without a unit, which is one of the reasons unit belongs to the relationship).

For the library:

  • COUNT(*) FROM loans — total loans, ever,
  • COUNT(*) FROM loans WHERE returned_at IS NULL — currently active loans,
  • COUNT(DISTINCT member_id) FROM loans — number of members who have ever borrowed something,
  • MIN(borrowed_at) FROM loans — the earliest loan in the system.

The library COUNT(*) FROM loans WHERE returned_at IS NULL is a good example of the use of NULL. The NULL value of returned_at means “not yet returned,” and it is exactly that meaning that lets us count active loans.

When Aggregates Are Not the Right Tool

Aggregate queries are well suited for questions about totals, averages, extremes, or counts. They are not the right tool when the goal is to inspect individual rows one by one.

If the question is “what is the average submission score for exercise 7?”, AVG is the answer. If the question is “what scores were given for exercise 7?”, an ordinary SELECT score is the answer. The two questions feel similar but want different shapes of result.

A good habit at this stage is to read the question carefully and ask: do I need one number per group, or do I need a list of stored rows? The answer decides whether to reach for aggregates.

Loading Exercise...

Aggregates and Data Meaning

Even simple aggregates require interpretation: a summary query is part of communication, not just calculation.

For example, an average score only makes sense if you know what the score means, what range it uses, and whether ungraded submissions should be included or excluded. SQL can be used to compute the numbers: you need to understand what the numbers represent.

Check Your Understanding

  1. What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
  2. What does AVG(score) return when no rows have a non-NULL score?
  3. Why is AVG(score) usually not the same as SUM(score) / COUNT(*)?

SQL Programming Exercise

This sequence starts with short aggregate questions where the main job is choosing the right form of COUNT. It then combines several aggregates in one query so that you practice reading what each result column really means.

Loading Exercise...