Aggregate Functions
Learning Objectives
- You can use the most common SQL aggregate functions to answer summary questions.
- You can explain how each aggregate treats
NULLvalues and what it returns for an empty input. - You can choose between
COUNT(*),COUNT(column), andCOUNT(DISTINCT column)based on what is actually being counted.
Aggregate functions summarize many rows into a smaller result. They are useful whenever the question is about totals, averages, extremes, or counts rather than individual records.
This chapter covers the small set of aggregates that handle the large majority of summary questions, and pays careful attention to two situations that often produce surprising results: NULL values and empty inputs.
The Common Aggregates
A small set of aggregate functions covers most summary work:
COUNT(*)— number of rows,COUNT(expression)— number of non-NULLvalues,SUM(expression)— total of the values,AVG(expression)— average of the values,MIN(expression)— smallest value,MAX(expression)— largest value.
For example:
SELECT COUNT(*) FROM submissions;
returns the number of rows in the submissions table.
SELECT AVG(score) FROM submissions;
returns one number: the average of the non-NULL scores across all submissions.
These look simple, and most of the time they are. The interesting questions arise when NULL values are present, when no rows match the filter, or when the same row contains values that should be counted differently.
Counting Rows vs. Counting Values
The most common confusion in aggregates is between COUNT(*) and COUNT(column).
COUNT(*) counts rows. Every row in the result of the underlying query contributes one to the count, regardless of any column being NULL.
COUNT(column) counts non-NULL values in that column. Rows where the column is NULL do not contribute.
Suppose the submissions table contains:
| id | exercise_id | user_id | score |
|---|---|---|---|
| 1 | 7 | 20 | 8 |
| 2 | 7 | 21 | NULL |
| 3 | 8 | 20 | 10 |
Then:
COUNT(*)returns3(three rows exist),COUNT(score)returns2(two rows have a non-NULLscore),COUNT(exercise_id)returns3(all rows have a non-NULLexercise),COUNT(DISTINCT exercise_id)returns2(the two distinct exercise values are 7 and 8).
A counting query is only as meaningful as what it actually counts. “How many submissions exist?” is COUNT(*). “How many submissions have been graded?” is COUNT(score), because grading sets the score and ungraded submissions have NULL. “How many distinct exercises have submissions?” is COUNT(DISTINCT exercise_id).
Choosing the wrong form silently produces a different number, and silent wrong answers are the main hazard of summary queries.
How NULL Affects the Other Aggregates
SUM, AVG, MIN, and MAX all ignore NULL values. They compute their result from the non-NULL values only.
In the example above:
SUM(score)returns18(8 + 10, with theNULLignored),AVG(score)returns9(the average of the two non-NULLscores),MIN(score)returns8,MAX(score)returns10.
This is usually what the question wants. An average of submission scores normally means “average of the scores that exist,” not “treat missing scores as zero.” But this default does have implications:
AVG(score)is not the same asSUM(score) / COUNT(*). It isSUM(score) / COUNT(score). The denominator changes.- Rows that are entirely missing the value still contribute to
COUNT(*)but not toAVG(value).
SUM, AVG, MIN, and MAX ignore NULL because that is almost always what summary questions mean. If the intent is to include missing values somehow, the query has to express that explicitly.
What Aggregates Return on Empty Input
Another situation where aggregates surprise people is when the underlying query matches no rows.
If a WHERE clause excludes everything, or if the table is empty:
COUNT(*)returns0,COUNT(column)returns0,SUM(expression)returnsNULL,AVG(expression)returnsNULL,MIN(expression)returnsNULL,MAX(expression)returnsNULL.
Notice the asymmetry. COUNT always returns a number, even when there is nothing to count. The other aggregates return NULL when there is nothing to aggregate, because no answer is defined.
This matters for application code. A page that shows “average score: NULL” is rarely what the designer intended. The fix is usually COALESCE:
SELECT COALESCE(AVG(score), 0) AS average_score FROM submissions WHERE exercise_id = 999;
If exercise 999 has no submissions, the bare AVG(score) would return NULL. The COALESCE replaces it with 0 for display.
As a reminder from Simple Expressions in Queries,
COALESCEcan be used to provide a fallback value when the input isNULL:COALESCE(value, fallback).
Note that a fallback of 0 for “no data” can be misleading. A real average of zero is not the same thing as “no submissions yet.” When the distinction matters, the query should keep NULL and let the application render it as something meaningful, like a string “not yet graded.”
Aggregates in Different Domains
The same aggregate functions show up in different domains with the same shape but different meanings.
For the recipe app:
COUNT(*) FROM recipes— total number of recipes,COUNT(DISTINCT user_id) FROM recipes— number of users who have written at least one recipe,MAX(created_at) FROM recipes— the most recent recipe’s timestamp,AVG(quantity) FROM recipe_ingredients— average ingredient quantity (this is probably meaningless without a unit, which is one of the reasons unit belongs to the relationship).
For the library:
COUNT(*) FROM loans— total loans, ever,COUNT(*) FROM loans WHERE returned_at IS NULL— currently active loans,COUNT(DISTINCT member_id) FROM loans— number of members who have ever borrowed something,MIN(borrowed_at) FROM loans— the earliest loan in the system.
The library COUNT(*) FROM loans WHERE returned_at IS NULL is a good example of the use of NULL. The NULL value of returned_at means “not yet returned,” and it is exactly that meaning that lets us count active loans.
When Aggregates Are Not the Right Tool
Aggregate queries are well suited for questions about totals, averages, extremes, or counts. They are not the right tool when the goal is to inspect individual rows one by one.
If the question is “what is the average submission score for exercise 7?”, AVG is the answer. If the question is “what scores were given for exercise 7?”, an ordinary SELECT score is the answer. The two questions feel similar but want different shapes of result.
A good habit at this stage is to read the question carefully and ask: do I need one number per group, or do I need a list of stored rows? The answer decides whether to reach for aggregates.
Even simple aggregates require interpretation: a summary query is part of communication, not just calculation.
For example, an average score only makes sense if you know what the score means, what range it uses, and whether ungraded submissions should be included or excluded. SQL can be used to compute the numbers: you need to understand what the numbers represent.
Check Your Understanding
- What is the difference between
COUNT(*),COUNT(column), andCOUNT(DISTINCT column)? - What does
AVG(score)return when no rows have a non-NULLscore? - Why is
AVG(score)usually not the same asSUM(score) / COUNT(*)?
SQL Programming Exercise
This sequence starts with short aggregate questions where the main job is choosing the right form of COUNT. It then combines several aggregates in one query so that you practice reading what each result column really means.