Summaries and Derived Data

Grouping and Filtering Groups


Learning Objectives

  • You can write grouped queries using GROUP BY for one or more columns.
  • You can name what one result row represents in any grouped query you write.
  • You can distinguish WHERE from HAVING and choose the right one for a given filter.

An aggregate over a whole table gives one summary for all matching rows. Often, however, the question wants one summary per group. That is what GROUP BY is for.

Once rows have been grouped, the next natural question is whether all groups should appear in the result or only some of them. That is what HAVING is for: WHERE filters rows; HAVING filters groups.

This chapter covers both clauses together, because in practice they are almost always thought about as one pair.

A First Grouped Query

Suppose the question is: how many submissions does each exercise have?

SELECT
  exercise_id,
  COUNT(*) AS submission_count
FROM submissions
GROUP BY exercise_id;

This returns one row per exercise_id. The COUNT(*) is computed within each group rather than across the whole table.

A possible result might look like this:

exercise_idsubmission_count
73
81
125

Three exercises had submissions; for each one, the count of its submissions appears alongside.

Loading Exercise...

What Grouping Really Does

A grouped query works in two conceptual steps:

  1. partition the rows into groups, where rows that share the same grouping value or values go into the same group,
  2. compute the aggregate separately inside each group.

The SELECT clause then produces one row per group. Each group becomes one result row.

This explains why the example above has one count per exercise rather than one count for the whole table. Each exercise_id value defines one group; each group contributes one row to the result.

Ask What One Result Row Represents

Before grouping, one result row from SELECT * FROM submissions represents one stored submission. After grouping by exercise_id, one result row represents one exercise together with a summary about its submissions.

That shift in row meaning is the most important mental move in this part. Every grouped query has a different “what does one result row represent?” answer, and the best habit when reading or writing one is to pause and name that answer in plain language.

For the query above, one result row represents:

one exercise, and the count of submissions made for that exercise.

For a query grouped by (exercise_id, status), one result row would represent:

one exercise-and-status combination, and the count of submissions in that combination.

If the row meaning is unclear, the query is usually unclear too.

Loading Exercise...

A Grouped Query Restricts SELECT

A grouped query cannot freely select any column it wants. Selected columns must be either:

  • columns that appear in the GROUP BY, or
  • aggregate expressions like COUNT(*) or AVG(score).

The reason is that the query produces one row per group. If a column is neither grouped nor aggregated, the query has no clear value to display for that group. There may be many different values of that column within one group, and the DBMS will not guess which to show.

For example, this query is invalid:

SELECT
  exercise_id,
  user_id,
  COUNT(*)
FROM submissions
GROUP BY exercise_id;

Within one group of exercise_id = 7, there are several rows, each with possibly different user_id values. Which user_id should appear in the result row for exercise 7? The DBMS cannot say, so the query is rejected.

The fix depends on the question. If the question is “for each exercise, how many distinct users submitted?”, then:

SELECT
  exercise_id,
  COUNT(DISTINCT user_id) AS distinct_user_count,
  COUNT(*) AS submission_count
FROM submissions
GROUP BY exercise_id;

The user_id is no longer selected directly; instead, an aggregate over it is computed. That respects the rule and answers a clearer question.

Loading Exercise...

Grouping by More Than One Column

A query can group by several columns. The groups are then formed by every distinct combination of those columns.

SELECT
  exercise_id,
  status,
  COUNT(*) AS submission_count
FROM submissions
GROUP BY exercise_id, status;

This produces one row for each (exercise_id, status) combination. The result might look like this:

exercise_idstatussubmission_count
7submitted1
7graded2
8graded1

Three exercise-and-status combinations existed in the data; each gets its own row.

The same row-meaning habit applies. One result row here represents:

one exercise-and-status combination, and the count of submissions in that combination.

Time Bucketing as Grouping

A common pattern is to group by a derived value rather than a stored column. For example, grouping submissions by the day they were submitted:

SELECT
  submitted_at::date AS submission_day,
  COUNT(*) AS submission_count
FROM submissions
GROUP BY submitted_at::date
ORDER BY submission_day;

The expression submitted_at::date extracts the date part of the timestamp. Two submissions from different times of the same day fall into the same group. The result is one row per day:

submission_daysubmission_count
2026-03-092
2026-03-105
2026-03-111

This is a foreshadowing of the time-bucketing patterns that come back later in this part. For now, the important point is that the grouping expression does not have to be a column name. Any expression can group, as long as the same expression appears in SELECT.

Loading Exercise...

Filtering Groups with HAVING

Suppose the question now becomes: which exercises have at least five submissions?

SELECT
  exercise_id,
  COUNT(*) AS submission_count
FROM submissions
GROUP BY exercise_id
HAVING COUNT(*) >= 5;

This filters groups after the counting has happened. Each group contributes one row to the intermediate result; HAVING then keeps only the groups whose count meets the condition.

A possible result:

exercise_idsubmission_count
75
128

Exercises with fewer than five submissions are not in the result. Their rows in submissions did not disappear; they were grouped, the count was computed, and then the group was dropped because its count failed the condition.

WHERE vs. HAVING

The cleanest way to remember the distinction:

  • WHERE filters rows before grouping,
  • HAVING filters groups after grouping.

That is not just a clause-order detail. It reflects two different questions.

WHERE status = 'graded' says only graded rows should even enter the groups. The grouping then sees only graded rows.

HAVING COUNT(*) >= 3 says keep only those groups that are large enough after grouping. The grouping sees all rows, and then the small groups are discarded.

These two conditions can co-exist in one query, because they target different stages.

Loading Exercise...

Combining WHERE and HAVING

A query may use both:

SELECT
  exercise_id,
  COUNT(*) AS graded_submission_count
FROM submissions
WHERE status = 'graded'
GROUP BY exercise_id
HAVING COUNT(*) >= 3;

Step by step:

  1. WHERE status = 'graded' keeps only graded submission rows,
  2. GROUP BY exercise_id partitions those rows into groups by exercise,
  3. HAVING COUNT(*) >= 3 keeps only the exercise groups with at least three graded submissions.

The result shows exercises that have at least three graded submissions, with the count of those graded submissions per exercise.

A useful exercise is to predict how the result would change if either clause were removed:

  • without the WHERE, the count would include ungraded submissions too, so the groups would be larger,
  • without the HAVING, the result would include exercises with fewer than three graded submissions.

Each clause shapes the result in a way the other cannot.

Rows or Groups?

The phrasing of the question usually makes it clear whether the condition is about rows or groups. “Keep only exercises with at least three submissions” is a group condition: it applies to the count of submissions per exercise, so it belongs in HAVING. “Keep only graded submissions” is a row condition: it applies to each submission individually, so it belongs in WHERE.

A Common Mistake

Novices sometimes try to write aggregate conditions in WHERE:

SELECT exercise_id, COUNT(*)
FROM submissions
WHERE COUNT(*) >= 5
GROUP BY exercise_id;

This does not work because WHERE runs before grouping. At that stage, no group exists yet, so COUNT(*) cannot be computed. The DBMS rejects the query.

The fix is to move the condition to HAVING:

SELECT exercise_id, COUNT(*)
FROM submissions
GROUP BY exercise_id
HAVING COUNT(*) >= 5;

Now the query is well-formed: rows are grouped first, the count is computed for each group, and the condition is applied to the computed count.

Loading Exercise...

If the condition uses an aggregate, it goes in HAVING

A simple test: does the condition contain COUNT, SUM, AVG, MIN, or MAX? If yes, it has to be in HAVING, because WHERE cannot see those values. If no, it almost always belongs in WHERE, because filtering before grouping is more efficient and clearer.

A Borderline Case: Filtering on a Grouping Column

A condition that filters on the grouping column itself can go in either clause and produces the same result. For example, “show grouped counts only for exercise 7”:

SELECT exercise_id, COUNT(*)
FROM submissions
WHERE exercise_id = 7
GROUP BY exercise_id;

and

SELECT exercise_id, COUNT(*)
FROM submissions
GROUP BY exercise_id
HAVING exercise_id = 7;

Both return the same single row. The first version is preferred because filtering before grouping does less work, but both are syntactically valid.

This is the only case where the two clauses are interchangeable. Once an aggregate is involved, only HAVING works.

Loading Exercise...

GROUP BY and DISTINCT Sometimes Overlap

The chapter Sorting and Limiting Results introduced DISTINCT as a way to eliminate duplicate rows from a result. Grouping has a similar effect for columns that are not aggregated.

SELECT DISTINCT status FROM submissions;

and

SELECT status FROM submissions GROUP BY status;

return the same rows. Both list the distinct status values that appear in the table.

The difference matters once aggregates enter the picture. DISTINCT cannot compute counts or averages per value; it only removes duplicates. GROUP BY can. So:

  • if the question is “what distinct values appear?”, DISTINCT is usually clearer,
  • if the question is “what distinct values appear, and what is the count or average for each?”, GROUP BY is the right tool.

This distinction comes back in later chapters when the question is borderline.

Loading Exercise...

Check Your Understanding

  1. What changes in the meaning of a result row when GROUP BY is added?
  2. Why does grouping restrict which columns may appear in SELECT?
  3. What does WHERE filter, what does HAVING filter, and why does an aggregate condition like COUNT(*) >= 5 have to go in HAVING rather than WHERE?

SQL Programming Exercise

This sequence starts with one grouped count and grows toward queries that use both WHERE and HAVING in the same statement. The goal is to make the row-meaning habit and the row-filter versus group-filter distinction feel routine rather than theoretical.

Loading Exercise...