Transactions, Normalization, and Performance

Denormalization as a Deliberate Trade-Off


Learning Objectives

  • You can describe denormalization as a deliberate trade-off, not as poor design.
  • You can list the named patterns for storing redundant data and the consistency machinery each one needs.
  • You can decide when denormalization is the right exception and when normalization should remain the default.

The previous chapters argued for normalization as the default. Each fact in one place, each dependency reflected by structure, no anomalies hiding in the schema.

This chapter takes up the question of when to deviate from that default. Real systems do denormalize, sometimes for good reasons. The discipline is to recognize the difference between deliberate denormalization (which is engineering) and accidental denormalization (which is a bug).

Redundancy Can Make Sense

Denormalization is the deliberate decision to violate a normalization rule for a stated reason, usually performance. The key word is “deliberate.” A schema with redundancy that the team has not explicitly chosen is not denormalized — it is just badly normalized. A schema with redundancy that the team has chosen, documented, and built consistency machinery for is denormalized.

There are a few cases where storing a value redundantly is appropriate, but each one is a deliberate trade-off. Some examples:

Historical snapshots. A submission’s score at the moment it was graded should be stored, not derived, even if the grading rubric later changes. The redundancy here is intentional and correct: the historical fact and the current rule are two different things.

Computed summaries with consistency machinery. A precomputed card_count on a deck row is redundant with the actual count of cards, but it can be the right choice when the count is queried often and the application has machinery (triggers, materialized views, scheduled refresh) to keep it correct.

Cached external data. Pulling a value from another system once and storing it is reasonable if the cost of querying the other system is high.

The common thread: redundancy is acceptable when the team has explicitly chosen it, knows the consistency cost, and has a plan for keeping the redundant copies aligned. Redundancy is a hazard when it appears by accident, because no one has thought through what should happen when one copy diverges.

Loading Exercise...

Considering Denormalization

A few signals suggest that denormalization is worth considering.

The derivation is genuinely expensive. A summary that requires joining many tables and aggregating millions of rows on every page load may be too slow. Storing the precomputed summary, with machinery to keep it correct, can be the right answer.

The derivation is impossible to express in SQL. Some derived values depend on application logic that SQL cannot easily compute. Storing the result of that computation is reasonable when the computation itself is the bottleneck.

The derivation depends on point-in-time data. A submission’s score at the moment it was graded is a fact, not a derivation. Storing it is correct, even if the grading rubric later changes. This is not really denormalization — it is just stored data — but it is sometimes confused with redundancy because there might be a separate derived “current score” alongside it.

The system has read-heavy workloads with strict latency requirements. Reporting databases, dashboards, and analytical systems often denormalize aggressively because their access pattern is read many times for each write. The cost of keeping the denormalized data correct is paid once per write but saved many times per read.

For most projects, none of these conditions apply. The default is normalization, and denormalization is a deliberate exception.

Denormalize for measured reasons

Each denormalization should answer a specific question: which query is slow, by how much, and how does this change make it fast enough? “It seems faster” is not a measured reason; it is a guess.


Loading Exercise...

Denormalization Patterns

A few specific patterns appear often enough to have names.

Precomputed counts and aggregates. A card_count column on decks that is kept in sync with the actual count of cards in each deck. The query becomes a single column read instead of a COUNT(*) over cards.

Copied facts. A denormalized user name on a posts table, kept in sync when the user changes their name. The query for “show this post with its author’s name” becomes a single-table read.

Loading Exercise...

Materialized aggregates. A nightly summary table that contains daily activity counts, refreshed once per day. Dashboard queries read the pre-aggregated rows instead of recomputing from raw activity logs.

Materialized views. PostgreSQL’s built-in feature for storing the result of a query and refreshing it on demand. This is the most disciplined version of “store an aggregate,” because the source query is part of the schema.

Each pattern has a different consistency-machinery shape, which the next section describes.

Keeping Denormalized Data Consistent

A stored summary is a promise that the system will keep it correct. The mechanism for keeping that promise is part of the design.

Update on write. Every operation that changes the underlying data also updates the denormalized value. For example, if the team chooses to store a card_count on decks, this means every insert into cards includes an UPDATE decks SET card_count = card_count + 1, every delete includes a decrement, and every move of a card between decks includes both. This is the most invasive pattern: every code path that touches the underlying table has to remember the summary.

Transactions are essential here: without them, a failure between the two writes would leave card_count out of sync. Update-on-write requires transactional discipline everywhere.

Loading Exercise...

Triggers. It is possible to create a trigger — code that fires automatically on certain conditions — to keep the summary up to date. For example, a trigger on cards could update the card_count on the corresponding deck whenever a card is inserted, deleted, or moved.

For the card_count example, in PostgreSQL, the trigger function might look like this:

CREATE FUNCTION update_deck_card_count()
RETURNS trigger AS $$
BEGIN
    IF TG_OP = 'INSERT' THEN
        UPDATE decks SET card_count = card_count + 1 WHERE id = NEW.deck_id;
    ELSIF TG_OP = 'DELETE' THEN
        UPDATE decks SET card_count = card_count - 1 WHERE id = OLD.deck_id;
    ELSIF TG_OP = 'UPDATE' AND OLD.deck_id <> NEW.deck_id THEN
        UPDATE decks SET card_count = card_count - 1 WHERE id = OLD.deck_id;
        UPDATE decks SET card_count = card_count + 1 WHERE id = NEW.deck_id;
    END IF;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER cards_update_count
AFTER INSERT OR UPDATE OR DELETE ON cards
FOR EACH ROW EXECUTE FUNCTION update_deck_card_count();

This centralizes the update logic, so adding a new code path to the application that inserts a card does not require remembering the summary. The cost is that triggers are part of the schema and have to be migrated, debugged, and maintained as code.

The developers of the application would also have to be aware of the triggers and understand that they exist. Triggers can sometimes make the flow of data less transparent, which can lead to confusion during debugging or when new developers join the team.

Loading Exercise...

Periodic recomputation. A scheduled job recomputes the denormalized values from scratch and overwrites the stored values. This is simpler than triggers but means the denormalized values are sometimes stale.

-- Run nightly
UPDATE decks
SET card_count = (
    SELECT COUNT(*) FROM cards WHERE cards.deck_id = decks.id
);

For dashboards that show “as of midnight” data, periodic recomputation is often perfectly acceptable. For UI elements that need to reflect changes immediately, it is not.

Loading Exercise...

Materialized views. PostgreSQL provides a built-in mechanism for storing the result of a query. The view definition is part of the schema; the data is refreshed on command.

CREATE MATERIALIZED VIEW deck_summary AS
SELECT
    d.id,
    d.name,
    COUNT(DISTINCT c.id) AS card_count,
    COUNT(r.id) AS review_count
FROM decks AS d
LEFT JOIN cards AS c ON c.deck_id = d.id
LEFT JOIN reviews AS r ON r.card_id = c.id
GROUP BY d.id, d.name;

-- Refresh on demand
REFRESH MATERIALIZED VIEW deck_summary;

The materialized view can be queried like a table but holds precomputed results. REFRESH MATERIALIZED VIEW rebuilds it. For summaries that take significant time to compute and are read often, this is an excellent pattern.

Each of the four patterns trades off freshness, complexity, and code-discipline differently. The right choice depends on how stale the value is allowed to be and how much consistency machinery the team is willing to maintain.

Loading Exercise...

Worked Example: Storing review_count on Cards

Suppose the study tracker dashboard frequently shows the number of reviews each card has received, and profiling reveals that the COUNT(*) FROM reviews WHERE card_id = ? query is the largest cost on the dashboard page.

The team considers denormalizing by storing review_count on the cards table.

The cost of storing. Every insert into reviews becomes two writes: the new review row and the update to the card’s count. Every delete from reviews becomes two writes too. The transaction must wrap both writes to prevent the count from drifting on partial failure. This is a real maintenance burden.

The cost of not storing. The dashboard query takes whatever time it takes to count, scaled by the number of reviews. For a project with 10,000 reviews, this is milliseconds, not seconds.

The decision. For this scale, storing is not worth it. The derived query is fast enough; the maintenance cost is not justified.

If, much later, the dashboard becomes slow because the project has grown to millions of reviews, the team can revisit the decision. At that point, the materialized-view pattern is probably the cleanest fix: define a card_review_summary materialized view, refresh it on a schedule, and have the dashboard read from it. This isolates the denormalization to one place rather than spreading update-on-write logic through every code path that touches reviews.

This is the recurring shape of the decision: derive by default, denormalize only with measured pressure, and choose the pattern that puts the consistency machinery in the most contained place. A read-heavy reporting system might denormalize aggressively from the start, because the data is point-in-time and never modified after generation; for an operational schema like the study tracker, the default of deriving is almost always the right starting choice.

Measure before denormalizing

A common mistake is to denormalize early, before any measured need. The justification might go “joins are slow, so let’s avoid them by storing the value directly.”

This reasoning is usually wrong.

If a denormalization is justified by “joins are slow,” the next question should be “how slow, on what query, with what indexes?” Without those numbers, the denormalization is a guess.

Premature denormalization buys nothing measurable, costs real maintenance burden, and often creates the very bugs (silent inconsistency) that the schema would have avoided otherwise. The pattern to follow is: design normalized, measure, denormalize where the measurement supports it.


Loading Exercise...

SQL Programming Exercise

The following SQL sequence contrasts deriving a value with storing it deliberately. It begins with a grouped count, then practices backfilling, and then practices update-on-write maintenance.

For the query that uses a transaction, note that the grader does not check that you use BEGIN, COMMIT, and ROLLBACK correctly. It only checks the final state of the database. As the point of the exercise is to practice the transaction shape, do use BEGIN and COMMIT appropriately.

Loading Exercise...

Check Your Understanding

  1. What is the difference between accidental redundancy and deliberate denormalization?
  2. Why is backfilling a denormalized column a one-time cost, while update-on-write is an ongoing cost?
  3. Why does update-on-write require transactional discipline everywhere it appears?