Normalization: From Anomalies to Normal Forms
Learning Objectives
- You can recognize design problems caused by redundancy, expressed as update, insert, and delete anomalies.
- You can describe a functional dependency in plain language and write one for a small schema.
- You can apply 1NF, 2NF, 3NF, and BCNF as practical checks on a schema.
- You can recognize when BCNF catches a redundancy that 3NF lets pass.
The Database Design and Modeling part taught how to recognize entities and relationships and turn them into a design package. The Schemas and Constraints part implemented that design as a real schema. The result, when done well, was a schema where each fact lived in one natural place.
This chapter returns to the same territory from a different direction. Instead of asking “how do we design well?” it asks “what specifically goes wrong when we don’t, and how do we systematically fix it?” The first half of the chapter shows three classic failure modes — update, insert, and delete anomalies. The second half introduces functional dependencies and normal-form rules that systematically remove the conditions for those failures.
We’ve already discussed update, insert, and delete anomalies in the chapter Data Distribution and Redundancy of part 4. This chapter revisits those concepts and connects them to the design and schema choices made in the earlier parts.
Redundancy Is a Correctness Problem
A schema with redundancies creates real problems with real data.
If the same fact is stored in many rows, then any change to that fact requires updating many rows. If even one row is missed, the database becomes inconsistent: different rows now report different versions of the “same” fact, and there is no way for a query to know which one is right.
A Deliberately Poor Table
Consider a wide table that tries to do too much:
CREATE TABLE course_activity (
user_name TEXT,
user_email TEXT,
course_code TEXT,
course_title TEXT,
term TEXT,
exercise_title TEXT,
deadline_at TIMESTAMP,
submitted_at TIMESTAMP,
score INTEGER
);
In one way, this is convenient, as everything is in one place. Querying one table gives you a complete picture of submission activity.
There are, however, several types of facts that are being mixed in the same table:
- facts about users (
user_name,user_email), - facts about courses (
course_code,course_title), - facts about course instances (
term), - facts about exercises (
exercise_title,deadline_at), - facts about submissions (
submitted_at,score).
For a course with 30 students and 10 exercises, the table holds 300 submission rows. The same course title appears 300 times. The same exercise title appears 30 times each. The same user name appears 10 times each. Every fact except the submission itself is replicated many times over.
This is the type of schema you have already learned to avoid — we’ve been practising putting a single fact in one place.
The schema you implemented earlier in the course — users, courses, course_instances, exercises, submissions — is the normalized version of course_activity. The work this chapter motivates is the work the earlier parts already did. What is new here is the explicit name for what could go wrong otherwise.
Update Anomaly
Suppose a course is renamed. In the wide table, this means every row mentioning that course has to be updated:
UPDATE course_activity SET course_title = 'Databases for Data Science'
WHERE course_code = 'CS-A1150';
If the table has 300 rows for this course, the update touches 300 rows. If it has 30,000 rows, it touches 30,000.
That much is just inefficient. The real problem is what happens if any row is missed — perhaps because the WHERE clause was slightly wrong, or because a script crashed halfway, or because a new row was inserted with the old title between the script reading the data and writing back. Now some rows say “Databases for Data Science” and other rows say the old title. A query cannot tell which one is the truth.
In the normalized schema, the course title lives in one place: courses.title. Renaming the course is one update to one row. There is no way for the title to be inconsistent across submissions because no submission stores a title at all.
The update anomaly is the problem that a fact stored in many places can become inconsistent across those places after an update.
Insert Anomaly
Suppose the system needs to add a new course before any submissions for it have been made. In the wide table, every row needs values for every column, including submitted_at and score. But there is no submission yet.
The options are all bad. Insert a row with NULL for the submission columns? That creates a strange row that is “about a course” but pretends to be “about a submission.” The row appears in queries about submissions and produces confusing results.
In the normalized schema, adding a course is one insert into courses. No submission row is needed because the schema does not require one. Courses can exist before any submission is made, exactly as the domain allows.
The insert anomaly is the problem that facts about one kind of entity cannot be added until facts about a different kind of entity exist, because the schema mixes them together.
Delete Anomaly
Suppose the last submission for a course is deleted. In the wide table, deleting that row also deletes every fact about the course (its title, its code) and the related exercise (its title, its deadline), because those facts only existed inside that submission row.
The course has not been removed from the system in any meaningful sense. It still exists as a real course; it has just stopped having submissions. But the database now believes the course does not exist at all.
In the normalized schema, deleting a submission removes one row from submissions. The course, the course instance, and the exercise stay where they are. They never depended on the submission for their existence.
The delete anomaly is the problem that deleting one fact unintentionally removes other facts that should have been preserved.
All three anomalies share the same root cause: a fact is stored in the same place as other facts that should be independent. The course title is stored in the same place as the submission facts, even though it really belongs to the course. The exercise deadline is stored in the same place as the submission facts, even though it really belongs to the exercise. The user name is stored in the same place as the submission facts, even though it really belongs to the user.
Having facts that should be independent stored together creates unnecessary dependencies between them. The anomalies are the symptoms of those dependencies.
Functional Dependencies
To formalize what “depends on” means, the database textbook vocabulary uses the term functional dependency: a statement that says, if you know one value, you automatically know another.
Some examples:
- if you know a course’s
id, you know itscodeandtitle(because there is exactly one row incoursesfor each id), - if you know a user’s
id, you know theirnameandemail, - if you know a course code, you know that course’s title (because each code identifies one course).
These are written as X -> Y, read as “X determines Y.” So:
course_id -> course_code, course_titleuser_id -> name, emailcourse_code -> course_title
The arrow does not mean “causes.” It means “given the value on the left, the value on the right is uniquely determined.”
A useful test for whether X -> Y holds: in the data the system actually stores, can two different rows have the same value of X but different values of Y? If yes, the dependency does not hold. If no, the dependency holds.
Functional Dependencies for Diagnostics
Functional dependencies are the diagnostic tool for redundancy. The anomalies above all share a pattern: a functional dependency that is true in the data but is not reflected by the schema’s structure.
In the wide course_activity table, the dependency course_code -> course_title holds (each code identifies one title). But the schema does not capture this; the title is repeated on every row. The mismatch between “what is logically true” and “how the data is stored” is exactly what makes the title vulnerable to update anomalies.
Normalization is the process of changing the schema so that every functional dependency is reflected by the structure: if X -> Y, then X should be a key (or part of a key) of the table that contains Y. When this is true everywhere, redundancy disappears.
The practical normalization workflow is: list the functional dependencies that hold in the domain, then check whether each one is reflected by the schema’s keys. Where it isn’t, the table needs to be split.
First Normal Form
The first rule, 1NF, says that each column holds one atomic value. No lists, no comma-separated strings, no repeating column groups.
A column like enrolled_course_ids = '3,5,8' violates 1NF because it stores three values in one column. Querying it requires string parsing. Updating it requires rewriting the whole string. The list is a hidden relationship that should be a separate table.
The fix is to pull the list out into its own table:
CREATE TABLE enrollments (
user_id INTEGER REFERENCES users(id),
course_instance_id INTEGER REFERENCES course_instances(id),
PRIMARY KEY (user_id, course_instance_id)
);
Each enrollment is now its own row. Queries can join, count, and filter without parsing strings.
The same applies to repeating column groups: exercise1_score, exercise2_score, exercise3_score is the same problem in a different shape. The fix is the same: move the repeated structure into a separate table.
For most realistic projects, 1NF is satisfied without thinking about it because relational schemas naturally encourage one value per column. The cases that violate 1NF are usually shortcuts that someone took deliberately, often before the relationship was fully understood.
Second Normal Form
2NF matters mainly when a table has a composite primary key. The rule: every non-key column must depend on the whole key, not just on part of it.
Suppose an enrollment table is structured this way:
CREATE TABLE enrollments_bad (
user_id INTEGER,
course_instance_id INTEGER,
user_email TEXT,
course_title TEXT,
joined_at TIMESTAMP,
PRIMARY KEY (user_id, course_instance_id)
);
The primary key is (user_id, course_instance_id). But user_email only depends on user_id, not on the whole key. And course_title only depends on course_instance_id, not on the whole key. The dependencies are:
user_id -> user_emailcourse_instance_id -> course_title(user_id, course_instance_id) -> joined_at
Only the third one uses the whole composite key. The first two are partial dependencies, and they are exactly what 2NF prohibits.
The fix is to put user_email in users and course_title in courses (or in course_instances, depending on where it really belongs). The enrollments table keeps only what truly depends on the pair:
CREATE TABLE enrollments (
user_id INTEGER REFERENCES users(id),
course_instance_id INTEGER REFERENCES course_instances(id),
joined_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, course_instance_id)
);
This is exactly the design from the design and modeling part. The 2NF rule is the formal reason why the design rationale separated user facts, course-instance facts, and enrollment facts.
When the table has a single-column primary key (which most tables do, with SERIAL PRIMARY KEY), 2NF is automatically satisfied. The rule only does interesting work when the key is composite.
Third Normal Form
3NF addresses transitive dependencies: cases where a non-key column depends on another non-key column.
Suppose a table is structured this way:
CREATE TABLE exercises_bad (
id SERIAL PRIMARY KEY,
course_instance_id INTEGER,
course_code TEXT,
course_title TEXT,
title TEXT,
deadline_at TIMESTAMP
);
The dependencies are:
id -> course_instance_id, title, deadline_atcourse_instance_id -> course_code(each course instance belongs to one course, which has one code)course_code -> course_title
That last one is a transitive dependency. course_title depends on course_code, which depends on course_instance_id, which depends on the primary key. So course_title is not directly dependent on the key — it is hanging off course_code, which is itself derived.
The result is the same kind of redundancy as before: every exercise row stores the course title, even though the title really belongs to the course.
The 3NF fix is to separate the transitively-dependent fact:
courses(id, code, title)course_instances(id, course_id, term, year)—course_codeis no longer stored; it can be looked up viacoursesexercises(id, course_instance_id, title, deadline_at)
Now course_title lives in courses exactly once. The exercises table contains only facts that depend directly on the exercise itself.
This is, again, the design from the earlier parts. The 3NF rule is the formal reason why those designs separated courses, course_instances, and exercises into three tables instead of one.
Boyce-Codd Normal Form
BCNF (Boyce-Codd Normal Form) handles a subtle case that 3NF leaves open. The rule is stricter than 3NF: for every non-trivial functional dependency X -> Y in the table, X must be a superkey — that is, X on its own must uniquely identify a row.
In practice, this only matters when a table has overlapping candidate keys with a dependency that crosses between them. That sounds abstract, so consider a concrete case.
Suppose the course platform assigns each student a personal mentor for each course instance they take. The department has two rules:
- for each pair of
(student, course_instance), exactly one mentor is assigned, - each mentor mentors students for only one course instance (a mentor specialises and does not split attention across courses).
A first attempt might collapse all of this into one table:
CREATE TABLE student_mentors_bad (
student_id INTEGER,
course_instance_id INTEGER,
mentor_id INTEGER,
PRIMARY KEY (student_id, course_instance_id)
);
The dependencies that hold here are:
(student_id, course_instance_id) -> mentor_id(each student-instance pair has one mentor, by the first rule),mentor_id -> course_instance_id(each mentor works for one course instance, by the second rule).
This table has two candidate keys: (student_id, course_instance_id) (the primary key) and (student_id, mentor_id). The second pair is a candidate key because mentor_id determines course_instance_id, so a student-mentor pair determines the instance and therefore the whole row.
Is this in 3NF? Yes. The 3NF rule has two ways for a functional dependency X -> A to pass: either X is a superkey, or A is a prime attribute (an attribute that appears in some candidate key). For mentor_id -> course_instance_id, the left side mentor_id is not a superkey on its own, but the right side course_instance_id is part of the candidate key (student_id, course_instance_id). So course_instance_id is prime, and the dependency passes 3NF on the second escape route.
But the redundancy is still there. Every row that mentions a particular mentor also mentions the course instance that mentor works for. If the mentor switches to a different course instance, every row has to be updated. If a mentor exists but has no students yet, there is no place to record which course instance they belong to. The classic anomalies are back.
BCNF catches this. The dependency mentor_id -> course_instance_id has mentor_id on the left, and mentor_id by itself is not a superkey of this table. BCNF requires the left side of every non-trivial dependency to be a superkey, with no escape clause for the right side being prime. So the table is in 3NF but not in BCNF.
The fix is the same shape as the earlier normalization moves: separate the dependency into its own table.
mentors(id, name, course_instance_id)— the course instance assignment belongs to the mentor,student_mentors(student_id, mentor_id)— the assignment of a mentor to a student.
Now mentor_id -> course_instance_id is reflected by structure: mentor_id is the key of mentors, and course_instance_id lives there. The student-mentor table no longer stores the instance.
This decomposition removes the redundancy caused by mentor_id -> course_instance_id. One trade-off is that the original dependency (student_id, course_instance_id) -> mentor_id is no longer enforced by a simple key on one table. In a production schema, enforcing “one mentor per student per course instance” would need additional machinery, such as a trigger, an exclusion-style design, or accepting a 3NF design that preserves the dependency more directly.
In everyday work, schemas designed by following the workflow in Database Design and Modeling and Schemas and Constraints usually land in BCNF without anyone explicitly checking. The cases where 3NF and BCNF differ all involve overlapping candidate keys with cross-dependencies like the one above, which are rare in domains where each entity has a clean surrogate primary key.
Higher normal forms exist. 4NF addresses multi-valued dependencies — cases where one table independently stores two unrelated multi-valued attributes of the same entity. 5NF handles certain join-related dependencies that appear when a table can be reconstructed by joining its projections. Both are rarely relevant in everyday domains, and most well-designed schemas satisfy them without anyone explicitly checking.
This part stops at BCNF deliberately. The goal is to give the practical tools that handle the common cases, while still omitting the more complex rules that only apply in edge cases.
A Practical Workflow
Most teaching presentations of normalization start with definitions and then show examples. In practice, it usually works better to start with the schema and work backwards.
A useful workflow:
- list the functional dependencies that hold in the domain (
course_id -> course_code, course_title,course_code -> course_title, etc.), - for each table, check that every functional dependency where the right side is a column in the table has its left side as a key (or part of a key) of that table,
- when a dependency violates this rule, split the table so that the dependency is captured by structure,
- re-check.
This workflow is what produces a 3NF schema, and in most cases a BCNF schema as well — the BCNF section above shows the rare situation where the workflow lands at 3NF but stops short of BCNF. The workflow can be applied without ever naming the normal forms; the normal form names are useful when discussing why a particular split was made, but the work itself is the same.
Anomalies in the Recipe and Library Domains
The same three anomalies apply across domains, but they are worth examining a little more carefully here, because they illustrate something the normal-form rules alone do not catch.
Recipe app. Suppose recipe_ingredients stored the ingredient name directly instead of referencing the ingredients table:
CREATE TABLE recipe_ingredients_bad (
recipe_id INTEGER REFERENCES recipes(id),
ingredient_name TEXT,
quantity NUMERIC,
PRIMARY KEY (recipe_id, ingredient_name)
);
Renaming “all-purpose flour” to “flour” would require updating every recipe that uses it (update anomaly). A new ingredient that no recipe uses yet could not be added (insert anomaly). Removing a recipe would also remove the only mention of any ingredient that only appeared in that recipe (delete anomaly).
The normalized form — ingredients as its own table, recipe_ingredients referencing it by id — avoids all three.
But notice something subtle: the bad table is technically in 3NF, and even in BCNF. The only functional dependency that holds inside it is (recipe_id, ingredient_name) -> quantity, and that dependency uses the whole composite key. There is no transitive dependency to flag, no partial dependency, no overlapping candidate key. By the rules introduced in this chapter, the table passes every check.
The reason the table still suffers the three anomalies is that the redundancy lives outside the table’s own functional dependencies. The implicit fact “this ingredient name refers to a real ingredient with an identity of its own” is not represented by any column, so no normal-form check can see it. Recognizing that “ingredient” deserves to be its own entity is a design judgment from the Database Design and Modeling part. Normalization formalizes the rules for tables whose entities are already correctly identified; it does not, on its own, tell you which entities should exist.
Library. Suppose loans stored the book title directly instead of going through copies:
CREATE TABLE loans_bad (
id SERIAL PRIMARY KEY,
member_id INTEGER REFERENCES members(id),
book_title TEXT,
borrowed_at TIMESTAMP
);
The same observation applies here, and even more cleanly. With id as the primary key, the only nontrivial functional dependency in this table is the trivial one from id to the other columns — which any table with a surrogate primary key satisfies by definition. There are no transitive dependencies, no partial dependencies, no overlapping candidate keys. The table is in BCNF by inspection.
And yet the anomalies are obvious: renaming a book requires updating every loan record for it; a new book that has never been borrowed cannot be added through this table; removing the last loan of a book loses the book’s title. The structural rules pass; the design is still wrong.
The normalized form — books as its own table, copies referencing books by id, loans referencing copies by id — keeps each fact where it belongs.
The normal-form rules catch redundancies within the columns of a table. They cannot catch redundancies that come from missing tables — from facts that should have been their own entities but were collapsed into someone else’s columns.
For introductory work, the recipe-app and library examples here are the most common shape of this gap: a string field (an ingredient name, a book title) standing in for an entity that deserved its own table. The normal-form check passes, but the schema is still wrong.
The remedy is the same as the original design discipline: ask what the system needs to remember as an entity in its own right, and give each such entity its own table. Normalization formalizes the consistency of a schema; the design parts of the course taught how to recognize which entities should exist in the first place.
Check Your Understanding
- Why is redundancy a correctness problem rather than only a tidiness problem?
- What does a functional dependency say in plain language, and why is it a useful diagnostic tool?
- Why does 2NF only do interesting work when the table has a composite primary key, and why do most real-world schemas land in BCNF without anyone explicitly checking?