Transactions, Normalization, and Performance

Indexes and Query Plans


Learning Objectives

  • You can explain what an index is, why some queries become much faster with the right index, and what an index costs.
  • You can read a basic EXPLAIN output and identify the main scan and join strategies it shows.
  • You can use EXPLAIN ANALYZE to confirm that an index is being used and that a change actually improves performance.

We’ve mentioned the term index a couple of times in this part, but we haven’t defined it yet. An index is a separate data structure that the database maintains to speed up lookups by a specific column. They do not change what a query means, and they do not change what data the database stores. They change how quickly the database can find the rows the query is asking for.

Like every other design decision in this course, index choice should be deliberate. The questions to ask:

  • which queries does the application run frequently?
  • which of those queries are slow today, or might become slow as the data grows?
  • which columns appear in the WHERE, JOIN, and ORDER BY clauses of those queries?
  • what is the cost (in storage, in write performance) of indexing those columns?
  • does EXPLAIN ANALYZE confirm the index is doing the work you expected?

The answers might lead to two or three indexes for a small project, or to dozens for a complex production system. There is no universal right number.

This chapter introduces both indexes as a deliberate design decision, and EXPLAIN as the way to read what the database is actually doing.

Indexes and Sequential Scans

A database table is a collection of rows. Without any index, finding a specific row requires looking through the rows one by one, comparing each one to the search condition, and collecting the matches. This is called a sequential scan or table scan.

For a small table, a sequential scan is fine. For 100 rows, looking at all 100 takes negligible time. For 100,000 rows, looking at all of them is noticeably slower. For 100 million, it can be the dominant cost of the query.

An index is a separate data structure that lets the database find rows by a specific column without scanning the whole table. Conceptually, it is like the index at the back of a book: instead of reading every page to find mentions of “transactions,” you flip to the index, find “transactions,” and jump straight to the listed page numbers.

For a query like:

SELECT * FROM cards WHERE deck_id = 3;

with no index, the database scans every row of cards. With an index on deck_id, the database goes directly to the index entry for deck_id = 3, finds the list of matching row locations, and reads only those rows.

The bigger the table, the bigger the difference.

Loading Exercise...

Index Types

The indexes that DBMSs use differ to some extent. The most commonly used index type is the B-tree, which is the default index in PostgreSQL. A B-tree organizes the indexed values into a tree structure where each lookup takes a small number of steps regardless of how big the table grows.

For a table with one million rows, a B-tree lookup typically takes about 3-4 steps. For a table with one billion rows, it takes about 5 steps. The lookup cost grows logarithmically with the table size, not linearly.

B-trees are not binary trees; each node can have many children, which keeps the tree shallow and efficient for disk access. The exact number of steps depends on the branching factor of the tree and the distribution of values, but the key point is that it does not grow proportionally to the number of rows.

Loading Exercise...

Other index types exist in DBMSs — hash indexes, GIN indexes for full-text search and JSONB, GiST for geometric data, BRIN for very large append-only tables — but B-tree handles the vast majority of normal cases. For basic databases work, “index” almost always means “B-tree index” unless stated otherwise.

Creating an Index

The syntax is straightforward:

CREATE INDEX cards_deck_id_idx ON cards (deck_id);

This creates a B-tree index on the deck_id column of cards. PostgreSQL will use this index for queries that filter, join, or order by deck_id.

The index name (cards_deck_id_idx) is conventional but not required. PostgreSQL will generate a name automatically if you omit it. Naming explicitly makes later maintenance easier — you can drop a specific index by name, or grant permissions on it.

Index creation is a schema change, so it should be done through a migration. This ensures that the index is created in all environments and that the schema state is consistent.

Indexes can also cover two or more columns together:

CREATE INDEX submissions_user_exercise_idx ON submissions (user_id, exercise_id);

This index is useful for queries that filter on both user_id and exercise_id, or that filter on user_id alone (because of how B-tree indexes work — the leftmost column can be used independently). It is not useful for queries that filter on exercise_id alone, because the index is sorted primarily by user_id.

Multi-column indexes are powerful but easy to over-create. Usually a single-column index on each frequently-filtered column is enough; multi-column indexes are worth adding only when a specific query is shown to need them.

Loading Exercise...

Confirming the Index Is Used: EXPLAIN

Once an index exists, the natural next question is: is it actually being used?

EXPLAIN answers that question. It shows the query plan — the strategy PostgreSQL will use to execute a query — without actually running it.

EXPLAIN SELECT * FROM cards WHERE deck_id = 3;

The output looks something like this:

                              QUERY PLAN
-----------------------------------------------------------------------
 Index Scan using cards_deck_id_idx on cards  (cost=0.29..8.35 rows=5 width=64)
   Index Cond: (deck_id = 3)

This says: PostgreSQL plans to use the cards_deck_id_idx index to find rows where deck_id = 3, expects to find 5 rows, and estimates the work at a cost of 0.29 to 8.35 (in arbitrary units; what matters is comparison between plans, not the absolute number).

The single most important question this output answers is whether an index is being used. If you see “Index Scan,” the index is doing work. If you see “Seq Scan” instead, the database is reading the whole table.

The three most common ways to read rows, visible in the explain output, are:

  • Sequential scan (Seq Scan). Read every row in the table, in storage order. This is what happens when there is no useful index, when the index would not help (the query asks for most rows anyway), or when the table is small enough that scanning is faster than going through an index.

  • Index scan (Index Scan). Use an index to find the matching row pointers, then read those rows from the table.

  • Index-only scan (Index Only Scan). Same as an index scan, but the query only needs columns that are in the index itself, so the database does not even read the table rows.

Sequential scan is not always bad

A Seq Scan is not automatically a problem. For a small table, or for a query that has to read most rows anyway, sequential scanning is the right choice. The optimizers that DBMSs use often pick a sequential scan deliberately when an index would not actually help. The judgment is whether the strategy fits the query, not whether it is “Index” or “Seq.”


Loading Exercise...

Good Index Candidates

A column is a good index candidate when it appears in:

  • WHERE clauses that filter the table down to a small subset, especially on large tables,
  • JOIN conditions, particularly on the side that the join is filtering through,
  • ORDER BY clauses on large result sets, where the index can serve the sort directly,
  • frequent equality lookups, especially when the column is selective (each value matches few rows).

The key word is “selective.” An index on a column where every value is unique (like users.email) is highly selective: looking up one email finds at most one row. An index on a column where most rows share the same value (like a status column where 99% of rows are 'active') is much less selective, and the database may not bother using it.

For the study tracker project, good index candidates include:

  • cards.deck_id — every card lookup by deck filters on this,
  • reviews.card_id — every review lookup by card filters on this,
  • card_tags.tag_id — for filtering cards by tag.

Each of these supports a frequent query pattern. Each is reasonably selective (a single deck has, say, 50 cards rather than half the table). Each is worth adding. The project checkpoints later in this part are where these indexes will actually be added to the project, in a migration tied to the first checkpoint that depends on them.

Foreign keys are not auto-indexed in PostgreSQL

Some DBMSs automatically create an index when a foreign key is declared. PostgreSQL does not. The foreign key constraint enforces referential integrity, but it does not speed up queries that filter on the foreign-key column. If you want fast lookups by cards.deck_id, you need to create the index explicitly.

This is one of the most common surprises for people coming from other DBMSs like MySQL or SQL Server. The fix is simple — add the index — but recognizing the situation requires knowing PostgreSQL’s behavior (or digging into why an application is behaving unexpectedly).


Loading Exercise...

Poor Index Candidates

Not every column benefits from an index.

Small tables. If a table has only a few hundred rows, a sequential scan is already fast. Adding an index does not help and adds cost.

Columns with low selectivity. If 90% of rows have the same value, an index on that column does not help much. The database may decide a sequential scan is faster anyway.

Columns rarely used in filters. If no query ever filters or joins on a column, indexing it is pure cost with no benefit.

Columns that change frequently. Every change to an indexed column requires updating the index too. For columns that are written far more often than they are queried, the cost can outweigh the benefit.

The right way to choose indexes is to look at the queries the application actually runs and to verify with EXPLAIN whether the index is actually used. The next sections show how.

Loading Exercise...

EXPLAIN ANALYZE for Real Timings

EXPLAIN shows the planned strategy. EXPLAIN ANALYZE actually runs the query and reports actual times.

EXPLAIN ANALYZE SELECT * FROM cards WHERE deck_id = 3;
Index Scan using cards_deck_id_idx on cards
  (cost=0.29..8.35 rows=5 width=64) (actual time=0.012..0.018 rows=5 loops=1)
  Index Cond: (deck_id = 3)
Planning Time: 0.072 ms
Execution Time: 0.038 ms

The new pieces are actual time, actual rows, Planning Time, and Execution Time. These tell you what really happened, not just what was planned.

The most useful comparison is rows=5 (the optimizer’s estimate) versus actual rows=5. When these match closely, the optimizer’s plan is well-informed. When they differ wildly (estimated 5, actual 50,000), the optimizer was working with bad statistics, and the plan it chose may not be the right one.

EXPLAIN ANALYZE is the tool for verifying that an optimization actually helped. Run it before and after adding an index; if the execution time drops noticeably, the index is doing real work. If the time stays the same, the optimizer is not using the index, and something is wrong.

EXPLAIN ANALYZE actually runs the query

Unlike plain EXPLAIN, EXPLAIN ANALYZE executes the query. For SELECT, this is harmless. For INSERT, UPDATE, or DELETE, the changes really happen. If you want to see the plan for a destructive query without applying the changes, wrap it in a transaction and roll back.


Loading Exercise...

Before-and-After: Index vs. No Index

The main benefit from EXPLAIN ANALYZE comes from comparing a query before and after adding an index.

Before, no index on reviews.card_id:

EXPLAIN ANALYZE SELECT COUNT(*) FROM reviews WHERE card_id = 42;
Aggregate  (cost=2200.00..2200.01 rows=1 width=8) (actual time=15.234..15.235 rows=1 loops=1)
  ->  Seq Scan on reviews  (cost=0.00..2150.00 rows=20 width=0)
        (actual time=0.011..14.892 rows=20 loops=1)
        Filter: (card_id = 42)
        Rows Removed by Filter: 99980
Planning Time: 0.094 ms
Execution Time: 15.267 ms

The sequential scan reads all 100,000 rows in the table and discards 99,980 of them, keeping only the 20 that match card_id = 42. Total time: about 15 ms. Even though only 20 rows match, the scan still has to look at every row to find them.

After adding CREATE INDEX reviews_card_id_idx ON reviews (card_id):

Aggregate  (cost=8.50..8.51 rows=1 width=8) (actual time=0.087..0.088 rows=1 loops=1)
  ->  Index Only Scan using reviews_card_id_idx on reviews
        (cost=0.42..8.32 rows=20 width=0) (actual time=0.024..0.082 rows=20 loops=1)
        Index Cond: (card_id = 42)
Planning Time: 0.082 ms
Execution Time: 0.105 ms

Now the index serves the count directly. The same 20 rows are found, but the database goes straight to them through the index instead of scanning the whole table. Total time: about 0.1 ms — a 150x speedup.

This is the kind of confirmation EXPLAIN ANALYZE provides. Without it, you would be guessing whether the index helped. With it, you have a measured before-and-after.

A Worked EXPLAIN Reading

Consider a slightly more involved query:

EXPLAIN
SELECT decks.name, COUNT(cards.id) AS card_count
FROM decks
LEFT JOIN cards ON cards.deck_id = decks.id
GROUP BY decks.id, decks.name
ORDER BY decks.name;

A possible plan might look like:

Sort  (cost=12.50..12.55 rows=20 width=40)
  Sort Key: decks.name
  ->  HashAggregate  (cost=11.20..11.50 rows=20 width=40)
        Group Key: decks.id
        ->  Hash Right Join  (cost=2.45..10.20 rows=200 width=36)
              Hash Cond: (cards.deck_id = decks.id)
              ->  Seq Scan on cards  (cost=0.00..6.00 rows=200 width=8)
              ->  Hash  (cost=2.20..2.20 rows=20 width=32)
                    ->  Seq Scan on decks  (cost=0.00..2.20 rows=20 width=32)

A plan is a tree, displayed indented. The deepest indented operation runs first; results bubble up to the top. Reading bottom-up:

  • Seq Scan on decks — read all 20 deck rows,
  • Seq Scan on cards — read all 200 card rows,
  • Hash Right Join — combine them with a hash join,
  • HashAggregate — group the joined rows by deck,
  • Sort — sort the result by deck name.

For these table sizes, Seq Scan is the right choice for both tables. They are small enough that scanning is faster than going through an index. If the tables grew to 10,000 decks and 100,000 cards, the optimizer might switch to a different plan, perhaps using an index on cards.deck_id.

The plan shows the work that will be done. Whether that work is reasonable depends on the size of the data and the alternatives.

A simple rule for reading any plan: start at the most indented line, work upward, and at each level ask “what does this operation do with the rows from below?” By the time you reach the top, you have understood the whole strategy.

Loading Exercise...

Indexes Have Costs

An index is not free. It has three kinds of cost.

Storage. Each index occupies disk space, roughly proportional to the size of the indexed column. For a table with several columns indexed, the indexes might even take more space than the table itself.

Write performance. Every insert, update, or delete that touches an indexed column also has to update the index. A table with no indexes accepts inserts as fast as it can write rows. A table with five indexes does six writes per insert (one to the table, five to the indexes). For write-heavy tables, this matters.

Maintenance complexity. Indexes have to be created in migrations, rebuilt occasionally on bloated tables, monitored for usefulness. Each one is a small ongoing concern.

For a lookup-heavy table where queries dominate the workload, three or four well-chosen indexes are usually fine. For a write-heavy table where inserts dominate, the same number of indexes can become a noticeable cost.

Loading Exercise...

SQL Programming Exercise

The following SQL sequence is sandbox practice with creating and inspecting indexes — not a change to the study tracker project. The tasks are small: in a scratch schema, create one or two targeted indexes, then verify from PostgreSQL’s index catalog that they exist and that a sample query uses them. The project’s actual baseline indexes will be added in the first project checkpoint of this part.

Loading Exercise...

Check Your Understanding

  1. Why does an index let the database find a specific row much faster than a sequential scan?
  2. What does it mean when a plan shows Seq Scan instead of Index Scan, and is that always a problem?
  3. Why are the three costs of an index (storage, write performance, maintenance complexity) reasons that “more indexes” is not the same as “better performance”?