Data and Databases

From Files to Databases

Learning Objectives

You can explain the main weaknesses of file-based data storage.
You can explain at a high level what a database is and what kinds of problems databases are designed to solve.
You can recognize that files and databases often coexist rather than replace one another completely.

Many applications can start with files. A text file, a CSV file, or a JSON document is often enough for a small prototype. The problem is that as soon as the amount of data, the number of users, or the number of questions grows, file-based storage starts to feel awkward.

A Simple File-Based Example

Imagine a small course platform storing submissions in a CSV file.

student_name,student_email,course_code,term,exercise_title,submitted_at,score
Jane Doe,jane@example.com,CS-A1150,spring-2026,SQL Basics,2026-03-10 20:14,
John Doe,john@example.com,CS-C3170,spring-2026,HTML,2026-03-10 21:03,9
Jane Doe,jane.doe@example.com,CS-A1150,spring-2026,Filtering Queries,2026-03-17 22:58,10

At first, this seems fine. There is one file, the contents are easy to read, and simple scripts can process it.

Suppose we want to answer a simple question: how many submissions have arrived for CS-A1150?

Reading the File with a Small Program

For a small file, a short script is enough.

import csv

count = 0

with open("submissions.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row["course_code"] == "CS-A1150":
            count += 1

print(count)

This works well for a first question. The file is readable, the program is short, and the result is easy to check.

Loading Exercise...

What If the File Is Wrong or Incomplete?

The small script above assumes that every row is well-formed and that the course code is always written correctly. Real data is messier.

Questions appear quickly:

What if one row contains CS-A150 instead of CS-A1150?
What if one row is missing a field entirely?
What if a program crashes while writing and leaves behind a partial row?

The file and the script do not by themselves prevent these problems. They also do not give us much help in detecting or recovering from them.

Loading Exercise...

Repeated Data

The same student name and email may appear many times. The same course code, term, and exercise title may also appear many times. If an email changes, the file may contain old and new versions of the same fact.

In the example above, the same student name appears with two different email addresses. Maybe the student changed email, maybe one row contains a typo, or maybe the rows refer to two different people with similar names. The file itself does not help us resolve the situation cleanly.

This is one of the core motivations for better structured data design: avoid storing the same fact in many places unless there is a good reason.

Loading Exercise...

New Questions Mean New Scripts

Suppose we want to ask:

Which exercises in CS-A1150 have already received submissions?
How many distinct students submitted during the first week of the course?
Which submissions are still ungraded?
Which students submitted both Exercise A and Exercise B?

These are possible questions, but answering them reliably with ad hoc file processing quickly becomes awkward: for each question, we need to write a new script that reads the file and processes it in a specific way. The more questions we have, the more scripts we need, and the more chances there are for mistakes.

A file usually stores data as it happened to be written. A more structured solution should let the same stored facts support many different questions later.

Shared data also becomes harder to manage when several people or services read and write it at the same time. Loose files give us very little help with that.

From File Problems to Databases

A database is an organized collection of related data that is kept so that programs and users can store it, find it, update it, and use it later.

At a very high level, we can think of it like this:

Figure 1 — A database brings related data together so the same stored facts can support many questions and features.

In this chapter, the related data happens to be about courses and submissions. The same idea applies just as well to books and loans, patients and appointments, customers and orders, or sensor readings collected over time.

For now, it is enough to think of a database this way: a database stores related facts in an organized form so they can be used reliably later. In the next chapter, we separate the stored data itself from the system that manages it.

Loading Exercise...

When people move from loose files to databases, they are usually not looking only for more storage. They also want a more reliable way to keep related data together, ask many questions from it, update it more safely, and share it across users or services.

A database system gives us at least these benefits:

A structured way to store related data
A query language for retrieving and modifying that data
Integrity rules for preventing invalid states
Support for multiple users and applications
A more systematic way to answer new questions later

Loading Exercise...

Files Still Have a Place

It is important not to overstate the case. Files are useful and often necessary.

A small application might use:

a database for users, content, and activity history
files for uploaded images and large media assets
exported CSV files for reporting or analysis
logs, backups, and configuration

The point is not that files are bad. The point is that databases are better suited for persistent, structured, shared data that needs to be queried and updated reliably.

Depending on the actual case, having a mixed setup is common and sensible. A good engineer should recognize when a task calls for structured queryable storage and when a simple file is enough.

Loading Exercise...

Check Your Understanding

Why does a folder full of files become difficult to manage once several users need the same data?
What kinds of mistakes become more common when the same fact is copied into many files?
Why is searching across many files not the same thing as querying a database?

AI Study Aid

Create a chapter diagram

We're looking into to what extent AI-generated diagrams could help with studying.

Use this study aid to generate an AI-generated visual summary of the material you just studied. Each diagram style emphasizes the content in a different way, so you can choose the focus that feels most useful.

Using the diagram generator is voluntary and does not affect your course progress. We encourage you to try it out and see if it helps, but it's totally up to you! Your answers help us understand how to make better study aids in the future.

Diagrams tried: 0

← The World of Data

Database Management Systems →