Data and Databases

Data Ethics and Privacy

Learning Objectives

You can examine a simple table structure or field list and identify privacy questions about what should be stored there.
You can explain the difference between data people provide and data systems record, and why both deserve careful design.
You can explain how data minimization, privacy by design, and deletion and retention affect database design.

In the previous chapter, you learned to read what a schema says about a system. A useful next step is to ask whether the schema should contain all of those fields in the first place.

A database is not only a technical artifact. It is also a record of what an organization or application has decided to remember. For that reason, database design also includes privacy decisions: which fields to store, who should be allowed to see them, and when some of them should stop being kept.

From Reading Structure to Questioning Structure

Suppose a small course platform has these two tables:

Table	Example Fields
`users`	`id`, `name`, `email`, `date_of_birth`, `home_address`
`submissions`	`user_id`, `exercise_id`, `submitted_at`, `score`

In the previous chapter, the main question would have been: what can this structure tell us about the system?

Here, the question becomes more reflective:

which fields are really needed?
which fields identify a person directly?
which fields could still reveal sensitive information when combined with others?
which fields should perhaps not exist in the schema at all?

That shift matters. A schema does not only support queries and features. It also expresses a decision about what information people will be asked to provide and what information the system will record and retain as they use it.

Personal Data and PII

Some stored data directly identifies a person. Some does not identify a person on its own, but becomes identifying when combined with other data.

A useful practical term is personally identifiable information, or PII. Names, emails, phone numbers, student numbers, and exact addresses are common examples. In many real settings, the broader idea of personal data is even more useful: data that relates to an identifiable person, either directly or through combination with other fields.

This also means that simply removing one identifying field does not automatically make data harmless. Pseudonymized data means data where direct identifiers have been removed or replaced, but where a person may still be identifiable indirectly. Even pseudonymized data may still deserve careful handling if a person can still be recognized from a combination such as course participation, timestamps, and a partial identifier.

For a database designer, the first useful question is simple: what do we really need to remember about a person for this system to work? In the example above, name and email have a clear connection to the application. date_of_birth and home_address do not. Even a field like submitted_at, which is recorded from system use rather than typed in by the user, can still become part of identifying information when combined with other data.

Loading Exercise...

Data People Provide and Data Systems Record

In many systems, some data is provided directly by people, and some data is recorded because the system observes an action or stores a result.

That distinction matters because both kinds of data create responsibility. A home address may be unnecessary even if it is easy to ask for, and a detailed record of every small interaction may be unnecessary even if the system can collect it automatically.

In the course-platform example, this means looking at both profile-style fields such as name and activity data such as submitted_at.

One practical way to reason about this is to list candidate fields and ask what job each one does.

Data Item	How It Appears	Needed?	Why?
`name`	provided by user	yes	identify the user in the system
`email`	provided by user	yes	communication and login identity
`submitted_at`	recorded during system use	yes	record when work was submitted
exact home address	provided by user	no	not needed for exercises or grading
date of birth	provided by user	probably no	usually unrelated to course tasks

Loading Exercise...

Data Minimization

Data minimization means storing only the data that is actually needed for the intended purpose.

In practice, this means resisting the temptation to keep extra fields “just in case.” Every unnecessary field creates extra responsibility: it has to be protected, justified, and sometimes deleted later. A field with no clear purpose is not only clutter. It is also an extra privacy decision that someone must defend.

In the course-platform example, that is why fields such as date_of_birth and home_address are hard to justify.

The smaller and more focused the dataset is, the easier it is to explain why each field exists and the easier it is to reduce harm if something goes wrong.

Loading Exercise...

Privacy by Design in Schema Decisions

Privacy is not something added at the very end. Good systems consider privacy already when deciding what to store, how long to keep it, and who may access it.

This idea is often called privacy by design.

In practice, that shows up in concrete design choices. A system might store one contact email instead of several overlapping identifiers, show result data only to the people who need to see it, and avoid columns that collect unrelated background information.

In the course-platform example, this could mean showing scores only to the student and relevant teachers and separating short-lived operational logs from longer-lived course records.

The same idea applies both to fields that users enter and to fields the system records automatically.

Loading Exercise...

Deletion and Retention

A responsible design also asks when data should stop being kept. Some records may need to remain for grading or course administration, while others should be removed or anonymized once they are no longer needed.

In any system, this means deciding which records should remain as core history and which should expire once their short-term purpose has passed. These are not only operational questions. They influence what tables exist, which columns are optional, and what kinds of maintenance tasks the system may later need.

In the course-platform example, that could mean asking whether old operational logs should expire earlier than final grades, whether inactive user accounts should eventually be removed, and whether every detailed activity record really needs to remain once its short-term purpose has passed.

In other words, once students learn to read structure, they can also start asking which parts of that structure should be temporary, which should remain, and which should perhaps never have been stored at all.

Loading Exercise...

Check Your Understanding

When reading a schema, what kinds of fields should make you pause and ask whether they are really needed?
How can a field become identifying even if it does not directly name a person?
Why is it useful to distinguish between data people provide and data systems record?
How does privacy by design affect decisions about tables, columns, and access?
Why should deletion and retention already be considered when designing a database structure?

AI Study Aid

Create a chapter diagram

We're looking into to what extent AI-generated diagrams could help with studying.

Use this study aid to generate an AI-generated visual summary of the material you just studied. Each diagram style emphasizes the content in a different way, so you can choose the focus that feels most useful.

Using the diagram generator is voluntary and does not affect your course progress. We encourage you to try it out and see if it helps, but it's totally up to you! Your answers help us understand how to make better study aids in the future.

Diagrams tried: 0

← Exploring Database Structure

Other Types of Databases →