Privacy, Copyright, Bias, and Responsible Use
Learning Objectives
- You understand the main privacy, copyright, and bias concerns in LLM application design.
- You can connect those concerns to practical engineering choices.
- You can describe what responsible use means in a course-level software project.
Responsible use is part of engineering
Privacy, copyright, and bias are sometimes treated as separate from software engineering, but in LLM systems they often appear inside ordinary technical decisions:
- what data is sent to a provider,
- what is stored in logs,
- what documents are indexed,
- and how the system presents uncertainty to users.
Responsible use is therefore not only a policy issue. It is part of system design.
Privacy and data minimization
A useful default is to send and store as little sensitive data as possible.
Questions worth asking include:
- Does the model call really need this personal or confidential information?
- Can sensitive fields be removed before the request?
- Who can access prompts, logs, transcripts, and evaluation data?
- How long is this data retained?
Many privacy problems come from convenience. It is easy to log everything “just in case” and only later realize that the logs contain more than the system should have stored.
Example: A support assistant and sensitive details
Suppose that a staff member asks a course-support assistant:
Student 123456 reports a medical emergency and asks for an extension until next week.
If the full message is sent to an external provider and copied into logs, the system has now duplicated sensitive personal information in at least two places.
A better design might:
- remove the student identifier before the model call,
- keep sensitive case details out of long-term logs,
- and use the model only for policy interpretation while storing the actual decision process in a separate controlled system.
This kind of example shows why privacy is not only a legal topic. It is a design question about what data actually needs to move through the LLM workflow.
The same idea can be expressed as a prompt-engineering habit. Instead of sending the original request directly to the model, the application can first minimize it. For example, this is weaker:
Student 123456 reports a medical emergency and asks for an extension until next week.
This is often safer:
A student is requesting an extension because of a serious personal situation.
Summarize the policy-relevant issue without adding private details.
The second version keeps the task-relevant meaning while removing identifiers and unnecessary sensitive detail.
Secrets and source code
Projects that work with LLMs often handle source code, configuration, and API keys. That creates two separate concerns:
- do not expose secrets in prompts or repositories,
- and do not assume that all code can be shared with every provider or service.
Engineering teams should be explicit about which materials are allowed to leave the local environment and which are not.
Copyright and licensing
When an application indexes documents or generates code and text, copyright and licensing questions can appear.
Practical issues include:
- whether the indexed documents may be processed this way,
- whether generated output resembles licensed material too closely,
- and whether the team understands the terms of the models and services it uses.
This course does not provide legal advice, but it is good engineering practice to treat licenses and content ownership as real constraints, not as afterthoughts.
For additional details on licensing concerns in code-oriented training datasets, see An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets.
Bias and uneven performance
LLM systems can perform unevenly across languages, writing styles, user groups, and task types. A system may appear strong in the examples chosen by the developer while failing more often for other users.
This is one reason to build evaluation sets carefully and to inspect failures rather than only average scores. Bias is not always dramatic or obvious. It can show up as uneven helpfulness, different error rates, or weaker support for some kinds of input.
Users should not be misled about what the system is doing. If an answer comes from an LLM-based assistant with retrieved context and automated ranking, the surrounding interface and documentation should make that clear enough that users can apply appropriate caution.