Discussion 376.2: Training Data as Stewardship | CS 375-376 Spring 2026 at Calvin University

Generative AI systems learn from massive datasets — text scraped from the web, books, images, code, conversations. These datasets aren’t neutral. They carry the assumptions, biases, and interests of whoever collected them, and the people whose work (or data) was collected. As people called to pursue shalom — right relationships with God, others, and creation — how we think about training data is not just a technical question.

This Discussion addresses the course objective Overall-Impact and connects to OG-SelfSupervised.

Initial Post

Find a specific, sourced example of a training data issue that matters to you. This could connect to your major, your community, your creative interests, your faith, or something you’ve encountered using AI tools.

Search for a recent news article, research paper, blog post, legal filing, or firsthand account. Topics are moving fast — look for current reporting on training data lawsuits and legislation, AI-generated content feeding back into training sets (“model collapse”), bias in generated images or text, the working conditions of people hired to label and filter training data, or how specific communities (artists, writers, open-source developers, speakers of minority languages) have been affected. Good starting points include major news outlets’ AI coverage, the PAIR Explorables interactive essays, arXiv preprints, ACM opinion pieces, or your own experience.

Some angles to consider:

Whose work is in the data? Copyright, consent, compensation — and what happens when creators can’t opt out.
Whose world does the data reflect? Which voices, languages, and perspectives are overrepresented or missing, and how that shapes what models generate.
What happens at scale? When models train on their own outputs, when training sets include benchmark answers, when AI-generated content displaces human participation in online communities.
Who benefits and who bears the cost? The people paid pennies per task to label toxic content, the energy and water consumed by data centers, the companies that profit versus the communities whose data was collected.

In your post (~150-250 words):

Describe the issue with a specific example. Name the model, dataset, company, or community involved.
Ground your analysis in a framework you find compelling (see below). Don’t just say “this is bad” — articulate what value or obligation is at stake and why it matters.
Take a position: What should be done differently? By whom?

Cite your source clearly enough that a classmate could find it.

Frameworks for Ethical Analysis

You’re welcome to draw on any ethical tradition you find genuinely useful. Here are some concrete starting points — pick what resonates, or bring your own:

Reformed Christian concepts:

Stewardship (Genesis 1:28, Psalm 24:1) — we don’t own creation, we tend it. Does scraping the internet’s creative output look like tending or extracting?
Image of God (Genesis 1:27) — every person has inherent dignity. What does that say about data laborers, or about communities whose likeness is reproduced without consent?
Shalom — the biblical vision of things being as they ought to be (Cornelius Plantinga, Not the Way It’s Supposed to Be). Where is shalom broken in how training data is collected or used?
Justice and the vulnerable (Proverbs 31:8-9, Micah 6:8) — who has power in this situation, and who doesn’t?
Common grace — shared gifts (knowledge, language, art) are meant for the common good. When they’re enclosed in a dataset, who gains and who loses?

Other ethical frameworks:

Distributive justice (Rawls) — would this arrangement be fair if you didn’t know which role you’d play?
Virtue ethics — what virtues (honesty, humility, courage) or vices (greed, indifference) are on display?
Care ethics — who is being cared for, and whose needs are invisible?
Digital commons — is the open internet a shared resource being depleted?

You don’t need to be a theologian or philosopher. A sentence or two connecting your example to a specific concept is enough.

Replies

Reply to at least two classmates (~75-150 words each). Your replies should do both of the following:

Engage with their argument: Do you agree with their position? What would you add, push back on, or complicate?
Bring a different lens: If they used a theological framework, try responding with a secular one (or vice versa). If they focused on creators’ rights, consider the perspective of users or model developers. The goal is to deepen the conversation, not just agree.

Rubric

Initial post is on time and addresses the prompt
Post identifies a specific, sourced training data issue
Post grounds its analysis in a named ethical or theological concept
Post takes a clear position on what should change
Replies engage substantively with classmates’ arguments
Replies bring a different perspective or framework
Writing is clear, specific, and well-cited