376 Unit 3: Architectures

Q&A

Contents

376 Preparation 3 (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.
Notes: Neural Architectures

These notes are reference material for Unit 3 (Architectures). The primary focus of class is self-attention and Transformers; these notes cover other architectures for comparison.

Deep Neural Net = Stack of Layers

Neural networks are built from modular components, often connected sequentially:

The key difference between architectures is their connectivity structure:

Feed-Forward / MLP

A feed-forward network (or multi-layer perceptron) is a stack of linear transformations with nonlinearities between them:

$$f(x) = f_2(\text{ReLU}(f_1(x)))$$

where $f_1$ and $f_2$ are both linear transformations ($f_i(x) = x W_i + b$) and $\text{ReLU}(x) = \max(0, x)$ applied elementwise. Other nonlinearities (GELU, SiLU, etc.) are sometimes used instead of ReLU.

Key properties:

Applying an MLP to a Sequence

There are two options for processing a sequence with an MLP:

Option 1: Concatenate Option 2: Per-element
Approach Concatenate the sequence into one giant vector Apply the MLP to each element independently
Interactions Can capture interactions between elements Cannot capture interactions between elements
Variable length Cannot handle variable-length sequences Can handle variable-length sequences
Parameters Huge number of parameters Fewer parameters (reuse the same weights for each element)

In Transformers, the MLP layers use Option 2 (applied independently to each position). Information sharing between positions happens in the attention layers instead.

Convolutional Networks (CNN)

A convolutional layer is essentially a feed-forward network applied to a small patch (or “window”) of the input, slid across the entire input to produce many outputs.

Key properties:

How they work: A small set of learnable weights (the “kernel” or “filter”) slides across the input. At each position, the kernel computes a weighted sum of the local patch. Different kernels detect different features — edges, textures, patterns. Stacking multiple convolutional layers lets the network build up from simple local features to complex high-level concepts.

Resources

Recurrent Networks (RNN, LSTM)

A recurrent network processes a sequence one step at a time, maintaining a “hidden state” that summarizes everything seen so far. At each time step, the network takes the current input and the previous hidden state, and produces an updated hidden state and an output.

Key properties:

LSTM (Long Short-Term Memory) is a variant of RNN designed to mitigate the long-range dependency problem. It uses a gating mechanism to selectively remember or forget information, which helps gradients flow over longer sequences. LSTMs were the dominant architecture for sequence tasks (translation, speech recognition, text generation) before Transformers.

Compare and Contrast

MLP CNN RNN/LSTM Transformer
Connectivity Fully connected Local (spatial neighbors) Temporal (previous step) Dynamic (learned attention)
Sequence handling Fixed length (concatenate) or no interaction (per-element) Local context via sliding window Naturally sequential, variable length Full context, variable length
Parallelism Fully parallel Fully parallel Sequential (hard to parallelize) Fully parallel
Long-range dependencies Only if concatenated (expensive) Requires many stacked layers Difficult (vanishing gradients) Direct (any token attends to any other)
Parameter sharing None across positions Same kernel at all positions Same weights at all time steps Same attention weights at all positions
Primary strength Simple, general Spatial/local patterns Sequential data with short-range dependencies Flexible, scalable, long-range
Classic applications Tabular data, small models Image recognition, object detection Early machine translation, speech Modern LLMs, vision (ViT), multimodal

When to Use Each

Discussion 376.2: Training Data as Stewardship

Generative AI systems learn from massive datasets — text scraped from the web, books, images, code, conversations. These datasets aren’t neutral. They carry the assumptions, biases, and interests of whoever collected them, and the people whose work (or data) was collected. As people called to pursue shalom — right relationships with God, others, and creation — how we think about training data is not just a technical question.

This Discussion addresses the course objective Overall-Impact and connects to OG-SelfSupervised.

Initial Post

Find a specific, sourced example of a training data issue that matters to you. This could connect to your major, your community, your creative interests, your faith, or something you’ve encountered using AI tools.

Search for a recent news article, research paper, blog post, legal filing, or firsthand account. Topics are moving fast — look for current reporting on training data lawsuits and legislation, AI-generated content feeding back into training sets (“model collapse”), bias in generated images or text, the working conditions of people hired to label and filter training data, or how specific communities (artists, writers, open-source developers, speakers of minority languages) have been affected. Good starting points include major news outlets’ AI coverage, the PAIR Explorables interactive essays, arXiv preprints, ACM opinion pieces, or your own experience.

Some angles to consider:

In your post (~150-250 words):

  1. Describe the issue with a specific example. Name the model, dataset, company, or community involved.
  2. Ground your analysis in a framework you find compelling (see below). Don’t just say “this is bad” — articulate what value or obligation is at stake and why it matters.
  3. Take a position: What should be done differently? By whom?

Cite your source clearly enough that a classmate could find it.

Frameworks for Ethical Analysis

You’re welcome to draw on any ethical tradition you find genuinely useful. Here are some concrete starting points — pick what resonates, or bring your own:

Reformed Christian concepts:

  • Stewardship (Genesis 1:28, Psalm 24:1) — we don’t own creation, we tend it. Does scraping the internet’s creative output look like tending or extracting?
  • Image of God (Genesis 1:27) — every person has inherent dignity. What does that say about data laborers, or about communities whose likeness is reproduced without consent?
  • Shalom — the biblical vision of things being as they ought to be (Cornelius Plantinga, Not the Way It’s Supposed to Be). Where is shalom broken in how training data is collected or used?
  • Justice and the vulnerable (Proverbs 31:8-9, Micah 6:8) — who has power in this situation, and who doesn’t?
  • Common grace — shared gifts (knowledge, language, art) are meant for the common good. When they’re enclosed in a dataset, who gains and who loses?

Other ethical frameworks:

  • Distributive justice (Rawls) — would this arrangement be fair if you didn’t know which role you’d play?
  • Virtue ethics — what virtues (honesty, humility, courage) or vices (greed, indifference) are on display?
  • Care ethics — who is being cared for, and whose needs are invisible?
  • Digital commons — is the open internet a shared resource being depleted?

You don’t need to be a theologian or philosopher. A sentence or two connecting your example to a specific concept is enough.

Replies

Reply to at least two classmates (~75-150 words each). Your replies should do both of the following:

  1. Engage with their argument: Do you agree with their position? What would you add, push back on, or complicate?
  2. Bring a different lens: If they used a theological framework, try responding with a secular one (or vice versa). If they focused on creators’ rights, consider the perspective of users or model developers. The goal is to deepen the conversation, not just agree.

Rubric

Exercise 376.2: Perplexity

Overview

In the u09 logits notebook, you implemented perplexity for a single model on short texts. In this assignment, you will extend that work to compare multiple language models of different sizes, analyzing how performance scales with model size.

Learning Objectives

This assignment addresses the following course objectives:

Students may also use this exercise to demonstrate additional objectives, such as:

Task

Your goal is to evaluate how language model performance (measured by perplexity) changes with model size:

  1. Select two or more language models from the Qwen2.5 family (e.g., 0.5B, 1.5B, 3B parameters)
  2. Evaluate these models on a set of short stories by computing perplexity for each model on the same stories
  3. Create a plot showing how perplexity changes with model size
  4. Analyze which stories or which parts of stories are most challenging for the models

Model Options

Use models from the Qwen2.5 family, which are available on the Hugging Face model hub:

All Qwen2.5 sizes share the same tokenizer, which makes perplexity comparison across sizes fair.

Memory tip: Load models with torch_dtype=torch.float16 (or torch.bfloat16) to reduce memory usage:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

Dataset

Use the TinyStories dataset, which contains short stories. You can load it from the Hugging Face hub using the datasets library:

from datasets import load_dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:100]")
# Take a sample of stories for evaluation
stories = dataset.select(range(50))

Computing Perplexity

Adapt your work from the u09 logits notebook, where you began implementing perplexity. I still recommend computing the loss manually (extracting logits, indexing the correct token probabilities, averaging) rather than using shortcut approaches like passing labels to the model. The main addition here is running it across multiple models and stories.

Here’s a suggested function signature (you may want to return additional values for token-level analysis, or take additional arguments for context to prepend):

def compute_perplexity(model, tokenizer, text):
    """
    Compute the perplexity of a model on a given text.
    
    Args:
        model: A language model that returns logits
        tokenizer: The tokenizer associated with the model
        text: The text to evaluate
        
    Returns:
        float: The perplexity of the model on the text
    """
    # Your implementation here

Caution about indexing: Pay careful attention to token positions! Remember that when predicting the token at position i, you use the logits from position i-1. This off-by-one error is easy to make.

For data collection, consider creating a structure like:

results = []

# For each model and story
for model_name in model_names:
    for story_idx, story in enumerate(stories):
        # Compute perplexity
        perplexity = compute_perplexity(model, tokenizer, story["text"])
        
        # Store results
        results.append({
            "model_name": model_name,
            "story_idx": story_idx,
            "perplexity": perplexity
        })

# Convert to DataFrame for easier analysis
import pandas as pd
results_df = pd.DataFrame(results)

Analysis and Submission

Create a Jupyter notebook that includes:

  1. Implementation of the perplexity calculation
  2. A table showing perplexity for each model on each story
  3. A plot showing how perplexity changes with model size
  4. Analysis of results:
    • Which models performed best?
    • Is there a consistent relationship between model size and perplexity?
    • Which stories had the highest/lowest perplexity across models? (look at their full text, don’t make assumptions)
    • Optional: Identify specific tokens or sentence positions that were most challenging for the models

Grading Rubric

Objective Level P (Progressing) Level M (Met) Level E (Excellent)
TM-LLM-Generation (extracting logits) Loads at least one model and extracts logits to compute perplexity Correctly computes perplexity for all chosen models Performs token-level analysis showing which specific tokens contribute most to perplexity
OG-LLM-Eval (evaluation strategy) Computes perplexity for at least one model on the dataset Compares perplexity across multiple models; identifies which model performs best Critically analyzes what perplexity captures and misses as an evaluation metric
TM-Scaling (size vs. performance) Reports perplexity values for different model sizes Creates a clear plot of perplexity vs. model size and describes the trend Connects findings to scaling laws; analyzes whether improvement is linear, logarithmic, etc.; discusses diminishing returns
OG-Eval-Experiment (experimental design) Runs the comparison on a small sample Uses a sufficient sample of stories and reports results systematically Controls for confounds (e.g., story length, genre); reports variance or confidence intervals

Extension (for E-level work)

Implement context: instead of just computing perplexity on the full story, only compute it on part of the story while providing the previous sentences as context.

Lab 376.3: Implementing Self-Attention

In this lab, you’ll trace through parts of the implementation of a Transformer language model, focusing on the self-attention mechanism. We’ll compare the performance of a Transformer model with a baseline that only uses a feedforward network (MLP).

This lab address the following course objectives:

Task

Start with this notebook:

Implementing self-attention (name: u10n1-implement-transformer.ipynb; show preview, open in Colab)

You may find it helpful to refer to The Illustrated GPT-2 (Visualizing Transformer Language Models) – Jay Alammar – Visualizing machine learning one concept at a time.

Extension idea

Optional Extension: Architectural Experimentation (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.