These notes are reference material for Unit 3 (Architectures). The primary focus of class is self-attention and Transformers; these notes cover other architectures for comparison.
Neural networks are built from modular components, often connected sequentially:
The key difference between architectures is their connectivity structure:
A feed-forward network (or multi-layer perceptron) is a stack of linear transformations with nonlinearities between them:
$$f(x) = f_2(\text{ReLU}(f_1(x)))$$where $f_1$ and $f_2$ are both linear transformations ($f_i(x) = x W_i + b$) and $\text{ReLU}(x) = \max(0, x)$ applied elementwise. Other nonlinearities (GELU, SiLU, etc.) are sometimes used instead of ReLU.
Key properties:
There are two options for processing a sequence with an MLP:
| Option 1: Concatenate | Option 2: Per-element | |
|---|---|---|
| Approach | Concatenate the sequence into one giant vector | Apply the MLP to each element independently |
| Interactions | Can capture interactions between elements | Cannot capture interactions between elements |
| Variable length | Cannot handle variable-length sequences | Can handle variable-length sequences |
| Parameters | Huge number of parameters | Fewer parameters (reuse the same weights for each element) |
In Transformers, the MLP layers use Option 2 (applied independently to each position). Information sharing between positions happens in the attention layers instead.
A convolutional layer is essentially a feed-forward network applied to a small patch (or “window”) of the input, slid across the entire input to produce many outputs.
Key properties:
How they work: A small set of learnable weights (the “kernel” or “filter”) slides across the input. At each position, the kernel computes a weighted sum of the local patch. Different kernels detect different features — edges, textures, patterns. Stacking multiple convolutional layers lets the network build up from simple local features to complex high-level concepts.
A recurrent network processes a sequence one step at a time, maintaining a “hidden state” that summarizes everything seen so far. At each time step, the network takes the current input and the previous hidden state, and produces an updated hidden state and an output.
Key properties:
LSTM (Long Short-Term Memory) is a variant of RNN designed to mitigate the long-range dependency problem. It uses a gating mechanism to selectively remember or forget information, which helps gradients flow over longer sequences. LSTMs were the dominant architecture for sequence tasks (translation, speech recognition, text generation) before Transformers.
| MLP | CNN | RNN/LSTM | Transformer | |
|---|---|---|---|---|
| Connectivity | Fully connected | Local (spatial neighbors) | Temporal (previous step) | Dynamic (learned attention) |
| Sequence handling | Fixed length (concatenate) or no interaction (per-element) | Local context via sliding window | Naturally sequential, variable length | Full context, variable length |
| Parallelism | Fully parallel | Fully parallel | Sequential (hard to parallelize) | Fully parallel |
| Long-range dependencies | Only if concatenated (expensive) | Requires many stacked layers | Difficult (vanishing gradients) | Direct (any token attends to any other) |
| Parameter sharing | None across positions | Same kernel at all positions | Same weights at all time steps | Same attention weights at all positions |
| Primary strength | Simple, general | Spatial/local patterns | Sequential data with short-range dependencies | Flexible, scalable, long-range |
| Classic applications | Tabular data, small models | Image recognition, object detection | Early machine translation, speech | Modern LLMs, vision (ViT), multimodal |
Generative AI systems learn from massive datasets — text scraped from the web, books, images, code, conversations. These datasets aren’t neutral. They carry the assumptions, biases, and interests of whoever collected them, and the people whose work (or data) was collected. As people called to pursue shalom — right relationships with God, others, and creation — how we think about training data is not just a technical question.
This Discussion addresses the course objective Overall-Impact and connects to OG-SelfSupervised.
Find a specific, sourced example of a training data issue that matters to you. This could connect to your major, your community, your creative interests, your faith, or something you’ve encountered using AI tools.
Search for a recent news article, research paper, blog post, legal filing, or firsthand account. Topics are moving fast — look for current reporting on training data lawsuits and legislation, AI-generated content feeding back into training sets (“model collapse”), bias in generated images or text, the working conditions of people hired to label and filter training data, or how specific communities (artists, writers, open-source developers, speakers of minority languages) have been affected. Good starting points include major news outlets’ AI coverage, the PAIR Explorables interactive essays, arXiv preprints, ACM opinion pieces, or your own experience.
Some angles to consider:
In your post (~150-250 words):
Cite your source clearly enough that a classmate could find it.
You’re welcome to draw on any ethical tradition you find genuinely useful. Here are some concrete starting points — pick what resonates, or bring your own:
Reformed Christian concepts:
Other ethical frameworks:
You don’t need to be a theologian or philosopher. A sentence or two connecting your example to a specific concept is enough.
Reply to at least two classmates (~75-150 words each). Your replies should do both of the following:
In the u09 logits notebook, you implemented perplexity for a single model on short texts. In this assignment, you will extend that work to compare multiple language models of different sizes, analyzing how performance scales with model size.
This assignment addresses the following course objectives:
Students may also use this exercise to demonstrate additional objectives, such as:
Your goal is to evaluate how language model performance (measured by perplexity) changes with model size:
Use models from the Qwen2.5 family, which are available on the Hugging Face model hub:
Qwen/Qwen2.5-0.5B (0.5 billion parameters)Qwen/Qwen2.5-1.5B (1.5 billion parameters)Qwen/Qwen2.5-3B (3 billion parameters, if your system can handle it)All Qwen2.5 sizes share the same tokenizer, which makes perplexity comparison across sizes fair.
Memory tip: Load models with torch_dtype=torch.float16 (or torch.bfloat16) to reduce memory usage:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
Use the TinyStories dataset, which contains short stories. You can load it from the Hugging Face hub using the datasets library:
from datasets import load_dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:100]")
# Take a sample of stories for evaluation
stories = dataset.select(range(50))
Adapt your work from the u09 logits notebook, where you began implementing perplexity. I still recommend computing the loss manually (extracting logits, indexing the correct token probabilities, averaging) rather than using shortcut approaches like passing labels to the model. The main addition here is running it across multiple models and stories.
Here’s a suggested function signature (you may want to return additional values for token-level analysis, or take additional arguments for context to prepend):
def compute_perplexity(model, tokenizer, text):
"""
Compute the perplexity of a model on a given text.
Args:
model: A language model that returns logits
tokenizer: The tokenizer associated with the model
text: The text to evaluate
Returns:
float: The perplexity of the model on the text
"""
# Your implementation here
Caution about indexing: Pay careful attention to token positions! Remember that when predicting the token at position i, you use the logits from position i-1. This off-by-one error is easy to make.
For data collection, consider creating a structure like:
results = []
# For each model and story
for model_name in model_names:
for story_idx, story in enumerate(stories):
# Compute perplexity
perplexity = compute_perplexity(model, tokenizer, story["text"])
# Store results
results.append({
"model_name": model_name,
"story_idx": story_idx,
"perplexity": perplexity
})
# Convert to DataFrame for easier analysis
import pandas as pd
results_df = pd.DataFrame(results)
Create a Jupyter notebook that includes:
| Objective | Level P (Progressing) | Level M (Met) | Level E (Excellent) |
|---|---|---|---|
| TM-LLM-Generation (extracting logits) | Loads at least one model and extracts logits to compute perplexity | Correctly computes perplexity for all chosen models | Performs token-level analysis showing which specific tokens contribute most to perplexity |
| OG-LLM-Eval (evaluation strategy) | Computes perplexity for at least one model on the dataset | Compares perplexity across multiple models; identifies which model performs best | Critically analyzes what perplexity captures and misses as an evaluation metric |
| TM-Scaling (size vs. performance) | Reports perplexity values for different model sizes | Creates a clear plot of perplexity vs. model size and describes the trend | Connects findings to scaling laws; analyzes whether improvement is linear, logarithmic, etc.; discusses diminishing returns |
| OG-Eval-Experiment (experimental design) | Runs the comparison on a small sample | Uses a sufficient sample of stories and reports results systematically | Controls for confounds (e.g., story length, genre); reports variance or confidence intervals |
Implement context: instead of just computing perplexity on the full story, only compute it on part of the story while providing the previous sentences as context.
Qwen2.5-0.5B vs Qwen2.5-0.5B-Instruct)? For the Instruct models, you may want to use the chat template to format the input as a conversation.In this lab, you’ll trace through parts of the implementation of a Transformer language model, focusing on the self-attention mechanism. We’ll compare the performance of a Transformer model with a baseline that only uses a feedforward network (MLP).
This lab address the following course objectives:
Start with this notebook:
Implementing self-attention
(name: u10n1-implement-transformer.ipynb; show preview,
open in Colab)
You may find it helpful to refer to The Illustrated GPT-2 (Visualizing Transformer Language Models) – Jay Alammar – Visualizing machine learning one concept at a time.
Extension idea
torch.compile it first)