Build a simple language model, then a transformer-based language model
Trace data flow step-by-step to understand how transformers work
Visualize attention patterns
Generate text from both models to compare their capabilities
Example Generated Text
MLP-only model: Andarashe he s war t ay fout t he s s immoo hang g wan as he was ga s w te t awe hang Lind and s s t
Transformer model: Once upon a time, there was a lounng a boy named in the was a very a specian a salw a so peciaing fo
Why the dramatic difference?
Why Transformers?
Transformers have revolutionized NLP since 2017
Key innovation: self-attention mechanism
Allow models to capture long-range dependencies
Underlying architecture of modern LLMs (OpenAI’s GPT family, Meta’s Llama, Google’s Gemma)
Scales effectively with more data and parameters
Self-Attention: The Key Insight
The core of intelligence is contextually adaptive behavior.
So modeling context is critical.
Self-attention was a dramatic boost in ability to model context
because the network can “rewire itself” based on context!
Adaptive wiring: connections between input and output can change
Self-Attention: Details
Each token can “attend” to all previous tokens
Leads to context-aware representations
Weights determined dynamically based on content
Multiple attention heads can focus on different patterns
Each Attention Head
Query: What do I want to know?
Key: What information is available?
Value: What is the answer?
Our Journey Today
We’ll build a character-level language model in increasing complexity:
Simple MLP (no context)
Self-attention transformer (with context)
Lab Structure
Set up environment and dataset (TinyStories)
Implement character-level tokenization
Build and train a simple MLP language model
Trace through the MLP to understand limitations
Implement self-attention mechanism
Build a transformer-based language model
Trace through the transformer to understand attention
Dataset: TinyStories
Simple, short stories generated by GPT-3.5
Perfect for experimentation with small models
We’ll predict the next character in the sequence
Character-level tokenization (simpler than BPE/WordPiece)
# Example from datasetprint(example['text'][:100])
Part 1: Tokenization
def encode_doc(doc): token_ids = torch.tensor([ord(x) for x in doc], device=device)# Remove any tokens that are out-of-vocabulary token_ids = token_ids[token_ids < n_vocab]return token_ids
Using Unicode code points (ASCII primarily)
Simple byte-level vocab (n_vocab=256)
Each character maps to a unique integer
No need for complex tokenization
Part 2: MLP Language Model
class FeedForwardLM(nn.Module):def__init__(self, n_vocab, emb_dim, n_hidden):super().__init__()self.word_to_embedding = nn.Embedding(n_vocab, emb_dim)self.model = MLP(emb_dim=emb_dim, n_hidden=n_hidden)self.lm_head = nn.Linear(emb_dim, n_vocab, bias=False)# Use the token embeddings for the LM head ("tie weights")self.lm_head.weight =self.word_to_embedding.weight
Simple architecture, can only look at one token at a time (no context awareness)
Three key components:
Token embeddings
MLP network
Language model head
Limitations of the MLP Model
Processes each token independently
No ability to use context from previous tokens
Cannot model dependencies between characters
Example: In “Once upon a time”, the MLP can’t connect “Once” to “upon”
Part 3: Tracing the MLP Model
Understanding the flow of data step-by-step:
Embedding lookup: Convert character to vector
MLP processing: Transform the embedding
LM head: Project back to vocabulary space
Why trace? To build intuition for how neural networks transform data.