Notes: Neural Architectures

These notes are reference material for Unit 3 (Architectures). The primary focus of class is self-attention and Transformers; these notes cover other architectures for comparison.

Deep Neural Net = Stack of Layers

Neural networks are built from modular components, often connected sequentially:

The key difference between architectures is their connectivity structure:

Feed-Forward / MLP

A feed-forward network (or multi-layer perceptron) is a stack of linear transformations with nonlinearities between them:

$$f(x) = f_2(\text{ReLU}(f_1(x)))$$

where $f_1$ and $f_2$ are both linear transformations ($f_i(x) = x W_i + b$) and $\text{ReLU}(x) = \max(0, x)$ applied elementwise. Other nonlinearities (GELU, SiLU, etc.) are sometimes used instead of ReLU.

Key properties:

Applying an MLP to a Sequence

There are two options for processing a sequence with an MLP:

Option 1: Concatenate Option 2: Per-element
Approach Concatenate the sequence into one giant vector Apply the MLP to each element independently
Interactions Can capture interactions between elements Cannot capture interactions between elements
Variable length Cannot handle variable-length sequences Can handle variable-length sequences
Parameters Huge number of parameters Fewer parameters (reuse the same weights for each element)

In Transformers, the MLP layers use Option 2 (applied independently to each position). Information sharing between positions happens in the attention layers instead.

Convolutional Networks (CNN)

A convolutional layer is essentially a feed-forward network applied to a small patch (or “window”) of the input, slid across the entire input to produce many outputs.

Key properties:

How they work: A small set of learnable weights (the “kernel” or “filter”) slides across the input. At each position, the kernel computes a weighted sum of the local patch. Different kernels detect different features — edges, textures, patterns. Stacking multiple convolutional layers lets the network build up from simple local features to complex high-level concepts.

Resources

Recurrent Networks (RNN, LSTM)

A recurrent network processes a sequence one step at a time, maintaining a “hidden state” that summarizes everything seen so far. At each time step, the network takes the current input and the previous hidden state, and produces an updated hidden state and an output.

Key properties:

LSTM (Long Short-Term Memory) is a variant of RNN designed to mitigate the long-range dependency problem. It uses a gating mechanism to selectively remember or forget information, which helps gradients flow over longer sequences. LSTMs were the dominant architecture for sequence tasks (translation, speech recognition, text generation) before Transformers.

Compare and Contrast

MLP CNN RNN/LSTM Transformer
Connectivity Fully connected Local (spatial neighbors) Temporal (previous step) Dynamic (learned attention)
Sequence handling Fixed length (concatenate) or no interaction (per-element) Local context via sliding window Naturally sequential, variable length Full context, variable length
Parallelism Fully parallel Fully parallel Sequential (hard to parallelize) Fully parallel
Long-range dependencies Only if concatenated (expensive) Requires many stacked layers Difficult (vanishing gradients) Direct (any token attends to any other)
Parameter sharing None across positions Same kernel at all positions Same weights at all time steps Same attention weights at all positions
Primary strength Simple, general Spatial/local patterns Sequential data with short-range dependencies Flexible, scalable, long-range
Classic applications Tabular data, small models Image recognition, object detection Early machine translation, speech Modern LLMs, vision (ViT), multimodal

When to Use Each

Discussion 376.2: Training Data as Stewardship
376 Preparation 3