Neural Computation

Key questions

375:
- How do neural nets compute? (How does that differ from traditional programming?)
- What are the “data structures” of neural computing and efficient operations we can do with them?
- How can we update parameters to optimize an objective function?
376:
- How can we represent text and other data as sequences?
- How can we process and generate sequences using neural nets?
- How can models capture and use nuanced long-range dependencies?

Key objectives

After this course, I will be able to:

375:
- I can compute the forward pass through a two-layer classification neural network by hand (or in simple code) and explain the purpose and operation of each part.
- I can implement the following basic neural network primitives in efficient parallel code (using a library like NumPy or PyTorch): linear layers, elementwise nonlinearities (like ReLU), softmax, and loss functions like MSE and categorical cross-entropy.
- I can draw clear diagrams of the data flow, including array shapes, for the forward pass and loss computation for the following models: linear regression, logistic regression, and a one-layer MLP.
- I can interpret vectors of data as points in a space and explain similarity measures like the dot product.
- I can use automatic differentiation APIs to compute and descend gradients.
376:
- I can identify the shapes of data flowing through a Transformer-style language model.
- I can sketch what the self-attention matrix would look like for a simple example.

Learning Path

CS 375

“I trained a neural net classifier from scratch.”

Basic array/“tensor” operations in PyTorch
- Code: array operations
- Concepts: dot product, mean squared error
Linear Regression “the hard way” (but black-box optimizer)
- Code: Representing data as arrays
- Concepts: loss function, forward pass, optimizer
Logistic Regression “the hard way”
- Concepts: softmax, cross-entropy loss
Multi-layer Perceptron
- Concepts: nonlinearity (ReLU), initialization
Gradient Descent
- Concepts: backpropagation, training loop
Data Loaders
- Concepts: batching, shuffling

Materials

Slides

Intro to Array Computing

You might have heard (or experienced) that Python is slow. So how can Python be the language behind basically all of the recent advances in AI, which all require huge amounts of computing? The secret is array computing. The Python code is orchestrating operations that happen on powerful “accelerator” hardware like GPUs and TPUs. Those operations typically involve repeatedly applying an operation to a big (usually rectangular) arrays of numbers, hence, array computing.

For those used to writing loops, this sort of coding can take some getting used to. Here are two exercises that previous students have found very helpful in getting their mind around how arrays work in PyTorch. (The concepts are basically identical in other libraries like TensorFlow, NumPy, and JAX.)

Objectives

Apply mathematical operations to arrays using PyTorch

Notebooks

PyTorch Warmup (name: u02n1-pytorch.ipynb; show preview, open in Colab)
- Dot Products
  - for loop approach
    - Torch Elementwise Operations
  - Torch Reduction Ops
  - Building a dot product out of Torch ops
- Linear Layer
  - Linear layer, Module-style
- Mean Squared Error
- Multidimensional arrays
- Appendix

The reference below is an AI-generated summary of the material in the notebook.

Dot Products

A dot product is a fundamental operation in neural networks, particularly in linear (Dense) layers. Key concepts:

Intuitions

Measures similarity/alignment between vectors
Can be thought of as “How much does the input look like this pattern?”
In a Linear layer, performs rotation and stretching of input space
Similar to multiple linear regression’s weighted mixture

Mathematical Form

Basic form: y = w1*x1 + w2*x2 + ... + wN*xN + b

Each input x[i] is multiplied by its corresponding weight w[i]
Results are summed together
Often includes a bias term b (can be omitted for simplicity)

Implementation Methods

Using PyTorch’s built-in operations:
- torch.dot(w, x) or w @ x
Using elementwise operations:
- Multiply corresponding elements: w * x
- Sum the results: (w * x).sum()

Linear Transformations

A linear transformation is the basic building block of neural networks:

Takes form: y = w*x + b
w represents weights
b represents bias
Can be implemented as a function or as a class (Module-style)

PyTorch Operations

Elementwise Operations

Operations between tensors of same shape happen element-by-element
Example: w * x multiplies corresponding elements

Reduction Operations

Common reduction methods:

sum(): Adds all elements
mean(): Computes average
max(): Finds maximum value
argmax(): Finds index of maximum value

Can be called as methods (x.sum()) or functions (torch.sum(x))

Mean Squared Error (MSE)

Common error metric for regression tasks.

Formula: MSE = (1/n)Σ(y_true - y_pred)²

Implementation steps:

Compute residuals: y_true - y_pred
Square residuals: (y_true - y_pred)**2
Take mean: ((y_true - y_pred)**2).mean()

PyTorch provides built-in implementations:

Functional style: F.mse_loss(y_pred, y_true)
Module style: nn.MSELoss()

Multidimensional Arrays

Key Concepts

Can have multiple axes (dimensions)
Indexing can use positive or negative indices
Shape determines valid operations

Reduction Operations on Multiple Dimensions

Can reduce along specific axes using axis parameter
Reducing along an axis removes that dimension
Example: x.sum(axis=1) sums along axis 1

Tensor Products

Reduces along middle axis
Shape compatibility matters for successful operations
Results in new tensor with specific shape based on input dimensions

Linear Regression the Hard Way

The simplest “neural computation” model is linear regression. We’ll implement it today so that we can understand each part of how it works.

To keep things simple we’ll use a “black box” optimizer, a function that just takes the data and the loss function and finds the best values of the parameters.

Later we’ll study how to optimize a function using stochastic gradient descent, and how to compute the gradients of the loss function with respect to the parameters using backpropagation.

Warm-Up Activity

Go to the interactive figures for the Understanding Deep Learning book.

Go to Figure 2.2 (Least squares loss). Adjust the sliders to try to make the loss bigger or smaller. What are the highest and lowest values you can get for the loss? What does the plot look like at those different settings? (consider the line, the data points, and the dashed lines).
How can you tell if you got a good setting for the sliders? Can you tell just by observing the loss (without looking at the plot of the data and the line)?

Notebooks

Single Linear Regression

Work through this notebook to practice linear regression with a single feature.

Linear Regression the Hard Way (name: u03n1-linreg-manual.ipynb; show preview, open in Colab)

Multiple Linear Regression

To reinforce our understanding and extend our understanding of linear regression to multiple features, we’ll work through this notebook:

Multiple Linear Regression, the Hard Way (name: u04n1-multi-linreg-manual.ipynb; show preview, open in Colab)

From Logistic Regression to the Multi-Layer Perceptron (draft!)

The content may not be revised for this year. If you really want to see it, click the link above.

Softmax

Background

Analogy: softmax takes skill scores and turns them into probabilities of winning.
Example:
- equal skill scores -> 50-50 chance of winning.
- one player has much higher skill score -> much higher chance of winning.
Plain-English: (1) make sure scores aren’t negative, (2) make sure they add up to 1.
Technical definition:
- softmax input: a vector x of shape (n,)
- softmax output: a vector y of shape (n,) where y[i] is the probability that x is in class i.
- y[i] = exp(x[i]) / sum(exp(x))

Jargon:

Logits or scores: the inputs to the softmax function.
probabilities or probs: the outputs of the softmax function.
logprobs: the log of the probabilities.

Warm-Up Activity

Open the softmax and cross-entropy interactive demo that Prof Arnold created.

Try adjusting the logits (the inputs to softmax) and get a sense for how the outputs change. Describe the outputs when:

All of the inputs are the same value. (Does it matter what the value is?)
One input is much bigger than the others.
One input is much smaller than the others.

Finally, describe the input that gives the largest possible value for output 1.

Notebooks

Softmax, part 1 (name: u04n2-softmax.ipynb; show preview, open in Colab)

PyTorch and Logistic Regression

Logistic Regression

Analogy: logistic regression is like linear regression, but for classification.
Example: predict whether an email is spam or not (binary logistic regression), or predict which of several categories a news article belongs to (multiclass logistic regression).
Plain-English: (1) multiply inputs by weights, (2) add a bias, (3) squash the result to numbers between 0 and 1, (4) train to make the right answer more likely.
Technical definition:
- logistic_regression input: an array X of shape (samples, features)
- logistic_regression output: an array y of shape (samples, classes) where y[i, j] is the probability that sample i is in class j.
- logits = x @ W + b where W is an array of shape (features, classes) and b is an array of shape (classes,).
- logits is then passed through the softmax function to get the output probabilities: y = softmax(logits).
  - The softmax function is defined as softmax(x) = exp(x) / sum(exp(x)), where the sums are taken across the classes.
- Categorical cross entropy loss (negative log likelihood) is used to train the model: loss_i = -sum(y_true_onehot_i * log(y_pred_i)).

Jargon:

Logits or scores: the inputs to the softmax function.
probabilities or probs: the outputs of the softmax function.

PyTorch

Imports:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

Building a model object with the desired architecture (structure)

model = nn.Linear(in_features=2, out_features=3, bias=True)

# or

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features=2, out_features=3, bias=True)
    
    def forward(self, x):
        return self.linear(x)
model = Model()

# or

n_hidden = 100
model = nn.Sequential(
    nn.Linear(in_features=2, out_features=n_hidden, bias=True),
    nn.ReLU(),
    nn.Linear(in_features=n_hidden, out_features=3, bias=True)
)

Training a model:

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters())
# in a training loop ...
y_pred = model(x)
loss = loss_fn(y_pred, y_true)
loss.backward()
optimizer.step()

Warm-Up Activity

Given on paper.

We’re classifying houses as low/medium/high price based on longitude and latitude using logistic regression. The model outputs 3 scores, one for each class. For 100 houses (processed all at once in a “batch” of samples):

a. What shape is X? X.shape =

b. What shape should W (the array of weights) be? W.shape =

c. What shape should b (the array of biases) be? b.shape =

d. What shape will the output have? (X @ W + b).shape =
For one house, if our model outputs scores [1.0, 2.0, -1.0] for low/med/high prices:

Write the steps to convert these scores to probabilities that sum to 1. (You can use words or math notation.)
If the true label for this house is “medium”, what’s the model’s accuracy and loss for this house? (You can use words or math notation.)

Notebooks

From Linear Regression in NumPy to Logistic Regression in PyTorch (name: u04n3-logreg-pytorch.ipynb; show preview, open in Colab)

Training an MLP by Gradient Descent in PyTorch

Gradient Game

Try the Gradient Game: How few calls do you need to get the Loss small? How do you do it?

Notebooks

MNIST with PyTorch (name: u06n1-mnist-torch.ipynb; show preview, open in Colab)

Linear Regression the Really Hard Way (gradient descent) (draft!)

The content may not be revised for this year. If you really want to see it, click the link above.

Embeddings

Analogy: like a map: each object has its GPS coordinates, similar objects are neighbors
Definition: a vector representation of an object, constructed to be useful for some task (not necessarily human-interpretable)
Examples: words, sentences, images, movies, users, etc.

Example: Sentence Embeddings (name: u08s1-sentence-embeddings.ipynb; show preview, open in Colab)

Slides (including graphics)

Notebooks

Probe an Image Classifier (name: u07n1-image-embeddings.ipynb; show preview, open in Colab)

Further Exploration

We’ll discuss this much more in CS 376, but here are some ideas for further exploration:

Try the SigLIP demo that embeds images and text together. Try computing the dot products between a few texts that you write by hand. Does the dot product reflect the similarity of the texts? Repeat with images. What do you find? (Use Colab for this one.)

PyTorch Autograd and SGD

How to Compute Gradients

Numerical approximation: $\frac{f(x+h) - f(x)}{h}$
- Pros: Easy to implement
- Cons: Computationally expensive, not accurate
Symbolic differentiation: $\frac{df}{dx}$
- Pros: Accurate
- Cons: Can make unwieldy expressions
Automatic differentiation: grad(f, x)
- Pros: Accurate, efficient, works even with billions of parameters
- Cons: Can be hard to debug, requires intermediate values to be stored

PyTorch Autograd

PyTorch uses automatic differentiation to compute gradients.

Example:

import torch

# Let's call it "w" as if it were a weight in a neural network
w = torch.tensor(2.0, requires_grad=True)
y = w**2
y.backward()
print(w.grad)

After calling y.backward(), the gradient of y with respect to w is stored in w.grad.

(Stochastic) Gradient Descent

If we want to minimize a function $f(w)$, we can use gradient descent:

Initialize $w$ randomly
Repeat:
- Compute the gradient of $f$ with respect to $w$
- Update $w$ by moving in the opposite direction of the gradient

If the function depends on some data (e.g., it’s the loss of a neural network computed on a batch of data), we often use stochastic gradient descent (SGD):

Initialize $w$ randomly
Repeat:
- Sample a batch of data
- Compute the gradient of the loss with respect to $w$ on the batch
- Update $w$ by moving in the opposite direction of the gradient

We call it stochastic because it uses a stochastic estimate of the gradient.

Warm-Up Activity

Suppose we’re trying to minimize a function $f(p) = p^2 + 2x + 1$ using gradient descent:

def f(p):
  return p**2 + 2*p + 1

def grad_f(p):
  # gradient of f with respect to p
  return ______________________

print(f(3)) # ______
print(grad_f(3)) # ______

# fill in the blank to minimize
p = random()
for i in range(100):
  p = ______________________

What should we fill in the blanks to minimize the function?

Notebooks

Compute gradients using PyTorch (name: u06n2-compute-grad-pytorch.ipynb; show preview, open in Colab)

Neural Computation

Key questions

Key objectives

Learning Path

CS 375

Materials

Contents

Objectives

Notebooks

Dot Products

Intuitions

Mathematical Form

Implementation Methods

Linear Transformations

PyTorch Operations

Elementwise Operations

Reduction Operations

Mean Squared Error (MSE)

Multidimensional Arrays

Key Concepts

Reduction Operations on Multiple Dimensions

Tensor Products

Warm-Up Activity

Notebooks

Single Linear Regression

Multiple Linear Regression

Background

Warm-Up Activity

Notebooks

Logistic Regression

PyTorch

Warm-Up Activity

Notebooks

Gradient Game

Notebooks

Notebooks

Further Exploration

How to Compute Gradients

PyTorch Autograd

(Stochastic) Gradient Descent

Warm-Up Activity

Notebooks