Neural Computation

Key questions

Key objectives

After this course, I will be able to:

Learning Path

CS 375

“I trained a neural net classifier from scratch.”

  1. Basic array/“tensor” operations in PyTorch
  2. Linear Regression “the hard way” (but black-box optimizer)
  3. Logistic Regression “the hard way”
  4. Multi-layer Perceptron
  5. Gradient Descent
  6. Data Loaders

Materials

Contents

Intro to Array Computing

You might have heard (or experienced) that Python is slow. So how can Python be the language behind basically all of the recent advances in AI, which all require huge amounts of computing? The secret is array computing. The Python code is orchestrating operations that happen on powerful “accelerator” hardware like GPUs and TPUs. Those operations typically involve repeatedly applying an operation to a big (usually rectangular) arrays of numbers, hence, array computing.

For those used to writing loops, this sort of coding can take some getting used to. Here are two exercises that previous students have found very helpful in getting their mind around how arrays work in PyTorch. (The concepts are basically identical in other libraries like TensorFlow, NumPy, and JAX.)

Objectives

Notebooks

The reference below is an AI-generated summary of the material in the notebook.

Dot Products

A dot product is a fundamental operation in neural networks, particularly in linear (Dense) layers. Key concepts:

Intuitions

Mathematical Form

Basic form: y = w1*x1 + w2*x2 + ... + wN*xN + b

Implementation Methods

  1. Using PyTorch’s built-in operations:
    • torch.dot(w, x) or w @ x
  2. Using elementwise operations:
    • Multiply corresponding elements: w * x
    • Sum the results: (w * x).sum()

Linear Transformations

A linear transformation is the basic building block of neural networks:

PyTorch Operations

Elementwise Operations

Reduction Operations

Common reduction methods:

Can be called as methods (x.sum()) or functions (torch.sum(x))

Mean Squared Error (MSE)

Common error metric for regression tasks.

Formula: MSE = (1/n)Σ(y_true - y_pred)²

Implementation steps:

  1. Compute residuals: y_true - y_pred
  2. Square residuals: (y_true - y_pred)**2
  3. Take mean: ((y_true - y_pred)**2).mean()

PyTorch provides built-in implementations:

Multidimensional Arrays

Key Concepts

Reduction Operations on Multiple Dimensions

Tensor Products

Linear Regression the Hard Way

The simplest “neural computation” model is linear regression. We’ll implement it today so that we can understand each part of how it works.

To keep things simple we’ll use a “black box” optimizer, a function that just takes the data and the loss function and finds the best values of the parameters.

Later we’ll study how to optimize a function using stochastic gradient descent, and how to compute the gradients of the loss function with respect to the parameters using backpropagation.

Warm-Up Activity

Go to the interactive figures for the Understanding Deep Learning book.

  1. Go to Figure 2.2 (Least squares loss). Adjust the sliders to try to make the loss bigger or smaller. What are the highest and lowest values you can get for the loss? What does the plot look like at those different settings? (consider the line, the data points, and the dashed lines).
  2. How can you tell if you got a good setting for the sliders? Can you tell just by observing the loss (without looking at the plot of the data and the line)?

Notebooks

Single Linear Regression

Work through this notebook to practice linear regression with a single feature.

Multiple Linear Regression

To reinforce our understanding and extend our understanding of linear regression to multiple features, we’ll work through this notebook:

From Logistic Regression to the Multi-Layer Perceptron (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.
Softmax

Background

Jargon:

Warm-Up Activity

Open the softmax and cross-entropy interactive demo that Prof Arnold created.

Try adjusting the logits (the inputs to softmax) and get a sense for how the outputs change. Describe the outputs when:

  1. All of the inputs are the same value. (Does it matter what the value is?)
  2. One input is much bigger than the others.
  3. One input is much smaller than the others.

Finally, describe the input that gives the largest possible value for output 1.

Notebooks

Softmax, part 1 (name: u04n2-softmax.ipynb; show preview, open in Colab)

PyTorch and Logistic Regression

Logistic Regression

i n f e a t u r e s M o d e l n ( _ a ( c s l l v c o a e o g s c r i s t e t e o s s s r ) ) s o f t m a x n ( _ a p c r l v o a e b s c s s t e o s r ) c c o r r o r s e s c - t e n a t n r s o w p e y r ( a l n o 1 u s m s b e r )

Jargon:

PyTorch

Imports:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

Building a model object with the desired architecture (structure)

model = nn.Linear(in_features=2, out_features=3, bias=True)

# or

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features=2, out_features=3, bias=True)
    
    def forward(self, x):
        return self.linear(x)
model = Model()

# or

n_hidden = 100
model = nn.Sequential(
    nn.Linear(in_features=2, out_features=n_hidden, bias=True),
    nn.ReLU(),
    nn.Linear(in_features=n_hidden, out_features=3, bias=True)
)

Training a model:

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters())
# in a training loop ...
y_pred = model(x)
loss = loss_fn(y_pred, y_true)
loss.backward()
optimizer.step()

Warm-Up Activity

Given on paper.

  1. We’re classifying houses as low/medium/high price based on longitude and latitude using logistic regression. The model outputs 3 scores, one for each class. For 100 houses (processed all at once in a “batch” of samples):

    a. What shape is X? X.shape =

    b. What shape should W (the array of weights) be? W.shape =

    c. What shape should b (the array of biases) be? b.shape =

    d. What shape will the output have? (X @ W + b).shape =

  2. For one house, if our model outputs scores [1.0, 2.0, -1.0] for low/med/high prices:

    Write the steps to convert these scores to probabilities that sum to 1. (You can use words or math notation.)

  3. If the true label for this house is “medium”, what’s the model’s accuracy and loss for this house? (You can use words or math notation.)

Notebooks

From Linear Regression in NumPy to Logistic Regression in PyTorch (name: u04n3-logreg-pytorch.ipynb; show preview, open in Colab)

Training an MLP by Gradient Descent in PyTorch

Gradient Game

Try the Gradient Game: How few calls do you need to get the Loss small? How do you do it?

Notebooks

MNIST with PyTorch (name: u06n1-mnist-torch.ipynb; show preview, open in Colab)

Linear Regression the Really Hard Way (gradient descent) (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.
Embeddings

Example: Sentence Embeddings (name: u08s1-sentence-embeddings.ipynb; show preview, open in Colab)

Slides (including graphics)

Notebooks

Further Exploration

We’ll discuss this much more in CS 376, but here are some ideas for further exploration:

PyTorch Autograd and SGD

How to Compute Gradients

PyTorch Autograd

PyTorch uses automatic differentiation to compute gradients.

Example:

import torch

# Let's call it "w" as if it were a weight in a neural network
w = torch.tensor(2.0, requires_grad=True)
y = w**2
y.backward()
print(w.grad)

After calling y.backward(), the gradient of y with respect to w is stored in w.grad.

(Stochastic) Gradient Descent

If we want to minimize a function $f(w)$, we can use gradient descent:

  1. Initialize $w$ randomly
  2. Repeat:
    • Compute the gradient of $f$ with respect to $w$
    • Update $w$ by moving in the opposite direction of the gradient

If the function depends on some data (e.g., it’s the loss of a neural network computed on a batch of data), we often use stochastic gradient descent (SGD):

  1. Initialize $w$ randomly
  2. Repeat:
    • Sample a batch of data
    • Compute the gradient of the loss with respect to $w$ on the batch
    • Update $w$ by moving in the opposite direction of the gradient

We call it stochastic because it uses a stochastic estimate of the gradient.

Warm-Up Activity

Suppose we’re trying to minimize a function $f(p) = p^2 + 2x + 1$ using gradient descent:

def f(p):
  return p**2 + 2*p + 1

def grad_f(p):
  # gradient of f with respect to p
  return ______________________

print(f(3)) # ______
print(grad_f(3)) # ______

# fill in the blank to minimize
p = random()
for i in range(100):
  p = ______________________

What should we fill in the blanks to minimize the function?

Notebooks

Compute gradients using PyTorch (name: u06n2-compute-grad-pytorch.ipynb; show preview, open in Colab)