Students who complete this unit will demonstrate that they can:
Describe the basic structure of a machine learning model.
Describe the overall approach of Stochastic Gradient Descent: how does it use information from a batch of data to improve its performance on that and other data?
Describe the parameters of a linear layer and how they are used to compute its output.
Identify the following loss functions: Mean Squared Error and Mean Absolute Difference.
Define what a Multi-Layer Perceptron (MLP) is and identify the terms “input features”, “hidden features”, “activation function”, and “output features”.
Trace the execution of a basic image classifier model using a fully-connected network.
Apply automatic differentiation (as implemented in PyTorch) to compute the gradients of programs
(Note that we’re focusing on regression models this week; next week we’ll add classification.)
Preparation
Check your prior knowledge:
can you define the Mean Squared Error (MSE) of a linear regression (i.e., y = m*x + b)?
can you write an algorithm that, given some data and a starting m and b, returns a new m and b that give a lower MSE?
For this week, focus on how things are used rather than the underlying math, especially for tensors (which have several different definitions) and derivatives (which we’ll get to shortly).
The book uses “rank” to refer to the number of axes of a tensor, but “rank” means something different in linear algebra. To avoid confusion, let’s call it “number of axes”, or perhaps “number of dimensions” (abbreviated “ndim” in PyTorch).
For example, a length-5 column vector times a length-4 row vector would give a matrix (tensor) with two axes (2-dimensional), with shape (5, 4) and rank 1 in the linear algebra sense. See this notebook.