PyTorch Warmup¶

PyTorch is the open-source machine learning framework that we'll be using in this class. It has a wide range of functionality; for now we'll just get started with some of its very basic array-processing functionality.

In [ ]:

import torch
import torch.nn as nn
import torch.nn.functional as F

Dot Products¶

The most common basic primitive in a neural network is a linear layer (you'll sometimes see it called a "Dense" layer). These are where almost all of the parameters go in a network. (Some architectures use a variant called a convolutional layer.) At its core, a linear layer does a bunch of dot products between its input vector and its (learned) weight vectors.

A few intuitions to understand what a dot product is:

It measures similarity, in the sense of alignment. The following statements loosely capture it:
- "How much does the input look like this?"
- "How big is the input in this direction?"
- "How aligned is the input with this direction?"
- "What's the cosine of the angle between the input vector and this vector?"
A bunch of dot products all together (like in a Linear layer) rotates and stretches the input space, like moving a camera around a scene.
It's how a multiple linear regression computes its output: a weighted mixture of each part of its input.

Recall that we can make a line by an expression like y = w*x + b. (Some of you may remember mx+b , but we'll use w for the weight(s) instead.)

That's a multiplication followed by a sum. We can extend that to lots of x's, each of which needs a corresponding w:

y = w1*x1 + w2*x2 + ... + wN*xN + b

For simplicity, let's ignore the b ("bias") for now (we'll bring it back later). So we're left with

y = w1*x1 + w2*x2 + ... + wN*xN

that is, multiply each number in w by its corresponding number in x and add up the results: sum(w[i] * x[i] for i in range(N)). Or, in mathematical notation: $\sum_{i=1}^{N} w_i x_i.$

The result is called a dot product, and is one of the fundamental operations in linear algebra. At this point you don't need to understand all the linear algebra part of this, we're just implementing a common calculation.

Let's do that in pure Python, and then in PyTorch. To start, let's make an array for the weights (called w) and an array for the inputs (called x).

In [ ]:

w = torch.tensor([2.0, -1.0])
w

Out[ ]:

tensor([ 2., -1.])

In [ ]:

x = torch.tensor([1.5, -3.0])
x

Out[ ]:

tensor([ 1.5000, -3.0000])

The shapes of w and x must match.

In [ ]:

N = len(w)
assert N == len(x)

`for` loop approach¶

Task: Write a function that uses a for loop to compute the dot product of w and x. Name the function dot_loop. Check that you get 6.0 for the w and x provided in the template.

In [ ]:

def dot_loop(w, x):
    return 0.0 # FIXME
dot_loop(w, x)

Out[ ]:

tensor(6.)

Here are some test cases that dot_loop should pass. You don't need to understand how this code works yet, but it would reward some study.

In [ ]:

test_cases = [
    ([0.], [500.], 0.0),
    ([1., 0.0], [50.0, .5], 50.0),
    ([-1.0, 1.0], [-1.0, 1.0], 2.0)
]
def run_dot_tests(f):
    for w, x, prod in test_cases:
        w = torch.tensor(w)
        x = torch.tensor(x)
        print(f"Testing dot_loop({w}, {x})")
        result = f(w, x)
        if not isinstance(result, torch.Tensor):
            result = torch.tensor(result)
        assert torch.isclose(
            result,
            torch.tensor(prod)
        )
    print("All tests passed")
run_dot_tests(dot_loop)

Testing dot_loop(tensor([0.]), tensor([500.]))
Testing dot_loop(tensor([1., 0.]), tensor([50.0000,  0.5000]))
Testing dot_loop(tensor([-1.,  1.]), tensor([-1.,  1.]))
All tests passed

Torch Elementwise Operations¶

But that's a lot of typing for a concept that we're going to use very frequently. To shorten it (and make it run way faster too!), we'll start taking advantage of some of Torch's builtin functionality.

First, we'll learn about elementwise operations (called pointwise operations in the PyTorch docs).

If you try to * two Python lists together, you get a TypeError (how do you multiply lists??). But in PyTorch (and NumPy, which it's heavily based on), array operations happen element-by-element (sometimes called elementwise): to multiply two tensors that have the same shape, multiply each number in the first tensor with the corresponding number of the second tensor. The result is a new tensor of the same shape with all the elementwise products.

Task: Predict what you'll get from running w * x. Then try it and compare with your prediction. (No need to write an explanation here.)

In [ ]:

# your code here

Torch Reduction Ops¶

Torch also provides reduction methods, so named because they reduce the number of elements in a Tensor.

One really useful reduction op is .sum. (I also frequently use .mean, .max, and .argmax).

Task: Predict the output of running x.sum() Then try it and compare with your prediction.

You can also write that as torch.sum(x).

In [ ]:

# your code here

Building a dot product out of Torch ops¶

Now make a new version of dot_loop, called dot_ops, that uses an elementwise op to multiply corresponding numbers and a reduction op to sum the result. Check that the result still passes the tests. (Don't use @ yet.)

In [ ]:

def dot_ops(w, x):
    return 0.0 # FIXME
dot_ops(w, x)

Out[ ]:

tensor(6.)

In [ ]:

run_dot_tests(dot_ops)

Testing dot_loop(tensor([0.]), tensor([500.]))
Testing dot_loop(tensor([1., 0.]), tensor([50.0000,  0.5000]))
Testing dot_loop(tensor([-1.,  1.]), tensor([-1.,  1.]))
All tests passed

Finally, since dot is such an important operation, PyTorch provides it directly:

torch.dot(w, x)

Python recently introduced a "matmul operator", @, that does the same thing.

w @ x

To apply this knowledge, let's try writing a slightly more complex function: a linear transformation layer.

Linear Layer¶

The most basic component of a neural network (and many other machine learning methods) is a linear transformation layer. Going back to our y = w*x + b example, the w*x + b is the linear transformation: given an x, dot it with some weights and add a bias.

Task: Write a function that performs a linear transformation of a vector x. Use PyTorch's built-in functionality for dot products (torch.dot() or @).

In [ ]:

def linear(weights, bias, x):
    return 0.0 # FIXME
linear(w, -4.0, x)

Out[ ]:

tensor(2.)

In [ ]:

assert torch.isclose(linear(w, -4.0, x), torch.tensor(2.0))
assert torch.isclose(linear(w, 0.0, x), torch.tensor(6.0))

Linear layer, Module-style¶

Notice that linear's job is to transform x, but it needed 3 parameters, not just 1. It would be convenient to view the linear function as simply a function of x, with weights and bias being internal details.

One way to do this is to make a Linear class that has these as parameters.

Task: Fill in the blanks in the template code to do this. (This is roughly how PyTorch's implementation works).

In [ ]:

class Linear:
    def __init__(self, weights, bias):
        self.weights = ...
        self.bias = ...
        
    def forward(self, x):
        return ...

layer = Linear(weights=w, bias=1.0)
layer.forward(x)

Out[ ]:

tensor(7.)

Note: PyTorch's Linear layer gives a vector-valued output, so to make the dimensionality work out, it actually computes x @ weights.T + bias, where T computes the transpose of the array.

Mean Squared Error¶

Now let's apply what you just learned about elementwise operations on PyTorch tensors to another very common building block in machine learning: measuring error.

Once we make some predictions, we usually want to be able to measure how good the predictions were. For regression tasks, i.e., tasks where we're predicting numbers, one very common measure is the mean squared error. Here's an algorithm to compute it:

compute resid as true (y_true) minus predicted (y_pred).
compute squared_error by squaring each number in resid
compute mean_squared_error by taking the mean of squared_error.

Technical note: This process implements the mean squared error loss function. That is a function that is given some true values (call them $y_1$ through $y_n$) and some predicted values (call them $\hat{y}_1$ through $\hat{y}_n$) and returns $$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2.$$

Generally you'd get the predicted values, y_pred, by calling a function that implements a model (like linear.forward() above. But to focus our attention on the error computation, we've provided sample values for y_true and y_pred that you can just use as-is.

In [ ]:

y_true = torch.tensor([3.14, 1.59, 2.65])
y_pred = torch.tensor([2.71, 8.28, 1.83])

Task:

Implement each line of the above algorithm in PyTorch code.
- Use separate cells so you can check the results along the way. For example, the first cell should have two lines, the first to assign (resid = ...) and the second to show the result (resid).
- You should not need to write any loops.
- Try using both squared_error.mean() and torch.mean(squared_error).
Now, write the entire computation in a single succinct expression (i.e., without having to create intermediate variables for resid and squared_error). Check that you get the same result.

Notes:

Recall that Python's exponentiation operator is **.

PyTorch tensors also have a .pow() method. So you can also use .pow(2); you might see this in other people's code.

In [ ]:

resid = ...
resid

Out[ ]:

tensor([ 0.4300, -6.6900,  0.8200])

In [ ]:

squared_error = ...
squared_error

Out[ ]:

tensor([ 0.1849, 44.7561,  0.6724])

In [ ]:

# your code here to compute MSE from squared_error

Out[ ]:

tensor(15.2045)

In [ ]:

# your code here to do the whole thing in one line

Out[ ]:

tensor(15.2045)

Multidimensional arrays¶

NumPy / PyTorch arrays can have more than one axis. Think of these like lists of lists (of lists of lists of ...).

In [ ]:

torch.manual_seed(1234)
x = torch.rand(4, 2)
x

Out[ ]:

tensor([[0.0290, 0.4019],
        [0.2598, 0.3666],
        [0.0583, 0.7006],
        [0.0518, 0.4681]])

Task: Use indexing to get out the top-left number, the top-right number, the bottom-left, and the bottom-right. One of them is done for you; study how that works.

In [ ]:

bottom_right = x[-1, -1]
assert bottom_right == x[3, 1] and bottom_right == x[3][-1]
bottom_right

Out[ ]:

tensor(0.4681)

In [ ]:

top_left = ...
top_right = ...
bottom_left = ...
print(f"Top-Left: {top_left:.2f}, Top-Right: {top_right:.2f}, Bottom-Left: {bottom_left:.2f}")

Top-Left: 0.03, Top-Right: 0.40, Bottom-Left: 0.05

We can apply a reduction operation "along" an axis, e.g.,

In [ ]:

x.sum(axis=1)

Out[ ]:

tensor([0.4309, 0.6265, 0.7589, 0.5199])

Task: Is summing on axis=1 summing each row or summing each column?

Your answer here

There's a general rule for what happens when you reduce along an axis: that axis "goes away". To think about that rule and its implications, try the following exercise:

Task: Predict what the .shape of each of the following operations will be. Then try each one and check if you were correct. For example, for the first operation, z.mean(axis=0), the shape should be (6, 7); check that it's true and make sure you can explain why (but you don't have to write up that explanation here).

In [ ]:

z = torch.rand(5, 6, 7)
print(f"z.shape is {z.shape}")

print(f"z.mean(axis=0).shape is {z.mean(axis=0).shape}")
# z.mean(axis=1).shape
# z.mean(axis=2).shape
# z.mean(axis=-1).shape
# Note: indexing is kind of like a reduction operation
# z[0]  
# z[1].mean(axis=1).shape

z.shape is torch.Size([5, 6, 7])
z.mean(axis=0).shape is torch.Size([6, 7])

Finally, the tensor product is a reduction operation that happens between two arrays / tensors. It reduces "along" the middle axis.

Task: Try to find several different shapes that make the following code succeed. Nothing to "turn in" here, though.

In [ ]:

shape1 = (2, 3, 4) # try to find examples with 1, 2, or 3 different numbers here.
shape2 = (4, 2) # try to find examples with 1, 2, or 3 different numbers here.
x = torch.rand(shape1)
y = torch.rand(shape2)
(x @ y).shape

Out[ ]:

torch.Size([2, 3, 2])

Appendix¶

For comparison and future reference, here's PyTorch's internal implementation of MSE loss. There are two ways to access it: the functional style...

In [ ]:

F.mse_loss(y_pred, y_true)

Out[ ]:

tensor(15.2045)

and the module style:

In [ ]:

loss_fn = nn.MSELoss()
loss_fn(y_pred, y_true)

Out[ ]:

tensor(15.2045)

PyTorch Warmup¶

Dot Products¶

for loop approach¶

Torch Elementwise Operations¶

Torch Reduction Ops¶

Building a dot product out of Torch ops¶

Linear Layer¶

Linear layer, Module-style¶

Mean Squared Error¶

Multidimensional arrays¶

Appendix¶

`for` loop approach¶