Lab 4: Nonlinear Regression

Mar 26, 2021

The following template is provided in your Portfolio repositories under narrative/lab04-nn-regression.ipynb.

Common gotchas on this lab:

Remember to instantiate your modules (e.g., nn.MSELoss()), and use PyTorch losses, not mean_squared_error from sklearn.
For Step 1, as long as either Train or Valid loss is below the threshold you’re okay.
Name your model model for the plot code to work (sorry)

Lab 4: Nonlinear Regression

Task: Use nn.Sequential to add layers to a model; use a nonlinearity between layers to increase the model’s capacity.

Setup

from fastai.vision.all import *

This function will make a DataLoaders object out of an arary dataset.

def make_dataloaders(x, y_true, splitter, batch_size):
    data = L(zip(x, y_true))
    train_indices, valid_indices = splitter(data)
    return DataLoaders(
        DataLoader(data[train_indices], batch_size=batch_size, shuffle=True),
        DataLoader(data[valid_indices], batch_size=batch_size)
    )

Here are utility functions to plot the first axis of a dataset and a model’s predictions.

def plot_data(x, y): plt.scatter(x[:, 0], y[:, 0], s=.5, color='#bbbbbb')
def plot_model(x, model):
    x = x.sort(dim=0).values
    y_pred = model(x).detach()
    plt.plot(x[:, 0], y_pred[:, 0], 'r')

The following Callback can be added to your Learner to plot the data and model after each epoch:

learner = Learner(
    ...
    cbs=[ShowPredictions(), ShowGraphCallback()],
    ...

# Inspired by https://gist.github.com/oguiza/c7559da6de0e036f01d7dee15e2f15e4
class ShowPredictions(Callback):
    def __init__(self): self.graph_fig = None # keep a reference to a figure object to update
    def before_fit(self):
        self.run = not hasattr(self.learn, 'lr_finder') and not hasattr(self, 'gather_preds')
    def after_fit(self): plt.close(self.graph_fig)
    def after_epoch(self):
        if self.graph_fig is None:
            self.graph_fig, self.graph_ax = plt.subplots(1)
            self.graph_out = display(self.graph_ax.figure, display_id=True)
        plt.sca(self.graph_ax)
        self.graph_ax.clear()
        # Plot code. Replace this if needed:
        plot_data(x, y_true)
        plot_model(x, model)
        # Update the graph.
        self.graph_out.update(self.graph_ax.figure)

Task

Most applications of neural net models work in very high dimensions (e.g., each individual pixel in an image!) so it’s hard to visualize what the model is actually learning. Here, we’ll revisit the simple linear model that we looked at in Fundamentals 006 and 009, which learned to predict a single continuous outcome variable y from a single continuous input feature x. So we can visualize the network’s behavior just like any other univariate function: by plotting y vs x.

But this time the data isn’t just a straight line; it’s a fancy function of x.

num_points = 5000

set_seed(40)
x = torch.rand(num_points, 1)
noise = torch.rand_like(x) * 1.
y_true = .5 * (x*6).sin() + x + (x - .75) * noise
# standardize y, just to make it well behaved.
y_true -= y_true.mean()
y_true /= y_true.std()

plot_data(x, y_true)

png

In 006 and 009, we dealt with models that could only ever make straight lines. They couldn’t even make a curve like 3 * x**2 + 2*x + 1, yet alone that one!

But you may remember from your math or stats studies that a curve like that is actually linear if you transform your data, e.g., using z = [x, x**2] as the input; then the model is 3 * z[1] + 2 * z[0] + 1, which is linear in z.

So if we transform our data before giving it to the linear model, we can actually get interesting functions from a linear model. But how do we transform the data?

The classic approach is to specify what transformation to make. e.g., in polynomial regression we put in a bunch of powers of x (x**2, x**3, …, x**10, …), but that gets numerically unstable with high powers. There are other “basis functions” that are better behaved, like splines.

But neural nets take a different approach: they learn the transformation based on what is needed to accomplish its objective.

Instructions:

Fit a line to this data (minimizing the MSE). Evaluate the MSE. By eye, how well does it fit?
Add a layer: Use nn.Sequential to put two nn.Linear layers back to back. Use 500 dimensions as the hidden dimension (the out_features of the first and the in_features of the second). Evaluate the MSE. How well does it fit?
Add a nonlinearity: Add a nn.ReLU between the two linear layers. Evaluate the MSE. How well does it fit?

Details and tips are given inline below.

Solution

Make a DataLoaders for this data. This step has been done for you.

We increased the dataset size and the batch size to make the learning better-behaved. Once you get this to work, you might see if you can deal with a smaller batch size or less data overall.

splitter = RandomSplitter(valid_pct=0.2, seed=42)
batch_size = 100
dataloaders = make_dataloaders(x, y_true, splitter, batch_size=batch_size)

Step 1: Fit a Line

Fit a line to this data (minimizing the MSE).

Use a nn.Linear module as your model
Use Learner with opt_func=SGD, as you did in 009.
Pass cbs=[ShowPredictions(), ShowGraphCallback()] to the Learner to show the training progress.

Tune the learning rate and number of epochs until you reliably get an MSE below 0.76.

# To determine the ideal MSE of a linear model, one approach is:
# from sklearn.linear_model import LinearRegression
# from sklearn.metrics import mean_squared_error
# mean_squared_error(
#     to_np(y_true),
#     LinearRegression().fit(x, y_true).predict(to_np(x))
# )

0.7543805

# your code here

epoch	train_loss	valid_loss	mae	time
0	0.950252	0.885827	0.719921	00:00
1	0.859075	0.808779	0.712994	00:00
2	0.807587	0.781937	0.713865	00:00
3	0.783546	0.768974	0.715225	00:00
4	0.768208	0.763778	0.716662	00:00
5	0.760159	0.762300	0.718045	00:00
6	0.753892	0.761404	0.719000	00:00
7	0.758386	0.763348	0.720821	00:00
8	0.751313	0.760574	0.719598	00:00
9	0.748924	0.760591	0.719648	00:00

png

Evaluate the MSE. By eye, how well does it fit?

Your narrative response here

Step 2: Add a Layer

Use nn.Sequential to put two nn.Linear layers back to back.

Use 500 dimensions as the hidden dimension (the out_features of the first and the in_features of the second).

You may notice that the training is much less stable, and is rather sensitive to initializations (run the same thing multiple times and see that it will sometimes converge much better than other times). To improve training, try the following:

Instead of learner.fit, use learner.fit_one_cycle. This starts the learning rate low, gradually ramps it up, then ramps it back down. It also enables momentum, which tends to make gradient descent both faster and more stable. Can fit_one_cycle handle a larger learning rate (lr_max=XXX) than fit (lr=XXX)?
Instead of opt_func=SGD, omit the opt_func parameter so it uses the default “Adam” optimizer. Adam adapts the effective learning rate for every parameter based on how big its gradients have been recently. As Sebastian Ruder puts it: “Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.” Does changing to Adam have much effect here?

model = nn.Sequential(
    ...
    ...
)
...

epoch	train_loss	valid_loss	mae	time
0	0.920632	0.789066	0.750009	00:00
1	0.825186	0.832955	0.750558	00:00
2	0.965909	1.005041	0.817890	00:00
3	0.863089	0.779145	0.721426	00:00
4	0.829559	0.824151	0.726931	00:00
5	0.809385	0.768478	0.728267	00:00
6	0.801617	0.829258	0.718980	00:00
7	0.783818	0.797611	0.752593	00:00
8	0.784719	0.797029	0.719279	00:00
9	0.795053	0.774839	0.742119	00:00
10	0.783517	0.781589	0.717142	00:00
11	0.771781	0.792023	0.723107	00:00
12	0.767133	0.764769	0.728470	00:00
13	0.768351	0.760104	0.720836	00:00
14	0.765217	0.760157	0.723078	00:00
15	0.763762	0.764044	0.716706	00:00
16	0.762276	0.760031	0.720895	00:00
17	0.757698	0.761009	0.719210	00:00
18	0.756880	0.762340	0.718883	00:00
19	0.754772	0.761380	0.719282	00:00

png

Evaluate the MSE. By eye, how well does it fit?

Your narrative response here

Step 3: Add a nonlinearity

Add a nn.ReLU between the two linear layers.

Definitely use fit_one_cycle here!
You will probably need more epochs to fit this model.
Try several different set_seeds here to ensure that your results aren’t a fluke.

set_seed(...)
...

epoch	train_loss	valid_loss	mae	time
0	0.693079	0.550056	0.593399	00:00
1	0.435455	0.226713	0.368506	00:00
2	0.318465	0.289056	0.387823	00:00
3	0.279852	0.261593	0.415863	00:00
4	0.232801	0.195240	0.315986	00:00
5	0.211104	0.184315	0.320228	00:00
6	0.208310	0.211015	0.375118	00:00
7	0.199087	0.211851	0.360161	00:00
8	0.207838	0.207093	0.346923	00:00
9	0.201181	0.173332	0.298741	00:00
10	0.202531	0.222609	0.324228	00:00
11	0.198644	0.256327	0.406994	00:00
12	0.202352	0.198207	0.331616	00:00
13	0.194324	0.255945	0.427632	00:00
14	0.196672	0.173948	0.308988	00:00
15	0.189098	0.215841	0.320294	00:00
16	0.185658	0.173194	0.298018	00:00
17	0.180072	0.177394	0.303801	00:00
18	0.178259	0.181867	0.302603	00:00
19	0.175262	0.174144	0.299257	00:00
20	0.173036	0.180167	0.323992	00:00
21	0.171939	0.169455	0.295680	00:00
22	0.170063	0.169196	0.297448	00:00
23	0.169878	0.169218	0.297110	00:00
24	0.167490	0.168323	0.293499	00:00
25	0.166276	0.169995	0.296735	00:00
26	0.164923	0.168415	0.293780	00:00
27	0.164030	0.168199	0.293551	00:00
28	0.163613	0.168040	0.293115	00:00
29	0.163334	0.167990	0.293041	00:00

png

Evaluate the MSE. How well does it fit?

your narrative response here

Analysis

Despite having a hidden layer like the final model, the second model never gave us anything more than a straight line. Why not?

your narrative response here

Watch the model plot in Step 3 as the model fits (use at least 30 epochs to be able to watch this clearly). What do you notice about the plot during three regimes of training:

Within the first epoch (right at the start of training)
Mid-training
In the last epoch or two

your narrative response here

Extension (optional)

What effect does the size of the hidden layer have on the quality of the fit?

What effect does the choice of nonlinearity (“activation function”) have? Try a few others: Tanh, Sigmoid, LeakyReLU, PReLU, … Early research in neural nets used smooth functions like Tanh and Sigmoid almost exclusively; do you think the ReLU was a good idea?

Lab 4: Nonlinear Regression

Lab 4: Nonlinear Regression

Setup

Task

Solution

Step 1: Fit a Line

Step 2: Add a Layer

Step 3: Add a nonlinearity

Analysis

Extension (optional)

Ken Arnold

Assistant Professor of Computer Science