Nonlinear RegressionΒΆ
Task: Use nn.Sequential
to add layers to a model; use a nonlinearity between layers to increase the model's capacity; train using Stochastic Gradient Descent (SGD).
SetupΒΆ
from fastai.vision.all import *
This function will make a DataLoaders
object out of an arary dataset.
def make_dataloaders(x, y_true, splitter, batch_size):
data = L(zip(x, y_true))
train_indices, valid_indices = splitter(data)
return DataLoaders(
DataLoader(data[train_indices], batch_size=batch_size, shuffle=True),
DataLoader(data[valid_indices], batch_size=batch_size)
)
Here are utility functions to plot the first axis of a dataset and a model's predictions.
def plot_data(x, y): plt.scatter(x[:, 0], y[:, 0], s=.5, color='#bbbbbb')
def plot_model(x, model):
x = x.sort(dim=0).values
y_pred = model(x).detach()
plt.plot(x[:, 0], y_pred[:, 0], 'r')
The following Callback
can be added to your Learner
to plot the data and model after each epoch:
learner = Learner(
...
cbs=[ShowPredictions(), ShowGraphCallback()],
...
# Inspired by https://gist.github.com/oguiza/c7559da6de0e036f01d7dee15e2f15e4
class ShowPredictions(Callback):
def __init__(self): self.graph_fig = None # keep a reference to a figure object to update
def before_fit(self):
self.run = not hasattr(self.learn, 'lr_finder') and not hasattr(self, 'gather_preds')
def after_fit(self): plt.close(self.graph_fig)
def after_epoch(self):
if self.graph_fig is None:
self.graph_fig, self.graph_ax = plt.subplots(1)
self.graph_out = display(self.graph_ax.figure, display_id=True)
plt.sca(self.graph_ax)
self.graph_ax.clear()
# Plot code. Replace this if needed:
plot_data(x, y_true)
plot_model(x, model)
# Update the graph.
self.graph_out.update(self.graph_ax.figure)
TaskΒΆ
Most applications of neural net models work in very high dimensions (e.g., each individual pixel in an image!) so it's hard to visualize what the model is actually learning. Here, we'll revisit the simple linear model that we looked at in Fundamentals 006
and 009
, which learned to predict a single continuous outcome variable y
from a single continuous input feature x
. So we can visualize the network's behavior just like any other univariate function: by plotting y vs x.
But this time the data isn't just a straight line; it's a fancy function of x
.
num_points = 5000
set_seed(40)
x = torch.rand(num_points, 1)
noise = torch.rand_like(x) * 1.
y_true = .5 * (x*6).sin() + x + (x - .75) * noise
# standardize y, just to make it well behaved.
y_true -= y_true.mean()
y_true /= y_true.std()
plot_data(x, y_true)
In previous notebooks, we dealt with models that could only ever make straight lines. They couldn't even make a curve like 3 * x**2 + 2*x + 1
, yet alone that one!
But you may remember from your math or stats studies that a curve like that is actually linear if you transform your data, e.g., using z = [x, x**2]
as the input; then the model is 3 * z[1] + 2 * z[0] + 1
, which is linear in z
.
So if we transform our data before giving it to the linear model, we can actually get interesting functions from a linear model. But how do we transform the data?
The classic approach is to specify what transformation to make. e.g., in polynomial regression we put in a bunch of powers of x
(x**2
, x**3
, ..., x**10
, ...), but that gets numerically unstable with high powers. There are other "basis functions" that are better behaved, like splines.
But neural nets take a different approach: they learn the transformation based on what is needed to accomplish its objective.
Instructions:
- Fit a line to this data (minimizing the MSE). Evaluate the MSE. By eye, how well does it fit?
- Add a layer: Use
nn.Sequential
to put twonn.Linear
layers back to back. Use 500 dimensions as the hidden dimension (theout_features
of the first and thein_features
of the second). Evaluate the MSE. How well does it fit? - Add a nonlinearity: Add a
nn.ReLU
between the two linear layers. Evaluate the MSE. How well does it fit?
Details and tips are given inline below.
SolutionΒΆ
Make a DataLoaders
for this data. This step has been done for you.
We increased the dataset size and the batch size to make the learning better-behaved. Once you get this to work, you might see if you can deal with a smaller batch size or less data overall.
splitter = RandomSplitter(valid_pct=0.2, seed=42)
batch_size = 100
dataloaders = make_dataloaders(x, y_true, splitter, batch_size=batch_size)
Step 1: Fit a LineΒΆ
Fit a line to this data (minimizing the MSE).
- Use a
nn.Linear
module as yourmodel
- Use
Learner
withopt_func=SGD
, as you did in009
. - Pass
cbs=[ShowPredictions(), ShowGraphCallback()]
to theLearner
to show the training progress.
Tune the learning rate and number of epochs until you reliably get an MSE around 0.76.
# your code here
# For some reason, the graph sometimes disappears after training is done. Show it again.
plt.figure()
plot_data(x, y_true)
plot_model(x, model)
epoch | train_loss | valid_loss | mae | time |
---|---|---|---|---|
0 | 0.950252 | 0.885827 | 0.719921 | 00:00 |
1 | 0.859075 | 0.808779 | 0.712994 | 00:00 |
2 | 0.807587 | 0.781937 | 0.713865 | 00:00 |
3 | 0.783546 | 0.768974 | 0.715225 | 00:00 |
4 | 0.768208 | 0.763778 | 0.716662 | 00:00 |
5 | 0.760159 | 0.762300 | 0.718045 | 00:00 |
6 | 0.753892 | 0.761404 | 0.719000 | 00:00 |
7 | 0.758386 | 0.763348 | 0.720821 | 00:00 |
8 | 0.751313 | 0.760574 | 0.719598 | 00:00 |
9 | 0.748924 | 0.760591 | 0.719648 | 00:00 |
Evaluate the MSE. By eye, how well does it fit?
Your narrative response here
# aside: is this the best possible line?
# To determine the ideal MSE of a linear model, one approach is:
# from sklearn.linear_model import LinearRegression
# from sklearn.metrics import mean_squared_error
# mean_squared_error(
# to_np(y_true),
# LinearRegression().fit(x, y_true).predict(to_np(x))
# )
Step 2: Add a LayerΒΆ
Use nn.Sequential
to put two nn.Linear
layers back to back.
Use 50 dimensions as the hidden dimension (the out_features
of the first and the in_features
of the second).
You may notice that the training is much less stable, and is rather sensitive to initializations (run the same thing multiple times and see that it will sometimes converge much better than other times). To improve training, try the following:
- Instead of
learner.fit
, uselearner.fit_one_cycle
. This starts the learning rate low, gradually ramps it up, then ramps it back down. It also enables momentum, which tends to make gradient descent both faster and more stable. Canfit_one_cycle
handle a larger learning rate (lr_max=XXX
) thanfit
(lr=XXX
)? - Instead of
opt_func=SGD
, omit theopt_func
parameter so it uses the default "Adam" optimizer. Adam adapts the effective learning rate for every parameter based on how big its gradients have been recently. As Sebastian Ruder puts it: "Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface." Does changing to Adam have much effect here?
model = nn.Sequential(
...
...
)
...
plt.figure()
plot_data(x, y_true)
plot_model(x, model)
epoch | train_loss | valid_loss | mae | time |
---|---|---|---|---|
0 | 0.920632 | 0.789066 | 0.750009 | 00:00 |
1 | 0.825186 | 0.832955 | 0.750558 | 00:00 |
2 | 0.965909 | 1.005041 | 0.817890 | 00:00 |
3 | 0.863089 | 0.779145 | 0.721426 | 00:00 |
4 | 0.829559 | 0.824151 | 0.726931 | 00:00 |
5 | 0.809385 | 0.768478 | 0.728267 | 00:00 |
6 | 0.801617 | 0.829258 | 0.718980 | 00:00 |
7 | 0.783818 | 0.797611 | 0.752593 | 00:00 |
8 | 0.784719 | 0.797029 | 0.719279 | 00:00 |
9 | 0.795053 | 0.774839 | 0.742119 | 00:00 |
10 | 0.783517 | 0.781589 | 0.717142 | 00:00 |
11 | 0.771781 | 0.792023 | 0.723107 | 00:00 |
12 | 0.767133 | 0.764769 | 0.728470 | 00:00 |
13 | 0.768351 | 0.760104 | 0.720836 | 00:00 |
14 | 0.765217 | 0.760157 | 0.723078 | 00:00 |
15 | 0.763762 | 0.764044 | 0.716706 | 00:00 |
16 | 0.762276 | 0.760031 | 0.720895 | 00:00 |
17 | 0.757698 | 0.761009 | 0.719210 | 00:00 |
18 | 0.756880 | 0.762340 | 0.718883 | 00:00 |
19 | 0.754772 | 0.761380 | 0.719282 | 00:00 |
What does your function look like? Evaluate the MSE. By eye, how well does it fit? Was this worth it at all???
Your narrative response here
Step 3: Add a nonlinearityΒΆ
Add a nn.ReLU
between the two linear layers.
- Definitely use
fit_one_cycle
here! - You will probably need more epochs to fit this model.
- Try several different
set_seed
s here to ensure that your results aren't a fluke.
set_seed(...)
...
plt.figure()
plot_data(x, y_true)
plot_model(x, model)
epoch | train_loss | valid_loss | mae | time |
---|---|---|---|---|
0 | 0.902750 | 0.806636 | 0.716210 | 00:00 |
1 | 0.780626 | 0.676869 | 0.687579 | 00:00 |
2 | 0.679084 | 0.533563 | 0.610950 | 00:00 |
3 | 0.577986 | 0.457662 | 0.562425 | 00:00 |
4 | 0.533933 | 0.457597 | 0.556661 | 00:00 |
5 | 0.491961 | 0.421448 | 0.521734 | 00:00 |
6 | 0.385747 | 0.247638 | 0.377729 | 00:00 |
7 | 0.295972 | 0.208983 | 0.347325 | 00:00 |
8 | 0.252144 | 0.223541 | 0.348319 | 00:00 |
9 | 0.236473 | 0.206575 | 0.338506 | 00:00 |
10 | 0.224881 | 0.271913 | 0.392909 | 00:00 |
11 | 0.217470 | 0.234632 | 0.350006 | 00:00 |
12 | 0.227044 | 0.247618 | 0.387190 | 00:00 |
13 | 0.225236 | 0.220354 | 0.359031 | 00:00 |
14 | 0.218467 | 0.196625 | 0.331336 | 00:00 |
15 | 0.208030 | 0.223110 | 0.379284 | 00:00 |
16 | 0.208443 | 0.192583 | 0.325148 | 00:00 |
17 | 0.203234 | 0.195509 | 0.323822 | 00:00 |
18 | 0.196995 | 0.192162 | 0.331088 | 00:00 |
19 | 0.194206 | 0.186345 | 0.320385 | 00:00 |
20 | 0.192301 | 0.186750 | 0.326519 | 00:00 |
21 | 0.187593 | 0.187173 | 0.327371 | 00:00 |
22 | 0.186603 | 0.184201 | 0.314130 | 00:00 |
23 | 0.184588 | 0.191887 | 0.319849 | 00:00 |
24 | 0.183695 | 0.183230 | 0.315446 | 00:00 |
25 | 0.180800 | 0.182556 | 0.312494 | 00:00 |
26 | 0.178944 | 0.181736 | 0.314476 | 00:00 |
27 | 0.177945 | 0.180996 | 0.311455 | 00:00 |
28 | 0.177578 | 0.181027 | 0.310830 | 00:00 |
29 | 0.176955 | 0.180832 | 0.310966 | 00:00 |
Evaluate the MSE. How well does it fit?
your narrative response here
AnalysisΒΆ
What did it learn? Let's look. First we'll look at the weights in the last layer. The ones that have the largest magnitudes (hence .abs
) might suggest which features are most important. (This isn't exactly true if the features are on different scales; a large weight on a feature with small values might not actually be very important. For the examples below, I fit the model above with some weight decay enabled (wd=0.1
), so the features should be on more comparable scales.)
# Get the last layer, extract its weights, take the absolute value, and get the top 10.
most_important_features = model[-1].weight.abs().topk(10)
most_important_features
torch.return_types.topk( values=tensor([[4.2692, 4.1945, 2.4745, 1.6662, 1.4536, 0.6439, 0.6395, 0.6335, 0.6302, 0.3595]], grad_fn=<TopkBackward0>), indices=tensor([[ 3, 14, 8, 27, 23, 16, 47, 32, 9, 37]]))
Now we'll look at what those features actually look like. We'll just pass all of the data through the first layer (and the ReLU) to get the data with all features.
# Pass all the data through the first linear and ReLU layers of the model.
# We'll sort the data for convenience in plotting.
x_sorted = x.sort(dim=0).values
# Now we'll get the first layer and ReLU, and pass the data through them.
with torch.no_grad():
first_layer_activations_ = model[1](model[0](x_sorted))
first_layer_activations_.shape
torch.Size([5000, 50])
Finally, we'll plot those features. Since the data is one-dimensional, we can just plot each feature as a line. I'll include a legend giving each feature's weight.
plt.plot(x_sorted, first_layer_activations_[:, most_important_features.indices.flatten()], label=[f'{w:.2f}' for w in most_important_features.values.flatten()])
plt.legend();
Or we could say (read the plot top to bottom):
fig, axs = plt.subplots(nrows=11, ncols=1, figsize=(10, 10))
axs[0].plot(x_sorted, model(x_sorted).detach(), 'r')
axs[0].set_ylabel('y_pred = ', rotation='horizontal', ha='right')
for ax_idx, act_idx in enumerate(most_important_features.indices.flatten()):
ax = axs[ax_idx + 1]
ax.plot(x_sorted, first_layer_activations_[:, act_idx])
op = '\u2248' if ax_idx == 0 else '+'
ax.set_ylabel(f"{op} {most_important_features.values.flatten()[ax_idx]:.2f} * ", rotation='horizontal', ha='right')
Notice how the bendy curve is a linear function of those features!
Despite having a hidden layer like the final model, the second model never gave us anything more than a straight line. Why not?
Hint
- The first layer output is
act1 = x W1 + b1
, whereW1.shape
is(1, 500)
andb1.shape
is(500,)
.- The second layer's output is
y = act1 W2 + b2
, whereW2.shape
is(500, 1)
andb2.shape
is(1,)
.- Write
y
as a function ofx
. Notice that you can write it in the formy = x Q + r
, whereQ
andr
are determined byW1
,W2
,b1
, andb2
. This means that the combination of two linear layers is just a linear layer.
your narrative response here
Watch the model plot in Step 3 as the model fits (use at least 30 epochs to be able to watch this clearly). What do you notice about the plot during three regimes of training:
- Within the first epoch (right at the start of training)
- Mid-training
- In the last epoch or two
your narrative response here
Extension (optional)ΒΆ
What effect does the size of the hidden layer have on the quality of the fit?
What effect does the choice of nonlinearity ("activation function") have? Try a few others: Tanh
, Sigmoid
, LeakyReLU
, PReLU
, ... Early research in neural nets used smooth functions like Tanh
and Sigmoid
almost exclusively; do you think the ReLU
was a good idea?