Computing

Ken Arnold

What is Neural Computing

Key Questions

  • How do neural nets compute? (How does that differ from traditional programming?)
  • What are the “data structures” of neural computing and efficient operations we can do with them?
  • How can we update parameters to optimize an objective function?

Learning Path

“I trained a neural net classifier from scratch.”

  1. Basic array/“tensor” operations in PyTorch
    • Code: array operations
    • Concepts: dot product, mean squared error
  2. Linear Regression “the hard way” (but black-box optimizer)
    • Code: Representing data as arrays
    • Concepts: loss function, forward pass, optimizer
  3. Logistic Regression “the hard way”
    • Concepts: softmax, cross-entropy loss
  4. Multi-layer Perceptron
    • Concepts: nonlinearity (ReLU), initialization
  5. Gradient Descent
    • Concepts: backpropagation, training loop
  6. Data Loaders
    • Concepts: batching, shuffling

Foundations

Jupyter Notebooks

  • notebook = prose + code + output
  • interfaces for notebooks: Jupyter (classic and Lab), VS Code, Kaggle, Google Colab (view-only: github, nbviewer)
  • cell types
    • Markdown (GitHub Docs, spec)
    • Code
      • Each code block feeds input to a hidden Python repl (“Shell” in Thonny)
        • Possible to run code out of order
        • Changing something doesn’t make dependent code re-run!
      • Outputs: anything explicitly display()ed or print()ed or plot()ted—and the result of the last expression

Model training and Evaluation

  • Outline of notebooks
    1. Load the data
      1. Download the dataset.
      2. Set up the dataloaders (which handles train-validation split, batching, and resizing)
    2. Train a model
      1. Get a foundation model (an EfficentNet in our case)
      2. Fine-tune it.
    3. Get the model’s predictions on an image.
  • Evaluating a model
    • Accuracy: correct or incorrect?
    • Loss:
      • partial credit
      • when it’s right, should be confident
      • when it’s wrong, shouldn’t be confident

Markdown Tips

aka, things to make your work look more professional

  • Headings: space between # and the heading text
  • Multiple lines collapse to a single line unless you:
    • Use a list (- abc)
    • Add a blank space between
    • manually add space (advanced technique)
  • Use backticks when you’re talking about code (e.g., functions, variable names)

Random Seeds

import random
random.seed(1234)
print(random.random())
print(random.random())
random.seed(1234)
print(random.random())
print(random.random())
  1. All four numbers will be different.
  2. The first two numbers will be the same, and the second two numbers will be the same.
  3. All four numbers will be the same.
import random
random.seed(1234)
print(random.random())
print(random.random())
0.9664535356921388
0.4407325991753527
random.seed(1234)
print(random.random())
print(random.random())
0.9664535356921388
0.4407325991753527

Logistic Regression

Code Example

# inputs:
# - x_train (num_samples, num_features)
# - y_train (num_samples, num_classes), one-hot encoded
num_classes = y_train.shape[1]
W = np.random.randn(num_features_in, num_classes)
b = np.random.randn(num_classes)

scores = x_train @ W + b
probs = softmax(scores, axis=1)

probs_of_correct = probs[np.arange(len(y_train)), y_train]
loss = -np.log(probs_of_correct).mean()

Check-in question: what loss function is this?

Where do the “right” scores come from?

  • In linear regression we were given the right scores.
  • In classification, we have to learn the scores from data.

Nonlinear Features

ReLU

Chop off the negative part of its input.

y = max(0, x)

(Gradient is 1 for positive inputs, 0 for negative inputs)

Why is ReLU Useful?

In 2D

Interactive Activity

ReLU interactive (name: u04n00-relu.ipynb; show preview, open in Colab)

Gradient Descent

(Stochastic) Gradient Descent algorithm

  1. Get some data
  2. Forward pass: compute model’s predictions
  3. Loss computation: compare predictions to targets
  4. Backward pass: compute gradient of loss with respect to model parameters
  5. Update: adjust model parameters in a direction that reduces loss (“optimization”)
  • Gradient descent: do this on the whole dataset
  • Stochastic Gradient Descent (SGD): do this on subsets (mini-batches) of the dataset

What it looks like in different libraries

PyTorch

import torch
import torch.nn as nn

model = nn.Linear(3, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters())
# ...
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()

TensorFlow (low-level)

import tensorflow as tf
import keras

model = keras.layers.Dense(1, input_shape=(3,))
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.SGD()
# ...
with tf.GradientTape() as tape:
    y_pred = model(x)
    loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))

Automatic Differentiation

  • Programmer computes the forward pass
  • Library computes the backward pass (gradients) using the backpropagation algorithm
    • Start at the end (loss)
    • Work backwards through the computation graph
    • Use the chain rule to compute gradients

Upshot: we can differentiate programs.

A Story of A Classifier

Images as Input

  • Each image is a 3D array of numbers (color channel, height, width) with rich structure: adjacent pixels are related, color channels are related, etc.
  • Eventually we’ll handle this, but to simplify for now:
    • Ignore color information: use grayscale images: 28x28 array of numbers from 0 to 255
    • Ignore spatial relationships: flatten to 1D array of 784 numbers

Work through how this is done in the Chapter 2 notebook

Classification the Wrong Way

  1. Compute y = x @ W + b to compute a one-dimensional output y for each image.
  2. Compute the MSE loss between y and the desired number (0-9).
  3. Optimize the weights to minimize this loss.

Discuss with neighbors:

  1. Suppose you’re trying to classify an image of a 4. What might a prediction from this model look like? Compute the loss.
  2. What’s wrong with this approach? What might you do instead?

Fix 1: Multiple Outputs

We want a score for each digit (0-9), so we need 10 outputs.

Each output is 0 or 1 (“one-hot encoding”).

Discuss with neighbors:

  1. What’s the shape of the output?
  2. What must be the shape of the weights?
  3. What might an output from this model look like when given an image of a 4? What’s the loss?

Fix 2: Softmax (Large Values)

Suppose the network predicted 1.5 for the correct output.

Suppose instead the network predicted 0.5 for the correct output.

What’s the loss in each case?

Fix: make outputs be probabilities.

  1. Exponentiate each output
  2. Divide by the sum of all exponentiated outputs

Discuss with neighbors:

  1. What might an output from this model look like? What’s the loss for your running example?

Fix 3: Cross-Entropy Loss

Negative log of the probability of the correct class.

Compute the loss for your running example.

Embeddings

The internal data structure of neural networks.

General Neural Network Architecture

  • Initial layers extract features from input
  • Final layers make decisions based on those features

flowchart LR
    A[Input] --> B[Feature Extractor]
    B --> C[Linear Classifier]
    C --> D[Output]

Example:

flowchart LR
    A[Input] --> B["Pre-trained CNN"]
    B --> C["Linear layer with 3 outputs"]
    C --> D["Softmax"]
    D --> E["Predicted probabilities"]
    style B stroke-width:4px

Your Homework 1 Architecture

  • The Feature Extractor is composed of:
    • a convolutional neural network …
    • pre-trained on some classification task …
    • with the linear classifier removed
  • Linear Classifier: a linear layer plus a softmax (like we’ve been using)

The feature extractor constructs a representation of the input that’s useful for classification.

  • A linear classifier on the raw pixels couldn’t learn much.
  • A linear classifier on the features extracted by a CNN can learn a lot.
  • The CNN computes an embedding of the input image.

The Data Structures of Neural Computing

  • Array / Tensor: the basic data structure
  • We’ve used them in a few different ways
    • Inputs to the model
      • sometimes each entry is meaningful (e.g., characteristics of a home, vitals of a patient)
      • sometimes entries only meaningful in aggregate (e.g., pixels in an image)
    • Outputs of the model (predictions, targets)
    • Parameters of the model (weights, biases)
    • Intermediate computations (e.g., logits, gradients)
    • Embedding: what we’ll talk about today

Defining Embeddings

Embedding noun: a vector representation of an object, constructed to be useful for some task (not necessarily human-interpretable); verb: to construct such a representation.

Similar items get similar embeddings.

Similarity can be defined as:

  • Euclidean distance
  • Dot product
  • Cosine similarity

(Note: some sources describe “embedding” as a specific lookup operation, but we’ll use it more broadly.)

Example: Word Embeddings

Source: Jurafsky and Martin. Speech and Language Processing 3rd ed

See also: Word embeddings quantify 100 years of gender and ethnic stereotypes (Garg et al, PNAS 2018)

Source: GloVe project

Example: Movie Recommendations

  • Each movie gets an embedding vector (analogy: genres)
  • Each user gets an embedding vector (analogy: genre preferences)
  • Predict rating as dot product of user and movie vectors

Similar people end up with similar vectors because they like similar movies.