Source: pixabay
Unexpected behavior can emerge from optimizing an objective.
If anyone uncovers a pit or digs one and fails to cover it and an ox or a donkey falls into it, the one who opened the pit must pay the owner for the loss and take the dead animal in exchange.
If anyone’s bull injures someone else’s bull and it dies, the two parties are to sell the live one and divide both the money and the dead animal equally. However, if it was known that the bull had the habit of goring, yet the owner did not keep it penned up, the owner must pay, animal for animal, and take the dead animal in exchange.
Exodus 21:33-36 (NIV)
Discuss with a different person than last week:
Explain how feature_vectors @ weight_matrix computes how much each image is like each of the prototypes. Discuss both what feature_vectors represents and what weight_matrix represents – and how the matrix multiply computes similarities.
If we wanted to predict a single number (e.g., the number of petals on the flower), what would the shape of the weight matrix be?

Imagine the data tables that YouTube might be using. What are the columns? Rows?
| timestamp | Viewer | Video | Watch time | |
|---|---|---|---|---|
| 1616963421 | UC2nEn-yNA1BtdDNWziphPGA | WK_Nr4tUtl8 | 600 | |
| 1616963422 | UCYO_jab_esuFRV4b17AJtAw | aircAruvnKk | 1153 | |
| … | … | … | … |
So we need a way to measure similarity for both users and items
During development, we make extensive use of offline metrics (precision, recall, ranking loss, etc.) to guide iterative improvements to our system. However for the final determination of the effectiveness of an algorithm or model, we rely on A/B testing via live experiments. In a live experiment, we can measure subtle changes in click-through rate, watch time, and many other metrics that measure user engagement. This is important because live A/B results are not always correlated with offline experiments.
Intuition:
Key challenge: how to represent text data.
For words not in the vocabulary, use an “unknown word” token.
Vocabulary: [“the”, “cat”, “dog”, “is”, “on”, “under”, “table”]
Document: “the cat is on the table”
Bag of words: [1, 1, 0, 1, 1, 0, 1]
Document: “the dog is under the table”
Bag of words: [1, 0, 1, 1, 0, 1, 1]
Task: given a single word, predict the next word
What will you predict? How?
| word | lentil | chickpea | recipe |
|---|---|---|---|
| is an ingredient | 1 | 1 | 0 |
| is a legume | 1 | 1 | 0 |
| is a color | 0 | 0 | 0 |
| is information | 0 | 0 | 1 |
| described by an ingredient | 0 | 0 | 1 |
Option A: hire an army of linguists (and food experts etc.)
Option B: learn it from data.
Source: Jurafsky and Martin. Speech and Language Processing 3rd ed


See also: Word embeddings quantify 100 years of gender and ethnic stereotypes (Garg et al, PNAS 2018)
Source: GloVe project
Which data do you absolutely need in your database for training a video recommender system based on collaborative filtering?
YouTube’s home-page recommender used embeddings to represent videos and users. What were these embeddings, and how were they used?
Which of these can be optimized by gradient descent: percent correct, precision, recall, MSE, MAE, categorical cross-entropy
Suppose you carefully study past exams and notice a pattern:
When the question includes the letter “m”, the answer is always “B”.
Is this a good pattern to learn?
Error = Bias^2 + Variance + “Noise”
(For MSE loss, this is a real equation–if we square the Bias term. For other losses, this decomposition is approximate.)
Imagine sampling many training sets and training a model on each. Compute all of those models’ predictions for a new sample.
MLU-Explain on Bias and Variance - see the LOESS example.
Overfitting: When improving the model’s fit to training data doesn’t help it generalize.
Training dynamics can be a clue (e.g., validation loss starting to go up).
Usually because of increasing variance.
All of these have equivalent training-set loss. Which should we choose?
Which would need larger weights?
Weight decay encourages #2
Think of a few experiences you’ve had with recommender systems.
(In practice we don’t actually know all of the “genre” features. Soon we’ll see how we can learn them.)
Intuition: represent a movie by its genre vector.
| title | Star Wars (1977) | Contact (1997) | Fargo (1996) | Return of the Jedi (1983) | Liar Liar (1997) | English Patient, The (1996) | Scream (1996) | Toy Story (1995) | Air Force One (1997) | Independence Day (ID4) (1996) |
|---|---|---|---|---|---|---|---|---|---|---|
| Action | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| Adventure | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| Animation | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Children’s | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| Comedy | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| Crime | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Documentary | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Drama | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Fantasy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Film-Noir | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Horror | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Musical | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Mystery | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Romance | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| Sci-Fi | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| Thriller | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| War | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
| Western | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Can you think of a math operation that would give us the number of genres in common? Perhaps from linear algebra?
What would your user vector look like?
How might we do that?
Same as usual:
Work with tens of numbers instead of tens of thousands of movies.