Objectives
- Practice with the
fitandpredictinterface of sklearn models - Use linear models and decision trees for classification tasks
- Explain the difference between accuracy and log loss
- Explain underfitting and overfitting
Notebooks
Classification in scikit-learn
(name: u03n2-sklearn-classification.ipynb; show preview,
open in Colab)
For reference only, you might want to have your sklearn-regression notebook open: Regression in scikit-learn
(name: u02n2-sklearn-regression.ipynb; show preview,
open in Colab)
Key Concepts
Reframing Regression as Classification
Rather than predicting exact home prices (regression), we’ll predict price categories (classification):
- Low (bottom third)
- Medium (middle third)
- High (top third)
Why? Price distribution is highly skewed - a small error on expensive homes dominates the loss. Classification evens out the importance. Really, we’re mostly doing this to help you see the difference between these two types of problems.
Watch out: Notice that the class is expressed as a number (0, 1, 2), but this is not a regression problem. The numbers are just labels for the classes.
Cross-Entropy Loss
- Measures how surprised the model is by the correct answer
- Example: Model predicts probabilities [0.7, 0.2, 0.1] for classes [Low, Med, High]
- If true class is Low: -log(0.7) = 0.36 (not very surprised)
- If true class is High: -log(0.1) = 2.30 (very surprised!)
- Perfect prediction: probability 1.0 on correct answer, 0 elsewhere → loss = 0
- Random guessing: probability [1/3, 1/3, 1/3] → loss = 1.10
True Class
↓
[0.7, 0.2, 0.1] → -log(prob of true class)
↑
Model's predicted probabilities
Working Through the Notebook
-
Data Setup
- Use same X (lat/long) as regression
- New y: price_bin column (0=low, 1=med, 2=high)
- Split into train/validation sets
-
Three Models
- Logistic Regression: Linear boundaries between classes
- Decision Tree: Box-like regions
- Random Forest: Smooth combination of many trees
-
For Each Model
- Fit on training data
- Plot probability contours
- Compute accuracy and cross-entropy loss
- Compare training vs validation performance
Analysis Tips
-
Look at probability plots:
- Sharp vs smooth boundaries?
- Simple vs complex shapes?
- High vs low confidence?
-
Compare metrics:
- Accuracy: Fraction correct
- Cross-entropy: (average of negative-log of) confidence in correct answer
- A model can have high accuracy but high cross-entropy if either (1) not confident enough when it’s correct or (2) over-confident when it’s wrong
-
Identify overfitting/underfitting:
- Underfitting: can’t capture patterns in training data
- Overfitting: captured patterns that aren’t real / won’t generalize