Classification Models

Objectives

Notebooks

Classification in scikit-learn (name: u03n2-sklearn-classification.ipynb; show preview, open in Colab)

For reference only, you might want to have your sklearn-regression notebook open: Regression in scikit-learn (name: u02n2-sklearn-regression.ipynb; show preview, open in Colab)

Key Concepts

Reframing Regression as Classification

Rather than predicting exact home prices (regression), we’ll predict price categories (classification):

Why? Price distribution is highly skewed - a small error on expensive homes dominates the loss. Classification evens out the importance. Really, we’re mostly doing this to help you see the difference between these two types of problems.

Watch out: Notice that the class is expressed as a number (0, 1, 2), but this is not a regression problem. The numbers are just labels for the classes.

Cross-Entropy Loss

      True Class
      ↓ 
[0.7, 0.2, 0.1] → -log(prob of true class)
 ↑
Model's predicted probabilities

Working Through the Notebook

  1. Data Setup

    • Use same X (lat/long) as regression
    • New y: price_bin column (0=low, 1=med, 2=high)
    • Split into train/validation sets
  2. Three Models

    • Logistic Regression: Linear boundaries between classes
    • Decision Tree: Box-like regions
    • Random Forest: Smooth combination of many trees
  3. For Each Model

    • Fit on training data
    • Plot probability contours
    • Compute accuracy and cross-entropy loss
    • Compare training vs validation performance

Analysis Tips

Generalization and a Kaggle Competition