﻿{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Week 2 — Wednesday: Visualization with Plotly Express\n",
    "\n",
    "**DATA 202 · Calvin University**\n",
    "\n",
    "*(First 15 min: devotions, announcements, retrieval quiz on Week 1 content)*\n",
    "\n",
    "---\n",
    "\n",
    "A DataFrame full of numbers tells you very little by itself. Visualization translates data into **visual metaphors** — distance, position, color, size — that our eyes can process instantly.\n",
    "\n",
    "Today's goal: learn to **map** variables to visual properties using Plotly Express, and understand when each mapping is appropriate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 1: Why Visualize?\n",
    "\n",
    "### Summary statistics can lie\n",
    "\n",
    "Consider the Datasaurus Dozen — a set of very different datasets that share nearly identical summary statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import plotly.express as px\n",
    "\n",
    "datasaurus = pd.read_csv(\"https://cs.calvin.edu/courses/data/202/fsantos/datasets/datasaurus.csv\")\n",
    "\n",
    "sample = datasaurus[datasaurus[\"dataset\"].isin([\"away\", \"bullseye\", \"dino\", \"star\", \"dots\"])]\n",
    "sample.groupby(\"dataset\")[[\"x\", \"y\"]].mean().round(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Same means — but look at the actual shapes\n",
    "px.scatter(sample, x=\"x\", y=\"y\", facet_col=\"dataset\", facet_col_wrap=5,\n",
    "           width=950, height=280, title=\"Same statistics, completely different data\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Takeaway:** always plot your data before trusting any summary. In ML, this matters even more — a model trained blindly on any of these datasets would behave very differently."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Our dataset: product sales\n",
    "\n",
    "Simulated sales data for 50 products across categories, seasons, and suppliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sales = pd.read_csv(\"https://cs.calvin.edu/courses/data/202/fsantos/datasets/product_sales.csv\")\n",
    "sales.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Variable types matter for choosing encodings\n",
    "\n",
    "Before plotting, classify each variable:\n",
    "\n",
    "| Type | Sub-type | Example | Good encodings |\n",
    "|---|---|---|---|\n",
    "| **Numerical** | Continuous | Sales amount | x/y axis, color gradient, size |\n",
    "| **Numerical** | Discrete | Units sold | x/y axis, size |\n",
    "| **Categorical** | Unordered | Product category | color (distinct), symbol, facet |\n",
    "| **Categorical** | Ordered | Season (Spring→Summer→Fall→Winter) | color (sequential), x-axis order |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### 🔨 Task 1 — Classify variables (~5 min)\n",
    "\n",
    "For each column in `sales`, identify its type (numerical continuous / numerical discrete / categorical unordered / categorical ordered):\n",
    "\n",
    "| Column | Type | Notes |\n",
    "|---|---|---|\n",
    "| Sales | | |\n",
    "| Returns | | |\n",
    "| Units Sold | | |\n",
    "| Profit | | |\n",
    "| Advertising Spend | | |\n",
    "| Category | | |\n",
    "| Season | | |\n",
    "| Supplier | | |\n",
    "\n",
    "Which column(s) would you put on the x-axis if you wanted to predict Profit? Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*(Answer in your own words here — double-click to edit)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 2: Mapping Variables to Visual Encodings\n",
    "\n",
    "### The core idea\n",
    "\n",
    "Plotly Express works by **mapping** DataFrame columns to visual properties:\n",
    "\n",
    "```python\n",
    "px.scatter(df, x=\"col_a\", y=\"col_b\", color=\"col_c\", size=\"col_d\", ...)\n",
    "```\n",
    "\n",
    "Each argument name is a **visual channel**. Each value is a **column name**. The data drives the visual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start simple: two numerical variables\n",
    "px.scatter(sales, x=\"Advertising Spend\", y=\"Profit\",\n",
    "           title=\"Advertising Spend vs. Profit\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add a third variable via color (categorical → distinct hues)\n",
    "px.scatter(sales, x=\"Advertising Spend\", y=\"Profit\",\n",
    "           color=\"Category\",\n",
    "           title=\"Advertising Spend vs. Profit by Category\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add a fourth variable via size (numerical → point area)\n",
    "px.scatter(sales, x=\"Advertising Spend\", y=\"Profit\",\n",
    "           color=\"Category\", size=\"Units Sold\",\n",
    "           title=\"Advertising Spend vs. Profit (size = Units Sold)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Facets: split into small multiples by a categorical variable\n",
    "px.scatter(sales, x=\"Advertising Spend\", y=\"Profit\",\n",
    "           color=\"Category\", size=\"Units Sold\",\n",
    "           facet_col=\"Season\",\n",
    "           title=\"Advertising Spend vs. Profit by Season\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Simpson's Paradox — when grouping reveals the truth\n",
    "\n",
    "Look at this overall trend:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px.scatter(sales, x=\"Advertising Spend\", y=\"Profit\",\n",
    "           trendline=\"ols\", title=\"Overall trend\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now break it out by Category — does the trend hold within each group?\n",
    "px.scatter(sales, x=\"Advertising Spend\", y=\"Profit\",\n",
    "           color=\"Category\", facet_col=\"Category\", facet_col_wrap=3,\n",
    "           trendline=\"ols\", title=\"Trend within each Category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Simpson's Paradox:** a trend visible in the whole dataset can reverse — or disappear — within subgroups. Adding `color` or `facet_col` is often what reveals it.\n",
    "\n",
    "In ML: this is why we always check whether a model's performance holds across subgroups, not just overall."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### 🔨 Task 2 — Explore encodings (~5 min)\n",
    "\n",
    "Create a single scatter plot of `Sales` vs. `Returns` that encodes **at least three additional variables** beyond x and y — use any combination of `color`, `size`, `symbol`, `facet_col`, `facet_row`, or `text`.\n",
    "\n",
    "Then answer: which of your encodings is most useful? Which feels cluttered or misleading? Why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*(Reflection — double-click to edit)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 3: Mapping vs. Styling\n",
    "\n",
    "### The key distinction\n",
    "\n",
    "| | **Mapping** | **Styling** |\n",
    "|---|---|---|\n",
    "| **What it is** | Data drives the visual property | You choose the visual property regardless of data |\n",
    "| **In code** | `color=\"Category\"` | `color_discrete_sequence=[\"red\",\"blue\"]` |\n",
    "| **Changes with data?** | Yes | No |\n",
    "| **Purpose** | Show a variable | Make the chart readable / accessible |\n",
    "\n",
    "A common confusion: `color=\"Category\"` is a **mapping**. `color_discrete_sequence=...` is a **style** that affects how that mapping looks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Styling: titles and axis labels (never leave these as column names)\n",
    "px.scatter(sales,\n",
    "           x=\"Advertising Spend\", y=\"Profit\",\n",
    "           color=\"Category\",\n",
    "           title=\"Profit vs. Advertising Spend by Product Category\",\n",
    "           labels={\n",
    "               \"Advertising Spend\": \"Advertising Spend ($)\",\n",
    "               \"Profit\": \"Profit ($)\",\n",
    "               \"Category\": \"Product Category\"\n",
    "           })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Styling: color-blind-friendly palette\n",
    "# Default colors can be hard to distinguish — always consider accessibility\n",
    "px.scatter(sales,\n",
    "           x=\"Advertising Spend\", y=\"Profit\",\n",
    "           color=\"Category\",\n",
    "           color_discrete_sequence=px.colors.qualitative.Safe,\n",
    "           title=\"Profit vs. Advertising Spend (color-blind safe palette)\",\n",
    "           labels={\n",
    "               \"Advertising Spend\": \"Advertising Spend ($)\",\n",
    "               \"Profit\": \"Profit ($)\"\n",
    "           })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Styling: overall theme\n",
    "px.scatter(sales,\n",
    "           x=\"Sales\", y=\"Profit\",\n",
    "           color=\"Category\",\n",
    "           color_discrete_sequence=px.colors.qualitative.Safe,\n",
    "           template=\"simple_white\",   # try: \"plotly\", \"ggplot2\", \"seaborn\", \"plotly_dark\"\n",
    "           title=\"Clean theme example\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "### 🔨 Task 3 — Polish a plot (~5 min)\n",
    "\n",
    "Take this bare-bones chart and improve it:\n",
    "\n",
    "```python\n",
    "px.scatter(sales, x=\"Units Sold\", y=\"Revenue\", color=\"Season\")\n",
    "```\n",
    "\n",
    "1. Fix the axis labels to be human-readable.\n",
    "2. Add a descriptive title.\n",
    "3. Use a color-blind-friendly palette (`px.colors.qualitative.Safe` or `.Set2`).\n",
    "4. Apply a clean template.\n",
    "\n",
    "**Discuss:** which of your changes were *mappings* and which were *styling*?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here\n",
    "# (Note: \"Revenue\" does not exist in this dataset — use a column that does)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Reference: Visual Encoding Cheatsheet\n",
    "\n",
    "| px argument | Variable type it works best with | Notes |\n",
    "|---|---|---|\n",
    "| `x`, `y` | Numerical (continuous or discrete) | The primary axes |\n",
    "| `color` | Categorical (unordered) or Numerical | Distinct hues for categories; gradient for numbers |\n",
    "| `size` | Numerical (positive) | Area encodes magnitude — use carefully |\n",
    "| `symbol` | Categorical (few levels, ≤6) | Redundant with color for accessibility |\n",
    "| `text` | Any (short strings) | Labels on points — gets crowded fast |\n",
    "| `facet_col` / `facet_row` | Categorical | Small multiples — great for comparisons |\n",
    "| `animation_frame` | Categorical or ordered | Animated over time or groups |\n",
    "\n",
    "**Coming up — Friday:** Practice 1 + Quiz 1 (covers SLOs 02A, 02B, 02C)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}