{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5d5a6ca3",
   "metadata": {},
   "source": [
    "# Week 1 — Friday: DataFrame Basics\n",
    "\n",
    "**DATA 202 · Calvin University**\n",
    "\n",
    "Machine learning models learn from data. Before any model sees a number, someone had to collect, clean, and organize that data into a structure the model can work with. The most common structure is a **table** — and in Python, tables live in **DataFrames**.\n",
    "\n",
    "Today we cover:\n",
    "1. **Access** — read rows, columns, and specific cells\n",
    "2. **Filter & Sort** — ask questions of the data\n",
    "3. **Modify** — add and delete rows and columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74b429a7",
   "metadata": {},
   "source": [
    "---\n",
    "## Tables are ancient\n",
    "\n",
    "The first systematically structured tables appeared in Mesopotamia around **1850 BCE**: column headings, row labels, subtotals, blank cells for missing values. Sound familiar?\n",
    "\n",
    "Our dataset today is a table of Roman emperors — deliberately messy, because real data always is.\n",
    "\n",
    "In ML terms:\n",
    "- **rows** = *samples* (one emperor per row)\n",
    "- **columns** = *features* (dynasty, region, cause of death...)\n",
    "- a future column = a *label* to predict: \"was this emperor assassinated?\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14452a68",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 1: Loading and Exploring\n",
    "\n",
    "### Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ecc7fb4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Replace with the URL or file path shared in class\n",
    "DATA_URL = \"https://raw.githubusercontent.com/fpasquinisantos/data202/main/datasets/roman_emperors.csv\"\n",
    "\n",
    "emperors = pd.read_csv(DATA_URL)\n",
    "print(\"Loaded:\", emperors.shape[0], \"rows,\", emperors.shape[1], \"columns\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "546813a7",
   "metadata": {},
   "source": [
    "### First look\n",
    "\n",
    "Before doing anything else, always look at your data. A model trained without understanding the data will fail in ways you will not expect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae041755",
   "metadata": {},
   "outputs": [],
   "source": [
    "emperors.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "822a00b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "emperors.info()    # column names, types, and how many values are non-null"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e806ca97",
   "metadata": {},
   "source": [
    "**What to read here:**\n",
    "- `Non-Null Count` less than total rows = **missing values** in that column.\n",
    "- `Dtype: object` = pandas' default for text (and anything it could not type more specifically).\n",
    "- `Reign_Start` is text, not a number — some values are `\"27 BC\"`. We will deal with this later.\n",
    "\n",
    "*Garbage in, garbage out.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d22242e",
   "metadata": {},
   "source": [
    "### Selecting columns\n",
    "\n",
    "Use `df[\"col\"]` for one column (returns a **Series**), `df[[\"col1\", \"col2\"]]` for several (returns a **DataFrame**)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aeefaaf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "emperors[\"Name\"]                                  # one column → Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01db2a71",
   "metadata": {},
   "outputs": [],
   "source": [
    "emperors[[\"Name\", \"Dynasty\", \"Cause_of_Death\"]]   # multiple columns → DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aceedf3b",
   "metadata": {},
   "source": [
    "### Selecting rows with `iloc`\n",
    "\n",
    "`iloc` selects rows (and optionally columns) by **integer position**, like a list index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10ddc2fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "emperors.iloc[0]          # first row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d564e9d",
   "metadata": {},
   "outputs": [],
   "source": [
    "emperors.iloc[0:5, 0:3]   # rows 0-4, first 3 columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cd4c862",
   "metadata": {},
   "source": [
    "---\n",
    "### 🔨 Task 1 — Explore (~5 min)\n",
    "\n",
    "1. How many emperors have **no Dynasty** listed? Use `emperors[\"Dynasty\"].isnull().sum()`.\n",
    "2. Display only the `Name`, `Region`, and `Legitimacy` columns for rows 10 through 19."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "faa1c23e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Missing dynasty count\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "393fa161",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2. Selected columns and rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b79b184f",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 2: Filtering and Sorting\n",
    "\n",
    "### Filtering\n",
    "\n",
    "Boolean filtering works by creating a True/False mask and applying it. Every row where the mask is `True` is kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29da6b5b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 1: a mask is just a Series of True/False\n",
    "emperors[\"Cause_of_Death\"] == \"Assassination\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eaae2093",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 2: apply the mask to the DataFrame\n",
    "emperors[emperors[\"Cause_of_Death\"] == \"Assassination\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b6571b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Combine conditions: & (AND), | (OR), ~ (NOT)\n",
    "# Always wrap each condition in its own parentheses\n",
    "emperors[\n",
    "    (emperors[\"Dynasty\"] == \"Julio-Claudian\") &\n",
    "    (emperors[\"Cause_of_Death\"] == \"Assassination\")\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2b30769",
   "metadata": {},
   "source": [
    "### Sorting\n",
    "\n",
    "`sort_values()` returns a **new** DataFrame — the original is not changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26e42abb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort by one column\n",
    "emperors.sort_values(\"Legitimacy\").head(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4af721db",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sort by multiple columns: Dynasty first, then Name within each dynasty\n",
    "emperors.sort_values([\"Dynasty\", \"Name\"]).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29955a5a",
   "metadata": {},
   "source": [
    "---\n",
    "### 🔨 Task 2 — Filter and Sort (~5 min)\n",
    "\n",
    "1. Filter to emperors in the `\"East\"` region. How many are there?\n",
    "2. From that filtered result, show only `Name` and `Dynasty`, sorted alphabetically by `Name`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4bf74641",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. Eastern emperors\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09c55687",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2. Name + Dynasty, sorted\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c347111",
   "metadata": {},
   "source": [
    "---\n",
    "## Part 3: Modifying DataFrames\n",
    "\n",
    "### Adding columns\n",
    "\n",
    "Assign to a new column name. Work on a **copy** so the original stays intact and cells are safe to re-run."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6dfedc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "emp = emperors.copy()\n",
    "\n",
    "# Boolean column: True/False flag\n",
    "emp[\"Is_Assassinated\"] = emp[\"Cause_of_Death\"] == \"Assassination\"\n",
    "\n",
    "# Derived column: string method applied element-wise\n",
    "emp[\"Legitimacy_Upper\"] = emp[\"Legitimacy\"].str.upper()\n",
    "\n",
    "emp[[\"Name\", \"Cause_of_Death\", \"Is_Assassinated\", \"Legitimacy_Upper\"]].head(8)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dd89e97",
   "metadata": {},
   "source": [
    "### Deleting columns\n",
    "\n",
    "`drop()` with `axis=1` removes columns. Returns a **new** DataFrame by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e3c7fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove one column\n",
    "emp_clean = emp.drop(\"Notes\", axis=1)\n",
    "\n",
    "# Remove several at once\n",
    "emp_slim = emp.drop([\"Notes\", \"Legitimacy_Upper\"], axis=1)\n",
    "emp_slim.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c341053a",
   "metadata": {},
   "source": [
    "### Deleting rows\n",
    "\n",
    "The most common approach: **filter out** the rows you do not want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0da44576",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Keep only emperors with a known dynasty\n",
    "emp_known = emp[emp[\"Dynasty\"].notna()]\n",
    "print(f\"{len(emp_known)} of {len(emp)} emperors have a known dynasty\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc79d963",
   "metadata": {},
   "source": [
    "---\n",
    "### 🔨 Task 3 — Modify (~5 min)\n",
    "\n",
    "Starting from `emperors.copy()`:\n",
    "\n",
    "1. Add a column `Violent_Death` that is `True` if `Cause_of_Death` contains any of: `Assassination`, `Execution`, `Killed`, or `Suicide`.\n",
    "   - *Hint:* `df[\"col\"].str.contains(\"word1|word2\", na=False)`\n",
    "2. Drop the `Notes` column.\n",
    "3. Keep only rows where `Dynasty` is not missing.\n",
    "4. Print how many of those emperors died violently: `result[\"Violent_Death\"].sum()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a33a17c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here\n",
    "result = emperors.copy()\n",
    "\n",
    "# 1. Add Violent_Death column\n",
    "\n",
    "# 2. Drop Notes\n",
    "\n",
    "# 3. Keep only known-dynasty rows\n",
    "\n",
    "# 4. Count violent deaths\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Coming Up\n",
    "\n",
    "| Week | Topic | Builds on today |\n",
    "|---|---|---|\n",
    "| Week 2 | Plotly basics | visualizing a DataFrame column |\n",
    "| Week 3 | Grouping | `groupby()` — aggregating across rows |\n",
    "| Week 4 | Joining tables | combining two DataFrames on a shared key |\n",
    "| Week 6+ | ML models | DataFrames become feature matrices fed to scikit-learn |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}