class: center, middle, inverse, title-slide # Conceptual Review ### Ken Arnold ### 2021-02-22 --- ## Reflection: Surrounded by rich data [Genesis 1](https://www.biblegateway.com/passage/?search=Genesis%202&version=NIV): Ctrl-F "saw" --- ## Logistics * Attendance sheet * Piazza: Live Q&A during class today * Homework 1: * How accurately can you label inside / outside? * How much data do you need? * How confident are you in these conclusions? Why? --- class: middle, center ## Data<br>*fuel for machine learning* --- .pull-left[ <img src="https://www.melissaanddoug.com/dw/image/v2/BBDH_PRD/on/demandware.static/-/Sites-master-catalog/default/dwac79566a/large/000413_1.jpg?sw=562&sh=570&sm=fit"> ] .pull-right[ <img src="https://media.nature.com/lw800/magazine-assets/d41586-021-00340-4/d41586-021-00340-4_18841126.jpg" style="height: 700px"> ] .floating-source[<https://www.nature.com/articles/d41586-021-00340-4>] --- ## Overfitting .center[<img src="https://3.bp.blogspot.com/-dc6B2h_o1fc/VYITir_QCgI/AAAAAAAAAlU/Ysi0_reQTpI/s1600/dumbbells.png" style="width: 100%">] .small[.right[*"Dumbbell"*]] .floating-source[<https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html>] --- ### An aside for the interested <blockquote class="twitter-tweet"><p lang="en" dir="ltr">The Bias-Variance Trade-Off & "DOUBLE DESCENT" 🧵<br><br>Remember the bias-variance trade-off? It says that models perform well for an "intermediate level of flexibility". You've seen the picture of the U-shape test error curve.<br><br>We try to hit the "sweet spot" of flexibility.<br><br>1/🧵 <a href="https://t.co/HPk05izkZh">pic.twitter.com/HPk05izkZh</a></p>— Dr. Daniela Witten (@daniela_witten) <a href="https://twitter.com/daniela_witten/status/1292293102103748609?ref_src=twsrc%5Etfw">August 9, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> <!-- *slight misnomer*: In modern big networks, it's possible to fit the training set almost perfectly without sacrificing generalization. A wide range of models can fit the training set well; the trick is to encourage the learner to find models that also generalize. --> --- ## Data Splitting **Why**: Confidence about deployment **How**: * Hide a test set * Single split (train-valid): [`RandomSplitter`](https://docs.fast.ai/data.transforms.html#RandomSplitter) or similar * or: cross-validation --- ## Data Loading .pull-left[ **Why**: Model expects data in ideal format, cleanly chunked... real world is messy. **How**: pipeline of transformations (to *items*, to *batches*) ] .pull-right[ <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Here's an animation of a <a href="https://twitter.com/PyTorch?ref_src=twsrc%5Etfw">@PyTorch</a> DataLoader.<br><br>It turns your dataset into a shuffled, batched tensors iterator.<br><br>(This is my first animation using <a href="https://twitter.com/manim_community?ref_src=twsrc%5Etfw">@manim_community</a>, the community fork of <a href="https://twitter.com/3blue1brown?ref_src=twsrc%5Etfw">@3blue1brown</a>'s manim)<br><br>Here's a little summary of the different parts for those curious:<br>1/5 <a href="https://t.co/t6NtOxVRMm">pic.twitter.com/t6NtOxVRMm</a></p>— Scott Condron (@_ScottCondron) <a href="https://twitter.com/_ScottCondron/status/1363494433715552259?ref_src=twsrc%5Etfw">February 21, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] --- **Code**: Work through the [`DataBlock` tutorial](https://docs.fast.ai/tutorial.datablock.html). `ImageDataLoader.from_name_func` is convenient way to construct a `DataBlock` and `DataLoader` --- ## Batches **Why**: Often more efficient to process several items at once; more confident in how to update weights **How**: * need to align sizes of each image (or text document, or sound, or ...) * limited by GPU memory (especially for the *backward* pass) --- ## Diagnosing Classifiers **Why**: Better *quantify* performance (e.g., sensitivity vs specificity), better *understand* performance (analyze ) **How**: Confusion matrix; FP/FN / precision/recall --- ## Data Augmentation **Why**: Discourage overfitting. Encourage generalization. **How**: move the camera around (images), skip words (text), add / subtract stuff. *Related*: mess with the model itself, e.g., Dropout. ??? Examples of what specific transforms can be used --- ## Running Experiments * Reproducibility * `set_seed` * Restart and Run All * Variability * try different seeds * try different modeling choices * **Keep a log** (e.g., spreadsheet with one line per run) --- ## Previews of concepts from next week * loss * NN architectures --- class: center, middle ## Tools --- ## Newline-delimited JSON ```python import json data = [] with open(filename) as f: for line in f: data.append(json.loads(line)) ``` --- ## List Comprehensions Often readable and declarative. Use judiciously. .pull-left[ ```python result = [] for i in range(4): if i % 2 == 1: for j in range(i): result.append(f'i={i} j={j}') result ``` ``` ## ['i=1 j=0', 'i=3 j=0', 'i=3 j=1', 'i=3 j=2'] ``` ] .pull-right[ ```python [ f'i={i} j={j}' for i in range(4) if i % 2 == 1 for j in range(i) ] ``` ``` ## ['i=1 j=0', 'i=3 j=0', 'i=3 j=1', 'i=3 j=2'] ``` ] * `dict` and `set` comprehensions work similarly. * Modern libraries often provide "broadcast" alternatives: * `x * y` works elementwise when either is `np.array` or `torch.Tensor` * `paths.attrgot('name')`, if `paths` is a `fastcore` `L` --- ## Jupyter Notebooks * execution order issues: Restart and Run All * kernel vs webpage: it's still running when you close! --- ## Markdown ```markdown *italic*, **bold** # heading ## heading 2 [link](https://example.com) * list * list 2 ``` --- ## Git(Hub) * stage / commit / push * nbdime? * integration with Colab, Jupyter Lab Remember to save your Colab! --- ## Collaboration * Teams screenshare --- ## fastcore [`fastcore.L`](https://fastcore.fast.ai/#L) is like a `list` but adds handy stuff: ```python paths = path.ls() paths.attrgot('name') ``` I haven't used it much yet, but I'm guessing these will be useful: * [`first`](https://fastcore.fast.ai/basics.html#first) * [`chunked`](https://fastcore.fast.ai/basics.html#chunked) * [`argwhere`](https://fastcore.fast.ai/basics.html#argwhere) Prefer native Python functionality where available. e.g., `filter_keys` is better expressed as a dict comprehension. --- ## Example of Peer Review Sample hw1 in: * [nbviewer](https://nbviewer.jupyter.org/github/kcarnold/cs344/blob/main/src/Sample_HW1.ipynb) * [Colab](https://colab.research.google.com/github/kcarnold/cs344/blob/main/src/Sample_HW1.ipynb) --- ## Practice <a href="https://colab.research.google.com/github/kcarnold/cs344/blob/main/src/A_Batch_of_Practice_(blank).ipynb">Colab</a> --- ## Questions? ??? If time: review Lab 1 + extensions