Conceptual Review

# Conceptual Review
### Ken Arnold
### 2021-02-22

---

## Reflection: Surrounded by rich data

[Genesis 1](https://www.biblegateway.com/passage/?search=Genesis%202&version=NIV): Ctrl-F "saw"

---

## Logistics

* Attendance sheet
* Piazza: Live Q&A during class today
* Homework 1:
  * How accurately can you label inside / outside?
  * How much data do you need?
  * How confident are you in these conclusions? Why?

---

## Data<br>*fuel for machine learning*

---

.pull-left[
<img src="https://www.melissaanddoug.com/dw/image/v2/BBDH_PRD/on/demandware.static/-/Sites-master-catalog/default/dwac79566a/large/000413_1.jpg?sw=562&sh=570&sm=fit">
]

.pull-right[
<img src="https://media.nature.com/lw800/magazine-assets/d41586-021-00340-4/d41586-021-00340-4_18841126.jpg" style="height: 700px">
]

---

## Overfitting

.center[<img src="https://3.bp.blogspot.com/-dc6B2h_o1fc/VYITir_QCgI/AAAAAAAAAlU/Ysi0_reQTpI/s1600/dumbbells.png" style="width: 100%">]
.small[.right[*"Dumbbell"*]]

---

### An aside for the interested

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">The Bias-Variance Trade-Off &amp; &quot;DOUBLE DESCENT&quot; 🧵<br><br>Remember the bias-variance trade-off? It says that models perform well for an &quot;intermediate level of flexibility&quot;. You&#39;ve seen the picture of the U-shape test error curve.<br><br>We try to hit the &quot;sweet spot&quot; of flexibility.<br><br>1/🧵 <a href="https://t.co/HPk05izkZh">pic.twitter.com/HPk05izkZh</a></p>&mdash; Dr. Daniela Witten (@daniela_witten) <a href="https://twitter.com/daniela_witten/status/1292293102103748609?ref_src=twsrc%5Etfw">August 9, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## Data Splitting

**Why**: Confidence about deployment

**How**:

* Hide a test set
* Single split (train-valid): [`RandomSplitter`](https://docs.fast.ai/data.transforms.html#RandomSplitter) or similar
* or: cross-validation

---

## Data Loading

**How**: pipeline of transformations (to *items*, to *batches*)
]

.pull-right[
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Here&#39;s an animation of a <a href="https://twitter.com/PyTorch?ref_src=twsrc%5Etfw">@PyTorch</a> DataLoader.<br><br>It turns your dataset into a shuffled, batched tensors iterator.<br><br>(This is my first animation using <a href="https://twitter.com/manim_community?ref_src=twsrc%5Etfw">@manim_community</a>, the community fork of <a href="https://twitter.com/3blue1brown?ref_src=twsrc%5Etfw">@3blue1brown</a>&#39;s manim)<br><br>Here&#39;s a little summary of the different parts for those curious:<br>1/5 <a href="https://t.co/t6NtOxVRMm">pic.twitter.com/t6NtOxVRMm</a></p>&mdash; Scott Condron (@_ScottCondron) <a href="https://twitter.com/_ScottCondron/status/1363494433715552259?ref_src=twsrc%5Etfw">February 21, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 
]

---

**Code**:

Work through the [`DataBlock` tutorial](https://docs.fast.ai/tutorial.datablock.html).

`ImageDataLoader.from_name_func` is convenient way to construct a `DataBlock`
and `DataLoader`

---

## Batches

**Why**: Often more efficient to process several items at once; more confident in how to update weights

**How**:

* need to align sizes of each image (or text document, or sound, or ...)
* limited by GPU memory (especially for the *backward* pass)

---

## Diagnosing Classifiers

**Why**: Better *quantify* performance (e.g., sensitivity vs specificity), better *understand* performance (analyze )

**How**: Confusion matrix; FP/FN / precision/recall

---

## Data Augmentation

**Why**: Discourage overfitting. Encourage generalization.

**How**: move the camera around (images), skip words (text), add / subtract stuff.

*Related*: mess with the model itself, e.g., Dropout.

???

Examples of what specific transforms can be used

---

## Running Experiments

* Reproducibility
  * `set_seed`
  * Restart and Run All
* Variability
  * try different seeds
  * try different modeling choices
  * **Keep a log** (e.g., spreadsheet with one line per run)

---

## Previews of concepts from next week

* loss
* NN architectures

---

## Tools

---

## Newline-delimited JSON

```python
import json

data = []
with open(filename) as f:
    for line in f:
        data.append(json.loads(line))
```

---

## List Comprehensions

Often readable and declarative. Use judiciously.

```python
result = []
for i in range(4):
  if i % 2 == 1:
    for j in range(i):
      result.append(f'i={i} j={j}')
result
```

```
## ['i=1 j=0', 'i=3 j=0', 'i=3 j=1', 'i=3 j=2']
```
]

```python
[
  f'i={i} j={j}'
  for i in range(4)
  if i % 2 == 1
  for j in range(i)
]
```

```
## ['i=1 j=0', 'i=3 j=0', 'i=3 j=1', 'i=3 j=2']
```
]

* `dict` and `set` comprehensions work similarly.
* Modern libraries often provide "broadcast" alternatives:
  * `x * y` works elementwise when either is `np.array` or `torch.Tensor`
  * `paths.attrgot('name')`, if `paths` is a `fastcore` `L`

---

## Jupyter Notebooks

* execution order issues: Restart and Run All
* kernel vs webpage: it's still running when you close!

---

## Markdown

```markdown
*italic*, **bold**

# heading

## heading 2

[link](https://example.com)

* list
* list 2
```

---

## Git(Hub)

* stage / commit / push
* nbdime?
* integration with Colab, Jupyter Lab

Remember to save your Colab!

---

## Collaboration

* Teams screenshare

---

## fastcore

[`fastcore.L`](https://fastcore.fast.ai/#L) is like a `list` but adds handy stuff:

```python
paths = path.ls()
paths.attrgot('name')
```
  
I haven't used it much yet, but I'm guessing these will be useful:

* [`first`](https://fastcore.fast.ai/basics.html#first)
* [`chunked`](https://fastcore.fast.ai/basics.html#chunked)
* [`argwhere`](https://fastcore.fast.ai/basics.html#argwhere)

Prefer native Python functionality where available. e.g., `filter_keys` is better expressed as a dict comprehension.

---

## Example of Peer Review

Sample hw1 in:

* [nbviewer](https://nbviewer.jupyter.org/github/kcarnold/cs344/blob/main/src/Sample_HW1.ipynb)
* [Colab](https://colab.research.google.com/github/kcarnold/cs344/blob/main/src/Sample_HW1.ipynb)

---

## Practice

<a href="https://colab.research.google.com/github/kcarnold/cs344/blob/main/src/A_Batch_of_Practice_(blank).ipynb">Colab</a>

---

## Questions?

???

If time: review Lab 1 + extensions