Visualization 1 Principles

class: left, top, title-slide

.title[
# Visualization 1 Principles
]
.author[
### Keith VanderLinden Calvin University
]

---

# The Purpose of Data Visualization

.pull-left[

```r
library(tidyverse)
library(datasauRus)
datasaurus_dozen
```

```
## # A tibble: 1,846 × 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
## 7 dino 35.6 79.9
## 8 dino 33.1 77.6
## 9 dino 29.0 74.5
## 10 dino 26.2 71.4
## # … with 1,836 more rows
```
]
???
This dataset includes multiple observations for each of the listed datasets. 
It's hard to see any patterns because there are so many observations.
--
.pull-right[

```r
datasaurus_dozen %>% 
  group_by(dataset) %>%
  summarize(n = n())
```

```
## # A tibble: 13 × 2
## dataset n
## <chr> <int>
## 1 away 142
## 2 bullseye 142
## 3 circle 142
## 4 dino 142
## 5 dots 142
## 6 h_lines 142
## 7 high_lines 142
## 8 slant_down 142
## 9 slant_up 142
## 10 star 142
## 11 v_lines 142
## 12 wide_lines 142
## 13 x_shape 142
```
]
???
There 142 observations for each of 13 datasets.

---
# The Purpose of Data Visualization

```r
datasaurus_dozen %>% 
  filter(dataset %in% c("away", "bullseye", "dots", "star", "dino")) %>%
  group_by(dataset) %>% 
  summarize(
    mean_x=mean(x), mean_y=mean(y),
    std_dev_x=sd(x), std_dev_y= sd(y), 
    corr_x_y=cor(x, y)
  )
```

```
## # A tibble: 5 × 6
## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 16.8 26.9 -0.0641
## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686
## 3 dino 54.3 47.8 16.8 26.9 -0.0645
## 4 dots 54.3 47.8 16.8 26.9 -0.0603
## 5 star 54.3 47.8 16.8 26.9 -0.0630
```

???
Interestingly, the means, std-deviations, and Pearson correlations are largely identical. 
And yet...

---
# The Purpose of Data Visualization

```r
datasaurus_dozen %>% 
  filter(dataset %in% c("away", "bullseye", "dots", "star", "dino")) %>%
  ggplot(aes(x=x, y=y, colour=dataset)) +
  geom_point() +
  theme_void() +
  theme(legend.position = "none") +
  facet_wrap(~dataset, nrow=1)
```

![](foundations_files/figure-html/unnamed-chunk-4-1.png)

.footnote[See: https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html]

???
Never trust summary statistics alone. 
Graphics can help distill *information* from *data*.

“The simple graph has brought more *information* to the *data* analyst’s mind than any other device.” — John Tukey, cited in R4DS
[J. Tukey&rsquo;s oft-cited quote](https://r4ds.had.co.nz/data-visualisation.html)

References:
- http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
- https://en.wikipedia.org/wiki/Anscombe%27s_quartet

---
# The Foundations of Data Visualization

*Data Visualization* uses visual representations to help data scientists discover and present patterns in data. It takes advantage of the well-developed human visual system.

Designing an effective visualization requires that we determine:

- The *purpose* of the visualization by identifying:
  - the key question
  - the context

- The most appropriate visual *composition* using:
  - Principles of effective graphic design
  - N. Yau&rsquo;s taxonomy (visual cues, coordinate systems, scale, context)

These ideas are summarized in the Novartis [Graphics Principles Cheat Sheet](https://github.com/GraphicsPrinciples/CheatSheet/blob/master/NVSCheatSheet.pdf)

.footnote[See: https://github.com/GraphicsPrinciples/CheatSheet/blob/master/NVSCheatSheet.pdf Cf. M. Vandemeulenbroecke et al, Effective Visual Communication for the Quantitative Scientist, https://ascpt.onlinelibrary.wiley.com/doi/full/10.1002/psp4.12455]

???
- Purpose/Function/Strategy (see: *Cheatsheet "Planning"*)
 - Identify the key *question* (cf. Question-driven scientific inquiry)
 - Communicating vs Exploring
 - Know the *context* (e.g., audience, domain, &hellip;)
- Composition/Form/Tactics
 - Principles of effective graphic design (for information)
 - Review: **Cheatsheet "Principles of Effective Graphic Design"**.
 - Review: **Cheatsheet "Selecting the right base graph"** (n.b., no pie charts)
 - Review: **Cheatsheet "Proximity improves association"**.
 - Yau&rsquo;s taxonomy (MDSR2.2 focuses on these.)
 - *Visual cues*: **Cheatsheet "Effectiveness Ranking"**.
 - *Color*: **Cheatsheet "Color for emphasis or distinction"**.
 - *Coordinate system*: Cartesian; polar; geographical
 - *Scale*: numeric (linear/logarithmic/percentage); categorical; time
 - *Context*: titles/labels
 
References:
- See: [Vandemeulebroeck, Figure 1](https://ascpt.onlinelibrary.wiley.com/doi/full/10.1002/psp4.12455#psp412455-fig-0001)
- See: [Vandemeulebroeck, Figure 2](https://ascpt.onlinelibrary.wiley.com/doi/full/10.1002/psp4.12455#psp412455-fig-0002)
- Graphics Principles
 - Website: <https://graphicsprinciples.github.io>
 - Paper: <https://ascpt.onlinelibrary.wiley.com/doi/full/10.1002/psp4.12455>
 - [DSBox Visualisation Tips](https://rstudio-education.github.io/datascience-box/course-materials/slides/u2-d14-effective-dataviz/u2-d14-effective-dataviz.html)
- Key researchers: J. Tukey, E. Tufte, W. Cleveland [& R. McGill]
 
---
# Analyzing a Visualization

.pull-left[
![](foundations_files/figure-html/health-and-wealth-1.png)
]
.pull-right[
- Purpose 
 - Question 
 - Context 
- Composition 
 - Graphic Design 
 - Yau&rsquo;s taxonomy
]

???
1. Purpose: 
  - To communicate the relationship between child health an over time across the world.
  - Narrowly, for his world health course; generally, for everyone

2. Composition
  - General principles of graphic design (see below)
  - N. Yau&rsquo;s taxonomy
      - *Visual cues*: 
          - Position (Scatter plot) shows correlation
          - Color (continent) & area (population) for less important data
      - *Coordinate system*
          - Cartesian (x-y) for most important numerical correlation
          - Categorical (color) for continent
      - *Scale*: numeric (logarithmic)
      - *Context*: titles/labels

---
# Identifying Visualization Issues

Evaluate the following visualizations.

- [In the Barrel&hellip;](images/oil_prices.png)

- [Four score and seven years&hellip;](images/gettysburg_address.png)

- To truncate or not to truncate
  - [taxes](images/tax_cuts.png)
  - [temperatures](images/temperature_change.png)

- [Snow on Cholera](images/snow_cholera_map.jpg)

- [Napoleon&rsquo;s long, cold Russian winter](images/napoleon_march.png)

???
Note that a picture can be worth a thousand words, either a thousand valuable words or a thousand malicious lies.

Notes:
- OPEC and the energy crisis
    - Tufte&rsquo;s *lie factor*
    - Dig up my old notes on this example.

- Gettysburg address
 - Useless graph
 - See: <http://www.norvig.com/Gettysburg/>

- Scaling examples
 - Reference: <https://engineering.tableau.com/truncating-the-y-axis-threat-or-menace-d0bce66d4d08>

- J. Snow&rsquo;s cholera map
    - famous

- C. Minard's visualization of Napoleon's army in the Russian campaign, 1812-1813
    - famous
    - variables:
        - Location: x/y geographical points (see rivers and towns)
        - Army Size: line width (422K at start near Niemen; 100K in Moscow; 10K on leaving Russia)
        - Time: left-two right in brown; right to left in black.
        - Temperature: height (i.e., length) of temperature line at bottom

- Consider the scaling used in the following two visualizations.
    - ![Bush Tax Cuts Visualization (Fox News) ](images/tax_cuts.png)
    - That climate change temperature chart that looked flat. Where did I see it?

- The text's Challenger example
 - If we do this, keep Norman's critique in mind (see <https://jnd.org/in_defense_of_powerpoint/>).

Here are some more ugly chart examples:

- Flowing Data - <https://flowingdata.com/category/visualization/ugly-visualization/>
- Reddit - <https://www.reddit.com/r/dataisugly/>
- DataVis - <https://www.datavis.ca/gallery/missed.php>
- (Mostly Bad) Graphics and Tables - <http://users.stat.umn.edu/~rend0020/Teaching/STAT8801-resources/graphics/index.html>

References: 
- E. Tufte, *lie factor*, *data density*
- D. Huff, *How to Lie with Statistics*
- H. Wainer, *How to display data badly*
 - <https://www.wisdom.weizmann.ac.il/~zvika/course2015/announcements/WainerAmericanStatistician1984.pdf>