library(tidyverse)
These examples are adapted from Data Science in a Box.
We use a subset of a dataset of financial loads from LendingClub, provided by OpenIntro.
library(openintro)
loans <- loans_full_schema %>%
mutate(grade = factor(grade, ordered = TRUE)) %>%
select(loan_amount, interest_rate, term, grade, state, annual_income,
homeownership, debt_to_income)
loans
## # A tibble: 10,000 × 8
## loan_amount interest_rate term grade state annual_income homeowner…¹ debt_…²
## <int> <dbl> <dbl> <ord> <fct> <dbl> <fct> <dbl>
## 1 28000 14.1 60 C NJ 90000 MORTGAGE 18.0
## 2 5000 12.6 36 C HI 40000 RENT 5.04
## 3 2000 17.1 36 D WI 40000 RENT 21.2
## 4 21600 6.72 36 A PA 30000 RENT 10.2
## 5 23000 14.1 36 C CA 35000 RENT 58.0
## 6 5000 6.72 36 A KY 34000 OWN 6.46
## 7 24000 13.6 60 C MI 35000 MORTGAGE 23.7
## 8 20000 12.0 60 B AZ 110000 MORTGAGE 16.2
## 9 20000 13.6 36 C NV 65000 MORTGAGE 36.5
## 10 6400 6.71 36 A IL 30000 RENT 18.9
## # … with 9,990 more rows, and abbreviated variable names ¹homeownership,
## # ²debt_to_income
Here are the characteristics of the data.
Variable | Type |
---|---|
loan_amount |
Numerical, Continuous |
interest_rate |
Numerical, Continuous |
term |
Numerical, Discrete (lenth in whole months) |
grade |
Categorical, Ordinal (values A through G) |
state |
Categorical, not Ordinal |
annual_income |
Numerical, Continuous |
homeownership |
Categorical, not Ordinal (owns, mortgage, rents) |
debt_to_income |
Numerical, Continuous (Debt-to-income ratio) |
We now demo plots appropriate for different types of data using various combinations of aesthetic and geometric settings.
Density plots are good for displaying the distribution of a continuous, numerical variable.
loans %>%
ggplot() +
aes(
x = loan_amount,
# fill = homeownership,
) +
geom_density(adjust = 1.0,
# alpha = 0.5,
)
# geom_histogram(binwidth = 5000)
# facet_wrap(vars(homeownership), nrow = 3)
Demo the following changes: - Adjust the bandwidth:
geom_density(adjust = X)
for X = 0.5-2.0. - Switch to a
histogram, which requires binning the continuous variable:
geom_histogram(binwidth = X)
for X = 1000-20000. - Add a
categorical variable: aes(... fill=homeownership)
(n.b., we
couldn’t fill with a numerical value).
Bar/Column plots are good for either binned numerical data or categorical data.
loans %>%
ggplot() +
aes(x = homeownership) +
# coord_flip() +
geom_bar()
Demo the following changes: - Try plotting by state, which gives too
many columns. Horizontal bars are better for that. Switch from a col to
a bar plot: + coord_flip()
or aes(y=grade)
-
Ordering is good for contests with winners and losers. Reorder the
output using x = fct_rev(fct_infreq(state))
Scatter plots are good for co-variation of numerical variables.
loans %>%
# filter(debt_to_income < 100) %>%
ggplot() +
aes(x = debt_to_income,
y = interest_rate) +
geom_point()
# geom_hex()
This is an unusual scatter plot with: - overplotting: Address this by using a hex plot to bin the data. - outliers (debt-income > 100%): Ignore outliers by filtering them out.
A scatter plot with time on one axis is called a time series plot.
library(gapminder)
gapminder %>%
ggplot() +
aes(x=year, y=lifeExp) +
geom_smooth()
Box plots are good for visualizing the spread of numerical variables and outliers. They can be used for univariate, numerical data as well.
loans %>%
ggplot() +
aes(x=grade,
y=interest_rate) +
geom_boxplot()
Boxplots show five summary statistics: - the median - two hinges (1st & 3rd quartiles) - two whiskers (an additional 1.5*IQR beyond the hinge) And all “outlying” points (individually).
Demo the following changes: - Focus first on annual income only to
show outliers: aes(y=annual_income)
- Highlight outliers
using color: geom_boxplot(outlier.colour="red")
Mosaic plots (mentioned in the text) can be used to show category/category relationships, but they’re not supported by ggplot.
Bar plots can do this as well, but only by filling or faceting.
loans %>%
ggplot() +
aes(y = homeownership, fill = grade) +
geom_bar()
# loans %>%
# ggplot() +
# aes(y = homeownership, fill = grade) +
# geom_bar(position = "fill")
Demo the following changes: - To focus on the relative percentages of
the whole, use: geom_bar(position = "fill")
- Which form is
better for visualizing the relationship between home ownership and loan
grade?
These plot types, some not supported by ggplot, are included here for reference.
library(ggridges)
loans %>%
ggplot() +
aes(x = loan_amount,
y = grade,
fill = grade,
color = grade,
) +
geom_density_ridges(
alpha = 0.5
)
R provides a wide variety of customizable plotting primitives, including: - The ggplot2 plots demoed here. - Plots provided by other packages, e.g.: - Mosaic plots - Maps - Network graphs
See the RStudio ggplot cheat sheet.