class: center, middle, inverse, title-slide # Visualising numerical and categorical data ### K. Arnold, based on
datasciencebox.org
--- class: middle # Terminology --- ## Number of variables involved - **Univariate** data analysis - distribution of single variable - **Bivariate** data analysis - relationship between two variables - **Multivariate** data analysis - relationship between many variables at once - usually focusing on bivariate relationships while conditioning for others --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. --- class: middle # Data --- ## Data: Lending Club .pull-left-wide[ - Thousands of loans made through the Lending Club, which is a platform that allows individuals to lend to other individuals - Not all loans are created equal -- ease of getting a loan depends on (apparent) ability to pay back the loan - Data includes loans *made*, these are not loan applications ] .pull-right-narrow[ <img src="img/lending-club.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Take a peek at data ```r library(openintro) glimpse(loans_full_schema) ``` ``` ## Rows: 10,000 ## Columns: 55 ## $ emp_title <chr> "global config enginee… ## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, 1… ## $ state <fct> NJ, HI, WI, PA, CA, KY… ## $ homeownership <fct> MORTGAGE, RENT, RENT, … ## $ annual_income <dbl> 90000, 40000, 40000, 3… ## $ verified_income <fct> Verified, Not Verified… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10… ## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000,… ## $ verification_income_joint <fct> , , , , Verified, , No… ## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66,… ## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1… ## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3,… ## $ earliest_credit_line <dbl> 2001, 1996, 2006, 2007… ## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, 1… ## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32,… ## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12,… ## $ total_credit_limit <int> 70795, 28800, 24193, 2… ## $ total_credit_utilized <int> 38767, 4321, 16000, 49… ## $ num_collections_last_12m <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_historical_failed_to_pay <int> 0, 1, 0, 1, 0, 0, 0, 0… ## $ months_since_90d_late <int> 38, NA, 28, NA, NA, 60… ## $ current_accounts_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ total_collection_amount_ever <int> 1250, 0, 432, 0, 0, 0,… ## $ current_installment_accounts <int> 2, 0, 1, 1, 1, 0, 2, 2… ## $ accounts_opened_24m <int> 5, 11, 13, 1, 6, 2, 1,… ## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, … ## $ num_satisfactory_accounts <int> 10, 14, 10, 4, 16, 12,… ## $ num_accounts_120d_past_due <int> 0, 0, 0, 0, 0, 0, 0, N… ## $ num_accounts_30d_past_due <int> 0, 0, 0, 0, 0, 0, 0, 0… ## $ num_active_debit_accounts <int> 2, 3, 3, 2, 10, 1, 3, … ## $ total_debit_limit <int> 11100, 16500, 4300, 19… ## $ num_total_cc_accounts <int> 14, 24, 14, 3, 20, 27,… ## $ num_open_cc_accounts <int> 8, 14, 8, 3, 15, 12, 7… ## $ num_cc_carrying_balance <int> 6, 4, 6, 2, 13, 5, 6, … ## $ num_mort_accounts <int> 1, 0, 0, 0, 0, 3, 2, 7… ## $ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100… ## $ tax_liens <int> 0, 0, 0, 1, 0, 0, 0, 0… ## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0… ## $ loan_purpose <fct> moving, debt_consolida… ## $ application_type <fct> individual, individual… ## $ loan_amount <int> 28000, 5000, 2000, 216… ## $ term <dbl> 60, 36, 36, 36, 36, 36… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6… ## $ installment <dbl> 652.53, 167.54, 71.40,… ## $ grade <ord> C, C, D, A, C, A, C, B… ## $ sub_grade <fct> C3, C1, D1, A3, C3, A3… ## $ issue_month <fct> Mar-2018, Feb-2018, Fe… ## $ loan_status <fct> Current, Current, Curr… ## $ initial_listing_status <fct> whole, whole, fraction… ## $ disbursement_method <fct> Cash, Cash, Cash, Cash… ## $ balance <dbl> 27015.86, 4651.37, 182… ## $ paid_total <dbl> 1999.330, 499.120, 281… ## $ paid_principal <dbl> 984.14, 348.63, 175.37… ## $ paid_interest <dbl> 1015.19, 150.49, 106.4… ## $ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0… ``` --- ## Selected variables ```r loans <- loans_full_schema %>% select(loan_amount, interest_rate, term, grade, state, annual_income, homeownership, debt_to_income) glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ``` --- .smaller-table[ ```r loans %>% head() %>% knitr::kable() ``` | loan_amount| interest_rate| term|grade |state | annual_income|homeownership | debt_to_income| |-----------:|-------------:|----:|:-----|:-----|-------------:|:-------------|--------------:| | 28000| 14.07| 60|C |NJ | 90000|MORTGAGE | 18.01| | 5000| 12.61| 36|C |HI | 40000|RENT | 5.04| | 2000| 17.09| 36|D |WI | 40000|RENT | 21.15| | 21600| 6.72| 36|A |PA | 30000|RENT | 10.16| | 23000| 14.07| 36|C |CA | 35000|RENT | 57.96| | 5000| 6.72| 36|A |KY | 34000|OWN | 6.46| ] --- ## Selected variables <br> .midi[ variable | description ----------------|------------- `loan_amount` | Amount of the loan received, in US dollars `interest_rate` | Interest rate on the loan, in an annual percentage `term` | The length of the loan, which is always set as a whole number of months `grade` | Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid `state` | US state where the borrower resides `annual_income` | Borrower’s annual income, including any second income, in US dollars `homeownership` | Indicates whether the person owns, owns but has a mortgage, or rents `debt_to_income` | Debt-to-income ratio ] --- ## Variable types <br> variable | type ----------------|------------- `loan_amount` | numerical, continuous `interest_rate` | numerical, continuous `term` | numerical, discrete `grade` | categorical, ordinal `state` | categorical, not ordinal `annual_income` | numerical, continuous `homeownership` | categorical, not ordinal `debt_to_income` | numerical, continuous --- class: middle # Visualizing numerical data --- ## Describing shapes of numerical distributions - center: mean (`mean`), median (`median`), mode (not always useful) - spread: range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) - shape: - skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) - modality: unimodal, bimodal, multimodal, uniform - unusual observations --- class: middle # Histogram --- ## Histogram ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with ## `binwidth`. ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-6-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Histograms and binwidth .panelset[ .panel[.panel-name[binwidth = 1000] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 1000) ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-7-1.png" width="75%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 5000] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-8-1.png" width="75%" style="display: block; margin: auto;" /> ] .panel[.panel-name[binwidth = 20000] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 20000) ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-9-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- ## Customizing histograms .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount)) + geom_histogram(binwidth = 5000) + * labs( * x = "Loan amount ($)", * y = "Frequency", * title = "Amounts of Lending Club loans" * ) ``` ] ] --- ## Fill with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-11-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_histogram(binwidth = 5000, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) ``` ] ] --- ## Facet with a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, fill = homeownership)) + geom_histogram(binwidth = 5000) + labs( x = "Loan amount ($)", y = "Frequency", title = "Amounts of Lending Club loans" ) + * facet_wrap(vars(homeownership), nrow = 3) ``` ] ] --- class: middle # Density plot --- ## Density plot ```r ggplot(loans, aes(x = loan_amount)) + geom_density() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Density plots and adjusting bandwidth .panelset[ .panel[.panel-name[adjust = 0.5] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 0.5) ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-14-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 1] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 1) # default bandwidth ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[adjust = 2] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-16-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Customizing density plots .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount)) + geom_density(adjust = 2) + * labs( * x = "Loan amount ($)", * y = "Density", * title = "Amounts of Lending Club loans" * ) ``` ] ] --- ## Adding a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = loan_amount, * fill = homeownership)) + geom_density(adjust = 2, * alpha = 0.5) + labs( x = "Loan amount ($)", y = "Density", title = "Amounts of Lending Club loans", * fill = "Homeownership" ) ``` ] ] --- class: middle # Box plot --- ## Box plot ```r ggplot(loans, aes(x = interest_rate)) + geom_boxplot() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Box plot and outliers ```r ggplot(loans, aes(x = annual_income)) + geom_boxplot() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Customizing box plots .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = interest_rate)) + geom_boxplot() + labs( x = "Interest rate (%)", y = NULL, title = "Interest rates of Lending Club loans" ) + * theme( * axis.ticks.y = element_blank(), * axis.text.y = element_blank() * ) ``` ] ] --- ## Adding a categorical variable .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r ggplot(loans, aes(x = interest_rate, * y = grade)) + geom_boxplot() + labs( x = "Interest rate (%)", y = "Grade", title = "Interest rates of Lending Club loans", * subtitle = "by grade of loan" ) ``` ] ] --- class: middle # Relationships between numerical variables --- ## Scatterplot ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_point() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Hex plot ```r ggplot(loans, aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Hex plot ```r ggplot(loans %>% filter(debt_to_income < 100), aes(x = debt_to_income, y = interest_rate)) + geom_hex() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Categorical Data --- ### Which variables are *categorical*? ```r glimpse(loans) ``` ``` ## Rows: 10,000 ## Columns: 8 ## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000, 2… ## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.72, … ## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 36, … ## $ grade <ord> C, C, D, A, C, A, C, B, C, A, C, B, C, B… ## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, IL, … ## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34000… ## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN, M… ## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.46, … ``` --- class: middle # Bar plot --- ## Bar plot ```r ggplot(loans, aes(x = homeownership)) + geom_bar() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-27-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Bar plot with lots of categories ```r ggplot(loans, aes(x = state)) + geom_bar() ``` <img src="w03-viz-numerical_files/figure-html/barplot-state-x-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Flip! ```r ggplot(loans, aes(y = state)) + geom_bar() ``` <img src="w03-viz-numerical_files/figure-html/barplot-state-y-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Use a meaningful order! .pull-left[ ```r ggplot(loans, aes(y = fct_infreq(state))) + geom_bar() ``` <img src="w03-viz-numerical_files/figure-html/barplot-state-y-infreq-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ```r # bonus! ggplot(loans, aes(y = state %>% fct_infreq() %>% fct_lump_n(15) %>% fct_rev())) + geom_bar() ``` <img src="w03-viz-numerical_files/figure-html/barplot-state-y-other-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Segmented bar plot ```r ggplot(loans, aes(y = homeownership, * fill = grade)) + geom_bar() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Segmented bar plot ```r ggplot(loans, aes(y = homeownership, fill = grade)) + * geom_bar(position = "fill") ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-29-1.png" width="60%" style="display: block; margin: auto;" /> --- .question[ Which bar plot is a more useful representation for visualizing the relationship between homeownership and grade? ] .pull-left[ <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-30-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Customizing bar plots .panelset[ .panel[.panel-name[Plot] <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-32-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ```r *ggplot(loans, aes(y = homeownership, fill = grade)) + geom_bar(position = "fill") + * labs( * x = "Proportion", * y = "Homeownership", * fill = "Grade", * title = "Grades of Lending Club loans", * subtitle = "and homeownership of lendee" * ) ``` ] ] --- ## Gotcha: `geom_bar` summarizes the data for you! .panelset[ .panel[.panel-name[Counting] ```r loan_proportions <- loans %>% group_by(homeownership, grade) %>% summarize(count = n()) %>% group_by(homeownership) %>% mutate(prop = count / sum(count)) loan_proportions ``` ``` ## # A tibble: 21 × 4 ## # Groups: homeownership [3] ## homeownership grade count prop ## <fct> <ord> <int> <dbl> ## 1 MORTGAGE A 1285 0.268 ## 2 MORTGAGE B 1499 0.313 ## 3 MORTGAGE C 1234 0.258 ## 4 MORTGAGE D 587 0.123 ## 5 MORTGAGE E 148 0.0309 ## 6 MORTGAGE F 32 0.00668 ## # … with 15 more rows ``` ] .panel[.panel-name[Plotting] ```r ggplot(loan_proportions, aes(y = homeownership, fill = grade, * x = prop)) + * geom_col() ``` <img src="w03-viz-numerical_files/figure-html/manual-counts-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- class: middle # Relationships between numerical and categorical variables --- ## Already talked about... - Colouring and faceting histograms and density plots - Side-by-side box plots --- ## Violin plots ```r ggplot(loans, aes(y = homeownership, x = loan_amount)) + geom_violin() ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Ridge plots ```r library(ggridges) ggplot(loans, aes(x = loan_amount, y = grade, fill = grade, color = grade)) + geom_density_ridges(alpha = 0.5) ``` <img src="w03-viz-numerical_files/figure-html/unnamed-chunk-35-1.png" width="60%" style="display: block; margin: auto;" />