This document demonstrates data cleansing in the context of a pet-ownership dataset.

library(tidyverse)

Dataset: Dog Owners

We’ll work with the following (contrived) dataset of dog ownership information. This dataset is inspired by an example from DSBox.

dogs <- read_csv("data/dog_lovers.csv")
## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, number_of_dogs, handedness, report_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dogs
## # A tibble: 60 × 4
##    name           number_of_dogs handedness report_date
##    <chr>          <chr>          <chr>      <chr>      
##  1 Bernice Warren 0              left       16-Apr-10  
##  2 Woodrow Stone  0              left       5-May-10   
##  3 Willie Bass    1              left       5-Aug-10   
##  4 Tyrone Estrada 2              left       23-Nov-10  
##  5 Alex Daniels   3              left       17-Feb-11  
##  6 Jane Bates     2              left       28-Mar-11  
##  7 Latoya Simpson 1              left       3-Dec-12   
##  8 Darin Woods    1              left       11-Dec-12  
##  9 Agnes Cobb     1              left       2-Jan-13   
## 10 Tabitha Grant  0              left       29-Oct-13  
## # … with 50 more rows

Computing the Average Number of Dogs

We’d like to compute the average number of dogs for the owners listed in the dataset.

dogs %>%
  summarize(mean_dog_count = mean(number_of_dogs)) 
## Warning in mean.default(number_of_dogs): argument is not numeric or logical:
## returning NA
## # A tibble: 1 × 1
##   mean_dog_count
##            <dbl>
## 1             NA

Use the error or warning message to debug the problem. Somehow, the number_of_dogs field is non-numeric. Looking through the data, we see that there are some NA values, presumably for people who didn’t enter a value for this field. We can fix those by ignoring NA values when computing the average.

dogs %>%
  summarize(mean_dog_count = mean(number_of_dogs, na.rm = TRUE)) 
## Warning in mean.default(number_of_dogs, na.rm = TRUE): argument is not numeric
## or logical: returning NA
## # A tibble: 1 × 1
##   mean_dog_count
##            <dbl>
## 1             NA

We still get an NA result, and on going back to the original data and see all sorts of odd values for the number_of_dogs field. It appears that we’ll need to clean up the values for Tony and Ginger as well.

dogs_cleansed <- dogs %>%
  mutate(
    number_of_dogs = case_when(                                
      name == "Tony Garcia"         ~ 2,
      name == "Ginger Clark"        ~ 2,
      TRUE                          ~ as.numeric(number_of_dogs)
    )
  )
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion
dogs_cleansed %>%
  summarize(mean_dog_count = mean(number_of_dogs, na.rm = TRUE)) 
## # A tibble: 1 × 1
##   mean_dog_count
##            <dbl>
## 1           17.7

Finally, we get a number, but it’s rather high. On yet further investigation, we se that the entries for Terrance and Travis are probably out-of-range. We can fix that using a function that sets outliers to NA (a version of NA designed for double values).

remove_outliers <- function(x, 
                            min_limit=0, 
                            max_limit=10) {
  case_when(
    x < min_limit ~ NA_real_,
    x > max_limit ~ NA_real_,
    TRUE          ~ x
  )
}

This function is a (self-documenting) abstraction that we can use to cleanse outliers.

dogs_cleansed <- dogs_cleansed %>%
  mutate(
    count = remove_outliers(
      as.numeric(number_of_dogs)
      )
    )

dogs_cleansed %>%
  select(name, count, number_of_dogs) %>%
  arrange(desc(is.na(count)))
## # A tibble: 60 × 3
##    name                count number_of_dogs
##    <chr>               <dbl>          <dbl>
##  1 Woodrow Elliott        NA             NA
##  2 Terrence Harrington    NA             -1
##  3 Travis Garner          NA           1000
##  4 Bernice Warren          0              0
##  5 Woodrow Stone           0              0
##  6 Willie Bass             1              1
##  7 Tyrone Estrada          2              2
##  8 Alex Daniels            3              3
##  9 Jane Bates              2              2
## 10 Latoya Simpson          1              1
## # … with 50 more rows

Finally, we can get a reasonable average value.

dogs_cleansed %>%
  summarize(mean_dog_count = mean(count, na.rm = TRUE)) 
## # A tibble: 1 × 1
##   mean_dog_count
##            <dbl>
## 1          0.807

The moral of this story is that you can’t assume that data entered freely by users is clean.