This document demonstrates data cleansing in the context of a pet-ownership dataset.
library(tidyverse)
We’ll work with the following (contrived) dataset of dog ownership information. This dataset is inspired by an example from DSBox.
dogs <- read_csv("data/dog_lovers.csv")
## Rows: 60 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, number_of_dogs, handedness, report_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dogs
## # A tibble: 60 × 4
## name number_of_dogs handedness report_date
## <chr> <chr> <chr> <chr>
## 1 Bernice Warren 0 left 16-Apr-10
## 2 Woodrow Stone 0 left 5-May-10
## 3 Willie Bass 1 left 5-Aug-10
## 4 Tyrone Estrada 2 left 23-Nov-10
## 5 Alex Daniels 3 left 17-Feb-11
## 6 Jane Bates 2 left 28-Mar-11
## 7 Latoya Simpson 1 left 3-Dec-12
## 8 Darin Woods 1 left 11-Dec-12
## 9 Agnes Cobb 1 left 2-Jan-13
## 10 Tabitha Grant 0 left 29-Oct-13
## # … with 50 more rows
We’d like to compute the average number of dogs for the owners listed in the dataset.
dogs %>%
summarize(mean_dog_count = mean(number_of_dogs))
## Warning in mean.default(number_of_dogs): argument is not numeric or logical:
## returning NA
## # A tibble: 1 × 1
## mean_dog_count
## <dbl>
## 1 NA
Use the error or warning message to debug the problem. Somehow, the
number_of_dogs field is non-numeric. Looking through the
data, we see that there are some NA values, presumably for
people who didn’t enter a value for this field. We can fix those by
ignoring NA values when computing the average.
dogs %>%
summarize(mean_dog_count = mean(number_of_dogs, na.rm = TRUE))
## Warning in mean.default(number_of_dogs, na.rm = TRUE): argument is not numeric
## or logical: returning NA
## # A tibble: 1 × 1
## mean_dog_count
## <dbl>
## 1 NA
We still get an NA result, and on going back to the
original data and see all sorts of odd values for the
number_of_dogs field. It appears that we’ll need to clean
up the values for Tony and Ginger as well.
dogs_cleansed <- dogs %>%
mutate(
number_of_dogs = case_when(
name == "Tony Garcia" ~ 2,
name == "Ginger Clark" ~ 2,
TRUE ~ as.numeric(number_of_dogs)
)
)
## Warning in eval_tidy(pair$rhs, env = default_env): NAs introduced by coercion
dogs_cleansed %>%
summarize(mean_dog_count = mean(number_of_dogs, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_dog_count
## <dbl>
## 1 17.7
Finally, we get a number, but it’s rather high. On yet further
investigation, we se that the entries for Terrance and Travis are
probably out-of-range. We can fix that using a function that sets
outliers to NA (a version of NA designed for
double values).
remove_outliers <- function(x,
min_limit=0,
max_limit=10) {
case_when(
x < min_limit ~ NA_real_,
x > max_limit ~ NA_real_,
TRUE ~ x
)
}
This function is a (self-documenting) abstraction that we can use to cleanse outliers.
dogs_cleansed <- dogs_cleansed %>%
mutate(
count = remove_outliers(
as.numeric(number_of_dogs)
)
)
dogs_cleansed %>%
select(name, count, number_of_dogs) %>%
arrange(desc(is.na(count)))
## # A tibble: 60 × 3
## name count number_of_dogs
## <chr> <dbl> <dbl>
## 1 Woodrow Elliott NA NA
## 2 Terrence Harrington NA -1
## 3 Travis Garner NA 1000
## 4 Bernice Warren 0 0
## 5 Woodrow Stone 0 0
## 6 Willie Bass 1 1
## 7 Tyrone Estrada 2 2
## 8 Alex Daniels 3 3
## 9 Jane Bates 2 2
## 10 Latoya Simpson 1 1
## # … with 50 more rows
Finally, we can get a reasonable average value.
dogs_cleansed %>%
summarize(mean_dog_count = mean(count, na.rm = TRUE))
## # A tibble: 1 × 1
## mean_dog_count
## <dbl>
## 1 0.807
The moral of this story is that you can’t assume that data entered freely by users is clean.