class: left, top, title-slide # Programming
Data Cleansing ### Keith VanderLinden
Calvin University --- # Data Cleansing As mentioned earlier, we distinguish between: 1. Data *Cleansing* 2. Data *Tidying* 3. Data *Wrangling* This unit focuses on *cleansing* data, the process of transforming “raw” data values into forms appropriate for computation. Computation requires that data: - be represented using an appropriate data type, either numerical and categorical. - are consistent with respect to in-range and out-of-range values. ??? - Type examples: integers/doubles/strings; dates; factors - Consistency examples: - Date strings, e.g., "Feb 7, 1961" vs "02-07-61" - Category names, e.g., "Joe Biden vs Joseph R. Biden" --- # Converting Between Types .pull-left[ *Explicit* Type Conversion ```r as.character(list(1, 2.0, "three", TRUE)) ``` ``` [1] "1" "2" "three" "TRUE" ``` ```r as.logical(list(0, 1, -2, 3.3, "FALSE")) ``` ``` [1] FALSE TRUE TRUE TRUE FALSE ``` ```r as.integer(list(1, 1.2, -2.3, "one", FALSE)) ``` ``` Warning: NAs introduced by coercion ``` ``` [1] 1 1 -2 NA 0 ``` ```r parse_number("$1,000") ``` ``` [1] 1000 ``` ] .pull-right[ *Implicit* Type Conversion ```r c(1, "one", 1.0) ``` ``` [1] "1" "one" "1" ``` ```r c(1, 1.2) ``` ``` [1] 1.0 1.2 ``` ```r c(TRUE, "True") ``` ``` [1] "TRUE" "True" ``` ] ??? Be clear on the conversion order. Compare & contrast `as.numeric()` vs. `parse_number()`. - https://mdsr-book.github.io/mdsr2e/ch-dataII.html#from-strings-to-numbers --- # Handling Special Values .pull-left[ Doubles have three special values that represent known but undefined values. ```r c(-1, 0, 1) / 0 ``` ``` [1] -Inf NaN Inf ``` `NA` represents missing values. ```r l <- c(1, 2, 3, NA) mean(l) ``` ``` [1] NA ``` ```r mean(l, na.rm = TRUE) ``` ``` [1] 2 ``` ] .pull-right[ `NA` is a logical value and is treated as “unknown” using three-valued logic. ```r typeof(NA) ``` ``` [1] "logical" ``` ```r TRUE | NA ``` ``` [1] TRUE ``` ```r FALSE | NA ``` ``` [1] NA ``` ```r TRUE & NA ``` ``` [1] NA ``` ] ??? - `NA`: Not available - `NaN`: Not a number - `Inf`: Positive infinity - `-Inf`: Negative infinity