class: left, top, title-slide .title[ # Predictive Analytics Unit 9: Clustering and Segmentation ] .author[ ### Ken Arnold
Calvin University ] --- --- ## Unsupervised Learning * So far we have been doing *supervised* learning, where have a *target* we're trying to predict. * "How much will these homes sell for?" * "How long will this person spend watching this video?" * **Unsupervised** learning works when we don't have an exact target to predict, or we want to explore relationships in the data. * "What general types of homes are on the market right now?" * "What are some different segments of our customer base?" * "[Are there distinct types of Covid-19 symptoms?](https://covid.joinzoe.com/us-post/covid-clusters)" * **Clustering** is one very common type of unsupervised learning. --- ## Clustering Goal: put observations into groups * Those in the *same* group should be *similar to each other* * Those in *different* groups should be *different*. Crucial questions: * How many groups? * How do we define "similar" / "different"? ---  .floating-source[Artwork by [@allison_horst](https://github.com/allisonhorst/stats-illustrations)] ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  ---  --- ## *Many* types of clustering algorithms <img src="img/sphx_glr_plot_cluster_comparison_001.png" width="100%" style="display: block; margin: auto;" /> .floating-source[Source: [sklearn documentation](https://scikit-learn.org/stable/modules/clustering.html)] --- ```r set.seed(20201120) data_for_clustering <- ames_all %>% select(Latitude, Longitude) %>% #select(Year_Built, Gr_Liv_Area) %>% mutate( #Gr_Liv_Area = rescale(Gr_Liv_Area, to = c(0, 1)), #Year_Built = rescale(Year_Built, to = c(0, 1)) ) clustering_results <- kmeans( data_for_clustering, nstart = 4, centers = 3 ) ames_with_clusters <- ames_all %>% mutate(cluster = as.factor(clustering_results$cluster)) ``` ```r glance(clustering_results) ``` ``` # A tibble: 1 × 4 totss tot.withinss betweenss iter <dbl> <dbl> <dbl> <int> 1 2.43 0.687 1.75 3 ``` ```r tidy(clustering_results) ``` ``` # A tibble: 3 × 5 Latitude Longitude size withinss cluster <dbl> <dbl> <int> <dbl> <fct> 1 42.0 -93.7 801 0.251 1 2 42.0 -93.6 845 0.326 2 3 42.1 -93.6 766 0.110 3 ``` --- .small-code[ ```r latlong_plot <- ggplot(ames_with_clusters, aes(y = Latitude, x = Longitude, color = cluster)) + geom_point(alpha = .5) year_area_plot <- ggplot(ames_with_clusters, aes(x = Gr_Liv_Area, y = Year_Built, color = cluster)) + geom_point(alpha = .5) library(patchwork) latlong_plot + year_area_plot + plot_layout(guides='collect') ``` <img src="slides09cluster_files/figure-html/cluster-plots-1.png" width="90%" style="display: block; margin: auto;" /> ]