class: center, middle, inverse, title-slide # Making Inferences about Features ### K Arnold --- ## Q&A > Logistics impact of moving online? * We'll continue normal class meetings. **Your job** is to ask questions, interrupting if necessary. > Projects? * Our [final exam slot](https://calvin.edu/offices-services/center-for-student-success/registration/exam-schedule/) is Friday Dec. 11 at 1:30 pm. * Final project presentations will be virtual, details TBA --- ## Q&A > Is web scraping legal?? * Publicly available data? [Yes.](https://realpython.com/podcasts/rpp/12/) (see [EFF article](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data)) * If you need a login? Maybe. Ask for permission. > How hard is it to scrape Twitter? Constraints about what you can access without paying. But [tools available](https://guides.lib.utexas.edu/c.php?g=743999&p=5326728) > What's a good Python dev environment? I use: RStudio when integrating with R, VS Code for command-line or web dev, Jupyter Notebooks for pure-Python data science. --- ## What's your goal? * **Predict** unseen labels * How much will this house sell for? * Does this child have autism? * Is this a positive or negative movie review? * **Infer** relationships between features and labels * How much does home size affect price? * Is DNA methylation a marker of autism? * Does "sick" indicate a positive or negative review? * Understand the **causal** effect of interventions * How much will building an addition increase the price of my home? * Will antioxidants prevent autism? * Will cutting this scene make my movie get better reviews? --- ## Techniques for Inference * Classical statistical inference * 2-sample t tests, chi squared tests, ANOVA, ... * inference about model parameters (coefficient standard errors etc.) * Variable importance plots * Benefit of adding each feature --- ## Objectives for Today * Identify several different approaches for drawing conclusions about features * Recognize potential challenges in making those inferences --- ## Palmer Penguins .pull-left[ <img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png" width="50%" style="display: block; margin: auto;" /> ```r library(palmerpenguins) ``` ] .pull-right[ <img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png" width="100%" style="display: block; margin: auto;" /> ] .floating-source[ [Artwork by @allison_horst](https://allisonhorst.github.io/palmerpenguins/articles/art.html) ] --- ## How does bill length relate to bill depth? .small-code[ <img src="w12d2-inference_files/figure-html/penguins-simpsons-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## How does bill length relate to bill depth? .small-code[ <img src="w12d2-inference_files/figure-html/penguins-with-species-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Variable Importance Plots .small-code[ ```r regresion_workflow <- workflow() %>% add_model(decision_tree(mode = "regression") %>% set_engine('rpart')) model <- regresion_workflow %>% add_recipe(recipe(Sale_Price ~ ., data = ames_train)) %>% fit(data = ames_train) model %>% pull_workflow_fit() %>% vip::vip(num_features = 15L) ``` <img src="w12d2-inference_files/figure-html/tree-vip-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## How much does it help to have a feature in? ```r set.seed(20201118) resamples <- vfold_cv(ames_train, v = 10) ``` ```r regresion_workflow %>% add_recipe(recipe(Sale_Price ~ ., data = ames_train)) %>% fit_resamples(resamples = resamples, metrics = metric_set(mae, rmse)) %>% collect_metrics() ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":[".metric"],"name":[1],"type":["chr"],"align":["left"]},{"label":[".estimator"],"name":[2],"type":["chr"],"align":["left"]},{"label":["mean"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["n"],"name":[4],"type":["int"],"align":["right"]},{"label":["std_err"],"name":[5],"type":["dbl"],"align":["right"]}],"data":[{"1":"mae","2":"standard","3":"25.53709","4":"10","5":"0.5663083"},{"1":"rmse","2":"standard","3":"36.71345","4":"10","5":"0.9225024"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> ```r regresion_workflow %>% * add_recipe(recipe(Sale_Price ~ ., data = ames_train) %>% step_rm(ends_with("Qual"))) %>% fit_resamples(resamples = resamples, metrics = metric_set(mae, rmse)) %>% collect_metrics() ``` <div data-pagedtable="false"> <script data-pagedtable-source type="application/json"> {"columns":[{"label":[".metric"],"name":[1],"type":["chr"],"align":["left"]},{"label":[".estimator"],"name":[2],"type":["chr"],"align":["left"]},{"label":["mean"],"name":[3],"type":["dbl"],"align":["right"]},{"label":["n"],"name":[4],"type":["int"],"align":["right"]},{"label":["std_err"],"name":[5],"type":["dbl"],"align":["right"]}],"data":[{"1":"mae","2":"standard","3":"25.36555","4":"10","5":"0.6889984"},{"1":"rmse","2":"standard","3":"36.45790","4":"10","5":"0.9566548"}],"options":{"columns":{"min":{},"max":[10]},"rows":{"min":[10],"max":[10]},"pages":{}}} </script> </div> --- ## Variable Importance without the Quality Features ```r regresion_workflow %>% * add_recipe(recipe(Sale_Price ~ ., data = ames_train) %>% step_rm(ends_with("Qual"))) %>% fit(data = ames_train) %>% pull_workflow_fit() %>% vip::vip(num_features = 15L) ``` <img src="w12d2-inference_files/figure-html/tree-vip-without-qual-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Appendix: code .small-code[ ```r include_graphics("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png") include_graphics("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png") ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Penguin bill dimensions", subtitle = "Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)") ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = species)) + geom_point() + geom_smooth(method = "lm") + scale_color_manual(values = c("darkorange","purple","cyan4")) + labs(title = "Penguin bill dimensions", subtitle = "Bill length and depth for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)", color = "Penguin species", shape = "Penguin species") + theme(legend.position = c(0.85, 0.15), legend.background = element_rect(fill = "white", color = NA)) #data(ames, package = "modeldata") ames <- AmesHousing::make_ames() ames_all <- ames %>% filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>% mutate(across(where(is.integer), as.double)) %>% mutate(Sale_Price = Sale_Price / 1000) rm(ames) set.seed(10) # Seed the random number generator ames_split <- initial_split(ames_all, prop = 2 / 3) ames_train <- training(ames_split) ames_test <- testing(ames_split) ``` ]