class: center, middle, inverse, title-slide # Databases and Data Formats ### K Arnold --- ## Q&As > How can we control our personal use of tech when it's autonomously engineering itself for maximum attention? * As individuals: Set boundaries (a "rule of life"), and times of reflection ("examen"). Authors: [Justin W. Earley](https://www.thecommonrule.org/), James K.A. Smith, Andy Crouch. * As developers * Suggest topics/questions instead of specific videos? * Develop UI for helping people *reflect on* and *control* their engagement ("you often click on X, Y, and Z. How many minutes per day do you want to spend watching videos about X?") > Do companies have ethics requirements? Many do, but they can be vague. Everyone needs to be responsible. ACM Code of Ethics. --- ## Q&As > How has DS changed recently? * *Much* more data, *much* more compute. * Covid: data-inspired decisions are shaping our lives. --- ## Q&As > My thoughts on social media censoring? * Some censorship seems essential (child porn etc.) * Some choices we have to make: * Subreddits (moderators decide) vs Twitter (committee decides)? * Control at the source (Twitter) vs recommendations/ads (Facebook) * Where does responsibility lie: publisher? platform? algorithm? reader? * Should we even have algorithms recommending / ranking content? * Recommendation algorithms are opaquely amplifying certain content (and implicitly censoring others). * Recommendation subtly implies endorsement. So some censorship is required. --- ## Q&As > My thoughts on social media censoring? An issue of the heart: * Entirely uncensored anonymous communities tend to become hostile * We'll find ways to "satisfy our itching ears" (if not YouTube then talk radio, moving to avoid uncomfortable people/ideas, ...) * We must seek out, amplify, and create what is "true, honorable, just, pure", ... (Philippians 4) --- ## Rest of the semester * Today: Data formats, APIs, SQL * Wednesday: Text classification, bias * Friday: advanced modeling and forecasting * Monday: communication, publishing, more reflections on ethics and impact All at-home focus is on projects. I'm finally almost done with midterm project feedback! --- ## Data Formats * Tabular, delimited (CSV, TSV) or fixed-width * Tabular, structured: Excel, SPSS/Stata/SAS, etc. * Hierarchical * JSON * XML * HTML * Database * SQLite (most apps) * PostgreSQL / MySQL / Oracle / Microsoft SQL Server * Google BigQuery (For a big table, see the [`rio` package vignette](https://cloud.r-project.org/web/packages/rio/vignettes/rio.html)) --- ## Spreadsheets vs Databases .pull-left[ * Often exchanged by email * Hope the format doesn't change * [Capacity limited](https://www.bbc.com/news/technology-54423988) ([other article](https://theconversation.com/why-you-should-never-use-microsoft-excel-to-count-coronavirus-cases-147681)) * Slow to query * Decentralized (resilient?) ] .pull-right[ * Centralized, highly available servers * Documented schema * Large capacity * Fast queries ] --- ## APIs .question[ Name as many APIs as you can think of. ] --- class: larger-table ## API vs Local Data | Local Data | API | |-----|--------| | Any query you want | Only queries the API exposes | | As much as you want | Often rate-limited | | Full dataset must fit on your computer | Practically unlimited data | | | ... but you can only see a small part of it | | Must be complete dataset | Can stream in new data | --- ## Example API: Bike Share Feeds > The [General Bikeshare Feed Specification](https://github.com/NABSA/gbfs), known as GBFS, is the open data standard for bikeshare. GBFS makes real-time data feeds in a uniform format publicly available online, with an emphasis on findability. * [Capital Bikeshare data](https://gbfs.capitalbikeshare.com/gbfs/gbfs.json) * Station Information: <https://gbfs.capitalbikeshare.com/gbfs/en/station_information.json> --- ### Step 1: Retrieve data via HTTP(S) ```r station_info_response <- httr::GET("https://gbfs.capitalbikeshare.com/gbfs/en/station_information.json") station_info_response ``` ``` ## Response [https://gbfs.capitalbikeshare.com/gbfs/en/station_information.json] ## Date: 2020-11-30 04:11 ## Status: 200 ## Content-Type: application/json ## Size: 305 kB ``` ```r station_info_text <- httr::content(station_info_response, as = "text") str_sub(station_info_text, end = 500) %>% cat() ``` ``` ## {"data":{"stations":[{"rental_methods":["CREDITCARD","KEY"],"station_type":"classic","lat":38.858971,"electric_bike_surcharge_waiver":false,"rental_uris":{"ios":"https://dc.lft.to/lastmile_qr_scan","android":"https://dc.lft.to/lastmile_qr_scan"},"short_name":"31000","station_id":"1","name":"Eads St & 15th St S","lon":-77.05323,"has_kiosk":true,"region_id":"41","legacy_id":"1","external_id":"082469cc-1f3f-11e7-bf6b-3863bb334450","eightd_station_services":[],"eightd_has_key_dispenser":false,"capac ``` --- ### Step 2: Parse the resulting JSON ```r station_info <- jsonlite::fromJSON(station_info_text, simplifyDataFrame = TRUE) stations <- station_info$data$stations stations %>% glimpse() ``` ``` ## Rows: 610 ## Columns: 16 ## $ rental_methods <list> [<"CREDITCARD", "KEY">, <"CREDITCARD", "KEY">, <"CREDITCA… ## $ station_type <chr> "classic", "classic", "classic", "classic", "classic", "cl… ## $ lat <dbl> 38.85897, 38.85725, 38.85643, 38.86017, 38.85787, 38.86230… ## $ electric_bike_surcharge_waiver <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA… ## $ rental_uris <df[,2]> <data.frame[33 x 2]> ## $ short_name <chr> "31000", "31001", "31002", "31003", "31004", "31005", "310… ## $ station_id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "10", "11", "12", … ## $ name <chr> "Eads St & 15th St S", "18th St & S Eads St", "Crystal Dr … ## $ lon <dbl> -77.05323, -77.05332, -77.04923, -77.04959, -77.05949, -77… ## $ has_kiosk <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE… ## $ region_id <chr> "41", "41", "41", "41", "41", "41", "41", "41", "41", "41"… ## $ legacy_id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "10", "11", "12", … ## $ external_id <chr> "082469cc-1f3f-11e7-bf6b-3863bb334450", "08246b69-1f3f-11e… ## $ eightd_station_services <list> [[], [], [], [], [], [], [], [], [], [], [], [], [], [], … ## $ eightd_has_key_dispenser <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA… ## $ capacity <int> 15, 11, 17, 11, 11, 19, 15, 19, 11, 11, 17, 11, 15, 31, 27… ``` --- ### Step 3: Profit .small-code[ ```r capacity_palette <- leaflet::colorNumeric("GnBu", domain = stations$capacity) lp <- stations %>% leaflet() %>% addProviderTiles(providers$CartoDB.Positron) %>% addCircleMarkers(~ lon, ~ lat, color = ~capacity_palette(capacity), radius = 2, stroke = FALSE, fillOpacity = 1) %>% addLegend("bottomright", pal = capacity_palette, values = ~capacity) lp$height <- 300 lp ```
] See `leaflet` docs on [markers](http://rstudio.github.io/leaflet/markers.html) and [colors](http://rstudio.github.io/leaflet/colors.html) --- ## Levels of API security 1. No key needed 2. Shared secret 3. Access delegation (typically [OAuth](https://en.wikipedia.org/wiki/OAuth)) --- ### A typical access delegation dance .center[ <img src="https://developer.spotify.com/assets/AuthG_AuthoriztionCode.png" width="50%" style="display: block; margin: auto;" /> ] .floating-source[ Source: [Spotify Developer Docs](https://developer.spotify.com/documentation/general/guides/authorization-guide/) ] --- class: center, middle ## SQL --- ## Grammar of Data | dplyr (`data %>% `) | Pandas | SQL | |-------|--------|-----| | `select(col1, col2)` | `data[['col1', 'col2']]` | `SELECT col1, col2 FROM data` | | `filter(col1 > 5)` | `data.query('col1 > 5')` | `SELECT * FROM data WHERE col1 > 5` | | `left_join(data2, by = "col2")` | `pd.merge(data, data2, by="col2", type="left")` | `SELECT * FROM data LEFT JOIN data2 ON data.col2 == data2.col2` | | `group_by(col2) %>% summarize(m = max(col1))` | `data.group_by('col2')['col1'].max()` | `SELECT max(col1) AS m FROM data GROUP BY col2` | | `pivot_longer()` | `data.melt()` or `data.pivot()` | **No standard approach** | * [SQLite syntax](https://sqlite.org/lang_select.html) * [More examples](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html) --- ## BigQuery To use BigQuery, you need to create a project in Google Cloud Platform to use for billing. ```r billing_project <- "calvindsdev" ``` Then you can use `bigrquery` to get set up to make BigQuery API calls. ```r library(bigrquery) bigrquery::bq_auth(email = TRUE) bq_connection <- DBI::dbConnect( bigrquery::bigquery(), project = "bigquery-public-data", billing = billing_project ) ``` --- ## A COVID Example ```r case_counts %>% kable() ``` |country_code |country_name | cumulative_confirmed| population_2018| cases_per_capita| |:------------|:------------------------|--------------------:|---------------:|----------------:| |US |United States of America | 13091758| 327167434| 0.0400155| |BR |Brazil | 6238350| 209469333| 0.0297817| |RU |Russia | 2215533| 144478050| 0.0153347| |MX |Mexico | 1078594| 126190788| 0.0085473| |IN |India | 9351109| 1352617328| 0.0069133| |PH |Philippines | 425918| 106651922| 0.0039935| |BD |Bangladesh | 458711| 161356039| 0.0028428| |ID |Indonesia | 522581| 267663435| 0.0019524| |PK |Pakistan | 392356| 212215030| 0.0018489| |JP |Japan | 142068| 126529100| 0.0011228| --- ```r covid_open_data <- bq_connection %>% tbl("bigquery-public-data.covid19_open_data.covid19_open_data") covid_open_data %>% glimpse() ``` ``` ## Rows: ?? ## Columns: 45 ## Database: BigQueryConnection ## $ date <date> 2020-11-25, 2020-11-26, 2020-11-27, 2020-11-28, 2020-… ## $ location_key <chr> "US_WA_53053", "US_WA_53053", "US_WA_53053", "US_WA_53… ## $ country_code <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", … ## $ country_name <chr> "United States of America", "United States of America"… ## $ subregion1_code <chr> "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", … ## $ subregion1_name <chr> "Washington", "Washington", "Washington", "Washington"… ## $ subregion2_code <chr> "53053", "53053", "53053", "53053", "53053", "53053", … ## $ subregion2_name <chr> "Pierce County", "Pierce County", "Pierce County", "Pi… ## $ iso_3166_1_alpha_2 <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", … ## $ iso_3166_1_alpha_3 <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"… ## $ aggregation_level <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ new_confirmed <int> 288, 0, 135, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ new_deceased <int> 6, 0, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ new_recovered <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ new_tested <int> 676, 611, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ cumulative_confirmed <int> 15658, 15658, 17290, NA, NA, NA, NA, NA, NA, NA, NA, N… ## $ cumulative_deceased <int> 274, 274, 308, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… ## $ cumulative_recovered <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ cumulative_tested <int> 117125, 117736, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ new_hospitalized_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ new_intensive_care_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ new_ventilator_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ cumulative_hospitalized_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ cumulative_intensive_care_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ cumulative_ventilator_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ current_hospitalized_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ current_intensive_care_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ current_ventilator_patients <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ mobility_transit_stations <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ mobility_retail_and_recreation <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ mobility_grocery_and_pharmacy <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ mobility_parks <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ mobility_residential <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ mobility_workplaces <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… ## $ wikidata_id <chr> "Q156459", "Q156459", "Q156459", "Q156459", "Q156459",… ## $ datacommons_id <chr> "geoId/53053", "geoId/53053", "geoId/53053", "geoId/53… ## $ openstreetmap_id <int> 1153347, 1153347, 1153347, 1153347, 1153347, 1153347, … ## $ latitude <dbl> 47.05, 47.05, 47.05, 47.05, 47.05, 47.05, 48.57, 48.57… ## $ longitude <dbl> -122.11, -122.11, -122.11, -122.11, -122.11, -122.11, … ## $ location_geometry <wk_wkt> POINT (-122.11 47.05), POINT (-122.11 47.05), POINT… ## $ average_temperature_celsius <dbl> NA, NA, NA, NA, NA, NA, 10.000000, 6.155556, 8.277778,… ## $ minimum_temperature_celsius <dbl> NA, NA, NA, NA, NA, NA, 8.077778, 4.033333, 4.388889, … ## $ maximum_temperature_celsius <dbl> NA, NA, NA, NA, NA, NA, 11.555556, 9.311111, 11.633333… ## $ rainfall_mm <dbl> NA, NA, NA, NA, NA, NA, 5.207000, 0.095250, 4.159250, … ## $ snowfall_mm <dbl> NA, NA, NA, NA, NA, NA, NA, 10.16, NA, NA, 10.16, NA, … ``` --- ```r world_bank_global_pop <- bq_connection %>% tbl("bigquery-public-data.world_bank_global_population.population_by_country") world_bank_global_pop %>% glimpse() ``` ``` ## Rows: ?? ## Columns: 62 ## Database: BigQueryConnection ## $ country <chr> "Sint Maarten (Dutch part)", "Not classified", "West Bank and Gaza", "Serbia… ## $ country_code <chr> "SXM", "INX", "PSE", "SRB", "KHM", "ARE", "CYP", "LMC", "MDV", "LIC", "PER",… ## $ year_1960 <int> NA, NA, NA, NA, 5722370, 92418, 572930, 928490499, 89863, 158779911, 1015501… ## $ year_1961 <int> NA, NA, NA, NA, 5872966, 100796, 576395, 949753699, 92325, 162259865, 104466… ## $ year_1962 <int> NA, NA, NA, NA, 6028431, 112118, 577691, 971742838, 94905, 165830670, 107494… ## $ year_1963 <int> NA, NA, NA, NA, 6183584, 125130, 577913, 994420244, 97551, 169529568, 110623… ## $ year_1964 <int> NA, NA, NA, NA, 6331449, 138039, 578625, 1017725630, 100171, 173410029, 1138… ## $ year_1965 <int> NA, NA, NA, NA, 6467197, 149857, 580966, 1041611332, 102718, 177509185, 1171… ## $ year_1966 <int> NA, NA, NA, NA, 6585035, 159976, 585309, 1066074552, 105136, 181839891, 1204… ## $ year_1967 <int> NA, NA, NA, NA, 6685960, 169771, 591308, 1091126634, 107478, 186385902, 1238… ## $ year_1968 <int> NA, NA, NA, NA, 6779787, 182627, 598495, 1116788548, 109891, 191120090, 1273… ## $ year_1969 <int> NA, NA, NA, NA, 6880623, 203106, 606116, 1143088505, 112582, 196001924, 1309… ## $ year_1970 <int> NA, NA, NA, NA, 6996576, 234514, 613618, 1170050244, 115691, 201001375, 1345… ## $ year_1971 <int> NA, NA, NA, NA, 7139647, 277471, 620859, 1197695284, 119302, 206110337, 1383… ## $ year_1972 <int> NA, NA, NA, NA, 7302111, 330974, 627998, 1226045232, 123354, 211335895, 1422… ## $ year_1973 <int> NA, NA, NA, NA, 7449238, 394624, 635107, 1255126850, 127701, 216681265, 1461… ## $ year_1974 <int> NA, NA, NA, NA, 7533336, 467451, 642332, 1284982577, 132104, 222154521, 1501… ## $ year_1975 <int> NA, NA, NA, NA, 7524447, 548301, 649751, 1315638970, 136417, 227761362, 1542… ## $ year_1976 <int> NA, NA, NA, NA, 7404684, 637922, 657527, 1347077568, 140563, 233513206, 1583… ## $ year_1977 <int> NA, NA, NA, NA, 7196042, 735344, 665525, 1379310126, 144631, 239409309, 1625… ## $ year_1978 <int> NA, NA, NA, NA, 6957265, 835508, 673247, 1412439164, 148780, 245430650, 1668… ## $ year_1979 <int> NA, NA, NA, NA, 6770396, 931749, 680005, 1446597778, 153276, 251550765, 1711… ## $ year_1980 <int> NA, NA, NA, NA, 6693764, 1019509, 685405, 1481859807, 158272, 257758125, 175… ## $ year_1981 <int> NA, NA, NA, NA, 6749847, 1096610, 689173, 1518293900, 163824, 264051673, 179… ## $ year_1982 <int> NA, NA, NA, NA, 6919801, 1164806, 691702, 1555686714, 169851, 270462062, 184… ## $ year_1983 <int> NA, NA, NA, NA, 7169997, 1228459, 694077, 1593969308, 176244, 277049449, 188… ## $ year_1984 <int> NA, NA, NA, NA, 7447853, 1293971, 697714, 1632885169, 182850, 283893524, 193… ## $ year_1985 <int> NA, NA, NA, NA, 7714880, 1366164, 703685, 1672196713, 189537, 291057942, 197… ## $ year_1986 <int> NA, NA, NA, NA, 7960949, 1446380, 712340, 1711883766, 196264, 298550548, 202… ## $ year_1987 <int> NA, NA, NA, NA, 8198082, 1533536, 723376, 1751925747, 203028, 306374108, 206… ## $ year_1988 <int> NA, NA, NA, NA, 8435912, 1627066, 736474, 1792245451, 209785, 314578310, 211… ## $ year_1989 <int> NA, NA, NA, NA, 8691334, 1725681, 751047, 1832773126, 216510, 323219957, 216… ## $ year_1990 <int> NA, NA, 1978248, 7586000, 8975597, 1828432, 766615, 1875358869, 223158, 3323… ## $ year_1991 <int> NA, NA, 2068845, 7595636, 9289299, 1937153, 783124, 1915913909, 229751, 3419… ## $ year_1992 <int> NA, NA, 2163591, 7646424, 9623889, 2052891, 800611, 1956554979, 236265, 3520… ## $ year_1993 <int> NA, NA, 2262676, 7699307, 9970733, 2173139, 818746, 1997079030, 242605, 3625… ## $ year_1994 <int> NA, NA, 2366298, 7734639, 10317899, 2294385, 837110, 2037399623, 248588, 373… ## $ year_1995 <int> NA, NA, 2474666, 7625357, 10656138, 2415090, 855383, 2077790619, 254139, 383… ## $ year_1996 <int> NA, NA, 2587997, 7617794, 10982917, 2539126, 873423, 2118416707, 259183, 394… ## $ year_1997 <int> NA, NA, 2706518, 7596501, 11298600, 2671362, 891190, NA, 263842, 405123344, … ## $ year_1998 <int> 31240, NA, 2776568, 7567745, 11600508, 2813214, 908702, NA, 268455, 41595941… ## $ year_1999 <int> 31084, NA, 2848431, 7540401, 11886458, 2966034, 926046, NA, 273527, 42714044… ## $ year_2000 <int> 30519, NA, 2922153, 7516346, 12155239, 3134062, 943290, NA, 279398, 43879943… ## $ year_2001 <int> 30600, NA, 2997784, 7503433, 12405408, 3302719, 960276, NA, 286306, 45096207… ## $ year_2002 <int> 30777, NA, 3075373, 7496522, 12637727, 3478777, 976964, NA, 294186, 46358476… ## $ year_2003 <int> 31472, NA, 3154969, 7480591, 12856163, 3711932, 993563, NA, 302683, 47664537… ## $ year_2004 <int> 32488, NA, 3236626, 7463157, 13066469, 4068570, 1010408, NA, 311265, 4900999… ## $ year_2005 <int> 33011, NA, 3320396, 7440769, 13273354, 4588225, 1027662, NA, 319608, 5039071… ## $ year_2006 <int> 33441, NA, 3406334, 7411569, 13477709, 5300174, 1045507, NA, 327487, 5180974… ## $ year_2007 <int> 33811, NA, 3494496, 7381579, 13679962, 6168838, 1063713, NA, 335169, 5326635… ## $ year_2008 <int> 33964, NA, 3591977, 7350222, 13883834, 7089487, 1081563, NA, 343452, 5474812… ## $ year_2009 <int> 34238, NA, 3689099, 7320807, 14093604, 7917372, 1098083, NA, 353398, 5623860… ## $ year_2010 <int> 34056, NA, 3786161, 7291436, 14312212, 8549988, 1112612, NA, 365734, 5772740… ## $ year_2011 <int> 33435, NA, 3882986, 7234099, 14541423, 8946777, 1124833, NA, 380495, 5920921… ## $ year_2012 <int> 34640, NA, 3979998, 7199077, 14780454, 9141596, 1135046, NA, 397237, 6069133… ## $ year_2013 <int> 36607, NA, 4076708, 7164132, 15026332, 9197910, 1143866, NA, 415593, 6219158… ## $ year_2014 <int> 37685, NA, 4173398, 7130576, 15274503, 9214175, 1152285, NA, 435015, 6373506… ## $ year_2015 <int> 38825, NA, 4270092, 7095383, 15521436, 9262900, 1160985, NA, 454915, 6534052… ## $ year_2016 <int> 39969, NA, 4367088, 7058322, 15766293, 9360980, 1170187, NA, 475513, 6701173… ## $ year_2017 <int> 40574, NA, 4454805, 7020858, 16009414, 9487203, 1179680, NA, 496402, 6874495… ## $ year_2018 <int> 40654, NA, 4569087, 6982084, 16249798, 9630959, 1189265, NA, 515696, 7054173… ## $ year_2019 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- .pull-left[ R (`dbplyr`) ```r country_pop <- world_bank_global_pop %>% select( iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) ``` ] .pull-right[ BigQuery SQL ```sql SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) ``` ] --- ## Common Table Expression (`WITH`); `JOIN` .pull-left[ ```r country_pop <- world_bank_global_pop %>% select( iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) case_counts_query <- covid_open_data %>% * left_join( * country_pop, by = "iso_3166_1_alpha_3") ``` ] .pull-right[ ```sql *WITH * country_pop AS ( SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) SELECT * FROM `bigquery-public-data.covid19_open_data.covid19_open_data` *JOIN country_pop USING (iso_3166_1_alpha_3) ``` ] --- .pull-left[ ```r country_pop <- world_bank_global_pop %>% select( iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) case_counts_query <- covid_open_data %>% left_join( country_pop, by = "iso_3166_1_alpha_3") %>% * filter( * date == '2020-11-27', * aggregation_level == 0, * population_2018 > 100000000) ``` ] .pull-right[ ```sql WITH country_pop AS ( SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) SELECT * FROM `bigquery-public-data.covid19_open_data.covid19_open_data` JOIN country_pop USING (iso_3166_1_alpha_3) *WHERE * date = '2020-11-27' * AND aggregation_level = 0 * AND population_2018 > 100000000 ``` ] --- .pull-left[ ```r country_pop <- world_bank_global_pop %>% select( iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) case_counts_query <- covid_open_data %>% left_join( country_pop, by = "iso_3166_1_alpha_3") %>% filter( date == '2020-11-27', aggregation_level == 0, population_2018 > 100000000) %>% * mutate( * cases_per_capita = * cumulative_confirmed / population_2018 * ) ``` ] .pull-right[ ```sql WITH country_pop AS ( SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) SELECT *, * cumulative_confirmed/population_2018 AS cases_per_capita FROM `bigquery-public-data.covid19_open_data.covid19_open_data` JOIN country_pop USING (iso_3166_1_alpha_3) WHERE date = '2020-11-27' AND aggregation_level = 0 AND population_2018 > 100000000 ``` ] --- .pull-left[ ```r country_pop <- world_bank_global_pop %>% select( iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) case_counts_query <- covid_open_data %>% left_join( country_pop, by = "iso_3166_1_alpha_3") %>% filter( date == '2020-11-27', aggregation_level == 0, population_2018 > 100000000) %>% mutate( cases_per_capita = cumulative_confirmed / population_2018 ) %>% * select( * country_code, country_name, * cumulative_confirmed, * population_2018, * cases_per_capita) ``` ] .pull-right[ ```sql WITH country_pop AS ( SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) SELECT * country_code, * country_name, * cumulative_confirmed, * population_2018, * cumulative_confirmed/population_2018 AS cases_per_capita FROM `bigquery-public-data.covid19_open_data.covid19_open_data` JOIN country_pop USING (iso_3166_1_alpha_3) WHERE date = '2020-11-27' AND aggregation_level = 0 AND population_2018 > 100000000 ``` ] --- .pull-left[ ```r country_pop <- world_bank_global_pop %>% select( iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) case_counts_query <- covid_open_data %>% left_join( country_pop, by = "iso_3166_1_alpha_3") %>% filter( date == '2020-11-27', aggregation_level == 0, population_2018 > 100000000) %>% mutate( cases_per_capita = cumulative_confirmed / population_2018 ) %>% select( country_code, country_name, cumulative_confirmed, population_2018, cases_per_capita) %>% * arrange(desc(cases_per_capita)) ``` ] .pull-right[ ```sql WITH country_pop AS ( SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) SELECT country_code, country_name, cumulative_confirmed, population_2018, cumulative_confirmed/population_2018 AS cases_per_capita FROM `bigquery-public-data.covid19_open_data.covid19_open_data` JOIN country_pop USING (iso_3166_1_alpha_3) WHERE date = '2020-11-27' AND aggregation_level = 0 AND population_2018 > 100000000 *ORDER BY * cases_per_capita DESC ``` ] --- ```r country_pop <- world_bank_global_pop %>% select(iso_3166_1_alpha_3 = country_code, population_2018 = year_2018) case_counts_query <- covid_open_data %>% filter(date == '2020-11-27', aggregation_level == 0) %>% left_join(country_pop, by = c("iso_3166_1_alpha_3")) %>% filter(population_2018 > 100000000) %>% mutate(cases_per_capita = cumulative_confirmed / population_2018) %>% select(country_code, country_name, cumulative_confirmed, population_2018, cases_per_capita) %>% arrange(desc(cases_per_capita)) %>% head(10) case_counts_query %>% show_query() ``` ``` ## <SQL> ## SELECT `country_code`, `country_name`, `cumulative_confirmed`, `population_2018`, `cumulative_confirmed` / `population_2018` AS `cases_per_capita` ## FROM (SELECT `date`, `location_key`, `country_code`, `country_name`, `subregion1_code`, `subregion1_name`, `subregion2_code`, `subregion2_name`, `iso_3166_1_alpha_2`, `LHS`.`iso_3166_1_alpha_3` AS `iso_3166_1_alpha_3`, `aggregation_level`, `new_confirmed`, `new_deceased`, `new_recovered`, `new_tested`, `cumulative_confirmed`, `cumulative_deceased`, `cumulative_recovered`, `cumulative_tested`, `new_hospitalized_patients`, `new_intensive_care_patients`, `new_ventilator_patients`, `cumulative_hospitalized_patients`, `cumulative_intensive_care_patients`, `cumulative_ventilator_patients`, `current_hospitalized_patients`, `current_intensive_care_patients`, `current_ventilator_patients`, `mobility_transit_stations`, `mobility_retail_and_recreation`, `mobility_grocery_and_pharmacy`, `mobility_parks`, `mobility_residential`, `mobility_workplaces`, `wikidata_id`, `datacommons_id`, `openstreetmap_id`, `latitude`, `longitude`, `location_geometry`, `average_temperature_celsius`, `minimum_temperature_celsius`, `maximum_temperature_celsius`, `rainfall_mm`, `snowfall_mm`, `population_2018` ## FROM (SELECT * ## FROM `bigquery-public-data.covid19_open_data.covid19_open_data` ## WHERE ((`date` = '2020-11-27') AND (`aggregation_level` = 0.0))) `LHS` ## LEFT JOIN (SELECT `country_code` AS `iso_3166_1_alpha_3`, `year_2018` AS `population_2018` ## FROM `bigquery-public-data.world_bank_global_population.population_by_country`) `RHS` ## ON (`LHS`.`iso_3166_1_alpha_3` = `RHS`.`iso_3166_1_alpha_3`) ## ) `q01` ## WHERE (`population_2018` > 100000000.0) ## ORDER BY `cases_per_capita` DESC ## LIMIT 10 ``` --- ### Getting the results ```r case_counts <- case_counts_query %>% collect() ``` ```r case_counts %>% kable() ``` |country_code |country_name | cumulative_confirmed| population_2018| cases_per_capita| |:------------|:------------------------|--------------------:|---------------:|----------------:| |US |United States of America | 13091758| 327167434| 0.0400155| |BR |Brazil | 6238350| 209469333| 0.0297817| |RU |Russia | 2215533| 144478050| 0.0153347| |MX |Mexico | 1078594| 126190788| 0.0085473| |IN |India | 9351109| 1352617328| 0.0069133| |PH |Philippines | 425918| 106651922| 0.0039935| |BD |Bangladesh | 458711| 161356039| 0.0028428| |ID |Indonesia | 522581| 267663435| 0.0019524| |PK |Pakistan | 392356| 212215030| 0.0018489| |JP |Japan | 142068| 126529100| 0.0011228| --- <https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/covid19-open-data?filter=solution-type:dataset&q=covid19&project=calvindsdev> ``` WITH country_pop AS ( SELECT country_code AS iso_3166_1_alpha_3, year_2018 AS population_2018 FROM `bigquery-public-data.world_bank_global_population.population_by_country`) SELECT country_code, country_name, cumulative_confirmed, population_2018, cumulative_confirmed/population_2018 AS cases_per_capita FROM `bigquery-public-data.covid19_open_data.covid19_open_data` JOIN country_pop USING (iso_3166_1_alpha_3) WHERE date = '2020-11-13' AND aggregation_level = 0 AND population_2018 > 100000000 ORDER BY case_percent DESC ``` See also: <https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/> from <https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/>