This document demonstrates data queries using both dplyr and SQL commands. The data is from a toy suppliers-parts-jobs database running on a cloud, SQL data service.

library(tidyverse)

Loading the Database

We start by establishing a connection to the SPJ database currently stored on ElephantSQL, a cloud-based database system. ElephantSql hosts this database in a manner similar to the AWS-hosted database used in the textbook. And we connect to it in same manner.

library(DBI)

db <- DBI::dbConnect(RPostgres::Postgres(), 
                     host = DB_HOST, 
                     port = 5432,
                     user = DB_USER,
                     password = DB_PASSWORD)

dbListTables(db)
## [1] "pg_stat_statements" "supplier"           "part"              
## [4] "job"                "partjob"
supplier <- tbl(db, "supplier")
part <- tbl(db, "part")
job <- tbl(db, "job")
partjob <- tbl(db, "partjob")

As a reminder, here is the SQL schema.

SPJ Schema

Querying Data

Queries retrieve data from databases. As with dplyr wrangling, the nature of the query depends on the structure of the underlying data.

Single-Table Queries

First, we retrieve the data from a single table (part). Note that the result acts like a tibble but is actually taken directly from the cloud-based, PostgreSQL database (see the “Source” and “Database:” values).

part %>% 
  show_query()
## <SQL>
## SELECT *
## FROM "part"
part
## # Source:   table<part> [?? x 4]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
##      id name             price supplierid
##   <int> <chr>            <dbl>      <int>
## 1     1 birdseed           5            1
## 2     2 TNT               90            1
## 3     3 lead shot         20            1
## 4     4 pocket fisherman  20.0          3

We include both the retrieved data and the SQL command used to do the retrieval. You can compare and contrast the equivalent queries in dplyr and SQL.

We can run (basically) the same SQL query directly in the cloud service. Note the use of * to query all fields.

SELECT * 
  FROM Part

Note also that if we modify the data on the cloud service, the local query results change as well. Students should not modify this shared cloud resource.

Multiple-Table Queries: 1-M

Second, we retrieve the data from a 1-to-many join. In the SPJ database, a supplier can supply many parts, but a part must come from one supplier. This may not be realistic, but for the sake of the example we assume that this is a business rule, perhaps intended to ensure the consistency of parts used on jobs.

part %>%
  left_join(supplier, 
            by = c("supplierid" = "id"),
            suffix = c(".part", ".supplier"),
            ) %>%
  select(name.part, price, name.supplier) %>%
  show_query()
## <SQL>
## SELECT "name.part", "price", "name.supplier"
## FROM (SELECT "LHS"."id" AS "id", "LHS"."name" AS "name.part", "price", "supplierid", "RHS"."name" AS "name.supplier", "loginid", "password"
## FROM "part" AS "LHS"
## LEFT JOIN "supplier" AS "RHS"
## ON ("LHS"."supplierid" = "RHS"."id")
## ) "q01"
part %>%
  left_join(supplier, 
            by = c("supplierid" = "id"),
             suffix = c(".part", ".supplier"),
            ) %>%
  select(name.part, price, name.supplier)
## # Source:   lazy query [?? x 3]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
##   name.part        price name.supplier
##   <chr>            <dbl> <chr>        
## 1 birdseed           5   Acme         
## 2 TNT               90   Acme         
## 3 lead shot         20   Acme         
## 4 pocket fisherman  20.0 Popiel

Note, that the output of this query includes the sort of redundant data we’d prefer to avoid in our database structures, but that this redundancy is being automatically produced by the query — the underlying database is normalized.

Here’s a cleaned up, SQL-standard version of this join query.

SELECT Part.name AS PartName, price, Supplier.name as SupplierName 
  FROM Part
  JOIN Supplier ON supplier.ID = Part.supplierID

Note that the SQL AS command performs the same basic renaming function performed by the dplyr suffix argument.

Multiple-Table Queries: M-M

Finally, we retrieve data from a many-to-many join. In the SPJ database, a part can be supplied to many different jobs, and a job can use many different parts.

partjob %>%
  inner_join(part, 
             by = c("partid" = "id"),
             ) %>%
  inner_join(job, 
             by = c("jobid" = "id"),
             suffix = c(".part", ".job"),
             ) %>%
  select(name.part, price, name.job, quantity, address, date) %>%
  show_query()
## <SQL>
## SELECT "name.part", "price", "name.job", "quantity", "address", "date"
## FROM (SELECT "partid", "jobid", "quantity", "date", "LHS"."name" AS "name.part", "price", "supplierid", "RHS"."name" AS "name.job", "address"
## FROM (SELECT "partid", "jobid", "quantity", "date", "name", "price", "supplierid"
## FROM "partjob" AS "LHS"
## INNER JOIN "part" AS "RHS"
## ON ("LHS"."partid" = "RHS"."id")
## ) "LHS"
## INNER JOIN "job" AS "RHS"
## ON ("LHS"."jobid" = "RHS"."id")
## ) "q01"
partjob %>%
  inner_join(part, 
             by = c("partid" = "id"),
             ) %>%
  inner_join(job, 
             by = c("jobid" = "id"),
             suffix = c(".part", ".job"),
             ) %>%
  select(name.part, price, name.job, quantity, address, date)
## # Source:   lazy query [?? x 6]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
##   name.part price name.job      quantity address date      
##   <chr>     <dbl> <chr>            <int> <chr>   <date>    
## 1 birdseed      5 Student Union       10 Calvin  2018-10-20
## 2 TNT          90 Student Union        2 Calvin  2018-10-21
## 3 birdseed      5 <NA>                 2 Amway   2018-10-21

Here’s a cleaned up, SQL-standard version of this join query.

SELECT Part.name AS PartName, price, Job.name as JobName, quantity, address, date
  FROM PartJob
  JOIN Part ON PartJob.partID = Part.ID
  JOIN Job ON PartJob.jobID = Job.ID

Summary

This document has demoed the most common data query types using both dplyr and SQL.