Database Query Demo

Loading the Database

We start by establishing a connection to the SPJ database currently stored on ElephantSQL, a cloud-based database system. ElephantSql hosts this database in a manner similar to the AWS-hosted database used in the textbook. And we connect to it in same manner.

library(DBI)

db <- DBI::dbConnect(RPostgres::Postgres(), 
                     host = DB_HOST, 
                     port = 5432,
                     user = DB_USER,
                     password = DB_PASSWORD)

dbListTables(db)

## [1] "partjob"            "supplier"           "part"              
## [4] "job"                "pg_stat_statements"

supplier <- tbl(db, "supplier")
part <- tbl(db, "part")
job <- tbl(db, "job")
partjob <- tbl(db, "partjob")

As a reminder, here is the SQL schema.

SPJ Schema

Querying Data

Queries retrieve data from databases. As with dplyr wrangling, the nature of the query depends on the structure of the underlying data.

Single-Table Queries

First, we retrieve the data from a single table (part). Note that the result acts like a tibble but is actually taken directly from the cloud-based, PostgreSQL database (see the “Source” and “Database:” values).

part %>% 
  show_query()

## <SQL>
## SELECT *
## FROM "part"

part

## # Source:   table<part> [4 x 4]
## # Database: postgres  [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
##      id name             price supplierid
##   <int> <chr>            <int>      <int>
## 1     1 Birdseed           500          1
## 2     2 TNT              10000          1
## 3     3 Lead shot         2000         NA
## 4     4 Pocket Fisherman  1999          3

We include both the retrieved data and the SQL command used to do the retrieval. You can compare and contrast the equivalent queries in dplyr and SQL.

We can run (basically) the same SQL query directly in the cloud service. Note the use of * to query all fields.

SELECT * 
  FROM Part

Note also that if we modify the data on the cloud service, the local query results change as well. Students should not modify this shared cloud resource.

Multiple-Table Queries: 1-M

Second, we retrieve the data from a 1-to-many join. In the SPJ database, a supplier can supply many parts, but a part must come from one supplier. This may not be realistic, but for the sake of the example we assume that this is a business rule, perhaps intended to ensure the consistency of parts used on jobs.

part %>%
  left_join(supplier, 
            by = c("supplierid" = "id"),
            suffix = c(".part", ".supplier"),
            ) %>%
  select(name.part, price, name.supplier) %>%
  show_query()

## <SQL>
## SELECT
##   "part"."name" AS "name.part",
##   "price",
##   "supplier"."name" AS "name.supplier"
## FROM "part"
## LEFT JOIN "supplier"
##   ON ("part"."supplierid" = "supplier"."id")

part %>%
  left_join(supplier, 
            by = c("supplierid" = "id"),
             suffix = c(".part", ".supplier"),
            ) %>%
  select(name.part, price, name.supplier)

## # Source:   SQL [4 x 3]
## # Database: postgres  [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
##   name.part        price name.supplier
##   <chr>            <int> <chr>        
## 1 Birdseed           500 Acme         
## 2 TNT              10000 Acme         
## 3 Lead shot         2000 <NA>         
## 4 Pocket Fisherman  1999 Popeil

Note, that the output of this query includes the sort of redundant data we’d prefer to avoid in our database structures, but that this redundancy is being automatically produced by the query — the underlying database is normalized.

Here’s a cleaned up, SQL-standard version of this join query.

SELECT Part.name AS PartName, price, Supplier.name as SupplierName 
  FROM Part
  JOIN Supplier ON supplier.ID = Part.supplierID

Note that the SQL AS command performs the same basic renaming function performed by the dplyr suffix argument.

Multiple-Table Queries: M-M

Finally, we retrieve data from a many-to-many join. In the SPJ database, a part can be supplied to many different jobs, and a job can use many different parts.

partjob %>%
  inner_join(part, 
             by = c("partid" = "id"),
             ) %>%
  inner_join(job, 
             by = c("jobid" = "id"),
             suffix = c(".part", ".job"),
             ) %>%
  select(name.part, price, name.job, quantity, address, date) %>%
  show_query()

## <SQL>
## SELECT
##   "part"."name" AS "name.part",
##   "price",
##   "job"."name" AS "name.job",
##   "quantity",
##   "address",
##   "date"
## FROM "partjob"
## INNER JOIN "part"
##   ON ("partjob"."partid" = "part"."id")
## INNER JOIN "job"
##   ON ("partjob"."jobid" = "job"."id")

partjob %>%
  inner_join(part, 
             by = c("partid" = "id"),
             ) %>%
  inner_join(job, 
             by = c("jobid" = "id"),
             suffix = c(".part", ".job"),
             ) %>%
  select(name.part, price, name.job, quantity, address, date)

## # Source:   SQL [4 x 6]
## # Database: postgres  [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
##   name.part        price name.job      quantity address date      
##   <chr>            <int> <chr>            <int> <chr>   <date>    
## 1 Birdseed           500 Student Union       10 Calvin  2018-10-20
## 2 TNT              10000 <NA>                 2 Calvin  2018-10-21
## 3 Birdseed           500 <NA>                 3 Calvin  2018-10-21
## 4 Pocket Fisherman  1999 Student Union        2 Calvin  2018-10-22

Here’s a cleaned up, SQL-standard version of this join query.

SELECT Part.name AS PartName, price, Job.name as JobName, quantity, address, date
  FROM PartJob
  JOIN Part ON PartJob.partID = Part.ID
  JOIN Job ON PartJob.jobID = Job.ID

Summary

This document has demoed the most common data query types using both dplyr and SQL.