This document demonstrates data queries using both dplyr and SQL commands. The data is from a toy suppliers-parts-jobs database running on a cloud, SQL data service.
library(tidyverse)
We start by establishing a connection to the SPJ database currently stored on ElephantSQL, a cloud-based database system. ElephantSql hosts this database in a manner similar to the AWS-hosted database used in the textbook. And we connect to it in same manner.
library(DBI)
db <- DBI::dbConnect(RPostgres::Postgres(),
host = DB_HOST,
port = 5432,
user = DB_USER,
password = DB_PASSWORD)
dbListTables(db)
## [1] "pg_stat_statements" "supplier" "part"
## [4] "job" "partjob"
supplier <- tbl(db, "supplier")
part <- tbl(db, "part")
job <- tbl(db, "job")
partjob <- tbl(db, "partjob")
As a reminder, here is the SQL schema.
SPJ Schema
Queries retrieve data from databases. As with dplyr wrangling, the nature of the query depends on the structure of the underlying data.
First, we retrieve the data from a single table (part). Note that the result acts like a tibble but is actually taken directly from the cloud-based, PostgreSQL database (see the “Source” and “Database:” values).
part %>%
show_query()
## <SQL>
## SELECT *
## FROM "part"
part
## # Source: table<part> [?? x 4]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
## id name price supplierid
## <int> <chr> <dbl> <int>
## 1 1 birdseed 5 1
## 2 2 TNT 90 1
## 3 3 lead shot 20 1
## 4 4 pocket fisherman 20.0 3
We include both the retrieved data and the SQL command used to do the retrieval. You can compare and contrast the equivalent queries in dplyr and SQL.
We can run (basically) the same SQL query directly in the cloud service. Note the use of * to query all fields.
SELECT *
FROM Part
Note also that if we modify the data on the cloud service, the local query results change as well. Students should not modify this shared cloud resource.
Second, we retrieve the data from a 1-to-many join. In the SPJ database, a supplier can supply many parts, but a part must come from one supplier. This may not be realistic, but for the sake of the example we assume that this is a business rule, perhaps intended to ensure the consistency of parts used on jobs.
part %>%
left_join(supplier,
by = c("supplierid" = "id"),
suffix = c(".part", ".supplier"),
) %>%
select(name.part, price, name.supplier) %>%
show_query()
## <SQL>
## SELECT "name.part", "price", "name.supplier"
## FROM (SELECT "LHS"."id" AS "id", "LHS"."name" AS "name.part", "price", "supplierid", "RHS"."name" AS "name.supplier", "loginid", "password"
## FROM "part" AS "LHS"
## LEFT JOIN "supplier" AS "RHS"
## ON ("LHS"."supplierid" = "RHS"."id")
## ) "q01"
part %>%
left_join(supplier,
by = c("supplierid" = "id"),
suffix = c(".part", ".supplier"),
) %>%
select(name.part, price, name.supplier)
## # Source: lazy query [?? x 3]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
## name.part price name.supplier
## <chr> <dbl> <chr>
## 1 birdseed 5 Acme
## 2 TNT 90 Acme
## 3 lead shot 20 Acme
## 4 pocket fisherman 20.0 Popiel
Note, that the output of this query includes the sort of redundant data we’d prefer to avoid in our database structures, but that this redundancy is being automatically produced by the query — the underlying database is normalized.
Here’s a cleaned up, SQL-standard version of this join query.
SELECT Part.name AS PartName, price, Supplier.name as SupplierName
FROM Part
JOIN Supplier ON supplier.ID = Part.supplierID
Note that the SQL AS command performs the same basic renaming function performed by the dplyr suffix argument.
Finally, we retrieve data from a many-to-many join. In the SPJ database, a part can be supplied to many different jobs, and a job can use many different parts.
partjob %>%
inner_join(part,
by = c("partid" = "id"),
) %>%
inner_join(job,
by = c("jobid" = "id"),
suffix = c(".part", ".job"),
) %>%
select(name.part, price, name.job, quantity, address, date) %>%
show_query()
## <SQL>
## SELECT "name.part", "price", "name.job", "quantity", "address", "date"
## FROM (SELECT "partid", "jobid", "quantity", "date", "LHS"."name" AS "name.part", "price", "supplierid", "RHS"."name" AS "name.job", "address"
## FROM (SELECT "partid", "jobid", "quantity", "date", "name", "price", "supplierid"
## FROM "partjob" AS "LHS"
## INNER JOIN "part" AS "RHS"
## ON ("LHS"."partid" = "RHS"."id")
## ) "LHS"
## INNER JOIN "job" AS "RHS"
## ON ("LHS"."jobid" = "RHS"."id")
## ) "q01"
partjob %>%
inner_join(part,
by = c("partid" = "id"),
) %>%
inner_join(job,
by = c("jobid" = "id"),
suffix = c(".part", ".job"),
) %>%
select(name.part, price, name.job, quantity, address, date)
## # Source: lazy query [?? x 6]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
## name.part price name.job quantity address date
## <chr> <dbl> <chr> <int> <chr> <date>
## 1 birdseed 5 Student Union 10 Calvin 2018-10-20
## 2 TNT 90 Student Union 2 Calvin 2018-10-21
## 3 birdseed 5 <NA> 2 Amway 2018-10-21
Here’s a cleaned up, SQL-standard version of this join query.
SELECT Part.name AS PartName, price, Job.name as JobName, quantity, address, date
FROM PartJob
JOIN Part ON PartJob.partID = Part.ID
JOIN Job ON PartJob.jobID = Job.ID
This document has demoed the most common data query types using both dplyr and SQL.