This document demonstrates data queries using both dplyr
and SQL commands. The data is from a toy suppliers-parts-jobs database
running on a cloud, SQL data service.
library(tidyverse)
We start by establishing a connection to the SPJ database currently stored on ElephantSQL, a cloud-based database system. ElephantSql hosts this database in a manner similar to the AWS-hosted database used in the textbook. And we connect to it in same manner.
library(DBI)
db <- DBI::dbConnect(RPostgres::Postgres(),
host = DB_HOST,
port = 5432,
user = DB_USER,
password = DB_PASSWORD)
dbListTables(db)
## [1] "partjob" "supplier" "part"
## [4] "job" "pg_stat_statements"
supplier <- tbl(db, "supplier")
part <- tbl(db, "part")
job <- tbl(db, "job")
partjob <- tbl(db, "partjob")
As a reminder, here is the SQL schema.
Queries retrieve data from databases. As with dplyr
wrangling, the nature of the query depends on the structure of the
underlying data.
First, we retrieve the data from a single table (part
).
Note that the result acts like a tibble but is actually taken directly
from the cloud-based, PostgreSQL database (see the “Source” and
“Database:” values).
part %>%
show_query()
## <SQL>
## SELECT *
## FROM "part"
part
## # Source: table<part> [4 x 4]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
## id name price supplierid
## <int> <chr> <int> <int>
## 1 1 Birdseed 500 1
## 2 2 TNT 10000 1
## 3 3 Lead shot 2000 NA
## 4 4 Pocket Fisherman 1999 3
We include both the retrieved data and the SQL command used to do the
retrieval. You can compare and contrast the equivalent queries in
dplyr
and SQL.
We can run (basically) the same SQL query directly in the cloud
service. Note the use of *
to query all fields.
SELECT *
FROM Part
Note also that if we modify the data on the cloud service, the local query results change as well. Students should not modify this shared cloud resource.
Second, we retrieve the data from a 1-to-many join. In the SPJ database, a supplier can supply many parts, but a part must come from one supplier. This may not be realistic, but for the sake of the example we assume that this is a business rule, perhaps intended to ensure the consistency of parts used on jobs.
part %>%
left_join(supplier,
by = c("supplierid" = "id"),
suffix = c(".part", ".supplier"),
) %>%
select(name.part, price, name.supplier) %>%
show_query()
## <SQL>
## SELECT
## "part"."name" AS "name.part",
## "price",
## "supplier"."name" AS "name.supplier"
## FROM "part"
## LEFT JOIN "supplier"
## ON ("part"."supplierid" = "supplier"."id")
part %>%
left_join(supplier,
by = c("supplierid" = "id"),
suffix = c(".part", ".supplier"),
) %>%
select(name.part, price, name.supplier)
## # Source: SQL [4 x 3]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
## name.part price name.supplier
## <chr> <int> <chr>
## 1 Birdseed 500 Acme
## 2 TNT 10000 Acme
## 3 Lead shot 2000 <NA>
## 4 Pocket Fisherman 1999 Popeil
Note, that the output of this query includes the sort of redundant data we’d prefer to avoid in our database structures, but that this redundancy is being automatically produced by the query — the underlying database is normalized.
Here’s a cleaned up, SQL-standard version of this join query.
SELECT Part.name AS PartName, price, Supplier.name as SupplierName
FROM Part
JOIN Supplier ON supplier.ID = Part.supplierID
Note that the SQL AS
command performs the same basic
renaming function performed by the dplyr
suffix
argument.
Finally, we retrieve data from a many-to-many join. In the SPJ database, a part can be supplied to many different jobs, and a job can use many different parts.
partjob %>%
inner_join(part,
by = c("partid" = "id"),
) %>%
inner_join(job,
by = c("jobid" = "id"),
suffix = c(".part", ".job"),
) %>%
select(name.part, price, name.job, quantity, address, date) %>%
show_query()
## <SQL>
## SELECT
## "part"."name" AS "name.part",
## "price",
## "job"."name" AS "name.job",
## "quantity",
## "address",
## "date"
## FROM "partjob"
## INNER JOIN "part"
## ON ("partjob"."partid" = "part"."id")
## INNER JOIN "job"
## ON ("partjob"."jobid" = "job"."id")
partjob %>%
inner_join(part,
by = c("partid" = "id"),
) %>%
inner_join(job,
by = c("jobid" = "id"),
suffix = c(".part", ".job"),
) %>%
select(name.part, price, name.job, quantity, address, date)
## # Source: SQL [4 x 6]
## # Database: postgres [fnujlqce@lallah.db.elephantsql.com:5432/fnujlqce]
## name.part price name.job quantity address date
## <chr> <int> <chr> <int> <chr> <date>
## 1 Birdseed 500 Student Union 10 Calvin 2018-10-20
## 2 TNT 10000 <NA> 2 Calvin 2018-10-21
## 3 Birdseed 500 <NA> 3 Calvin 2018-10-21
## 4 Pocket Fisherman 1999 Student Union 2 Calvin 2018-10-22
Here’s a cleaned up, SQL-standard version of this join query.
SELECT Part.name AS PartName, price, Job.name as JobName, quantity, address, date
FROM PartJob
JOIN Part ON PartJob.partID = Part.ID
JOIN Job ON PartJob.jobID = Job.ID
This document has demoed the most common data query types using both
dplyr
and SQL.