Data Engineering: Databases

Keith VanderLinden
Calvin University

Database Management Systems

Database Management Systems (DBMS) provide:

  • Reliability
  • Scalability
  • Flexibility
  • Usability

Data scientists benefit from being able to manage data stored in a variety of types of data systems.

Database Models

Relational

  • Relations
    • Structured schema
    • Normalized (often)
  • Vertical scaling
  • SQL

Non-Relational

  • Multi-mode
    • Semi/Un-structured
    • De-normalized (often)
  • Vertical & horizontal scaling
  • Not SQL

This course will focus on non-relational database systems.

MongoDB

A MongoDB database is a set of collections, each of which is set of documents. Here’s a collection of three documents.

[
    {
        "name": "Halley's Comet",
        "officialName": "1P/Halley",
        "specs": {"orbitalPeriod": 75, "radius": 3.4175, "mass": 2.2e14 }
    },
    {
        "name": "Wild2",
        "officialName": "81P/Wild",
        "specs": {"orbitalPeriod": 6.41, "radius": 1.5534, "mass": 2.3e13 }
    },
    {
        "name": "Comet Hyakutake",
        "officialName": "C/1996 B2",
        "specs": {"orbitalPeriod": 17000, "radius": 0.77671, "mass": 8.8e12 }
    }
]

MongoDB Queries

MongoDB

collection.find()

 

collection.find(
    {
        "name": "Wild2"
    }
)

 

collection.find({
    "specs.orbitalPeriod": {
        "$gt": 5,
        "$lt": 85
    }
})

SQL

SELECT * FROM collection

 

SELECT *
FROM collection
WHERE name = "Wild2"

 

SELECT *
FROM collection
WHERE specs.orbitalPeriod > 5
  AND specs.orbitalPeriod < 85

MongoDB Aggregation Pipelines

Aggregation Pipelines are a more flexible way to query data.

# Standard approach

collection.find(
    { "specs.orbitalPeriod": { "$lt": 100 } },
    { "name": 1, "specs.orbitalPeriod": 1, "_id": 0}
).sort("orbitalPeriod", 1)
# Aggregation version of the same query

collection.aggregate([
    { "$match": { "specs.orbitalPeriod": { "$lt": 100 } } },
    { "$project": { "name": 1, "specs.orbitalPeriod": 1, "_id": 0} },
    { "$sort": { "specs.orbitalPeriod": 1 } }
])

MongoDB Join Queries

orders.insert_many( [
    { "_id" : 1, "item" : "almonds", "price" : 12, "quantity" : 2 },
    { "_id" : 2, "item" : "pecans", "price" : 20, "quantity" : 1 }
] )
inventory.insert_many( [
    { "_id" : 1, "sku" : "almonds", "in-stock" : 120 },
    { "_id" : 3, "sku" : "cashews", "in-stock" : 60 },
    { "_id" : 4, "sku" : "pecans", "in-stock" : 70 }
] )
orders.aggregate([
    { "$lookup": {
            "from": "inventory",
            "localField": "item",
            "foreignField": "sku",
            "as": "inventory_documents"
        }
    } ])
   _id     item  price  quantity  inventory_documents
0    1  almonds     12         2  [{'_id': 1, 'sku': 'almonds', "in-stock": 120}]
1    2   pecans     20         1  [{'_id': 4, 'sku': 'pecans', "in-stock": 70 }]