stuff removed from class 10.
ETL, the task of collecting and managing data for analysis, requires that we:
This task can be complex because real data come in various forms from various sources at various velocities.
??? The ETL example in the text is quite tame. Most real data is not already formatted as an R dataset.
Using foreign key values to link records in different tables is critical in allowing databases to reside is secondary storage rather than in main memory, but it also has the weakness that it is not as efficient to retrieve a particular record in a table by key value.
Indexes are data structures that speed retrieval of records based on index values.
??? Demo this using the relationship between records in the fights and airlines datasets.
Relational data is not like in-memory data structures (cf. CS 212).
This was a key stumbling block in the early deployment of relational databases. E.F. Codd recounts battles with proponents of network databases over efficiency (for which his time in RAF in WWII was useful training!).
We won’t cover index types or construction in this course. But picking good indexes is critical in making production databases “produce”.