Data Engineering

Keith VanderLinden
Calvin University

Big Data

The Vs of big data:

  • Volume
  • Variety
  • Velocity
  • Veracity
  • Value

Data engineering is the process of designing, building, and managing the infrastructure for big data.

Data Formats

Data are stored in a variety of formats.

  • CSV — Comma Separated Values
  • TSV — Tab Separated Values
  • JSON — JavaScript Object Notation
  • ZIP — Compressed Archive
  • Pickle — Python-specific binary format

Row vs Column Storage

The distinction between row-major vs column-major format can make a difference in performance.

Versioning

Supporting reproducibility requires that we maintain version histories for everything.

  • Code
  • Configuration
  • Data
  • Models

Data Storage

Data storage systems are classified into different types.