Data Engineering: DataFlow

Keith VanderLinden
Calvin University

Data Stacks

Modern data stacks are designed to handle large datasets and to support the flow of that data through an ML system. This generally comprises:

  • Data Storage (i.e., sources, formats, models, engines)
  • Data Processing
  • Dataflow

Data Processing

Traditional systems distinguish between two types of data processing:

  • Transactional processing (OLTP)
  • Analytical processing (OLAP)

DMLS argues that this distinction is less relevant today.

Transaction Models

When performing transactions on data, we set different goals depending on the context.

  • ACID: Atomic, Consistent, Isolated, Durable
  • BASE: Basically Available, Soft state, Eventual consistency

These transaction principles are related to Brewer’s CAP theorem.

Data Processing Tasks

Preparing data for analysis requires three traditional data processing tasks:

  • Extract
  • Transform
  • Load

These task flows can be ordered as appropriate for the context: ETL vs ELT.

Dataflow

In production systems, data generally flows between separate processes.

  • DBMS-based flow
  • Service-based flow
  • Real-time transport-based flow