Data Engineering: DataFlow
Keith VanderLinden
Calvin University
Data Stacks
Modern data stacks are designed to handle large datasets and to support the flow of that data through an ML system. This generally comprises:
- Data Storage (i.e., sources, formats, models, engines)
- Data Processing
- Dataflow
Data Processing
Traditional systems distinguish between two types of data processing:
- Transactional processing (OLTP)
- Analytical processing (OLAP)
DMLS argues that this distinction is less relevant today.
Transaction Models
When performing transactions on data, we set different goals depending on the context.
- ACID: Atomic, Consistent, Isolated, Durable
- BASE: Basically Available, Soft state, Eventual consistency
These transaction principles are related to Brewer’s CAP theorem.
Data Processing Tasks
Preparing data for analysis requires three traditional data processing tasks:
These task flows can be ordered as appropriate for the context: ETL vs ELT.
Dataflow
In production systems, data generally flows between separate processes.
- DBMS-based flow
- Service-based flow
- Real-time transport-based flow