overview

Data Engineering: DataFlow

Keith VanderLinden
Calvin University

Modern data stacks are designed to handle large datasets and to support the flow of that data through an ML system. This generally comprises:

Traditional systems distinguish between two types of data processing:

DMLS argues that this distinction is less relevant today.

When performing transactions on data, we set different goals depending on the context.

These transaction principles are related to Brewer’s CAP theorem.

ACID
- Typical of traditional SQL databases
- Properties:
  - Atomic - Transactions run all-or-nothing.
  - Consistent - Data is always internally valid.
  - Isolated - Transactions don’t interfere with each other.
  - Durable - Data is saved (typical of all DBMSs).
- Example use cases:
  - Atomic/Isolated - Bank transactions (e.g., Chase)
  - Consistency - Flight bookings (e.g., Sabre)
BASE
- Typical of scaled-up modern databases, often noSQL.
- Properties:
  - Basically Available - The system is always available, but may not be consistent.
  - Soft state - The system’s state may change over time.
  - Eventual consistency - The system will eventually become consistent.
- Example use cases
  - Social media (e.g., Facebook)
CAP: Consistency, Availability, Partition tolerance
- aka, Brewer’s theorem
- C, A, P, choose any two.
- Discuss the figure on the linked Wikipedia page.

References

Preparing data for analysis requires three traditional data processing tasks:

These task flows can be ordered as appropriate for the context: ETL vs ELT.

In production systems, data generally flows between separate processes.