Data Integration
No single tool does it all. A complex application typically needs to combine several systems.
Systems of Record vs. Derived Data
- System of Record: The authoritative source of truth (e.g., your primary OLTP database). Data is normalized and durable.
- Derived Data: Data that is transformed or aggregated from the system of record (e.g., Cache, Search Index, Data Warehouse). It can be recreated from the source if lost.
Evolution of Architectures
- Lambda Architecture: Run two parallel systems.
- Batch Layer: Processes all historical data (accurate but slow).
- Speed Layer: Processes recent data (fast but approximate).
- Problem: You have to maintain two codebases (one for batch, one for stream) that do the same thing.
- Kappa Architecture: "Stream processing is all you need."
- Treat the log of all events as the definitive record.
- To reprocess historical data (e.g., to fix a bug or build a new index), simply start a new stream consumer from the beginning of the log.
- Requires a log that can be retained indefinitely (like Kafka with long retention or tiered storage).
Knowledge Check
What is the main operational disadvantage of the Lambda Architecture?
It cannot handle real-time data.
Maintaining two separate codebases (batch and stream) logic is complex.
It relies on a single leader.