Successfully started the Week 1: Dimensional Data Modeling of the "Finish the YouTube Boot Camp" hosted by Zach Wilson and DataExpert.io and have completed the Day 1 lecture and lab sessions 💪
Here's some gist about the topics and my insights from the lesson:
Day 1: Working with Complex Data Types - Struct, Array etc.
1. Know Your Consumer
Initially, before beginning any data modeling, it is imperative to understand in depth who the end user of data is. Whether it is being used for analytics by a Data Analyst / Data Scientists, or consumed by downstream jobs managed by Data Engineers to curate master data with other pipelines dependencies, or being fed into ML models or executive dashboards, the intricacies of the data model distinctively varies in terms of usage of complex vs. flat data types, storage and compression, ease of query and accessibility.
2. OLTP vs. Master Data vs. OLAP Continuum
Understanding the differences when modeling a transactional system like an application database requiring low latency, low volume versus an analytical system such as cubes used for quick analysis on aggregations, while also finding a sweet spot in between, where the master data sits, which is deduped and optimized for completeness of entity definitions from which other datasets can be created.
3. Cumulative Table Design
CT designs are very commonly used to create master data, where you hold on to all of the dimensions that existed right up until a specific time (until purged or hibernated). Such designs are beneficial for state transition tracking of different metrics for e.g. for growth accounting, which can be used to analyze patterns and model. Especially, the design serves well in computing cumulation metrics, using complex data types such as array of struct to combine the changing values.
4. Complex Data Types
Usage of complex data types depending upon the the type of modeling based on the end user, ranging from most compact for transactional purposes to most usable for analytics, with upstream staging or master data residing somewhere in between. Mostly used complex data types such as struct, map, array, nested arrays such as array of struct are quite common to utilize for compacting the datasets.
5. Temporal Cardinality Explosion, Compression & Run-length Encoding
Explored the importance of considering the cardinality when working with dimensions that have a time aspect, the need to sort data correctly before compressing such as using parquet format with run-length encoding. Also, complex data types such as array of struct can be used to combine the temporal dimension values, which prevents spark shuffle from ruining compression when working on distributed environments.
Thank you, Zach Wilson & DataExpert.io for the incredible session!
Day 2, loading 🔜 🚀
#bootcamp #zachwilson #dataexpertio #dataengineering #freeyoutubebootcamp #finishtheyoutubebootcamp #rampup #upskilling #onwardsandupwards