Data Engineering 101 Sample DW
Data Engineering 101 Sample DW
OLTP
OLAP
OLTP
1
Motivation for data warehouse
➔ Organizations are facing increasingly complex challenges in achieving operational goals so need
analysis tools for making decision.
➔ Traditional operational or transactional databases do not satisfy the requirements for data analysis
◆ Designed/optimized to support daily business operations; primary concern: concurrent access
and recovery techniques to guarantee data consistency.
◆ Contain detailed data, and perform poorly for complex queries that involve many tables or
aggregate large volumes of data, introduced extra cost and resources.
OLTP
OLAP
OLTP
2
Motivation for data warehouse
➔ To analyze the behavior of an organization, data from several operational systems must be
integrated
◆ Difficult to accomplish due to many differences in data definition and content
◆ Time consuming and cost affecting due to complexity of transactional analysis
3
Motivation for data warehouse
● Subject-oriented: Organized around business entities (e.g., customers, products, and employees)
rather than business processes (OrderHeader, OrderDetail, PurchaseItems,...)
● Integrated: many transformations to unify source data from independent data sources (units of
measure, data formats, naming conventions)
● Time-variant: historical data, snapshots of business processes captured at different points in time
● Analytical Designed: new data are appended periodically, existing data is not changed
4
OLTP vs OLAP
● Operational databases (online transaction processing systems or OLTP), are not suitable for data analysis
○ Contain detailed data, historical data is isolated, perform poorly for complex queries due to
normalization
○ Does not contain measurement
○ Primary data from transactions
○ Daily operations and short term decisions
● Online analytical processing (OLAP): Allows decision-making users to perform interactive analysis of data
○ Transformed secondary data
○ Aggregation data, measure data derived
○ Medium and long-term decisions
5
OLTP vs OLAP
Important
Characteristic Operational Database (OLTP) Data Warehouse (OLAP)
Query Insert, Update, and Delete information from Mostly select operations
the database. Data Loads Mechanism: Full load, Incremental
Load (Insert + Delete pattern is preferred)
Update data level Not recommended Could be, recommend insert only
7
Multidimensional Model
● DWs and OLAP use a multidimensional view of data
● Represented as a Data CUBE
○ Dimensions: Perspectives for analyzing data
○ Cells (facts): Contain measures, values that are to be analyzed
8
Hierarchies
● As mentioned in 3rd Normal Form, CUBE has inherited the Hierarchy
● Data granularity: level of detail at which measures are represented for each dimension of the cube
● Data analyzed at different granularities (abstraction levels)
● Hierarchies relate low-level (detailed) concepts to higher-level (general concepts)
○ Example: Store – City – Region/Province – Country
● Given two related levels in a hierarchy, lower level is called child, higher level is called parent
● Instances of these levels are called members
Example:
10
OLAP Operations: Roll up
● Transforms detailed measures into summarized ones when one moves up in a hierarchy
● Use roll-up on total sales by quarter, product for each store to find total sales by product and time for
each country
11
OLAP Operations: Drill down
● Opposite to the roll-up operation, i.e., it moves from a more general level to a detailed level in a
hierarchy
● Use drill-down on total sales by store, product for each quarter to find total sales by product and
store for each month.
12
OLAP Operations: Pivot or Rotate
● Rotates the axes of a cube to provide an alternative presentation of the data
● Change dimension for different angle of analysis, consider different factors / dimension
● Analyse the Store by City instead of by Product
13
OLAP Operations: Slice
● Performs a selection on a dimension of a cube, resulting in a subcube
● Use Slice on total sales by time, product for each store to find total sales by product, quarter for
“Paris”
14
OLAP Operations: Dice
● Defines a selection on two or more dimensions, thus again defining a subcube
● Use dice on total sales by product, store for each quarter to find total sales by product with store’s
country is France and quarter is Q1 and Q2
15
OLAP Operations – Summary
Important
Dice Focus attention on a subset of Replace a dimension with two or more values
dimensions (Q1 and Q2) and French
Drill-down Obtain more detail about a dimension Navigate from a more general level to a more
(Month) specific level
Roll-up Summarize details about a dimension Navigate from a more specific level to a more
( Country) general level
Pivot Present data in a different order/angle Rearrange the dimensions in a data cube
(Product ⇒ Store)
16
Top-Down Architecture
Top Down Immon
● Warehouse feeds marts
● Enterprise data warehouse
● Higher integration levels
● Logically centralized
● Larger project scope
17
Bottom-up Architecture
Bottom Up Kimball
18
Data Mesh & Data Fabric
Trending
19
NEW Modern Data Landscape
➔ Applying data mesh and data fabric frameworks, it does not matter which tools of your usage.
➔ They can be plugged and played together ⇒ Learn the foundations
20
General Architecture
This is pattern for all data platform.
● Data sources
○ Operational databases
○ Other internal or external sources of information
(e.g. files)
● Back-end tier
○ Extraction-Transformation-Loading (ETL) tools for
manipulating data from sources
○ Data staging area: Intermediate database where
manipulation is done
● Warehouse tier: centralize logics, feeds marts, mange
metadata, governance
● OLAP tier
○ OLAP Server: Supports multidimensional data and
operations
● Front-end tier: Deals with data analysis and visualization
○ Composed of OLAP tools, reporting tools,
statistical tools, data-mining tools, …
21
1. What is a data pipelines
2. ETL
3. Extraction
Data pipelines 4. Cleaning
5. Transformation
6. Data Loading
7. ETL vs ELT
8. Batching, Streaming, Lambda
22
What is a data pipelines ?
Data pipelines consist of three essential elements: source, processing steps, and destination.
● Source
○ Sources are where data comes from. Common sources include relational database
management systems, CRMs, ERPs, social media management tools, and even IoT device
sensors.
● Processing steps
○ In general, data is extracted data from sources, manipulated and changed according to
business needs, and then deposited it at its destination. Common processing steps include
engineering and business applies:
○ Type of processing: transformation, augmentation, filtering, grouping, and aggregation.
● Destination
○ A destination is where the data arrives at the end of its processing, typically a data lake or data
warehouse for analysis.
23