0% found this document useful (0 votes)
6 views

Data Engineering 101 Sample DW

DWfasfsadasdfasdsadsad

Uploaded by

vietquang90dn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Engineering 101 Sample DW

DWfasfsadasdfasdsadsad

Uploaded by

vietquang90dn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data 1.

Motivation for data warehouse


2. OLTP vs OLAP
Warehouse 3. Multidimensional Model
4. Hierarchies Dimension
Data evolution 5. OLAP Operations
6. Data warehouse Architectures

OLTP
OLAP
OLTP

1
Motivation for data warehouse
➔ Organizations are facing increasingly complex challenges in achieving operational goals so need
analysis tools for making decision.
➔ Traditional operational or transactional databases do not satisfy the requirements for data analysis
◆ Designed/optimized to support daily business operations; primary concern: concurrent access
and recovery techniques to guarantee data consistency.
◆ Contain detailed data, and perform poorly for complex queries that involve many tables or
aggregate large volumes of data, introduced extra cost and resources.

OLTP

OLAP

OLTP
2
Motivation for data warehouse
➔ To analyze the behavior of an organization, data from several operational systems must be
integrated
◆ Difficult to accomplish due to many differences in data definition and content
◆ Time consuming and cost affecting due to complexity of transactional analysis

➔ Data warehouse address requirements of decision-making


◆ Populated from operational databases and external data sources
◆ Integrated and transformed data
◆ Optimized for reporting and periodic integration

3
Motivation for data warehouse
● Subject-oriented: Organized around business entities (e.g., customers, products, and employees)
rather than business processes (OrderHeader, OrderDetail, PurchaseItems,...)
● Integrated: many transformations to unify source data from independent data sources (units of
measure, data formats, naming conventions)
● Time-variant: historical data, snapshots of business processes captured at different points in time
● Analytical Designed: new data are appended periodically, existing data is not changed

4
OLTP vs OLAP
● Operational databases (online transaction processing systems or OLTP), are not suitable for data analysis
○ Contain detailed data, historical data is isolated, perform poorly for complex queries due to
normalization
○ Does not contain measurement
○ Primary data from transactions
○ Daily operations and short term decisions
● Online analytical processing (OLAP): Allows decision-making users to perform interactive analysis of data
○ Transformed secondary data
○ Aggregation data, measure data derived
○ Medium and long-term decisions

5
OLTP vs OLAP
Important
Characteristic Operational Database (OLTP) Data Warehouse (OLAP)

User It is an online transactional system. It manages Data retrieving process.


database modification.

Function Day to day operations Decision support

Details level Individual Individual and summary


(Transaction of event) (Transaction of measurement/fact)

Query Insert, Update, and Delete information from Mostly select operations
the database. Data Loads Mechanism: Full load, Incremental
Load (Insert + Delete pattern is preferred)

Record per request Few Thousands

Update data level Not recommended Could be, recommend insert only

Data model Relational Relational (star schema) and multidimensional


(data cubes) 6
OLTP vs. OLAP Schema Comparison
Don’t be confused between Relational Data and Dimensional Data Model

Different structure, design, functionalities, purpose, etc


Operational Database Data warehouse

7
Multidimensional Model
● DWs and OLAP use a multidimensional view of data
● Represented as a Data CUBE
○ Dimensions: Perspectives for analyzing data
○ Cells (facts): Contain measures, values that are to be analyzed

- Measure Aggregation and


Summarizability
- Measure Classification

8
Hierarchies
● As mentioned in 3rd Normal Form, CUBE has inherited the Hierarchy
● Data granularity: level of detail at which measures are represented for each dimension of the cube
● Data analyzed at different granularities (abstraction levels)
● Hierarchies relate low-level (detailed) concepts to higher-level (general concepts)
○ Example: Store – City – Region/Province – Country
● Given two related levels in a hierarchy, lower level is called child, higher level is called parent
● Instances of these levels are called members

Example:

Hierarchies of the Product, Time,


and Customer dimensions
9
OLAP Operations Functionalities
➔ Roll up
➔ Drill down
➔ Pivot or Rotate
➔ Dice

10
OLAP Operations: Roll up
● Transforms detailed measures into summarized ones when one moves up in a hierarchy
● Use roll-up on total sales by quarter, product for each store to find total sales by product and time for
each country

11
OLAP Operations: Drill down
● Opposite to the roll-up operation, i.e., it moves from a more general level to a detailed level in a
hierarchy
● Use drill-down on total sales by store, product for each quarter to find total sales by product and
store for each month.

12
OLAP Operations: Pivot or Rotate
● Rotates the axes of a cube to provide an alternative presentation of the data
● Change dimension for different angle of analysis, consider different factors / dimension
● Analyse the Store by City instead of by Product

13
OLAP Operations: Slice
● Performs a selection on a dimension of a cube, resulting in a subcube
● Use Slice on total sales by time, product for each store to find total sales by product, quarter for
“Paris”

14
OLAP Operations: Dice
● Defines a selection on two or more dimensions, thus again defining a subcube
● Use dice on total sales by product, store for each quarter to find total sales by product with store’s
country is France and quarter is Q1 and Q2

15
OLAP Operations – Summary
Important

Operations Purpose Description

Slice Focus attention on a subset of Replace a dimension with a single member


member value (Paris) value or with a summary of its measure values

Dice Focus attention on a subset of Replace a dimension with two or more values
dimensions (Q1 and Q2) and French

Drill-down Obtain more detail about a dimension Navigate from a more general level to a more
(Month) specific level

Roll-up Summarize details about a dimension Navigate from a more specific level to a more
( Country) general level

Pivot Present data in a different order/angle Rearrange the dimensions in a data cube
(Product ⇒ Store)

16
Top-Down Architecture
Top Down Immon
● Warehouse feeds marts
● Enterprise data warehouse
● Higher integration levels
● Logically centralized
● Larger project scope

17
Bottom-up Architecture
Bottom Up Kimball

● Marts compose Warehouse


● Independent data marts
● Lower integration levels
● Logically decentralized
● Smaller project scope

18
Data Mesh & Data Fabric
Trending

19
NEW Modern Data Landscape
➔ Applying data mesh and data fabric frameworks, it does not matter which tools of your usage.
➔ They can be plugged and played together ⇒ Learn the foundations

Below is a diagram of combination leading services

20
General Architecture
This is pattern for all data platform.
● Data sources
○ Operational databases
○ Other internal or external sources of information
(e.g. files)
● Back-end tier
○ Extraction-Transformation-Loading (ETL) tools for
manipulating data from sources
○ Data staging area: Intermediate database where
manipulation is done
● Warehouse tier: centralize logics, feeds marts, mange
metadata, governance
● OLAP tier
○ OLAP Server: Supports multidimensional data and
operations
● Front-end tier: Deals with data analysis and visualization
○ Composed of OLAP tools, reporting tools,
statistical tools, data-mining tools, …

21
1. What is a data pipelines
2. ETL
3. Extraction
Data pipelines 4. Cleaning
5. Transformation
6. Data Loading
7. ETL vs ELT
8. Batching, Streaming, Lambda

22
What is a data pipelines ?
Data pipelines consist of three essential elements: source, processing steps, and destination.
● Source
○ Sources are where data comes from. Common sources include relational database
management systems, CRMs, ERPs, social media management tools, and even IoT device
sensors.
● Processing steps
○ In general, data is extracted data from sources, manipulated and changed according to
business needs, and then deposited it at its destination. Common processing steps include
engineering and business applies:
○ Type of processing: transformation, augmentation, filtering, grouping, and aggregation.
● Destination
○ A destination is where the data arrives at the end of its processing, typically a data lake or data
warehouse for analysis.

23

You might also like