SUMMARIZING DATA MODULE 01
Do you have the pre-requisite knowledge?
io n a l M o de l
Relat
Basic SELECT
Functio
n s in SQ
L
JOINing tables
We work with LOTS
of DIFFERENT systems…
Applicants
From
Small Businesses Students
to Human Resource
Large Enterprises, Enrollment
multiple systems
are in place Libraries
AnimoSpace
Usage of INDEPENDENT systems
leads to ISLANDS of DATA …
with possibly different formats
For example…
How can these
ISLANDS OF DATA
be utilized for
BUSINESS INTELLIGENCE
for planning and
decision-making activities?
In this
module,
you’ll
learn
about… Online Analytical Processing (OLAP) Revisiting SQL
TOPIC ONLINE ANALYTICAL PROCESSING
What did your readings
tell you about…
• OLAP?
• Comparison with
Relational models?
• Data Warehouse?
• Analytics?
• Related terms?
https://www.cleverism.com/what-is-olap/
Data
Analysis
Operations
Reports
Decision
-Making
Online Analytical Processing
(OLAP)
• Fast multidimensional analysis of large volumes of data for business intelligence and
decision support
• Extracts data from multiple relational datasets and reorganizes it into a
multidimensional format to enable fast processing and analysis
https://www.ibm.com/cloud/learn/olap
• A data discovery tool
• Enables users to perform multidimensional analysis of data from different
perspectives or points of view
https://www.cleverism.com/what-is-olap/
OLTP vs OLAP
OLTP OLAP
• Online transaction processing • Online analytical processing
• Large volume of short transaction • Low volume of very complex queries that
operations (INSERT, UPDATE, DELETE) on the involve data aggregation
database • Generated analytical reports can aid in
• Fast query processing using basic SELECT business intelligence and decision making
statements • Modern terms – data mining, data
• Maintains data integrity within a multi-user, warehousing, data analytics, business
multi-access environment intelligence
• Queries contain detailed and current data
• Give some examples
Transactional DB vs
Analytical DB
Operational Database Analytical Database, e.g., Data Warehouse
Day-to-day transaction processing Historical analytical processing
Used by operational users (clerks, DBAs, DB Used by knowledge workers (analysts, managers,
professionals) executives)
Used to run the business Used to analyze the business
Narrow, planned and simple updates and queries Broad, adhoc, complex queries and analysis
Focuses on Data In (insert, modify, retrieve) Focuses on Information Out (read only)
Based on Entity Relationship and Relational Based on Star, Snowflake and Constellation
Models Schema
Primitive and highly detailed, Summarized and consolidated,
flat relational view of data multidimensional view of data
DB size: 100MB to 100GB DB size: 100GB to 100TB
Number of users: thousands Number of users: hundreds 14
http://www.tutorialspoint.com/dwh/dwh_overview.htm
The OLAP Process
Application-oriented Unified view
heterogenous data of data
Integrates
heterogenous
data
Image courtesy of: https://smartboost.com/blog/how-to-use-online-analytical-processing-olap-in-marketing/
Connolly & Begg, 2015
Data Warehouse
Data Warehouse
• Refers to a data repository that is maintained separately from an organization’s
operational databases
• A subject-oriented, integrated, time-variant, and nonvolatile collection of data in
support of management’s decision-making process
• Generalize and consolidate data in multidimensional space
• Provide OLAP tools for business executives to systematically organize, understand,
and use their multidimensional data of varied granularities for generalization and
data mining in strategic decision-making activities
Han, Kamber & Pei, 2012
Dimensions of Data
• Business data have multiple dimensions
• Dimensions
• The entities with respect to which an enterprise preserves the records
• Example: Sales data dimensions
• Location – region, country, province, city, store
• Time – year, quarter, month, week, day
• Product – type (clothing, food, devices), brand, price
https://www.ibm.com/cloud/learn/olap
Dimensionality Modeling
• Logical design technique
• Aims to present the data in a standard, intuitive form that allows for
high-performance access
• Every dimensional model (DM) is composed of:
• One (1) Fact table with a composite primary key
• Two (2) or more Dimension tables, each with a simple primary key that references one of
the components of the composite key in the Fact table
Connelly & Begg, 2015
Dimensional Model
• Fact table
• Contains tuples of recorded factual data
• Facts
• Generated by events that occurred in the past
• Are unlikely to change, regardless of how they are analyzed
• Dimensional table:
• Contains tuples of attributes describing reference data
• Attributes are used as the constraints in data warehouse queries
• Dimensions
• The entities with respect to which an enterprise preserves the records (TutorialsPoint.com)
Connelly & Begg, 2015
Dimensional Model
• Star schema
• A logical structure that has a Fact table in the center, surrounded by
denormalized Dimension tables
• Can be used to speed up query performance by denormalizing reference information
into a single dimension table
• Excellent for adhoc queries, but bad for OLTP
Connelly & Begg, 2015
Star Schema -
Components
Fact tables contain
factual or quantitative
data
1:N relationship between Dimension tables are
dimension tables and fact denormalized to maximize
tables performance
Dimension tables contain
descriptions about the subjects of
the business
Hoffer, Ramesh & Topi, 2018
Star Schema - Example
Fact table provides statistics for
sales broken down by product,
period and store dimensions
Hoffer, Ramesh & Topi, 2018
Star Schema - Example
Hoffer, Ramesh & Topi, 2018
Dimensional Model
• Snowflake schema
• Variant of the star schema
• Some dimension tables are normalized
normalized to form a
hierarchy
normalized
Connelly & Begg, 2015 https://www.sciencedirect.com/topics/computer-science/star-schema
Surrogate Keys
• All natural keys are replaced with surrogate keys (non-intelligent and non-business
related)
• Every join between fact and dimension tables is based on surrogate keys, not
natural keys
• Business keys may change over time. Surrogate keys -
• Allow the data in the warehouse to have some independence from the data
used and produced by the OLTP systems
• Help keep track of non-key attribute values for a given production key
• Are simpler and shorter
• Can be same length and format for all keys
Connelly & Begg, 2015; Hoffer, Ramesh & Topi, 2018
Granularity of the Fact Table
• What level of detail do you want?
• Transactional grain - finest level
• Aggregated grain - more summarized
• Finer grains
• Better market analysis capability
• More dimension tables, more rows in fact table
• In Web-based commerce, finest granularity is a click
Hoffer, Ramesh & Topi, 2018
Size of the Fact Table
• Depends on the number of dimensions and the grain of the fact table
Number of rows = product of number of possible values
for each dimension associated with the fact table
• Given the following values:
Total rows calculated as follows
(assuming only half of the total products have recorded sales for a given month):
Hoffer, Ramesh & Topi, 2018
ETL
Non-volatile Data
• Data in the Warehouse
• Comes from multiple heterogenous sources
• Not updated in real-time but is refreshed from operational systems on a regular basis
• New data is always added as a supplement to the database, rather than a
replacement
• Update-driven approach
• Integrated data is available for direct querying and analysis
Connelly & Begg, 2015
Update Driven Approach: When to Gather Data?
Source Driven Destination Driven
Data sources transmit new Warehouse periodically requests
information to warehouse, either new information from data
continuously or periodically sources
Keeping warehouse exactly synchronized with data sources is too expensive.
Silberschatz, Korth & Sudarshan, 2019
Considerations in Building a DW
• The design of the DW should support ad-hoc querying
• Acquisition of data for the warehouse
• Data must be extracted from multiple heterogeneous sources
• Data must be formatted for consistency within the warehouse
• Data must be cleaned to ensure validity
• Data must be fitted into the data model of the warehouse
• Data must be loaded into the warehouse
• Ensures data storage meets the query requirements efficiently
• Gives full consideration to the environment in which the data resides
Elmasri & Navathe, 2016
Extraction Targets one or more internal data sources, e.g., OLTP
databases, personal databases and spreadsheets,
Enterprise Resource Planning (ERP) files, web usage log files
May include external sources from suppliers and customers
ETL Transformation Applies a series of rules or functions to the extracted data
to prepare them for analysis
Manage May involve data summations, data encoding, data
r
merging, data splitting, data calculations, and creation of
surrogate keys
Loading Additional constraints defined in the database schema can
be activated, e.g., uniqueness, referential integrity, and
mandatory fields
Hands-on Activity 01 Connelly & Begg, 2015
We can now perform
ANALYSIS across
HETEROGENEOUS data sources
without disrupting
TRANSACTIONAL performance
Learning
Activities
Take Ex 02: Advanced SQL Self-Assessment
Perform Hands-on Activity: H 01: ETL Tool
Take Ex 03 : OLAP Self-Assessment
References
Chapter 32: Data Warehouse Design
Connolly, T. & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation,
and Management, 6th Edition. Harlow, Essex: Addison-Wesley
Chapter 29: Overview of Data Warehousing and OLAP
Elmasri, R. & Navathe, S. (2016). Fundamentals of Database Systems, 7th Edition. Boston:
Pearson/Addison Wesley
Chapter 9: Data Warehousing
Hoffer, J., Ramesh, V. and Topi, H. (2018). Modern Database Management, 12th Edition. Upper Saddle River,
N.J.: Pearson/Prentice Hall
Chapter 11: Data Analytics
Silberschatz, A., Korth, H. & Sudarshan, S. (2019). Database System Concepts, 7th Edition. McGraw-Hill Book Co.
Chapter 04: Data Warehousing and OLAP
Han, J., Kamber, M. & Pei, J. (2012). Data Mining, 3rd Edition. he Morgan Kaufmann Series in Data Management
Systems, ScienceDirect (DLSU Institutional Access: https://www.sciencedirect.com/book/9780123814791/data-mining-
concepts-and-techniques)
Online
www.tutorialspoint/dwh/index.htm
TutorialsPoint.com, Data Warehousing Tutorial
Miscellaneous References
• https://www.cleverism.com/what-is-olap/
• https://www.guru99.com/online-analytical-processing.html
• https://www.ibm.com/cloud/learn/olap
• https://www.stitchdata.com/resources/oltp-vs-olap/
• https://www.commbox.io/how-data-analysis-and-reports-can-improve-custo
mer-service/