2024 Meeting 1 - Data Warehouse Fundamentals
2024 Meeting 1 - Data Warehouse Fundamentals
Data Warehousing
2024 Meeting 1
Data Warehouse Fundamentals
Meeting Agenda
Introductory Notes
Contact details
Study material
Highly Recommended
Title Data Warehousing Fundamentals for IT Professionals
Author Paulraj Ponniah
Edition 2
Publisher John Wiley & Sons, 2011
ISBN 1118211308, 9781118211304
Additional Recommended
Title The Data Warehouse Toolkit: The Complete Guide to Dimensional Modelling
Author Ralph Kimball and Margy Ross
Publisher John Wiley & Sons
ISBN ISBN-10: 0471200247, ISBN-13: 978-0471200246
Responsibilities / Contributions
LECTURER STUDENT
YEAR LEVEL ASSESSED on
Contribution* Contribution*
First year 100% 0% 100%
Second Year 75% 25% 100%
Third Year 50% 50% 100%
Honours Level 25% 75% 100%
* Contribution refers to essential content delivery, self-study and the application
thereof.
As may be seen, an honours level student will be required to
engage in what is know and guided (lecturer contribution)
investigation (student contribution).
Learning presumed to be in place
• Study guide
• ClickUP
• Departmental Brochure
• Honours Brochure
Meeting 1 Meeting 2
Planning
Meeting 3
Fundamentals A very expensive
exercise – Measure
twice cut once.
Strategic Level
Tactical Level
Operational Level
Day-to-day running
support
Overview: Data Warehousing
1. Basic concepts of data warehousing
2. Data warehouse architectures
3. Some characteristics of data warehouse data
4. Design aspect – Star Schemas
Motivation
Data Warehousing:
• The process of constructing and using a data warehouse.
Data Warehouse: Subject-Oriented
The time horizon for the data warehouse is significantly longer than that of operational
systems.
• Operational database: current value data.
• Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
External Data
External Data
Unstructured
Structured
(scraped)
All involve some form of extraction, transformation
EXTRACT
Internal Data
Transforming
• Generic Two-Level Architecture
TRANSFORM
Cleaning
Processing
Processing
Reconciling
Deriving
• Independent Data Mart
Matching
• Dependent Data Mart and Operational Data Store
DATA STORAGE
• Active Warehouse
LOAD
Metadata Storage Data Warehouse Data Storage
• Three-Layer architecture
FEED
USER PRESENTATION
End-User Ad Hoc Query Modelling And
Report Writers
Presentation Tools Tools Mining
Visualization
A simplified architecture that divides the data warehouse
environment into a client tier and database (data tier). Generic two-level architecture
Client Tier (Presentation Layer)
• Acts as the interface between the end-user and the data
warehouse, providing tools for querying, reporting, and data
analysis.
• Includes Business Intelligence (BI) tools, analytics
applications, and reporting tools.
Database Tier (Data Tier)
• The central repository where all the data is stored. It
includes the data warehouse database itself along with the
ETL (Extract, Transform, Load) processes.
• Handles data storage, management, and retrieval. It's where
data is cleansed, integrated, and stored from various source
systems.
• Utilizes Database Management Systems (DBMS) optimized
for large-scale data processing and complex queries.
Two-tier architecture consists of two layers : Client Tier and Database (Data Tier).
• An Independent Data Mart is a stand-alone system
designed for a specific business function or department,
without relying on a centralized data warehouse. Independent Data mart
• Focuses on meeting the specific, often immediate,
analytical needs of individual departments or business
units.
• A data mart filled with data extracted from the operational
environment without benefits of a data warehouse.
• An independent data mart does not get data from the
central or the main data warehouse.
• Therefore, an independent data mart does not have any
association with the main data warehouse or other data
marts.
• Storing and performing analytics on each data mart is a
separate task. Mostly, this data mart type is suitable for
small groups or sections within an organization.
Stand-alone system (created without the use of a data warehouse) focus is one subject area or business function.
A Dependent Data Mart sources its data from a centralized data
warehouse. It is designed to serve the specific needs of a particular
business segment or department.
Dependent data mart -vs-
• Data Source: Directly integrated with the enterprise data
warehouse, ensuring consistency and reliability of data. Operational data store
• Purpose: Tailored to support decision-making in specific
business areas with a high level of data integrity and alignment
with the overall data strategy.
• Update Frequency: Data is refreshed based on the central
warehouse's update cycles, which can be scheduled or
triggered by specific events.
Dependent data marts draw data from a central data warehouse that has already been created.
An Active Data Warehouse (ADW) is a form
of data warehousing that supports real-time
data integration, analysis, and reporting, Active data warehouse
enabling immediate decision-making and
action-taking.
Key Features:
• Real-Time Data Processing: Incorporates
data as soon as it becomes available,
allowing for up-to-the-minute analysis.
• Event-Driven Actions: Can trigger actions
or alerts based on specific data conditions
or business events.
• Highly Interactive: Supports complex, ad-
hoc queries and analyses with minimal
latency.
Capture data continuously. Deliver real time data. Single integrated view across multiple business lines.
Characteristic Data Warehouse Data Mart
• Encompasses a broad spectrum of subjects, aiming for a • Targets specific subjects like sales or finance,
Data Subjects complete organizational overview. • Dimensional model approach to efficiently organize data
• Understanding of business operations. around measurable events.
• Larger, reflecting its comprehensive scope. • Smaller, designed for speed and agility.
Size
• Scalability challenges and a need for robust infrastructure. • Quicker query responses and easier maintenance.
• Broad user base, from executives to analysts. • Departmental users with specific, tactical needs.
Users
• Used in strategic decision-making across the enterprise. • User-friendly, dimensional models for self-service BI.
• High, due to the integration of data across the organization. • Lower, with a focus on simplicity and relevance to users.
Complexity
• Sophisticated ETL processes required. • Use of dimensional models for ease of understanding.
• Regular updates to reflect current business conditions. • May vary, but generally less frequent than a data warehouse.
Update Frequency
Strategies for real-time data warehousing required. • Designing for the refresh needs of the business area.
• Top-down, emphasizing a structured, enterprise-wide view. • Bottom-up Starting with the most critical business needs
Design Approach
• Comprehensive design phase to align with business goals. and expanding over time.
CLICK TO EDIT MASTER TITLE STYLE
Designing Data Warehouses
Subheading here
The Star Schema
• ……. is a simple database design in which dimensional (describing how data are
commonly aggregated) are separated from fact or event data.
• A star schema consists of two types of tables: fact tables and dimension table.
• A fact table is the central table in a star schema of a data warehouse. It is designed to
store quantitative information for analysis and is typically surrounded by dimension
tables.
• Quantitative Metrics: Stores numerical measurements and metrics that businesses want to
analyse, such as sales amounts, quantities sold, or hours worked.
• Foreign Keys: Contains foreign keys that uniquely identify rows in dimension tables, establishing
relationships between facts and dimensions.
• Granularity: The level of detail represented by a row in the fact table, which could range from an
individual transaction to daily summaries.
• Time Variant: Fact table data is often associated with a specific point in time, making it possible
to track changes and trends over time.
• Large Volume: Typically contains a large number of rows due to the detailed level of tracking it
provides.
• Sparse Data: In some cases, especially with high granularity, fact tables may contain a lot of null
or zero values, known as sparsity.
Sample Fact Table: Sales Transactions
TransactionID DateKey ProductKey StoreKey EmployeeKey QuantitySold SalesAmount DiscountAmount
1 20240101 101 10 500 2 40.00 5.00
2 20240101 102 11 501 1 20.00 2.50
3 20240102 103 10 502 3 60.00 0.00
4 20240102 101 12 500 2 40.00 4.00
5 20240103 104 11 503 1 30.00 3.00
• Descriptive Attributes: Contains attributes that describe the business entities referenced in the fact table, such as
names, descriptions, and categories.
• Primary Key: Each row has a unique primary key that is used to link data back to the fact table.
• Hierarchies and Levels: Often includes hierarchies that allow data to be analysed at various levels of granularity,
such as region > country > city.
• Relatively Static: While they can change, dimension tables are updated less frequently than fact tables. Changes
are often managed through slowly changing dimensions techniques.
• Smaller Size: Compared to fact tables, dimension tables are usually smaller since they contain less granular,
more descriptive data.
• Supports Readability: The structure and data in dimension tables are designed to make the data warehouse
user-friendly for analysts and decision-makers.
Sample Dimension Table: Product Information
ProductKey: A unique identifier for each product. This key is used to link the product information to the sales transactions in the fact table.
ProductName: The name of the product.
Category: The category to which the product belongs, such as Electronics, Home Goods, Hardware, or Appliances.
Price: The standard price of the product. Note that actual sales prices, after discounts, are recorded in the fact table.
SupplierName: The name of the supplier or manufacturer of the product.
SupplierRegion: The geographical region where the supplier is located.
Discontinued: Indicates whether the product is still available for sale or has been discontinued.
Fact tables contain factual
or quantitative data
1:N relationship
Dimension tables are
between dimension
denormalized to
tables and fact tables
maximize performance
time_key item_key
day item_name
day_of_the_week brand
Sales Fact Table
month type
quarter time_key supplier_type
year
item_key
branch_key
location_key
branch location
branch_key units_sold
location_key
branch_name street
branch_type dollars_sold city
province_or_street
avg_sales country
Measures
Star schema example
Fact table provides statistics for
sales broken down by product,
period and store dimensions
Star schema with sample data
Snowflake schema
Snowflake schema is an expanded version of a star schema
in which dimension tables are normalized into several related
tables.
Advantages
• Small saving in storage space
Sales Fact Table
• Normalized structures are easier to update and maintain Item_id
Store_id
Sales_dollars
Disadvantages Sales_units
time_key
year supplier_key
item_key
branch_key
location
branch location_key
location_key
branch_key street
branch_name units_sold
city_key
branch_type
dollars_sold city
schema
avg_sales city_key
city
Measures province_or_street
country
TIME_DIMENSION Coffee Sales Over Time
time_key day day_of_the_week month quarter year
1 2023-01-01 Sunday Jan 1 2023
2 2023-01-02 Monday Jan 1 2023
600
3 2023-01-03 Tuesday Jan 1 2023
4 2023-01-04 Wednesday Jan 1 2023
5 2023-01-05 Thursday Jan 1 2023
500
fact tables can vary in granularity, meaning some may capture very detailed data while others
summarize data at a higher level.
• Shared Dimension Tables: Fact tables in a constellation schema often share common
dimension tables. For example, both sales and inventory fact tables might link to the same
Time, Product, and Store dimensions. This shared dimensionality ensures consistency across
different areas of analysis.
• Support for Complex Queries: By providing a more nuanced structure that reflects different
aspects of the business, a fact constellation schema enables complex queries and analyses.
Analysts can cross-reference data from different fact tables through their common dimensions.
Constellation
• Efficient Data Analysis: The schema is designed to optimize data analysis and reporting by
organizing data in a way that aligns closely with the analytical needs of the business. It
supports efficient data retrieval for a wide range of queries.
• Scalability and Flexibility: Fact constellations offer a scalable and flexible approach to data
warehouse design. New fact tables can be added to the schema as business requirements
evolve, without disrupting existing structures.
Extraction, Transformation,
Loading
So what did we
cover today?
CLICK TO EDIT MASTER TITLE STYLE
Agenda - Next meeting
Subheading here
2024 - Meeting 2