0% found this document useful (0 votes)
19 views

2024 Meeting 1 - Data Warehouse Fundamentals

The document discusses data warehousing fundamentals including definitions, motivations, and architectures. It covers topics like basic concepts, characteristics of data warehouse data, and design aspects focusing on star schemas. It also outlines an agenda for future meetings on related topics such as planning, profiling, delivery and trends.

Uploaded by

alexisterblanche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

2024 Meeting 1 - Data Warehouse Fundamentals

The document discusses data warehousing fundamentals including definitions, motivations, and architectures. It covers topics like basic concepts, characteristics of data warehouse data, and design aspects focusing on star schemas. It also outlines an agenda for future meetings on related topics such as planning, profiling, delivery and trends.

Uploaded by

alexisterblanche
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

INF 485 | 785

Data Warehousing

2024 Meeting 1
Data Warehouse Fundamentals
Meeting Agenda

Introductory Notes

Data Warehouse Fundamentals

Data Warehouse Architectures

Designing Data Warehouses

Agenda - Next meeting


CLICK TO EDIT MASTER TITLE STYLE
Introductory Notes
Subheading here
Dr. JP van Deventer [email protected]

Prof. O Daramola [email protected]

Contact details
Study material
Highly Recommended
Title Data Warehousing Fundamentals for IT Professionals
Author Paulraj Ponniah
Edition 2
Publisher John Wiley & Sons, 2011
ISBN 1118211308, 9781118211304

Additional Recommended
Title The Data Warehouse Toolkit: The Complete Guide to Dimensional Modelling
Author Ralph Kimball and Margy Ross
Publisher John Wiley & Sons
ISBN ISBN-10: 0471200247, ISBN-13: 978-0471200246
Responsibilities / Contributions
LECTURER STUDENT
YEAR LEVEL ASSESSED on
Contribution* Contribution*
First year 100% 0% 100%
Second Year 75% 25% 100%
Third Year 50% 50% 100%
Honours Level 25% 75% 100%
* Contribution refers to essential content delivery, self-study and the application
thereof.
As may be seen, an honours level student will be required to
engage in what is know and guided (lecturer contribution)
investigation (student contribution).
Learning presumed to be in place

• Fundamental knowledge of database management systems (DBMS) or equivalent


• Knowledge of Entity Relationship Modelling by means of Crow’s Foot or Object-oriented
notation is essential.
• Knowledge of data normalization and denormalization is essential.
• Knowledge of how database design influenced data relationship and patterns in the data
schema is essential. This is critical to ensure that data discovery for DW is applied
appropriately.
• The aforementioned will not be repeated in class. Without the aforementioned
knowledge a student will struggle to keep up, especially when we start modelling /
designing data warehouses.
Important

All the important information you need will be in the following:

• Study guide
• ClickUP
• Departmental Brochure
• Honours Brochure
Meeting 1 Meeting 2
Planning
Meeting 3
Fundamentals A very expensive
exercise – Measure
twice cut once.

Profiling and a little bit


of data mining
Meeting 4 fundamentals in SQL
Designing
Meeting 6
Delivery
Converting existing Meeting 7
transactional databases Meeting 5
in preparation for ETL
and staging. Data
ETL, SQL, Data
transformation
Meeting 8
Data is data, no

Roadmap Future Trends matter how big, no


matter how small.
CLICK TO EDIT MASTER TITLE STYLE
Data Warehouse Fundamentals
Subheading here
CONTEXT

Strategic Level

Long-term, strategic decisions


made by managers

Tactical Level

Mid-Term decisions made by


middle managers

Operational Level

Day-to-day running
support
Overview: Data Warehousing
1. Basic concepts of data warehousing
2. Data warehouse architectures
3. Some characteristics of data warehouse data
4. Design aspect – Star Schemas
Motivation

“Modern organization is drowning in data but starving for information”.

• Operational processing (transaction processing) captures, stores and


manipulates data to support daily operations.
• Information processing is the analysis of data or other forms of information to
support decision making.
• Data warehouse can consolidate and integrate information from many internal
and external sources and arrange it in a meaningful format for making business
decisions.
Definition
Data Warehouse: (W.H. Immon)
• A subject-oriented, integrated, time-variant, non-updatable (non-volatile) collection
of data used in support of management decision-making processes.
• Subject-oriented: e.g. customers, patients, students, products.
• Integrated: Consistent naming conventions, formats, encoding structures; from
multiple data sources.
• Time-variant: Can study trends and changes.
• Non-updatable: Read-only, periodically refreshed.

Data Warehousing:
• The process of constructing and using a data warehouse.
Data Warehouse: Subject-Oriented

• Organized around major subjects, such as customer, product, sales.


• Focusing on the modelling and analysis of data for decision makers, not on daily
operations or transaction processing.
• Provide a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
Data Warehouse: Integrated

Constructed by integrating multiple, heterogeneous data sources


• relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.


• Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.

• When data is moved to the warehouse, it is converted.


Data Warehouse: Time Variant

The time horizon for the data warehouse is significantly longer than that of operational
systems.
• Operational database: current value data.
• Data warehouse data: provide information from a historical perspective (e.g., past 5-10
years)

Every key structure in the data warehouse


• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”.
Data Warehouse: Non-updatable (Non-volatile)

A physically separate store of data transformed from the operational environment.


Operational update of data does not occur in the data warehouse environment.
• Does not require transaction processing, recovery, and concurrency control
mechanisms.
• Requires only two operations in data accessing – the initial loading of data and
access of data.
Need for Data Warehousing

Integrated, company-wide view of high-quality information (from disparate databases)


Separation of operational and informational systems and data (for improved performance)

OPERATIONAL Systems INFORMATIONAL Systems


Purpose Run the business on a current basis Support managerial decision making
Data Current representation of state of the business Historical point-in-time (snapshots) and predictions
Main users Clerks, salespersons, administrators Managers, business analysts, customers
Scope Narrow, planned, and simple updates and queries Broad, ad hoc, complex queries and analysis
Goal Performance throughput, availability Ease of flexible access and use
Volume Many, constant updates. Queries one or few table rows Periodic batch updates. Queries many or all rows
Need to separate operational and informational

Three primary factors:


• A data warehouse centralizes data that are scattered throughout disparate
operational systems and makes them available for DS.
• A well-designed data warehouse adds value to data by improving their quality and
consistency.
• A separate data warehouse eliminates much of the contention for resources that
results when information applications are mixed with operational processing.
CLICK TO EDIT MASTER TITLE STYLE
Data Warehouse Architectures
Subheading here
Data Architecture Reference
Data Warehouse Architectures
DATA SOURCE

External Data

External Data
Unstructured
Structured

(scraped)
All involve some form of extraction, transformation

EXTRACT
Internal Data

and loading (ETL)


DATA STAGING

Transforming
• Generic Two-Level Architecture

TRANSFORM
Cleaning

Processing
Processing

Reconciling

Deriving
• Independent Data Mart
Matching
• Dependent Data Mart and Operational Data Store
DATA STORAGE
• Active Warehouse

LOAD
Metadata Storage Data Warehouse Data Storage
• Three-Layer architecture
FEED

USER PRESENTATION
End-User Ad Hoc Query Modelling And
Report Writers
Presentation Tools Tools Mining

Visualization
A simplified architecture that divides the data warehouse
environment into a client tier and database (data tier). Generic two-level architecture
Client Tier (Presentation Layer)
• Acts as the interface between the end-user and the data
warehouse, providing tools for querying, reporting, and data
analysis.
• Includes Business Intelligence (BI) tools, analytics
applications, and reporting tools.
Database Tier (Data Tier)
• The central repository where all the data is stored. It
includes the data warehouse database itself along with the
ETL (Extract, Transform, Load) processes.
• Handles data storage, management, and retrieval. It's where
data is cleansed, integrated, and stored from various source
systems.
• Utilizes Database Management Systems (DBMS) optimized
for large-scale data processing and complex queries.

Two-tier architecture consists of two layers : Client Tier and Database (Data Tier).
• An Independent Data Mart is a stand-alone system
designed for a specific business function or department,
without relying on a centralized data warehouse. Independent Data mart
• Focuses on meeting the specific, often immediate,
analytical needs of individual departments or business
units.
• A data mart filled with data extracted from the operational
environment without benefits of a data warehouse.
• An independent data mart does not get data from the
central or the main data warehouse.
• Therefore, an independent data mart does not have any
association with the main data warehouse or other data
marts.
• Storing and performing analytics on each data mart is a
separate task. Mostly, this data mart type is suitable for
small groups or sections within an organization.

Stand-alone system (created without the use of a data warehouse)  focus is one subject area or business function.
A Dependent Data Mart sources its data from a centralized data
warehouse. It is designed to serve the specific needs of a particular
business segment or department.
Dependent data mart -vs-
• Data Source: Directly integrated with the enterprise data
warehouse, ensuring consistency and reliability of data. Operational data store
• Purpose: Tailored to support decision-making in specific
business areas with a high level of data integrity and alignment
with the overall data strategy.
• Update Frequency: Data is refreshed based on the central
warehouse's update cycles, which can be scheduled or
triggered by specific events.

An Operational Data Store (ODS) is a centralized database that


aggregates data from multiple sources for operational reporting and
near real-time analyses.
• Data Source: Collects data from various transactional systems,
providing a consolidated view for operational needs.
• Purpose: Designed to support operational processes and real-
time decision-making with up-to-date data.
• Update Frequency: Highly frequent updates, often in real-time
or near real-time, to reflect the latest operational data.

Dependent data marts draw data from a central data warehouse that has already been created.
An Active Data Warehouse (ADW) is a form
of data warehousing that supports real-time
data integration, analysis, and reporting, Active data warehouse
enabling immediate decision-making and
action-taking.

Key Features:
• Real-Time Data Processing: Incorporates
data as soon as it becomes available,
allowing for up-to-the-minute analysis.
• Event-Driven Actions: Can trigger actions
or alerts based on specific data conditions
or business events.
• Highly Interactive: Supports complex, ad-
hoc queries and analyses with minimal
latency.

Capture data continuously. Deliver real time data. Single integrated view across multiple business lines.
Characteristic Data Warehouse Data Mart

• Enterprise-wide, serving as a centralized repository for all


• Focused on specific business areas,
Scope organizational data.
• Role in serving departmental needs efficiently.
• Role in integrating data across the organization.

• Encompasses a broad spectrum of subjects, aiming for a • Targets specific subjects like sales or finance,
Data Subjects complete organizational overview. • Dimensional model approach to efficiently organize data
• Understanding of business operations. around measurable events.

• Integrates diverse sources, including external data, to • Primarily internal sources,


Data Sources provide a comprehensive data landscape. • Can also integrate external data as long as it's relevant to
• Achieving a unified organizational view. the specific business function.

• Larger, reflecting its comprehensive scope. • Smaller, designed for speed and agility.
Size
• Scalability challenges and a need for robust infrastructure. • Quicker query responses and easier maintenance.

• Broad user base, from executives to analysts. • Departmental users with specific, tactical needs.
Users
• Used in strategic decision-making across the enterprise. • User-friendly, dimensional models for self-service BI.

• High, due to the integration of data across the organization. • Lower, with a focus on simplicity and relevance to users.
Complexity
• Sophisticated ETL processes required. • Use of dimensional models for ease of understanding.

• Strategic decision-making. • Tactical decision-making,


Purpose
• Long-term planning and organizational alignment. • Effectiveness in addressing immediate business questions.

• Regular updates to reflect current business conditions. • May vary, but generally less frequent than a data warehouse.
Update Frequency
Strategies for real-time data warehousing required. • Designing for the refresh needs of the business area.

• Top-down, emphasizing a structured, enterprise-wide view. • Bottom-up  Starting with the most critical business needs
Design Approach
• Comprehensive design phase to align with business goals. and expanding over time.
CLICK TO EDIT MASTER TITLE STYLE
Designing Data Warehouses
Subheading here
The Star Schema

• ……. is a simple database design in which dimensional (describing how data are
commonly aggregated) are separated from fact or event data.
• A star schema consists of two types of tables: fact tables and dimension table.

• A fact table is the central table in a star schema of a data warehouse. It is designed to
store quantitative information for analysis and is typically surrounded by dimension
tables.

• Dimension tables store the context (qualitative information) necessary to understand


the facts recorded in the fact table. They describe the "who, what, where, when, and
how" associated with the facts.
Properties of Data in a Fact Table

• Quantitative Metrics: Stores numerical measurements and metrics that businesses want to
analyse, such as sales amounts, quantities sold, or hours worked.
• Foreign Keys: Contains foreign keys that uniquely identify rows in dimension tables, establishing
relationships between facts and dimensions.
• Granularity: The level of detail represented by a row in the fact table, which could range from an
individual transaction to daily summaries.
• Time Variant: Fact table data is often associated with a specific point in time, making it possible
to track changes and trends over time.
• Large Volume: Typically contains a large number of rows due to the detailed level of tracking it
provides.
• Sparse Data: In some cases, especially with high granularity, fact tables may contain a lot of null
or zero values, known as sparsity.
Sample Fact Table: Sales Transactions
TransactionID DateKey ProductKey StoreKey EmployeeKey QuantitySold SalesAmount DiscountAmount
1 20240101 101 10 500 2 40.00 5.00
2 20240101 102 11 501 1 20.00 2.50
3 20240102 103 10 502 3 60.00 0.00
4 20240102 101 12 500 2 40.00 4.00
5 20240103 104 11 503 1 30.00 3.00

TransactionID: A unique identifier for each sales transaction.


DateKey: A reference to the Date dimension table, indicating when the transaction occurred.
ProductKey: A reference to the Product dimension table, identifying the product sold.
StoreKey: A reference to the Store dimension table, indicating where the sale took place.
EmployeeKey: A reference to the Employee dimension table, identifying the employee who made the sale.
QuantitySold: The number of items sold in the transaction.
SalesAmount: The total amount of money generated from the sale, before discounts.
DiscountAmount: The amount of discount applied to the sales transaction.
Properties of Data in a Dimension Table

• Descriptive Attributes: Contains attributes that describe the business entities referenced in the fact table, such as
names, descriptions, and categories.
• Primary Key: Each row has a unique primary key that is used to link data back to the fact table.
• Hierarchies and Levels: Often includes hierarchies that allow data to be analysed at various levels of granularity,
such as region > country > city.
• Relatively Static: While they can change, dimension tables are updated less frequently than fact tables. Changes
are often managed through slowly changing dimensions techniques.
• Smaller Size: Compared to fact tables, dimension tables are usually smaller since they contain less granular,
more descriptive data.
• Supports Readability: The structure and data in dimension tables are designed to make the data warehouse
user-friendly for analysts and decision-makers.
Sample Dimension Table: Product Information

ProductKey ProductName Category Price SupplierName SupplierRegion Discontinued


101 Widget A Electronics 20.00 TechCorp North America No
102 Gadget B Home Goods 20.00 HomeSupplies Inc. Europe No
103 Toolset C Hardware 20.00 BuildIt Right Asia Yes
104 Appliance D Appliances 30.00 KitchenTech Europe No
105 Smartphone E Electronics 600.00 SmartTech Asia No

ProductKey: A unique identifier for each product. This key is used to link the product information to the sales transactions in the fact table.
ProductName: The name of the product.
Category: The category to which the product belongs, such as Electronics, Home Goods, Hardware, or Appliances.
Price: The standard price of the product. Note that actual sales prices, after discounts, are recorded in the fact table.
SupplierName: The name of the supplier or manufacturer of the product.
SupplierRegion: The geographical region where the supplier is located.
Discontinued: Indicates whether the product is still available for sale or has been discontinued.
Fact tables contain factual
or quantitative data

1:N relationship
Dimension tables are
between dimension
denormalized to
tables and fact tables
maximize performance

Dimension tables contain


descriptions about the subjects
of the business
Example of star schema time item

time_key item_key
day item_name
day_of_the_week brand
Sales Fact Table
month type
quarter time_key supplier_type
year
item_key

branch_key

location_key
branch location

branch_key units_sold
location_key
branch_name street
branch_type dollars_sold city
province_or_street
avg_sales country

Measures
Star schema example
Fact table provides statistics for
sales broken down by product,
period and store dimensions
Star schema with sample data
Snowflake schema
Snowflake schema is an expanded version of a star schema
in which dimension tables are normalized into several related
tables.

Product Table Store Table


Product_id District Table
Store_id
Product_desc District_id
Store_desc
District_desc
District_id

Advantages
• Small saving in storage space
Sales Fact Table
• Normalized structures are easier to update and maintain Item_id
Store_id
Sales_dollars
Disadvantages Sales_units

• Schema less intuitive


Time Table Item Table Dept Table Mgr Table
• Ability to browse through the content difficult Week_id Item_id Dept_id Dept_id
Period_id Item_desc Dept_desc Mgr_id
• Degraded query performance because of additional joins. Year_id Dept_id Mgr_id Mgr_name
time
item
time_key
day item_key supplier
day_of_the_week item_name
month Sales Fact Table supplier_key
brand
quarter type supplier_type
Example of snowflake

time_key
year supplier_key
item_key

branch_key
location
branch location_key
location_key
branch_key street
branch_name units_sold
city_key
branch_type
dollars_sold city
schema

avg_sales city_key
city
Measures province_or_street
country
TIME_DIMENSION Coffee Sales Over Time
time_key day day_of_the_week month quarter year
1 2023-01-01 Sunday Jan 1 2023
2 2023-01-02 Monday Jan 1 2023
600
3 2023-01-03 Tuesday Jan 1 2023
4 2023-01-04 Wednesday Jan 1 2023
5 2023-01-05 Thursday Jan 1 2023
500

Sales Am ount ($)


6 2023-01-06 Friday Jan 1 2023
7 2023-01-07 Saturday Jan 1 2023
400
PRODUCT_SALES_FACT Time Series (Changes over time)
sales_key time_key product_key sales_amount quantity_sold
300
time
1 1 1 500.00 100
time_key
2 2 1 450.00 90 200 day
3 3 1 550.00 110
day_of_the_week
4 4 1 600.00 120 month
5 5 1 400.00 80 100 quarter
year
6 6 1 650.00 130
7 7 1 475.00 95
0
2023 Tue 03 Thu 05 Sat 07
PRODUCT_TABLE Date
product_key product_name category price
1 Coffee Beverage 5.00
Why time?
And then there is Fact
Constellation
• A complex schema that incorporates multiple fact tables sharing some dimension tables.
• This model is an extension of the simpler star and snowflake schemas and is designed
to support a wider range of business queries and data analysis needs.
• Accommodates various levels of granularity and different perspectives within the same
overall data warehouse architecture.
• Multiple Fact Tables: Unlike a single fact table in a star schema, a fact constellation schema
includes several fact tables, each representing different business processes or events. These
Key Characteristics of a Fact

fact tables can vary in granularity, meaning some may capture very detailed data while others
summarize data at a higher level.
• Shared Dimension Tables: Fact tables in a constellation schema often share common
dimension tables. For example, both sales and inventory fact tables might link to the same
Time, Product, and Store dimensions. This shared dimensionality ensures consistency across
different areas of analysis.
• Support for Complex Queries: By providing a more nuanced structure that reflects different
aspects of the business, a fact constellation schema enables complex queries and analyses.
Analysts can cross-reference data from different fact tables through their common dimensions.
Constellation

• Efficient Data Analysis: The schema is designed to optimize data analysis and reporting by
organizing data in a way that aligns closely with the analytical needs of the business. It
supports efficient data retrieval for a wide range of queries.
• Scalability and Flexibility: Fact constellations offer a scalable and flexible approach to data
warehouse design. New fact tables can be added to the schema as business requirements
evolve, without disrupting existing structures.
Extraction, Transformation,
Loading
So what did we
cover today?
CLICK TO EDIT MASTER TITLE STYLE
Agenda - Next meeting
Subheading here
2024 - Meeting 2

• Class Test 1  First 15 minutes.


• DW Project Planning and Project Management.
• Class attendance at end of session.

You might also like