0% found this document useful (0 votes)
108 views49 pages

Data Warehousing: Modern Database Management

Uploaded by

Ngọc Trâm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views49 pages

Data Warehousing: Modern Database Management

Uploaded by

Ngọc Trâm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

CHAPTER 9:

DATA WAREHOUSING

Modern Database Management


12th Edition
Jeff Hoffer, Ramesh Venkataraman,
Heikki Topi

Copyright © 2016 Pearson Education, Inc.


OBJECTIVES
 Define terms
 Give reasons for information gap between
information needs and availability
 List reasons for need of data warehousing
 Describe three levels of data warehouse
architectures
 Describe two components of star schema
 Estimate fact table size
 Design a data mart
 Develop requirements for a data mart
 Understand future data warehousing trends

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-2


DEFINITIONS
 Data Warehouse
 A subject-oriented, integrated, time-variant, non-updatable

collection of data used in support of management decision-


making processes. Key terms are:
 Subject-oriented: Organized on key objects such as

customers, patients, students, products


 Integrated: consistent naming conventions, formats,

encoding structures; from multiple data sources


 Time-variant: Contain time dimensions to study trends and

changes
 Non-updatable: read-only, periodically refreshed from

operational systems, not from end-users


 Data Mart
 A data warehouse that is limited in scope

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-3


DATA WAREHOUSING

 Is the process where organizations create and


maintain data warehouse

 Extract meaning and form decision making from


informational assets through these warehouses.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-4


NEED FOR DATA WAREHOUSING

 Integrated, company-wide view of high-


quality information (from different
databases)

 Separation of operational and informational


systems and data (for improved
performance)

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-5


CONTENT OF A DATA WAREHOUSE

 Your data warehouse will store these types of


data:
 Historical data: Data is recorded throughout history

 Derived data: Data is filtered and transformed to


information

 Metadata: Data that describe data and schema objects

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-6


EXAMPLE:

 Give some examples of those data.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-7


ISSUES WITH COMPANY-WIDE VIEW (FIG 9-1)

 Inconsistent key structures: 1st and 2nd table contains number, the last
contains string.

 Synonyms: StudentID and number is the same

 Free-form vs. structured fields:


 In student health: StudentName consists of first/last name whereas in Student
Data: name is broken into parts

 Inconsistent data values: Conflicts in Mr Smith phone numbers (using 1


or 2 number)

 Missing data: Insurance value is missing.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-8


Figure 9-1
Examples of
heterogeneous
data

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-9


ORGANIZATIONAL TRENDS MOTIVATING
DATA WAREHOUSES
 No single system of records
 Split into several databases.
 Multiple systems not synchronized:
 All data from separate system must be
synced into additional database.
 Organizational need to analyze
activities in a balanced way
 Result must be consistent, data wh is
necessary
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-10
CONT (2)

 Customer relationship management


 To view overall picture of activity with customer across all
touch points

 Supplier relationship management


 To view overall picture of activity with supplier across all
touch points, from billing, meeting, quality control, pricing
and support

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-11


SEPARATING OPERATIONAL AND
INFORMATIONAL SYSTEMS

 Operational system – a system that is used to run


a business in real time, based on current data; also
called a system of record
 For example: Reservation system, sales order
processing systems, …

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-12


CONT

 Informational system – a system designed to support


decision making based on historical point-in-time and
prediction data for complex queries or data-mining
applications
 For example: sale trends analysis, human resource planning

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-13


Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-14
THE NEED TO SEPARATE BETWEEN INFOR. AND OP. SYSTEM

 A data warehouse centralizes data that are scattered


throughout disparate operational systems and makes
them readily available for decision support applications.
 A properly designed data warehouse adds value to data
by improving their quality and consistency.
 A separate data warehouse eliminates much of the
contention for resources that results when informational
applications are confounded with operational processing.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-15


DATA WAREHOUSE ARCHITECTURES

 Independent Data Mart


 Dependent Data Mart and
Operational Data Store
 Logical Data Mart and Real-Time
Data Warehouse
 Three-Layer architecture
All involve some form of extract, transform and load (ETL)

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-16


Figure 9-2 Independent data mart Data marts:
data warehousing architecture Mini-warehouses, limited in scope

T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-17
LIMITATIONS OF INDEPENDENT DATA MARTS

 Separate ETL process for each data mart 


redundant data and processing
 Inconsistency between data marts
 Difficult to drill down for related facts between
data marts, analysis is limited
 Excessive scaling costs are more applications are
built since add new data mart is costly, repeat
ETL process.
 High cost for obtaining consistency between
marts
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-18
Figure 9-3 Dependent data mart with ODS provides option for
operational data store: a three-level architecture obtaining current data

T
E
Simpler data access
Single ETL for Dependent data marts
enterprise data warehouse (EDW) loaded from EDW
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-19
Figure 9-4 Logical data mart and real
ODS and data warehouse
time warehouse architecture are one and the same

T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-20
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-21
Figure 9-5 Three-layer data architecture for a data warehouse

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-22


DATA CHARACTERISTICS
STATUS VS. EVENT DATA
Figure 9-6
Example of DBMS
Status log entry

Event = a
database action
(create/ update/
delete) that
results from a
transaction

Status

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-23


DATA CHARACTERISTICS
TRANSIENT(TẠM THỜI) VS. PERIODIC (ĐỊNH KÌ)
DATA
Figure 9-7
Transient
operational data

With transient
data, changes
to existing
records are
written over
previous
records, thus
destroying the
previous data
content.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-24


DATA CHARACTERISTICS
TRANSIENT VS. PERIODIC DATA
Figure 9-8 Periodic
warehouse data

Periodic data
are never
physically
altered or
deleted once
they have been
added to the
store.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-25


OTHER DATA WAREHOUSE CHANGES NEED TO BE
ACCOMMODATED

 New descriptive attributes


 New business activity attributes
 New classes of descriptive attributes =
new table
 Descriptive attributes become more
refined
 Descriptive data are related to one
another
 New source of data
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-26
DERIVED DATA
 Objectives
 Ease of use for decision support applications
 Fast response to predefined user queries
 Customized data for particular target audiences
 Ad-hoc query support
 Data mining capabilities
 Characteristics
 Detailed (mostly periodic) data
 Aggregate (for summary)
 Distributed (to departmental servers)
Most common data model = dimensional model
(usually implemented as a star schema)
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-27
Figure 9-9 Components of a star schema
Fact tables contain factual
or quantitative data

1:N relationship between Dimension tables are denormalized


dimension tables and fact tables to maximize performance

Dimension tables contain descriptions


about the subjects of the business

Excellent for ad-hoc queries, but bad for online transaction processing
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-28
Figure 9-10 Star schema example (data recored daily)

Fact table provides statistics for sales


broken down by product, period and
store dimensions

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-29


Figure 9-11 Star schema with sample data

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-30


SURROGATE (REPRESENTATIVE) KEYS
 Dimension table keys should be surrogate (non-
intelligent and non-business related), because:

 Business keys may change over time


 Helps keep track of nonkey attribute values
for a given production key
 Surrogate keys are simpler and shorter
 Surrogate keys can be same length and format
for all key

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-31


GRAIN OF THE FACT TABLE
 Granularity of Fact Table: level of detail in the
fact table
 Transactional grain–finest level
 Aggregated grain–more summarized
 Finer grains  better market basket
analysis capability
 Finer grain  more dimension tables, more
rows in fact table

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-32


DURATION OF THE DATABASE
 Amount of history to be kept on database

 Natural duration–13 months or 5 quarters

 Financial institutions may need longer duration

 Older data is more difficult to source and cleanse

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-33


SIZE OF FACT TABLE
 Depends on the number of dimensions and the grain of
the fact table

 Number of rows = product of number of possible values


for each dimension associated with the fact table

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-34


SIZE OF FACT TABLE
 Example: Assume the following for Figure 9-11:

 Total rows calculated as follows (assuming only


half the products record sales for a given month):

If fact table contains 6 fields, each of 4 bytes


=> 120.000 k rows * 6 * 4 = 2.88GB of data.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-35


Figure 9-12 Modeling dates and time

Fact tables contain time-period data


 Date dimensions are important
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-36
VARIATIONS OF THE STAR SCHEMA
 Multiple Facts Tables
 Can improve performance
 Often used to store facts for different combinations of
dimensions
 Conformed dimensions
 Factless Facts Tables
 No nonkey data, but foreign keys for associated
dimensions
 Used for:
 Tracking events
 Inventory coverage
Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-37
Figure 9-13 Conformed dimensions
Two fact tables  two (connected) star schemas.

Conformed
dimension
Associated with
multiple fact
tables, here, date
& product key

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-38


Figure 9-14a Factless fact table showing occurrence of
an event
No data in fact
table, just keys
associating
dimension records

Fact table forms an


n-ary relationship
between
dimensions

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-39


NORMALIZING DIMENSION TABLES

 Multivalued Dimensions
 Facts qualified by a set of values for the same business subject
 Normalization involves creating a table for an associative entity
between dimensions
 Hierarchies
 Sometimes a dimension forms a natural, fixed depth hierarchy
 Design options
 Include all information for each level in a single denormalized table
 Normalize the dimension into a nested set of 1:M table relationships

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-40


Figure 9-15 Multivalued dimension

Helper table is an associative entity that implements


a M:N relationship between dimension and fact.

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-41


Figure 9-16 Fixed product hierarchy

Dimension hierarchies help to provide levels of


aggregation for users wanting summary information
in a data warehouse.

Dimension tables are normalized into several related tables

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-42


SLOWLY CHANGING DIMENSIONS (SCD)
 How to maintain knowledge of the past
 Kimball’s approaches:
 Type 1: just replace old data with new (lose historical

data)
 Type 2: for each changing attribute, create a current

value field and several old-valued fields (multivalued)


 Type 4: create a new dimension table row each time

the dimension object changes, with all dimension


characteristics at the time of change. Most common
approach

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-43


TYPE 1

 Consider [Supplier] table

 When apply Type 1

 No history changes tracking

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-44


TYPE 2

 Add new version column

 Or add start/end date column’

 Or add date with flag (Y: current version)

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-45


TYPE 4:

 Add new history log table

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-46


10 ESSENTIAL RULES FOR
DIMENSIONAL MODELING
 Use atomic facts  Honor hierarchies
 Create single-process fact  Decode dimension tables
tables  Use surrogate keys
 Include a date dimension
for each fact table  Conform dimensions
 Enforce consistent grain  Balance requirements with
 Disallow null keys in fact actual data
tables

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-47


THE FUTURE OF DATA WAREHOUSING:
INTEGRATION WITH BIG DATA AND ANALYTICS
 Issue of Big Data (huge volume, often unstructured)
 Speed of processing
 Design/purchase storage, database, and networking aspects in tandem
 Use in-memory databases (RAM instead of disk)
 Add analytics capabilities closer to the original data sources instead of
separate data warehouses
 Cost of Data Storage
 Move data warehouse to the cloud
 Use Columnar databases for storage optimization
 Unstructured Data
 NoSql “Not only SQL”
 Hadoop

Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-48


Chapter 9 Copyright © 2016 Pearson Education, Inc. 9-49

You might also like