Unit #1 - Data Warehouse and Data Mining
Unit #1 - Data Warehouse and Data Mining
and
Data Mining
Prof. Dr. M. S. Memon
[email protected]
M.S. Memon
05/20/23 Department of CSE, QUEST 1
Outline
• Introduction to Data Warehouse
• OLTP vs. DW
• Applications of DW
Source: www.stonebridgegroup.com
M.S. Memon
05/20/23 Department of CSE, QUEST 2
Data Warehouse
• Purpose of the Data Warehouse
– Value of the DATA - Realize!!!
• Data / Information is an asset
• Data / Information can be sold
• Methods to realize the VALUE – Reporting, Analysis, Data
Mining, etc
• Make better decisions!!!
– Turn data into Information
– Create competitive advantages
– Methods to support decision making process – DSS etc
M.S. Memon
05/20/23 Department of CSE, QUEST 3
Why data
warehouse?
• Bad decisions can lead to disasters
– Data Warehousing is at the base of decision support
systems
• Data warehousing is a data-driven decision-
support system
• Data warehousing helps to
– Understand the information hidden within the
organization’s data
• See data from different angles: product, client, time,
geographical area
• Get a glimpse of the future.
M.S. Memon
05/20/23 Department of CSE, QUEST 4
Why data
warehouse?
• DBMS Approach
– List of all items that were sold last month?
M.S. Memon
05/20/23 Department of CSE, QUEST 5
Why data
warehouse?
• Intelligent Enterprise
– Which items sell together? Which items to stock?
M.S. Memon
05/20/23 Department of CSE, QUEST 7
What is Data warehouse?
• Basically a very large database…
– Not all very large databases are data warehouses, but
all data warehouses are pretty large databases
M.S. Memon
05/20/23 Department of CSE, QUEST 8
What is Data warehouse?
• More specific, a collective data repository
– Containing snapshots of the operational data (history)
– Obtained through data cleansing ETL
(Extract-Transform- Load)
– Useful for analytics
M.S. Memon
05/20/23 Department of CSE, QUEST 9
What is Data warehouse?
• Compared to other solutions it…
– Is suitable for tactical/strategic focus
M.S. Memon
05/20/23 Department of CSE, QUEST 10
Definition
• Ralph Kimball: “a copy of transaction data
specifically structured for query and analysis”
M.S. Memon
05/20/23 Department of CSE, QUEST 11
Data Warehouse (definitions)
• Used for decision making, Duplicates existing
data, Combination of hardware, specialized
software and data – Dyche
• A copy of transaction data specifically structured
for query and analysis – Kimball
• A single, complete and consistent store of data
obtained from a variety of different sources made
available to end users in a way that can be
understood and used in business context – Barry
Devlin
M.S. Memon
05/20/23 Department of CSE, QUEST 12
Data Warehouse (definitions)
• A data warehouse is a database where data is
collected for the purpose of being analyzed
M.S. Memon
05/20/23 Department of CSE, QUEST 14
Data Warehouse
• Subject oriented: Data is arranged by subject
area rather than by application. Data is
organized so that all the data elements relating
to the same real-world event or object are
linked together
M.S. Memon
05/20/23 Department of CSE, QUEST 15
Data Warehouse
• Subject oriented:
– Example: customer as subject in a DW
• DW is organized in this case by the customer
• It may consist of 10, 100 or more physical tables, all
related
M.S. Memon
05/20/23 Department of CSE, QUEST 16
Data Warehouse
• Integrated: Data is collected and consistently
stored from multiple, diverse sources of an
organization's operational systems and this data
is made consistent
– E.g. gender, measurement, conflicting keys, consistency,
…
M.S. Memon
05/20/23 Department of CSE, QUEST 17
Data Warehouse
• Non-volatile: Data in the data warehouse is never
over-written or deleted - once committed, the
data is static, read-only, and retained for future
reporting. Data is loaded, but not updated
– When subsequent changes occur, a new snapshot record
is written.
M.S. Memon
05/20/23 Department of CSE, QUEST 18
Data Warehouse
• Time-variant: The changes to the data in the
data warehouse are tracked and recorded so
that reports can be produced showing changes
over time.
– Different environments have different time horizons
• associated
• While for operational systems a 60-to-90 day time horizon is
normal, data warehouse has a 5-to-10 year horizon
M.S. Memon
05/20/23 Department of CSE, QUEST 19
General Definition
• More general, a DW is a
– Repository of an
organization’s
electronically stored data
– Designed to facilitate
reporting and analysis
M.S. Memon
05/20/23 Department of CSE, QUEST 20
General Definition
A complete repository of historical corporate data
extracted from transaction systems that is available
for ad-hoc access by knowledge workers
•Transaction Systems
– Management Information System (MIS)
•Ad-hoc access
– Dose not have a certain access pattern
– Queries not known in advance
– Difficult to write SQL in advance
•Knowledge workers
– Typically NOT IT literate (Executives, Analysts, Managers)
M.S. Memon
05/20/23 Department of CSE, QUEST 21
Data Warehousing
• A paradigm specifically designed for
strategic business information or decision
making
M.S. Memon
05/20/23 Department of CSE, QUEST 22
Typical Features
• DW typically…
– Reside on computers dedicated to this function
– Run on DBMS such as Oracle, IBM DB2, Teradata or
Microsoft SQL Server
– Retain data for long periods of time
– Consolidate data obtained from a variety of sources
– Are built around their own carefully designed data
model
M.S. Memon
05/20/23 Department of CSE, QUEST 23
What can be
warehoused?
• Customer records
• Customer purchases
• Click stream, web traffic
• Product records
• Product purchase records
• Inventory movement
M.S. Memon
05/20/23 Department of CSE, QUEST 24
How does it work?
Business user
needs info
Answers result
User requests
in more questions
IT people
?
Business user
may get answers
IT people do
system analysis
and design
IT people
send reports to IT people
business user create reports
M.S. Memon
05/20/23 Department of CSE, QUEST 25
Data Warehouse vs. Operational Database
•Non-volatile • Updateable
M.S. Memon
05/20/23 Department of CSE, QUEST 26
On Line Transaction Processing
• OLTP (OnLine Transaction Processing):
– Also known under the name of operational data, it
represents day-to-day operational business activities:
• Purchasing, sales, production distribution, …
– Typically for data entry and retrieval transaction
processing
– Reflects only the current state of the data
M.S. Memon
05/20/23 Department of CSE, QUEST 27
On Line Transaction Processing
• OLAP (OnLine Analytical Processing):
– Represents front-end analytics based on a DW
repository
– It provides information for activities like:
• Resource planning, capital budgeting, marketing initiatives,...
– It is decision oriented
M.S. Memon
05/20/23 Department of CSE, QUEST 28
OLTP vs. DW
• Properties
Operational DB DW
Mostly updates Mostly reads
Many small transactions Queries long, complex
MB-TB of data GB-PB of data
Raw data Summarized data
Clerical users Decision makers
Up-to-date data May be slightly outdated
M.S. Memon
05/20/23 Department of CSE, QUEST 29
OLTP vs. DW
OLTP Data Warehouse
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, historical,
flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
M.S. Memon
05/20/23 Department of CSE, QUEST 30
OLTP vs. DW
• Consider a normalized database for a store,
tables would like:
M.S. Memon
05/20/23 Department of CSE, QUEST 31
OLTP vs. DW
• DW for that store would start by building the
following star schema:
M.S. Memon
05/20/23 Department of CSE, QUEST 32
OLTP vs. DW
• Basic insights from comparing OLTP and DWs
– A DW is a separate (RDBMS) installation that contains
copies of data from on-line systems
• Physically separate hardware may not be absolutely necessary
if one has lots of extra computing power, but it is
recommended
– With an optimistic locking DBMS one might even be
able to get away for a while with keeping just one copy
of its data
M.S. Memon
05/20/23 Department of CSE, QUEST 33
OLTP vs. DW
• There is an essentially different pattern of
hardware utilization between on-line and
analytical processing
M.S. Memon
05/20/23 Department of CSE, QUEST 34
Applications of DW
• Typical questions which can be answered with
DW & OLAP
– How much did sales unit A earn in January?
– How much did sales unit B earn in February?
– What was their combined sales amount for the first quarter?
• Answering these questions with SQL-queries is
difficult
– Complex query formulation necessary
– Process is likely to be slow due to complex joins and
multiple scans
M.S. Memon
05/20/23 Department of CSE, QUEST 35
Applications of DW
• Why such questions can be answered better with
a DW?
– Because in a DW tables are rearranged and pre-
aggregated (known as computing cubes)
• The tables arrangement is subject oriented, usually some star
schema
M.S. Memon
05/20/23 Department of CSE, QUEST 36
Applications of DW
• A DW is the base repository for front-end analytics
– OLAP
– KDD
– Data visualization
– Reporting
KDD (Knowledge
Discovery in
Databases) a data
mining process
M.S. Memon
05/20/23 Department of CSE, QUEST 37
Applications of DW
• OLAP is a form of information processing and thus
needs to provide timely, accurate and
understandable information
– timely is however a relative term:
• In OLTP one expects an update to go through in a matter of
seconds
• In OLAP the time to answer a query can take minutes, hours
or even longer
• There are many flavors of OLAP
– ROLAP, DOLAP, MOLAP, WOLAP, HOLAP,…
M.S. Memon
05/20/23 Department of CSE, QUEST 38
Applications of DW
– Data mining might return the following set of rules for
customers spending more than €100:
• IF AGE > 35 AND CAR = ‘MINIVAN’ THEN TOTAL SPENT >
€100
• IF SEX = ‘M’ AND ZIP = 38106 THEN TOTAL SPENT > €100
– It answers questions like
• Which products or customers are more profitable
• Which outlets have sold the least this year
– In consequence it motivates decisions like
• Which products should have their production increased
• Which customers should be targeted for special promotions
• Which outlets should be closed
M.S. Memon
05/20/23 Department of CSE, QUEST 39
DW User
• Users of DW are called DSS analysts and usually
are business persons
– Their primary job is to define and discover information
used in corporate decision-making
– The way they think
• “Give me what I say I want, and then I can tell you what I really
want”
• They work in explorative manner
M.S. Memon
05/20/23 Department of CSE, QUEST 40
DW User
– Typical explorative line of work
• “Ah! Now that I see what the possibilities are, I can tell what I
really want to see. But until I know what the possibilities are, I
cannot describe exactly what I want...”
M.S. Memon
05/20/23 Department of CSE, QUEST 41
Lifecycle of Data warehouse
M.S. Memon
05/20/23 Department of CSE, QUEST 42
Outline
• Lifecycle of DW
• Operating DW
M.S. Memon
05/20/23 Department of CSE, QUEST 43
Lifecycle of DW
M.S. Memon
05/20/23 Department of CSE, QUEST 44
Lifecycle of DW
• Prototype
– Objective is to constrain and in some cases reframe
end-user requirements
• Deployment
– Development of documentation
– Training
– Operations and management processes
• Operation
– Day-to-day maintenance of the DW needs a good
management of ongoing Extraction, Transformation and
Loading (ETL)M.S.
process
Memon
05/20/23 Department of CSE, QUEST 45
Lifecycle of DW
• Enhancement needs the modification of
– HW - physical components
– Operations and management processes
– Logical schema designs
M.S. Memon
05/20/23 Department of CSE, QUEST 46
Lifecycle of DW
M.S. Memon
05/20/23 Department of CSE, QUEST 50
Operating a DW: Monitoring
• Monitoring
– Surveillance of the data sources
– Identification of data modification which is relevant to the
DW
– Monitoring has an important role over the whole process
deciding on which data the next steps will be applied on
• Monitoring techniques
– Active mechanisms - Event Condition Action (ECA)
rules:
M.S. Memon
05/20/23 Department of CSE, QUEST 51
Operating a DW: Monitoring
• Monitoring techniques
– Replication mechanisms
• Snapshot:
– Local copy of data, similar to a View
– Used by Oracle 9i
• Data replication
– Replicates and maintains data in destination tables through data
propagation processes
– Used by IBM
M.S. Memon
05/20/23 Department of CSE, QUEST 52
Operating a DW: Monitoring
• Monitoring techniques
– Protocol based mechanisms
• Since DBMS write protocol data for transaction management,
the protocol can be used also for monitoring
• Difficult due to the fact that the protocol format is proprietary and
subject to change
– Application managed mechanisms
• Hard to implement for legacy systems
• Based on time stamping or data comparison
M.S. Memon
05/20/23 Department of CSE, QUEST 53
Operating a DW: Extraction
• Extraction
– Reads the data which was selected throughout the
monitoring phase and inserts it in the data structures of
the workplace
– Due to large data volume, compression can be used
– The time-point for performing extraction can be:
• Periodical:
– Weather or stock market information can be actualized more times
in a day, while product specification can be actualized in a longer
period of time
• On request:
– For example when a new item is added to a product group
M.S. Memon
05/20/23 Department of CSE, QUEST 54
Operating a DW: Extraction
• Extraction
– The time-point for performing extraction can be:
• Event driven:
– Event driven extraction can be helpful in scenarios where time,
or the number of modifications over passing a specified
threshold triggers the extraction. For example each night at
03:00 or each time 50 new modifications took place, an
extraction is performed
• Immediate:
– In some special cases like the stock market it can be necessary
that the changes propagate immediately to the warehouse
– The extraction largely depends on hardware and the
• software used for the DW and the data source
M.S. Memon
05/20/23 Department of CSE, QUEST 55
Operating a DW: Transforming
• Transforming
– Implies adapting data, schema as well as data quality
to the application requirements
– Data integration:
• Transformation in de-normalized data structures
• Handling of key attributes
• Adaptation of different types of the same data
• Conversion of encoding:
– “Buy”,“Sell” 1,2 vs. B,S 1,2
• Normalization:
– “Michael Schumacher” “Michael, Schumacher” vs. “Schumacher Michael”
“Michael, Schumacher”
M.S. Memon
05/20/23 Department of CSE, QUEST 56
Operating a DW: Transforming
• Transforming
– Data integration:
• Date handling:
– “MM-DD-YYYY” “MM.DD.YYYY”
• Measurement units and scaling:
– 10 inch 25,4 cm
– 30 mph 48,279 km/h
• Save calculated values
– Price_incl_VAT = Price_excl_VAT * 1.19
• Aggregation
– Daily sums can be added into weekly ones
– Different levels of granularity can be used
M.S. Memon
05/20/23 Department of CSE, QUEST 57
Operating a DW: Transforming
• Transforming
– Data cleaning:
• Consistency check
– Delivery_date < Order_date
• Completeness
– Management of missing values as well as NULL values
M.S. Memon
05/20/23 Department of CSE, QUEST 58
Operating a DW: Loading
• Loading
– Loading usually takes place during weekends or nights
when the system is not under user stress
– Split between initial load to initialize the DW and the
periodical load to keep the DW updated
– Initial loading
• Implies big volumes of data and for this reason a bulk loader is
used
– Usually performed by partitioning, parallelization and
incremental actualization
M.S. Memon
05/20/23 Department of CSE, QUEST 59
Operating a DW: Analyzing
• Analyze
– Data access
• Useful for extracting goal oriented information:
– How many iPhones 3G were sold in the Braunschweig stores of T-
Mobile in the last 3 calendar weeks of 2008?
– Although it is a common OLTP query, it might be to complex for the
operational environment to handle
– OLAP
• Falsely used as representing DW because it is used to analyze
data contained in DW
• Used to answer requests like:
– In which district does a product group register the highest profit
– How did the profit change in comparison to the previous month?
M.S. Memon
05/20/23 Department of CSE, QUEST 60
Operating a DW: Analyzing
• Analyze
– OLAP
• Used to answer requests like:
– Mostly known as organized on a multidimensional data model
– Common operations for analyze are:
» Pivoting/Rotation
» Roll-up, Drill-down and Drill-across
» Slice and Dice
– Data mining
• Useful for identifying hidden patterns
• Refers to two separate processes:
– KDD (Knowledge Discovery in Databases)
– Prediction
M.S. Memon
05/20/23 Department of CSE, QUEST 61
Operating a DW: Analyzing
• Analyze
– Data mining
• Useful for answering questions like:
– How did the sales of this product group evolve?
• Methods and procedures for data mining
– Clustering, Classification, Regression, Association rule learning
M.S. Memon
05/20/23 Department of CSE, QUEST 62