0% found this document useful (0 votes)
49 views

Chapter 2 and 3

This document discusses key concepts related to data warehousing and online analytical processing (OLAP). It begins with an introduction to data warehouses and their purpose in supporting business decision making. The document then covers architectural components of data warehouses including data sources, the data warehouse itself, and business intelligence tools. It also discusses dimensional modeling, extract-transform-load processes, and techniques for efficiently querying and indexing multidimensional data.

Uploaded by

Harshad Nawghare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Chapter 2 and 3

This document discusses key concepts related to data warehousing and online analytical processing (OLAP). It begins with an introduction to data warehouses and their purpose in supporting business decision making. The document then covers architectural components of data warehouses including data sources, the data warehouse itself, and business intelligence tools. It also discusses dimensional modeling, extract-transform-load processes, and techniques for efficiently querying and indexing multidimensional data.

Uploaded by

Harshad Nawghare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Data Warehousing & Mining

B. Tech and MBA. Tech. (Information Technology)


Semester-V

Chapter 2 Architecture and Infrastructure & Data


Representation
Chapter 3 Information access and delivery

By. Prof. Bhushan Inje


Outline
• Architectural components
• Infrastructure and metadata
• Principles of dimensional modeling
• Dimensional modeling advance topics
• Data Extraction, Transformation and Loading
• Data quality

By. Prof. Bhushan Inje


Introduction
• A data warehouse is a collection of corporate information,
derived directly from operational systems and some external data sources.
• Its specific purpose is to support business decisions, not business
operations.
• This is what a data warehouse is all about, helping your business ask
“What if?” questions.
• The answers to these questions will ensure your business is proactive,
instead of reactive, a necessity in today’s information age.
What is a Data Warehouse?
• Data warehouse provides architectures and tools for business executives to
systematically organise, understand, and use their data to make strategic
decisions.
• In simple terms, a data warehouse refers to a database that is maintained
separately from an organization’s operational databases.
• According to W. H. Inman, a leading architect in the construction of data
warehouse systems,
“a data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision making
process.”
Cont.…
• Let us understand the four key words in more detail as
follows:
– Subject-oriented
– Integrated
– Time-variant
– Non-volatile
Use of Data Warehouses in organizations
• Many organizations are creating data warehouse to support
business decision-making activities for the following reasons:
– To increasing customer focus.
– To reposition products and managing product portfolios.
– To analyzing operations and looking for sources of profit.
– To managing the customer relationships
– Data warehousing is also very useful from the point of view of
heterogeneous database integration.
Differences between operational Database systems
and Data Warehouses
• On Line Transactional Processing (OLTP)
• On Line Analytical Processing (OLAP).
Characteristics of Data Warehouse
• Subject oriented
• Integrated
• Time variant
• Non volatile
Data Warehouse components
Data Warehouse components
• Data sources
• Data Warehouse
• Reporting
– Business intelligence tools
– Executive information systems (known more widely as Dashboard (business)
– OLAP Tools
– Data Mining

• Metadata
• Operations

Data Warehouse components Cont..

• Optional components
– Dependent Data Marts
– Logical Data Marts
– Operational Data Store
Designing the Data Warehouse
Data Warehouse Architecture
• Why do Business analysts need Data Warehouse?
Process of Data Warehouse Design
A data warehouse can be built using three approaches:
1. A top-down approach
2. A bottom-up approach
3. A combination of both approaches
• In general, the warehouse design process consists of the
following steps:
A three-tier Data
Warehouse
architecture
OLAP server architectures
There are three different possible designs:
1. Relational OLAP (ROLAP)
2. Multidimensional OLAP (MOLAP)
3. Hybrid OLAP (HOLAP)
Getting Multidimensional Data out of the Warehouse
A Multidimensional Data Model
• From Tables and Spreadsheets to Data Cubes
• What is a data cube?
– A data cube allows data to be modeled and viewed in
multiple dimensions. It is defined by dimensions and
facts
– Dimensions are the perspectives or entities with respect
to which an organization wants to keep records.
Cont.…
Cont.…
Schemas for Multidimensional Databases

• Stars, Snowflakes, and Fact Constellations:


• Star schema:
– (1) a large central table (fact table) containing the bulk of the data,
with no redundancy, and
– (2) a set of smaller attendant tables (dimension tables), one for each
dimension.
Examples for Defining Star, Snowflake,
and Fact Constellation Schemas
• Data warehouses and data marts can be defined using two language primitives,
one for cube definition and one for dimension definition. The cube definition
statement has the following syntax:
Measures: Their Categorization and Computation

• Note that a multidimensional point in the data cube space can be defined
by a set of dimension-value pairs, for example, (time = “Q1”, location =
“Vancouver”, item = “computer”).
• A data cube measure is a numerical function that can be evaluated
at each point in the data cube space.
• Measures can be organized into three categories (i.e., distributive,
algebraic, holistic), based on the kind of aggregate functions used.
• Distributive: An aggregate function is distributive if it can be
computed in a distributed manner
• Algebraic: An aggregate function is algebraic if it can be
computed by an algebraic function with M arguments
• Holistic: An aggregate function is holistic if there is no constant
bound on the storage size needed to describe a sub aggregate.
Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from a set of low-
level concepts to higher-level, more general concepts.
OLAP Operations in the Multidimensional Data Model
• “How are concept hierarchies useful in OLAP?”
OLAP Operations
• Roll-up- The roll-up operation (also called the drill-up operation by some vendors)
performs aggregation on a data cube, either by climbing up a concept hierarchy for
a dimension or by dimension reduction.
• Drill-down- Drill-down is the reverse of roll-up. It navigates from less detailed data
to more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions.
• Slice and dice- The slice operation performs a selection on one dimension of the
given cube, resulting in a sub-cube
• Pivot (rotate)- Pivot (also called rotate) is a visualization operation that rotates the
data axes in view in order to provide an alternative presentation of the data.
• Other OLAP operations
– Drill-across
– Drill-through
OLAP Systems versus Statistical Databases

• Many of the characteristics of OLAP systems, such as the use


of a multidimensional data model and concept hierarchies, the
association of measures with dimensions, and the notions of
roll-up and drill-down, also exist in earlier work on statistical
databases (SDBs).
• A statistical database is a database system that is designed to
support statistical applications. Similarities between the two
types of systems are rarely discussed, mainly due to
differences in terminology and application domains.
A Starnet Query Model for Querying Multidimensional
Databases
A Starnet Query Model for Querying Multidimensional
Databases
Data Warehouse Back-End Tools and Utilities
• Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources
• Data cleaning, which detects errors in the data and rectifies them when possible
• Data transformation, which converts data from legacy or host format to
warehouse format
• Load, which sorts, summarizes, consolidates, computes views, checks integrity,
and builds indices and partitions
• Refresh, which propagates the updates from the data sources to the warehouse
Types of OLAP Servers: ROLAP versus MOLAP
versus HOLAP
• Relational OLAP (ROLAP) servers:
• Multidimensional OLAP (MOLAP) servers:
• Hybrid OLAP (HOLAP) servers:
• Specialized SQL servers:
“How are data actually stored in ROLAP and MOLAP
architectures?”
Data Warehouse Implementation
• Efficient Computation
of Data Cubes
– The compute cube
Operator and the
Curse of
Dimensionality
“How many cuboids are there in an n-dimensional data cube?”

• If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen above,
is 2n.
• such as in the hierarchy “day < month < quarter < year”.

where Li is the number of levels associated with dimension i.


Partial Materialization: Selected Computation of Cuboids

• There are three choices for data cube materialization given a base cuboid:
1. No materialization:
2. Full materialization:
3. Partial materialization:

• The partial materialization of cuboids or subcubes should consider three factors:


(1) identify the subset of cuboids or subcubes to materialize;
(2) exploit the materialized cuboids or subcubes during query processing; and
(3) efficiently update the materialized cuboids or subcubes during load and refresh.
Indexing OLAP Data
• How to index OLAP data by bitmap indexing and join indexing.
Efficient Processing of OLAP Queries

1. Determine which operations should be performed on the


available cuboids:
2.Determine to which materialized cuboid(s) the relevant
operations should be applied:
• “Which of the above four cuboids should be
selected to process the query?”
• “How would the costs of each cuboid
compare if used to process the query?”
From Data Warehousing to Data Mining
• “How do data warehousing and OLAP relate to data mining?”
• Data Warehouse Usage
– There are three kinds of data warehouse applications:
• Information processing
• Analytical processing
• Data mining
• “How does data mining relate to information processing and
on-line analytical processing?”
• “Do OLAP systems perform data mining?
• Are OLAP systems actually data mining systems?”
From On-Line Analytical Processing to
On-Line Analytical Mining
• On-line analytical mining (OLAM) (also called OLAP mining)
• OLAM is particularly important for the following reasons:
– High quality of data in data warehouses
– Available information processing infrastructure surrounding data
warehouses
– OLAP-based exploratory data analysis
– On-line selection of data mining functions
Architecture for
On-Line
Analytical Mining
Data
Warehouse
Deployment
Data Warehouse Deployment
• Lifecycle for data warehouse deployment project:
– 0. Project Scoping and Planning
– 1. Requirement
– 2. Front-End Design
– 3. Warehouse Schema Design
– 4. OLTP to data warehouse mapping
– 5. Implementation
– 6. Deployment
– 7. Management and Maintenance of the system
Growth and maintenance of Data warehouse

• Monitoring The Data Warehouse


– Collection of Statistics
– The following is a random list that includes statistics for different uses. You will find most
of these applicable to your environment.
• Physical disk storage space utilization
• Number of times the DBMS is looking for space in blocks or causes fragmentation
• Memory buffer activity
Collection of Statistics Cont..
• Buffer cache usage
• Input–output performance
• Memory management Profile of the warehouse content, giving number of distinct
entity occurrences (example: number of customers, products, etc.)
• Size of each database table Accesses to fact table records
• Usage statistics relating to subject areas
• Numbers of completed queries by time slots during the day
• Time each user stays online with the data warehouse
• Total number of distinct users per day
• Maximum number of users during time slots daily
• Duration of daily incremental loads
• Count of valid users
• Query response times
• Number of reports run each day
• Number of active tables in the database
Using Statistics for Growth Planning
• We indicate below the types of action that are prompted by the monitoring
statistics:
– Allocate more disk space to existing database tables Plan for new disk
space for additional tables
– Modify file block management parameters to minimize fragmentation
– Create more summary tables to handle large number of queries looking
for summary information
– Reorganize the staging area files to handle more data volume
– Add more memory buffers and enhance buffer management Upgrade
database servers
– Offload report generation to another middle tier
– Smooth out peak usage during the 24-hour cycle
– Partition tables to run loads in parallel and to manage backups
MANAGING THE DATA WAREHOUSE
Platform
Upgrades

Ongoing Managing
Fine- Data
Tuning Growth

Information
Delivery Storage
Enhanceme Management
nts

Data
ETL
Model Management
Revisions
Thank You

You might also like