0% found this document useful (0 votes)
1 views

UNIT2DM

This document provides an overview of data warehousing, including its definition, characteristics, and architecture. It explains key concepts such as the multidimensional data model, OLAP operations, and different warehouse schemas like star and snowflake schemas. Additionally, it discusses the roles of metadata and OLAP engines in facilitating data analysis and decision-making processes within organizations.

Uploaded by

agastua8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

UNIT2DM

This document provides an overview of data warehousing, including its definition, characteristics, and architecture. It explains key concepts such as the multidimensional data model, OLAP operations, and different warehouse schemas like star and snowflake schemas. Additionally, it discusses the roles of metadata and OLAP engines in facilitating data analysis and decision-making processes within organizations.

Uploaded by

agastua8
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

UNIT-2

DATA WAREHOUSING
Presentation Flow
DATA WAREHOUSE
1. INTRODUCTION
2. WHAT IS DATA WAREHOUSE
3. DEFENITION
4. MULTIDIMENSIONAL DATA MODEL
5. OLAP OPERATIONS
6. WAREHOUSE SCHEMA
7. DATA WAREHOUSE ARCHITECTURE
8. WAREHOUSE SERVER
9. META DATA
10. DATA WAREHOUSE BACKEND PROCESS
INTRODUCTION
 A Data warehouse is an information system that contains
historical and commutative data from single or multiple
sources.
 Data Warehouse Concepts simplify the reporting and analysis
process of organizations.
 The concept of the data warehouse has existed since the 1980s,
when it was developed to help transition data from merely
powering operations for decision support systems that
reveal business intelligence.
WHAT IS DATA WAREHOUSE
A single complete and consistent store of data obtained from a
variety of different sources made available to the end user in
what they can understand and use in Business context

A data warehouse is a large collection of business data used to


help an organization make decisions.

The large amount of data in data warehouses comes from


different places such as internal applications such as
marketing, sales, and finance; customer-facing apps; and
external partner systems, among others.
DEFINE DATA WAREHOUSE
A data warehouse is a subject- oriented, integrated, time-varying,
non-volatile collection of data in support of the management‟s
decision-making process.
CHARACTERISTICS OF
WAREHOUSE
SUBJECT-ORIENTED:-
 A data warehouse is organized around major subjects such as
customer, products, sales etc..
 Data are organized according to subject instead of application.
 For Example:- An insurance company using a data warehouse
would organize their data by customer, premium, and claim
instead of by different products(Bike, car, Auto)
 The data organized by subject obtains only the information
necessary for the decision support processing.
INTEGRATED:-

 A data warehouse is usually constructed by integrating


multiple, heterogeneous sources such as relational databases,
flat files and OLTP files.
 Data cleaning and data integration techniques are applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources.
– When data is moved to the warehouse, it is converted
Time-varying:-
Data are stored in a data warehouse to provide a historical
perspective. Every key structure in the data warehouse
contains, implicitly or explicitly an element of time.
NON-VOLATILE:-
 A data warehouse is always a physically separated store of
data, which is transformed from the application data found in
the appropriate environment.
 The Existing data in the data warehouse are not updated or
changed in any way once they enter the data warehouse,
DISTINGUISH BETWEEN DATA
WAREHOUSE AND DBMS
DATA WAREHOUSE DBMS
1. It is subject-oriented 1. It is application Oriented
2. Integrated[Dependent] 2. It is autonomous
3. Time-variant [independent]
4. Non-Volatile 3. Current data
5. Multidimensional 4. Volatile in nature
6. Summary data 5. 2 Dimensional
6. Transaction data
 A multidimensional model views data in the form of a data-
cube. A data cube enables data to be modelled and viewed in
multiple dimensions. It is defined by dimensions and facts.
 The dimensions are the perspectives or entities concerning
which an organization keeps records.
DATA CUBE:-
 A data cube allows data to be modeled and viewed in multiple
dimensions. It is defied by dimensions and facts.
 Data cube is a structure that enables OLAP to achieve the
Multidimensional functionality.
 Data cubes are an easy way to look at the database data( Allow us to
look at complex data in a simple format).
 Although its called a cube, It can be 2 –Dimensional, 3-
Dimensional or higher dimensional.
Data cube types are :-
1.Multidimensional Data Cube
2.Relational data cube.
Data cube have categories of data
1) Measure:- Represent some Number such as cost or units of
services.
2) Dimension:- represent descriptive categories of data such as time
or location.
Dimension Modeling:-
Dimension modeling is a special technique for structuring data
around business concepts, Unlike ER modeling which
describes entities and relationships.
Dimension modeling structures the numeric measures and the
dimensions. PROFESSION

Gender

Engineer
Male Female Teacher secretary

CIVIL
Primary Executive
Software
High Junior
Lattice of Cuboids:-
 The dimension hierarchy helps us view the multidimensional
data in several different data cube representations.
 Multidimensional data can be viewed as a Lattice of cuboids.
 The C[A1,A2,A3…..An] at the finest level of granularity is
called the base cuboids and it consists of all data cells.
 The coarsest level consists of one cell with numeric measures
of all n dimensions. This is called an apex cuboids.
Summary Measures:-
1. Distributive:-
A numeric measure is distributive if it can be computed in a
distributed manner. The measure can be simply the
aggregation of the measures of all partitions. Ex:- Count,
sum, min, max are distributive measures.
2. Algebraic:-
An aggregate function is algebraic if it can be computed by an
algebraic function with some set of arguments. For example;-
average is obtained by sum/count.
3.Holistic:- An aggregate function is holistic if there is no
constant bound on the storage size needed to describe a sub
aggregate. That is, there does not exist an algebraic function
that can be used to compute this function Example;- Median,
mode, most-frequent etc..
Data Cube
The sales data warehouse includes the sales amount in rupees and
the total number of units sold. Note that we can have more than
one numeric measure. Fig shows the Multidimensional model
for such situations.
- we do not take all the subdivisions of the dimensions into account. The
dimension hierarchies considered for the data cube are:-

Time:( month<quarter<year);
Location:(city<province<country);
and product

In the given dimensional hierarchies,


 the base cuboid of the lattice corresponds to C[ month, city, product].
 The cuboid C[quarter ,city ,product ].
 In the apex cuboid is C[ year, country, product].
 Other intermediate cuboids in the lattice are
a) C[ quarter, province, product] ,
b) C[ quarter, country, product ],
c) C[month, province, product],
d) C[month, country, product],
e) C[year, city , product],
f) C[year, province , product ]
 Data analysis tools are called OLAP.
 OLAP is based on the multi-dimensional data model
 OLAP enables a user to easily and selectively extract & view data
from different points of view.
 OLAP is mainly used to access the live data online and to analyze
 OLAP provides a user-friendly environment for interactive data
analysis.
 OLAP provides the means to analyze those data in an application
oriented manner.
The Basic OLAP operations for a
multidimensional model:-
1)Slicing:-The slice operation performs a selection on one dimensions of the
given cube, resulting in a subcube.
Fig shows a slice operation where the sales data are selected from the central cube
for the dimension time, using the criteria time=„Q2‟.
slicetime=’Q2’ C[quarter, city, product] = C[city, product]
2)Dicing:-The dice operation defines a subcube by performing a selection on two
or more dimensions.
• Fig shows a dice operation operation on the central cube based on the
following selection criteria, which involves three dimensions:
(location= "Mumbai" or "Pune") and (time = "Ql" or "Q2").
Dicetime=‟Q1‟ or „Q2‟ and location=”Mumbai” or “Pune”
C[quarter, city, product]=C[quarter‟, city‟, product]
Where quarter' and city' have truncated domains such as {Ql, Q2} and {Mumbai,
Pune}, respectively.
3)Drilling:-This operation is meant for moving up and down along classification
hierarchies.
a) Drill-up:- The drill up operation is also known as roll up operations.
This operation deals with switching from a detailed to an aggregated level
within the same classification hierarchy. The drill-up operation performs
aggregation on a data cube, either by climbing up a dimension hierarchy or
by dimension reduction.
The roll-up operation aggregates the data by ascending the location hierarchy
from the level of the city to the level of state. In other words, rather than
grouping the data by the city, the resulting cube groups the data by the state.
Roll-uptime C[quarter, city, product] = C[quarter, state, product]
Here, each data cell of the resulting cuboid is the aggregation of the data cells that
are merged due to the roll-up operation. In other words, the measures stored
in the datacells, C[Ql, Mumbai, computer] and C[Ql, Pune, computer], are
added to determine the measure to be stored at C[Ql, Maharashtra, computer]
.When roll-up is performed by dimension reduction, one or more dimensions
are removed from the given cube.
b) Drill Down:-
This operation is concerned with switching from an aggregated to a more detailed
level within the same classification hierarchy. Drill-down is the reverse of
roll-up.
c) Drill Within:-
It is switching from one classification to a different one within the same
dimensions.
d)Drill-across:-
It means switching from a classification in one dimension to a different
classification in a different dimension.
4) Pivot(rotate):-
It is a visualization operation which rotates the data axes in order to provide an
alternative presentation of the same data.
WAREHOUSE SCHEMA
A multidimensional data model identifies the dimensions, their
hierarchies, the measure functions etc.. For the design of a
data cube.

1. Star Schema
2. Snowflake Schema
3. Fact Constellation
STAR SCHEMA
 It is a modelling paradigm in which the data warehouse
contains a large single central fact table and a set of smaller
dimension table, one for each dimension.
 The fact table contains detailed summary data.
 Its primary key has one key per dimension.
 Every tuple in the Fact table consists of the fact or subject of
interest, and the dimensions that provide that fact.
 Each dimension is a single highly denormalized table.
 Each tuple of the fact table consists of a key pointing to the
dimension table that provides multidimensional coordinates.
 The dimensional table consists of columns corresponding to
the attributes of the dimension.
 There exists a 1:n relation between the fact table and
dimension table.
ADVANTAGES OF STAR SCHEMA
 Easy to understand
 Easy to define hierarchies
 Reduces the number of physical joins
 Low maintenance
 Simple meta data
SNOWFLAKE SCHEMA
 The dimension tables are normalized here.
 There is a single fact table and multiple dimensional tables.
 Like the Star Schema, each tuple of the fact table consists of a
(foreign) key pointing to each of the dimension tables that
provide its multidimensional coordinates.
 It also stores numerical values (non-dimensional attributes,
and results of statistical functions) for those coordinates.
ADVANTAGES SNOWFLAKE SCHEMA
 Normalized tables are easy to maintain
 Saves storage space as it contains non redundant data

Limitations
 Navigation across tables is complex.
FACTCONSTELLATION SCHEMA
A Fact Constellation is a kind of schema where we have more
than one Fact Table sharing among them some Dimension
Tables. It is also called Galaxy Schema.
For example, let us assume that Deccan Electronics Company
would like to have supply and delivery fact table. It may
contain five dimensions, or keys: time, item, delivery-agent,
origin and destination along with the numeric measure as the
number of units supplied and the cost of delivery.
Explain Data Warehouse Architecture?
TIER-1
 Tier-1 is a Bottom Tier or Physical Layer.
 Tier-1 contains the main data warehouse,
 It is essentially the warehouse server It can follow 3 models.
 Tier-1 includes Meta Data, Data Marts , Data warehouse ,
Monitoring and Administration.
 These tools and utilities first perform extract, clean,
transform, load functions on the data.
TIER-2
 Tier-2 is the OLAP engine for analytical processing
 OLAP engine, namely includes ROLAP ,MOLAP and
Extended SQL OLAP.
TIER-3
 This tier is the front-end client Layer.
 This layer holds the query tools and reporting tool , analysis
tools , visualization tools and data mining tools.
 Backend process which is concerned with extracting data
from multiple operational databases and from external
sources, with cleaning , transforming and integrating the data
for loading into the data warehouse server.
WAREHOUSE SERVER
1) Enterprise Warehouse:-
This model collects all the information about the subjects. It
provides corporate wide data integration, usually from one or
operational system. An enterprise data warehouse requires a
traditional mainframe.
2) Data Mart:-
Data Mart are partitions of the overall data warehouse. Data Mart
is a subset of that huge data warehouse built specifically for a
department. Data mart may contain some overlapping data.
The industry is moving away from a single, physical data
warehouse towards a set of smaller, more manageable,
databases called data marts.
 Stand Alone Data Mart:-
This approach enables a department or work group to implement
a data mart with minimal or no impact on the enterprise‟s
operational database.
 Dependent Data Mart:-
This approach is similar to the stand alone data mart, except that
management of the data sources by the enterprise database is
required. These data sources include operational databases
and external sources of data.
3)Virtual DataWarehouse:-
It is a set of views over operational databases. A virtual
warehouse is easy to build but requires excess capacity on
operational database servers.
META DATA
 Meta Data is a data about the data.
 It serves to identify the contents and location of data in the
warehouse.
 It can access information across from the data warehouse
 Metadata is a bridge between the data warehouse and the
decision support application.
 In addition to providing a logical linkage between data and
application
Types of Meta Data:-
1. Build-Time Metadata
2. Usage Metadata
3. Control Metadata
1. Build-time Metadata:-
 Whenever a warehouse is designed and built the Meta data
thus generated is known is built time Meta data.
 This describes the data‟s technical structure.
 This metadata links business and warehouse terminology and
describes the data's technical structure.
 It is the most detailed and exact type of metadata and is used
extensively by warehouse designers, developers, and
administrators.
 It is the primary source of most of the metadata used in the
warehouse.
2. Usage Metadata:-
This Meta data is derived from build time Meta data. This is an
important tool for users and data administrators.

3. Control Metadata:-
The control metadata provides vital information about the
timeliness of warehouse data and helps users to keep track of
the sequence and timing of warehouse events.
System programmers mostly use it.
However, one subset which is generated and used by the tools
that populate the warehouse, is of considerable interest to
users and data warehouse administrators.
OLAP ENGINE
The main functions of the OLAP engine are to present the user a
multidimensional view of the data warehouse and to provide
tools for OLAP operations.

There are three options of the OLAP engine:-


 Specialized SQL Server
 Relational OLAP Server (ROLAP)
 Multidimensional OLAP (MOLAP)
1. Specialized SQL Server:-
 This model assumes that the warehouse organizes data in a
relational structure and the engine provides an SQL-like
environment for OLAP tools.
 The main idea is to exploit the capabilities of SQL.
 We shall see that the standard SQL is not suitable for OLAP
operations.
 However, some researchers, (and some vendors) are
attempting to extend the abilities of SQL to provide OLAP
operations.
 This is of relevance when the data warehouse is available in a
relational structure.
2. Relational OLAP Server (ROLAP)
 A scalable, parallel, relational database provides the storage and
high-speed access to this underlying data.
 A middle analysis tier provides a multidimensional conceptual view
of the data and an extended analytical functionality which are not
available in the underlying relational server.
 The presentation tier delivers the results to the users.
 ROLAP systems provide the benefit of full analytical functionality,
while maintaining the advantage of relational data.
 ROLAP depends on a specialized schema design and its technology
is limited by its non-integrated, disparate tier architecture.
 The problem is that the data is physically separated from the
analytical processing.
 For many queries this is not a major problem, but it limits the scope
of analysis.
Two important features of ROLAP are
• Data warehouse and relational database are inseparable
• Any change in the dimensional structure requires a physical
reorganization of the database, which is too time consuming.
3. Multidimensional OLAP (MOLAP)
 Multidimensional Data Model for the data warehouse will
have Multidimensional OLAP (MOLAP) server for analysis.
 MOLAP servers support multidimensional views of data
through array-based data warehouse servers.
 They map multidimensional views of a data cube to array
structures.
 The advantage of using a data cube is that it allows fast
indexing to pre compute summarized data.
 As with a multidimensional data store, storage utilization is
low, and MOLAP is recommended in such cases.
DIFFERENCE B /WROLAP & MOLAP

ROLAP MOLAP
• It stands for Relational Online • It stands for Multidimensional
Analytical Processing. Online Analytical Processing.
• Data stored in relational • Data stored in
Database. multidimensional cube.
• Data retrieved via SQL from
• Data retrieved directly from database for analysis.
cube for analysis. • It is used for less/limited
• It is used for large volumes of volumes of data
data • The access time is quick in
• The access time in ROLAP is MOLAP.
slow. • A sparse matrix is used in
• It uses complex SQL queries MOLAP
DATA WAREHOUSE BACKEND
PROCESS

The data ware server use back end utilities in order to populate
and refresh data. The various functions that take place at the
backend process are
 Data Extraction
 Data Cleaning
 Data Transformation
 Data Loading
 Data Refreshing.
Data Extraction: This function gathers data from multiple
heterogeneous sources. The data may be collected from
production data, legacy data, internal office system, and
external system and meta data.

Data cleaning: The data in the data warehouse should be correct


and accurate. This function deals with detection of errors in the
data and rectifying them whenever possible. Data is extracted
from multiple sources and so there is a high possibility of
errors in data. Hence cleaning is essential in constructing
quality data. Data cleaning techniques involves use of
transformation rules, domain specific knowledge and auditing.
Data Transformation: This function converts the data from
legacy or host format into data warehouse format. The data
from the heterogeneous sources are transformed into a uniform
structure so that data can be combining and integrated.

Loading: The volume of data to be loaded into data warehouse is


large. This function sorts, summarizes, and consolidates,
computes and partitions data. The different loading strategies
are batch loading, sequential loading and incremental loading

Refresh: It propagates, updates from data sources to warehouse


when the source data is updated we need to update the
warehouse also. This process is called refreshing. One method
is to refresh data on every update but this method is expensive.
Hence data is refreshed periodically at regular intervals of
time.
OTHER FEATURES
 Warehouse Management Tools
 Data Warehouse Usage
 The Warehouse Atlas
 The misunderstood OLAP Engine

 Ware house Management Tools


Data warehouse architecture usually provides a set of
management tools which include load manager, warehouse
manager, query manager. In addition, a data warehouse must
be supported by other management tools like server
manager, network manager.
 Data Warehouse Usage:-
Initially, the data warehouse is mainly used for generating reports
and answering predefined queries. Eventually, it is used to
analyze summarized and detailed data. In the next phase, the
data warehouses are used for strategic purposes, performing
multidimensional analysis and sophisticated slice-and-dice
operations. Finally, the data warehouse may be employed for
knowledge discovery and strategic decision making using data
mining tools.
 The Warehouse Atlas
In the data warehouse atlas, metadata provides a variety of high-
level views as starting points of data warehouse exploration
for various users. It provides views for executive users, for
management, a physical features view for data administrators
and key business users, and a searchable, no-nonsense index
for everyday users. A warehouse atlas provides two detailed
views-one for end users and one for builders.
 The Misunderstood OLAP Engine
There are three fundamental misconceptions about OLAP engines
OLAP servers can perform data warehousing functions.
OLAP engines build relational cubes that provide the ability to perform
multidimensional analysis on a given data set. They are completely
inadequate for many tasks commonly associated with data
warehouses, such as historical archiving.
OLAP engines can cleanse and manipulate data being loaded. OLAP
servers focus on providing multidimensional analysis. Most
available products emphasize the OLAP functionality and leave the
data preparation to the user OLAP engines store the data in a format
open to other tools. There is nothing "open" about OLAP data stores.
In order to perform effective roll-up, drill-down, or data pivoting,
OLAP servers store their cubes in proprietary formats, if not
proprietary file managers. If other application tools have access to
that data, it is simply because they have written custom drivers to
accommodate the format.

You might also like