0% found this document useful (0 votes)
22 views

Lect 5 Data Warehousing I_240924_033406

Uploaded by

lilyshaa04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lect 5 Data Warehousing I_240924_033406

Uploaded by

lilyshaa04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Warehousing -I

3 Credit Lecture Note 05


Chapter Contents
2

q What is Data Warehousing?

q DWH Architecture
What is Data Warehouse
3

q A data warehouse is a central repository of data management system that collects,


manages data from various sources designed to enable and support business intelligence
(BI) activities, especially analytics. It is used to help the organization in taking decisions.
ü A data warehouse centralizes and consolidates large amounts of data from
multiple sources.
ü They store current and historical data in one single place.
ü Data in the data warehouse must have strong analytical characteristics.
ü Creating data to be analytical requires that it be subject- oriented, integrated,
time-referenced, and non-volatile.
ü Support information processing by providing a solid platform
ü of consolidated, historical data for analysis.
q The traditional database is designed for transaction processing, whereas a data warehouse
is a relational database that is designed for query and
q It is a collection of methods, techniques, and tools used to support knowledge workers to
conduct data analyses that help with performing decision-making processes and improving
information resources
Features of Data Warehouses
4
q Subject Oriented −
• A data warehouse is subject oriented because it provides information around
a subject rather than the organization's ongoing operations.
• These subjects can be product, customers, suppliers, sales, revenue, etc.
• A data warehouse does not focus on the ongoing operations, rather it focuses
on modelling and analysis of data for decision making.

q Integrated −
• A data warehouse is constructed by integrating data from heterogeneous
sources such as relational databases, flat files, etc into one consistent
database .
• Data cleaning and data integration techniques are applied when data is
moved to the warehouse
• It ensures consistency in naming conventions, encoding structures, attribute
measures etc. among different data sources.
Features of Data Warehouses
5
q Time Variant −
• The data collected in a data warehouse is identified with a particular time period.
• The data in a data warehouse provides information from the historical point of
view.
• Every key structure in the data warehouse Contains an element of time,
explicitly or implicitly.
q Nonvolatile Data
• Non-volatile means the previous data is not erased when new data is added to it.
• A data warehouse is kept separate from the operational database and therefore
frequent changes in operational database is not reflected in the data warehouse.
• Data is never deleted from data warehouses and updates are normally carried
out when data warehouses are offline. This means that data warehouses can be
essentially viewed as read-only databases.
• Data warehouse does not require transaction processing, recovery, and
concurrency control mechanisms.
• It requires only two operations in data accessing: initial loading of data and
access of data
Data Warehouses Vs Operational Database Systems -
Functional point of view
6
Key Data warehouse Operational Database
Basic A data warehouse is a repository for Operational Database are those databases
structured, filtered data that has where data changes frequently
already been processed for a specific
purpose

Data Structure Data warehouse has denormalized It has normalized schema


schema

Transaction Optimized for bulk loads and large Optimized for a common and known set
Optimization complex, unpredictable queries. of transactions.

Performance It is fast for analysis queries It is slow for analytics queries

Type of Data It focuses on historical data It focuses on current transactional data

Uses Case It is used for Online analytical It is used for Online transaction
processing (OLAP) processing OLTP
Data Updates Batch updates Continuous updates

Query Handling Usually very complex queries Simple to complex queries


6
Types Warehouses Aplications
7
ü Information processing, analytical processing, and data mining are the three
types of data warehouse applications that are discussed below −

q Information Processing − A data warehouse allows to process the data stored in it.
The data can be processed by means of querying, basic statistical analysis, reporting
using crosstabs, tables, charts, or graphs.

q Analytical Processing − A data warehouse supports analytical processing of the


information stored in it. The data can be analyzed by means of basic OLAP
operations, including slice-and-dice, drill down, drill up, and pivoting.

q Data Mining − Data mining supports knowledge discovery by finding hidden


patterns and associations, constructing analytical models, performing classification
and prediction. These mining results can be presented using the visualization tools.
Applications of Data Warehousing
8
Sector Usage
Airline Helps in airline system management operations like crew
assignment, analyzes of route, frequent flyer program discount
schemes for passenger, etc.
Banking It is used in the banking sector to manage the resources
available on the desk effectively.
Healthcare sector Used to strategize and predict outcomes, create patient's
treatment reports, etc. Advanced machine learning, big data
enable datawarehouse systems can predict ailments.
Insurance sector Used to analyze data patterns, customer trends, and to
track market movements quickly.
Retail chain Helps you to track items, identify the buying pattern of the
customer, promotions and also used for determining pricing
policy.
Telecommunication Used for product promotions, sales decisions and to
make distribution decisions.
8
Database vs Data Warehousing
9
Database Data Warehouse
Purpose Is designed to record Is designed to analyze
Processing MethodThe database uses the Online Transactional Data warehouse uses Online Analytical
Processing (OLTP) Processing (OLAP).
Usage The database helps to perform fundamental Data warehouse allows you to analyze your
operations for your business business.
Tables and Joins Tables and joins of a database are complex as Table and joins are simple in a data
they are normalized. warehouse because they are denormalized.
Orientation Is an application-oriented collection of data It is a subject-oriented collection of data
Storage limit Generally limited to a single application Stores data from any number of applications
Availability Data is available real-time Data is refreshed from source systems as and
when needed
Usage ER modeling techniques are used for designing. Data modeling techniques are used for
designing.
Technique Capture data Analyze data
Data Type Data stored in the Database is up to date. Current and Historical Data is stored in Data
Warehouse. May not be up to date.
Storage of data Flat Relational Approach method is used for Data Ware House uses dimensional and
data storage. normalized approach for the data structure.
Example: Star and snowflake schema.
Query Type Simple transaction queries are used. Complex queries are used for analysis
purpose.
Data Summary Detailed Data is stored in a database. It stores highly summarized data.
9
Types of Data Warehouse Architectures
10

10
Single-Tier Architecture
11
ü In the single-tier architecture, only the source layer is physical. The data warehouse
layer is virtual and provides data in a multidimensional view, created by an
intermediate processing layer.
ü The single-tier data warehouse architecture reduces the amount of data stored in a data
warehouse by building a more compact data set.
ü Its purpose is to minimize the amount of data stored to reach this goal; it removes data
redundancies.
ü Analysis queries are agreed to operational data after the middleware interprets them. In
this way, queries affect transactional workloads.

11
Single-Tier Architecture
12
ü The single-tier architecture has three layers:
1. A source layer
2. A data warehouse layer
3. An analysis layer
ü One drawback of the single-tier architecture is the lack of separation between
analytical and transactional processing. And that’s why this type of data
warehouse architecture is not used frequently.

12
Two-Tier Architecture
13
ü Unlike the single-tier architecture, the two-tier architecture contains a data
staging area that ensures any data you load into the warehouse is cleansed and in
the right format.
ü It’s found between the source layer and the data warehouse layer, as depicted in
the image below.

13
Two-Tier Architecture
14

14
Two-Tier Architecture
15
ü Most businesses that use data marts as a server make use of the two-tier data
warehouse architecture, which is also made up of two tiers:
1. The Data Tier
ü This is the layer where actual data is stored after various ETL processes have
been used to load data into the data warehouse.
ü It’s also made up of three layers:
1. A source layer, 2. A data staging layer, 3. A data warehouse layer
2. The Client Tier
ü This layer is where clients can use data stored in the data warehouse to generate
insights for making informed, data-driven decisions. You can modify or
transform this layer based on the data trends that you discover from your analysis
reports.
ü And it’s made up of a single layer: An analysis layer
ü Some disadvantages of the two-tier architecture are that it’s not scalable, has
network limitations, and only supports a small number of users.

15
Three-Tier Architecture
16
ü The three-tier architecture is what most organisations go for when building a data
warehouse system. It solves the connectivity problems that the two-tier architecture
commonly faces.
ü The three-tier architecture is made up of:
1. A source layer, 2. A reconciled layer ,3. A data warehouse layer
ü The Reconciled layer materializes operational data obtained after integrating and
cleansing source data. As a result, those data are integrated, consistent, correct,
current, and detailed.
ü The main advantage of the reconciled data layer is that it creates a common
reference data model for a whole enterprise. At the same time, it sharply separates
the problems of source data extraction and integration from those of data warehouse
population .
ü However, reconciled data leads to more redundancy of operational source data
ü The three-tier architecture is useful for extensive, enterprise-wide systems. But its
disadvantage is the additional storage space it uses through the redundant,
reconciled layer.
16
Three-Tier Architecture
17
1. Bottom Tier
ü T h e b o t t o m - t i e r, a l s o c a l l e d t h e d a t a w a r e h o u s e l a y e r, i s w h e r e d a t a
is extracted, transformed and loaded into the data repository using backend tools.

2. Middle Tier
ü The middle tier is responsible for arranging data into a more suitable structure for
complex querying and analysis.
ü This process is done with an Online Analytical Processing (OLAP) server and it’s
implemented using two models:
ü The Relational OLAP model (also called ROLAP), which assigns multidimensional
data processes to standard relational operations.
ü The Multidimensional OLAP (also called MOLAP) model, which implements
multidimensional information and operations.

3. Top Tier
ü The top-tier is basically the front-end layer that houses various tools and
APIs (Application Programming Interfaces) you can use for high-level data analysis,
querying, reporting and data mining.
ü It’s where end-users can access, interact and extract data from the warehouse.
17
Three-tier Data Warehousing Architecture
18

Metadata
Metadata, in simple words, is “data about data”. Its function is to describe the
structure of data in a warehouse and how it’s related to other data in the warehouse.
18
Extraction, Transformation, and Loading (ETL)
19
q Data extraction
Ø get data from multiple, heterogeneous, and external sources
q Data cleaning
Ø detect errors in the data and rectify them when possible
q Data transformation
Ø convert data from legacy or host format to warehouse format
q Load
Ø sort, summarize, consolidate, compute views, check integrity, and build
indicies and partitions
q Refresh
Ø propagate the updates from the data sources to the warehouse

19
Data Staging & ETL
20
ü The data staging layer hosts the ETL processes that extract, integrate, and clean
data from operational sources to feed the data warehouse layer
ü ETL takes place once when a data warehouse is populated for the first time, then
it occurs every time the data warehouse is regularly updated .

ü Extraction : This phase gathers data from multiple, heterogeneous and


external sources
ü Static extraction is used when a data warehouse needs to be
populating for the first time
ü Incremental extraction is used to update data warehouses regularly
20
Data Staging & ETL
21
q Cleansing :
ü This phase detects errors in the data and rectifies them when possible
ü The most frequent errors and inconsistencies that make data unclean:
1. Duplicate data
2. Missing data
3. Impossible or Wrong values
4. Inconsistent values
q Transformation
ü This phase converts data from legacy or host format to warehouse format.
ü It is the core of the reconciliation phase
ü It converts data from its operational source format into a specific data
warehouse format
ü The main transformation processes are:
ü Conversion and normalization that operate on both storage formats
and units of measure to make data uniform
ü Matching that associates equivalent fields in different sources
ü Selection that reduces the number of source fields and records

21
Data Staging & ETL
22
q Loading
ü This phase sorts, consolidates, checks integrity, and builds indices and partitions.
ü It can be carried out in two ways:
ü Refresh: Data warehouse data is completely rewritten. This means that
older data is replaced. Refresh is normally used in combination with static
extraction to initially populate a data warehouse

ü Update: Only those changes applied to source data are added to the data
warehouse. Update is typically carried out without deleting or modifying
preexisting data. This technique is used in combination with incremental
extraction to update data warehouses regularly

22
Types of Data Warehouse Models
23

23
Enterprise Warehouse
24
• An Enterprise warehouse collects all of the records about subjects
spanning the entire organization.

• It supports corporate-wide data integration, usually from one or more


operational systems or external data providers, and it's cross-functional
in scope.

• It generally contains detailed information as well as summarized


information and can range in estimate from a few gigabyte to hundreds
of gigabytes, terabytes, or beyond.

• An enterprise data warehouse may be accomplished on traditional


mainframes, UNIX super servers, or parallel architecture platforms.

• It required extensive business modeling and may take years to develop


and build.

24
Data Marts
25
• A data mart contains a subset of organization-wide data that is important to a specific
group of an organization.
• The scope is limited to specific selected subjects. e.g. a marketing data mart may limit
its topics to customers, goods, and sales.
• Data marts are focused on one area and they draw data from a limited number of
sources.
• Time taken to build the data is very low compared to the time taken to build a
Datawarehouse.
• Data Mart helps to enhance user's response time due to reduction in volume of data
• The data contained in the data marts are summarized, small in size and flexible

• Data Marts is divided into three parts:


• Independent Data Mart: Independent data mart is sourced from data captured
from one or more operational systems or external data providers, or data
generally locally within a different department or geographic area.
• Dependent Data Mart: Dependent data marts are sourced exactly from
enterprise data-warehouses.
• Hybrid data mart: can take data from data warehouse or operational systems

25
Data Marts
26

26
Data Marts
27
Advantages of Data Mart
• Improve end-user response time
• Lower implementation cost
• Fast easy access data
• Frequently requested data is fastly provided to the end-user
• Data mart store only single subject area data

27
Virtual Warehouse
28
• A virtual warehouse is a group of views on an operational
database
• For efficient query processing, only a few possible summary
views can be physical
• Creating a virtual warehouse is easy, but requires additional
capacity on operational database servers

28
Multidimensional Model
29
• Data is divided into Dimensions and Facts
• Dimensions
• Dimensions are the perspectives or entities with respect to which an
organization wants to keep records
• Ex: Sales data warehouse may keep records of the store’s sales w.r.t. the
dimensions time, item, branch and location
• Each dimension may have a table associated with it, called a dimension
table
• Facts
• Facts are numerical measures which are used to analyse the relationship
between dimensions
• Ex: Facts for a Sales data warehouse include dollars_sold, units_sold
• The fact table contains the names of the facts or measures as well as keys
to each of the related dimension tables
• Dimensions describe facts
• Facts have measures that can be aggregated: sales price

29
Multidimensional Model
30
• Goal for dimensional modeling:
• Surround facts with as much context (dimensions) as possible
• A data warehouse is based on a multidimensional data model which views data in the
form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables

• ER models describe entities and relationships where as Dimensional models describe


measures and dimensions

• Each dimension is associated with a hierarchy of aggregation levels, called as roll - up


hierarchy

30
Multidimensional Model...
31
• Hierarchies consist of levels called dimensional attributes

31
Multidimensional Model...
32
• Sales volume as a function of product, month, and region

32
Multidimensional Model...
33

33
Multidimensional Model...
34

• Each cube axis shows a possible analysis dimension


• Each dimension can be analyzed at different detail levels specified by hierarchically structured
attributes

34
Meta Data
35
• It specifies source, values, usage, and features of data ware house data and defines
how data can be changed and processed at every architecture layer
• Applications use it intensively to carry out data - staging and analysis tasks
• In data warehouse, metadata is used for building, maintaining, managing, and using
the data warehouses. Metadata helps users to easily access, understand the content and
find data in data warehouse
• Metadata includes the following:
• The location and descriptions of warehouse systems and components
• Names, definitions, structures, and content of data-warehouse and endusers
views
• Integration and transformation rules used to populate data, to deliver
information to end-user analytical tools
• Metrics used to analyze warehouses usage and performance
• Security authorizations, access control list, etc

35
Meta Data
36
• Internal Meta-Data
• It defines sources, transformation processes, population policies, logical and
physical schema, constraints and user profiles
• System administrator is interested

• External Meta-Data
• It is about definitions, quality standards, units of measure, relevant
aggregations
• It is relevant to end users

36
Recommended Text and Reference Books
37
q Text Book:
Ø J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed., 2011
q Reference Books:
Ø H. Dunham. Data Mining: Introductory and Advanced Topics. Pearson
Education. 2006.
Ø I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann. 2000.
Ø D. Hand, H. Mannila and P. Smyth. Principles of Data Mining.Prentice-Hall.
2001.

37
38

You might also like