Lect 5 Data Warehousing I_240924_033406
Lect 5 Data Warehousing I_240924_033406
q DWH Architecture
What is Data Warehouse
3
q Integrated −
• A data warehouse is constructed by integrating data from heterogeneous
sources such as relational databases, flat files, etc into one consistent
database .
• Data cleaning and data integration techniques are applied when data is
moved to the warehouse
• It ensures consistency in naming conventions, encoding structures, attribute
measures etc. among different data sources.
Features of Data Warehouses
5
q Time Variant −
• The data collected in a data warehouse is identified with a particular time period.
• The data in a data warehouse provides information from the historical point of
view.
• Every key structure in the data warehouse Contains an element of time,
explicitly or implicitly.
q Nonvolatile Data
• Non-volatile means the previous data is not erased when new data is added to it.
• A data warehouse is kept separate from the operational database and therefore
frequent changes in operational database is not reflected in the data warehouse.
• Data is never deleted from data warehouses and updates are normally carried
out when data warehouses are offline. This means that data warehouses can be
essentially viewed as read-only databases.
• Data warehouse does not require transaction processing, recovery, and
concurrency control mechanisms.
• It requires only two operations in data accessing: initial loading of data and
access of data
Data Warehouses Vs Operational Database Systems -
Functional point of view
6
Key Data warehouse Operational Database
Basic A data warehouse is a repository for Operational Database are those databases
structured, filtered data that has where data changes frequently
already been processed for a specific
purpose
Transaction Optimized for bulk loads and large Optimized for a common and known set
Optimization complex, unpredictable queries. of transactions.
Uses Case It is used for Online analytical It is used for Online transaction
processing (OLAP) processing OLTP
Data Updates Batch updates Continuous updates
q Information Processing − A data warehouse allows to process the data stored in it.
The data can be processed by means of querying, basic statistical analysis, reporting
using crosstabs, tables, charts, or graphs.
10
Single-Tier Architecture
11
ü In the single-tier architecture, only the source layer is physical. The data warehouse
layer is virtual and provides data in a multidimensional view, created by an
intermediate processing layer.
ü The single-tier data warehouse architecture reduces the amount of data stored in a data
warehouse by building a more compact data set.
ü Its purpose is to minimize the amount of data stored to reach this goal; it removes data
redundancies.
ü Analysis queries are agreed to operational data after the middleware interprets them. In
this way, queries affect transactional workloads.
11
Single-Tier Architecture
12
ü The single-tier architecture has three layers:
1. A source layer
2. A data warehouse layer
3. An analysis layer
ü One drawback of the single-tier architecture is the lack of separation between
analytical and transactional processing. And that’s why this type of data
warehouse architecture is not used frequently.
12
Two-Tier Architecture
13
ü Unlike the single-tier architecture, the two-tier architecture contains a data
staging area that ensures any data you load into the warehouse is cleansed and in
the right format.
ü It’s found between the source layer and the data warehouse layer, as depicted in
the image below.
13
Two-Tier Architecture
14
14
Two-Tier Architecture
15
ü Most businesses that use data marts as a server make use of the two-tier data
warehouse architecture, which is also made up of two tiers:
1. The Data Tier
ü This is the layer where actual data is stored after various ETL processes have
been used to load data into the data warehouse.
ü It’s also made up of three layers:
1. A source layer, 2. A data staging layer, 3. A data warehouse layer
2. The Client Tier
ü This layer is where clients can use data stored in the data warehouse to generate
insights for making informed, data-driven decisions. You can modify or
transform this layer based on the data trends that you discover from your analysis
reports.
ü And it’s made up of a single layer: An analysis layer
ü Some disadvantages of the two-tier architecture are that it’s not scalable, has
network limitations, and only supports a small number of users.
15
Three-Tier Architecture
16
ü The three-tier architecture is what most organisations go for when building a data
warehouse system. It solves the connectivity problems that the two-tier architecture
commonly faces.
ü The three-tier architecture is made up of:
1. A source layer, 2. A reconciled layer ,3. A data warehouse layer
ü The Reconciled layer materializes operational data obtained after integrating and
cleansing source data. As a result, those data are integrated, consistent, correct,
current, and detailed.
ü The main advantage of the reconciled data layer is that it creates a common
reference data model for a whole enterprise. At the same time, it sharply separates
the problems of source data extraction and integration from those of data warehouse
population .
ü However, reconciled data leads to more redundancy of operational source data
ü The three-tier architecture is useful for extensive, enterprise-wide systems. But its
disadvantage is the additional storage space it uses through the redundant,
reconciled layer.
16
Three-Tier Architecture
17
1. Bottom Tier
ü T h e b o t t o m - t i e r, a l s o c a l l e d t h e d a t a w a r e h o u s e l a y e r, i s w h e r e d a t a
is extracted, transformed and loaded into the data repository using backend tools.
2. Middle Tier
ü The middle tier is responsible for arranging data into a more suitable structure for
complex querying and analysis.
ü This process is done with an Online Analytical Processing (OLAP) server and it’s
implemented using two models:
ü The Relational OLAP model (also called ROLAP), which assigns multidimensional
data processes to standard relational operations.
ü The Multidimensional OLAP (also called MOLAP) model, which implements
multidimensional information and operations.
3. Top Tier
ü The top-tier is basically the front-end layer that houses various tools and
APIs (Application Programming Interfaces) you can use for high-level data analysis,
querying, reporting and data mining.
ü It’s where end-users can access, interact and extract data from the warehouse.
17
Three-tier Data Warehousing Architecture
18
Metadata
Metadata, in simple words, is “data about data”. Its function is to describe the
structure of data in a warehouse and how it’s related to other data in the warehouse.
18
Extraction, Transformation, and Loading (ETL)
19
q Data extraction
Ø get data from multiple, heterogeneous, and external sources
q Data cleaning
Ø detect errors in the data and rectify them when possible
q Data transformation
Ø convert data from legacy or host format to warehouse format
q Load
Ø sort, summarize, consolidate, compute views, check integrity, and build
indicies and partitions
q Refresh
Ø propagate the updates from the data sources to the warehouse
19
Data Staging & ETL
20
ü The data staging layer hosts the ETL processes that extract, integrate, and clean
data from operational sources to feed the data warehouse layer
ü ETL takes place once when a data warehouse is populated for the first time, then
it occurs every time the data warehouse is regularly updated .
21
Data Staging & ETL
22
q Loading
ü This phase sorts, consolidates, checks integrity, and builds indices and partitions.
ü It can be carried out in two ways:
ü Refresh: Data warehouse data is completely rewritten. This means that
older data is replaced. Refresh is normally used in combination with static
extraction to initially populate a data warehouse
ü Update: Only those changes applied to source data are added to the data
warehouse. Update is typically carried out without deleting or modifying
preexisting data. This technique is used in combination with incremental
extraction to update data warehouses regularly
22
Types of Data Warehouse Models
23
23
Enterprise Warehouse
24
• An Enterprise warehouse collects all of the records about subjects
spanning the entire organization.
24
Data Marts
25
• A data mart contains a subset of organization-wide data that is important to a specific
group of an organization.
• The scope is limited to specific selected subjects. e.g. a marketing data mart may limit
its topics to customers, goods, and sales.
• Data marts are focused on one area and they draw data from a limited number of
sources.
• Time taken to build the data is very low compared to the time taken to build a
Datawarehouse.
• Data Mart helps to enhance user's response time due to reduction in volume of data
• The data contained in the data marts are summarized, small in size and flexible
25
Data Marts
26
26
Data Marts
27
Advantages of Data Mart
• Improve end-user response time
• Lower implementation cost
• Fast easy access data
• Frequently requested data is fastly provided to the end-user
• Data mart store only single subject area data
27
Virtual Warehouse
28
• A virtual warehouse is a group of views on an operational
database
• For efficient query processing, only a few possible summary
views can be physical
• Creating a virtual warehouse is easy, but requires additional
capacity on operational database servers
28
Multidimensional Model
29
• Data is divided into Dimensions and Facts
• Dimensions
• Dimensions are the perspectives or entities with respect to which an
organization wants to keep records
• Ex: Sales data warehouse may keep records of the store’s sales w.r.t. the
dimensions time, item, branch and location
• Each dimension may have a table associated with it, called a dimension
table
• Facts
• Facts are numerical measures which are used to analyse the relationship
between dimensions
• Ex: Facts for a Sales data warehouse include dollars_sold, units_sold
• The fact table contains the names of the facts or measures as well as keys
to each of the related dimension tables
• Dimensions describe facts
• Facts have measures that can be aggregated: sales price
29
Multidimensional Model
30
• Goal for dimensional modeling:
• Surround facts with as much context (dimensions) as possible
• A data warehouse is based on a multidimensional data model which views data in the
form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
• Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
• Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
30
Multidimensional Model...
31
• Hierarchies consist of levels called dimensional attributes
31
Multidimensional Model...
32
• Sales volume as a function of product, month, and region
32
Multidimensional Model...
33
33
Multidimensional Model...
34
34
Meta Data
35
• It specifies source, values, usage, and features of data ware house data and defines
how data can be changed and processed at every architecture layer
• Applications use it intensively to carry out data - staging and analysis tasks
• In data warehouse, metadata is used for building, maintaining, managing, and using
the data warehouses. Metadata helps users to easily access, understand the content and
find data in data warehouse
• Metadata includes the following:
• The location and descriptions of warehouse systems and components
• Names, definitions, structures, and content of data-warehouse and endusers
views
• Integration and transformation rules used to populate data, to deliver
information to end-user analytical tools
• Metrics used to analyze warehouses usage and performance
• Security authorizations, access control list, etc
35
Meta Data
36
• Internal Meta-Data
• It defines sources, transformation processes, population policies, logical and
physical schema, constraints and user profiles
• System administrator is interested
• External Meta-Data
• It is about definitions, quality standards, units of measure, relevant
aggregations
• It is relevant to end users
36
Recommended Text and Reference Books
37
q Text Book:
Ø J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 3rd ed., 2011
q Reference Books:
Ø H. Dunham. Data Mining: Introductory and Advanced Topics. Pearson
Education. 2006.
Ø I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann. 2000.
Ø D. Hand, H. Mannila and P. Smyth. Principles of Data Mining.Prentice-Hall.
2001.
37
38