0% found this document useful (0 votes)
51 views

Chapter 2

Uploaded by

Damni Mukhi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Chapter 2

Uploaded by

Damni Mukhi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

Fair Use Notice

The material used in this presentation i.e., pictures/graphs/text, etc. is solely


intended for educational/teaching purpose, offered free of cost to the students for
use under special circumstances of Online Education due to COVID-19 Lockdown
situation and may include copyrighted material - the use of which may not have
been specifically authorised by Copyright Owners. It’s application constitutes Fair
Use of any such copyrighted material as provided in globally accepted law of many
countries. The contents of presentations are intended only for the attendees of the
class being conducted by the presenter.
Data Warehouse: The Building Blocks

Disclaimer: The contents in this presentation have been taken


from multiple resources available at the internet including
books, notes, reports, websites and presentations.
Chapter Objectives
• Review formal definitions of a data warehouse

• Discuss the defining features

• Distinguish between data warehouses and data marts

• Study each component or building block that makes up a


data warehouse

• Introduce metadata and highlight its significance


Major Defining Features
• The most popular definition came from Bill Inmon, who
provided the following:

“A data warehouse is a subject-oriented, integrated, time-


variant and non-volatile collection of data in support of
management's decision making process”
Major Defining Features
Defining Features
• The data in the data warehouse is:
• Separate
• Available
• Accessible
• Subject oriented
• Integrated
• Time-variant
• Non-volatile
Subject oriented
• In operational systems, we store data by individual applications. For example:
• Order processing application
• Consumer loan application

• A data warehouse target on the modeling and analysis of data for decision-
makers. Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer, product,
or sales, instead of the global organization's ongoing operations.

• In the data warehouse, data is stored by subjects. For example:


• Sales
• Shipments
• Inventory
Subject oriented
Integrated
• The data in the data warehouse comes from several OLTP
systems.
• Different
• Operational platforms.
• Operating systems
• File layouts
• Character code representations
• Field naming conventions
Integrated

• In addition to this, for many enterprises, outside sources are also


important.
• You get a mix of source data for a data warehouse (eg. OLTP, flat files,
XML docs, DB objects, CSV files, CRM/ERP)
Integrated
• In Application A gender field store logical values like M or F
• In Application B gender field is a numerical value,
• In Application C application, gender field stored in the form of a character value.Same is the case with Date
and balance
However, after transformation and cleaning process all this data is stored in common format in the Data
Warehouse. [Source: https://www.guru99.com/data-warehouse-architecture.html]
Time-variant
• For an operational system, the stored data contain current values. For
example:
• Balance in the customer’s account
• Order status
• Balance amount in loans application

• Data in the data warehouse is for analysis and decision making.


Time-variant
• For example:
• Buying pattern of a customer
• Drop-down in sales
• Data is stored as snapshots of past and current periods.
• Time element is important.
• Time variant natures allows for analysis of past, relates information to
present and provide forecasts for future.
Non-volatile
• Data warehouse is also non-volatile means the previous data is not
erased when new data is entered in it. Data is read-only and
periodically refreshed.

• Data from operational systems are moved in data warehouse at specific


time intervals.

• Data movements to different data sets may take place at different


frequencies.

• Every business transaction does not update the data in the data
warehouse.
Non-volatile
Data Warehouse
Components
Overview of the components
• When we build operational system, we put several components to
make up the system.

• Similarly you build data warehouse with hardware and software


components.

• Architecture is the proper arrangement of components.

• To get maximum benefit, you arrange these components.


Overview of the components
Overview of the components
• Source data component
• Production data
• Internal data
• Archived data
• External data
Production Data
• This category of data comes from the various operational systems of
the enterprise.
• You may come across variations in
• data formats
• Hardware platforms
• Database structures
• Operating systems etc.
Production Data
• In operational systems, queries are narrow and predictable.
• There is no conformance of data across various operational system.
• Your great challenge is to:
• standardize and transform the data
• Convert the data and
• Integrate pieces into useful data
Internal Data
• Private spreadsheets, documents, customer profiles, marketing
budgets, and profit-and-loss statements etc.

• You can not ignore the internal data held in private files.

• On basis of collective judgment, you will decide how much internal


data is to be included in a data warehouse
Internal Data
• Internal data adds additional complexity to the process of
transforming and integrating the data.
• You may schedule the acquisition of internal data.
Archived Data
• OLTP systems periodically take the “old data” and store it in archived
files.
• Many archiving methods exist.
• For getting historical snapshots of data, you look into archived data
sets.
• This data is useful for discerning patterns and analyzing trends.
External Data
• Most executive depend on data from external sources. This
data comes from the market, including customers and competitors
• Statistics produced by external agencies.
• Market share data of competitors.
• Standard values of financial indicators etc.
External Data (example)
• UK based supermarket chain Tesco, for example, is renowned for their use of
weather data to drive richer insights that help them to predict sales and stock
requirements.
• They reported in 2013 that they had managed to save £6m ($7.5m) per year
and reduced out-of-stock by 30% on special offers.
•s
• In fact, in a recent survey of supply chain professionals by the UK Met Office,
47% cited weather as one of the top three factors external to their business
that drives consumer demand.

Reference(https://channels.theinnovationenterprise.com/articles/are-companies-using-their-external-data )
External Data
• Data from outside do not conform to your formats.
• Organize data transmission
• Conversion into internal data formats and types.
Data Staging Component
• After extracting data from various sources, you have to prepare the
data for storing in a data warehouse.
• Three major functions need to be performed:
• Extraction
• Transformation
• Loading
Data Staging Component
• Why do you need a separate place or component to perform data
preparation?

• In a data warehouse , you pull in data from many source systems

• A separate staging area is therefore necessary to prepare data for a


data warehouse.
Data Extraction
• This function has to deal with numerous data sources.

• Source data may be in different formats.

• Data extraction may become quite complex.

• Tools are available in market for data extraction.


Data Extraction
• Purchasing tools may entail high initial cost.

• In house programs may require ongoing cost for development and


maintenance.

• After you extract the data, you may keep it in a separate physical
environment for further preparation.
Data Transformation
• In every system implementation , data transformation is an important
function.

• If data extraction poses great challenges, data transformation may


pose even greater challenges.

• Data feed in a data warehouse is not just an initial load.


Data Transformation
• You perform a number of individual tasks in transformation.
• First, clean the extracted data.
• Correction of misspelling
• Resolution of conflicts
• Providing default values
• Eliminating duplicates and so on.
Data Transformation
• Standardization of data elements is another important part of transformation.

• Standardize data formats and lengths.

• Semantic Standardization is important.(One persistent misunderstanding recurs in


discussion of semantics - the confusion of words and meanings)

• Resolve synonyms and homonyms.


(When multiple words refer to the same (fixed) concept in language this is called synonymy)
(each of two or more words having the same spelling or pronunciation but different
meanings and origins. Eg. Mail-male, Mean – Mean)
Data Transformation
• It involves combing pieces of data.
• It also involves purging of source data that is not useful.
• Sorting and merging of data also takes place on large scale in the data
staging area.
• Data transformation also includes the assignment of surrogate keys.
Data Transformation
• Summarization of data is also very important function of data
transformation.
• When the data transformation function ends, you have a collection of
integrated data, that is:
• Cleaned
• Standardized and
• Summarized
Data Loading
• Two distinct groups of tasks:
• When you go live for first time, you do initial loading.
• Initial load moves large volumes of data.
• Substantial amount of time is required.
• You continue to extract changes to the source data and feed the incremental
data revisions in a data warehouse on regular basis.
Data Loading
Data storage Component
• The data storage for the data warehouse is a separate repository.

• The data repositories of OLTP systems contain


• Current data
• Highly normalized data

• For data warehouse, you need large volumes of data.


Data storage Component
• The data in the operational databases could change from moment to
moment.
• For data warehouse, you need to have stable data.
• Data storage must not be in continual updating.
• For supporting this component, You may use tools from multiple
vendors.
Data storage Component
• Many of the data warehouses use multidimensional database
management systems(MDDBs).
Information Delivery Component
• Consider users who need information from the data warehouse.

• For example:
• Novice users
• Casual users
• Business analyst
• Power users etc.

• Information delivery component includes methods of information


delivery.
Information Delivery Component
Information Delivery Component
• Some data warehouse also provide data to data mining applications.

• Data mining applications are knowledge discovery applications.

• You may include various information delivery methods depending


upon the requirements.
Metadata component
• Metadata in a data warehouse is similar to data dictionary in a
database.
• Logical data structures.
• Files and addresses
• Indexes etc.

• Metadata in a Data warehouse is much more than the data dictionary


in a database.
Metadata component
• Similar to yellow pages.
• Metadata is a key architectural component of a data warehouse.
• Are there any predefined queries I can look at?
• What are the various elements of data in the data warehouse?
• Is there any information about unit sales and unit cost by product?
• How old is the data in the warehouse?
Metadata component
• When was the last time fresh data was brought in?
• Are there any summaries by month and product?
• Metadata repository is answer to all these questions.
• Types of metadata
• Operational metadata
• Extraction and transformation metadata
• End-user metadata.
Metadata component
• Operational metadata
• Data for data warehouse comes from several operational systems of an
enterprise.
• Operational metadata contain all the information about these sources.
• The operational metadata is primarily for use by developers and
programmers.
• Extraction and Transformation metadata.
• Extraction frequencies.
• Extraction methods
• Business rules for data extraction and so on.
Metadata component
• End-User Metadata
• It is a navigational map of the data warehouse.
Metadata component
• Why is metadata specially important in a data warehouse?
• First, it acts as a glue to connect all the parts of a data warehouse.
• Next, it provides information about the contents and structures to the
developers.
• Finally, it opens the door to the end users and make the contents recognizable
in their own terms.
Management and Control component
• This component sits on the top.
• Coordinates the services and activities.
• Controls the data transformation.
• Controls storage of data.
Data Granularity
Data Granularity
• When designing the data warehouse, one of the most basic concepts
is that of the Data granularity.
• Its important to determine the proper level of granularity.
• When the level of granularity is properly set, the remaining aspects of
design and implementation flow smoothly.
• Consider the following example:
• In an operational system, data is usually kept at the lowest level of detail.

• In a point-of-sale system for a grocery store, the units of sale are captured and
stored at the level of units of a product per transaction at the check-out counter.

• In an order entry system, the quantity ordered is captured and stored at the
level of units of a product per order received from the customer.

• Whenever you need summary data, you add up the individual transactions.
• For example, If you are looking for units of a product ordered this month, you
read all the orders entered for the entire month for that product and add up.

• You do not usually keep summary data in an operational system.


• When a user queries the data warehouse for analysis, he or she usually starts
by looking at summary data.
• For example, The user may start with total sale units of a product in an entire
region. Then the user may want to look at the breakdown by states in the
region. The next step may be the examination of sale units by the next level of
individual stores.
• Frequently, the analysis begins at a high level and moves down to lower levels
of detail.
• In a data warehouse, therefore, you find it efficient to keep data summarized
at different levels.

Term Data granularity in a data warehouse refers to the level of detail.


Data Granularity
• Granularity refers to “the level of detail or summarization of the units of data in the
data warehouse”. The low level of granularity contains high level of detail and the high
level of granularity contains low level of detail.

• Granularity means the level of detail of your data within the data structure. In a
typical Data Warehouse one might find very detailed data (such as
seconds, single product, one specific attribute) and aggregated data (such
as total number of, monthly orders, all products).
• The higher the granularity of a fact table the more data you will have. But the
granularity of your data also determines what kind of information you can get out of the
stored data.

• It determines the types of analysis that can be done.

• Granularity indicates the level or grain of data.


• The primary issue of granularity is that of getting it at the right level. The
level of granularity needs to be neither too high nor too low.

• For summary data, you add up individual transactions.

• Users of a data warehouse usually start by looking at summary data.

• Analysis begins at a high level and moves down to lower levels of details.
Data Granularity
Exercise
• Identify an Organization whose business needs can not be fulfilled by
existing operational database systems and it require a data
warehouse solution. List down the issues, which can not be resolved
by operational databases for this particular organization and how a
data warehouse would help. Also identify required levels of
granularity (Be Precise).
Approaches for building a
data warehouse
Data Warehouses & Data Marts
• Bill Inmon in 1998 stated in one of the magazine,
• “The single most important issue facing the IT manager this year is
whether to build the data warehouse or the data mart first.”

• This is even true today.


Data Marts
• A subset of DWH that supports the requirements of a particular department or business
function.

• Individual data marts are targeted to particular business groups in the enterprise.

• The collection of all the data marts form an integrated whole, called the enterprise data
warehouse.

• Data warehouses have an enterprise-wide depth, the information in data marts pertains


to a single department.

• A Data Mart is a condensed version of Data Warehouse and is designed for use by a
specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or
finance
Basic Data Warehouse
Architecture
One Version
Source OLTP of the Truth Subset Data Marts
Systems

Enterprise
Data
Warehouse
Differences Between a Data Warehouse and a Data Mart

Category Data Warehouse Data Mart


Scope Corporate (broad) Line of Business (LOB)
(focused)

Subject Multiple Single subject


Data Sources Many Few
Size (typical) 100 GB-TB+ < 100 GB
Implementation Time Months to years Months
Types of Data Marts
• There are two basic types of data marts:
• dependent data marts and
• Independent data marts
Independent Data Marts
• Independent data marts are created directly from operational
systems, just as is a data warehouse.

• In the data mart, the data is usually transformed as part of the load
process(ETL).

• Data might be aggregated, dimensionalized or summarized


historically, as the requirements of the data mart dictate.
Independent data mart Data marts:
Mini-warehouses, limited in scope

Separate ETL for each Data access complexity


independent data mart due to multiple data marts
Dependent Data Marts
• Dependent data marts are created from the detail
data in the data warehouse.

•  Dependent data marts draw data from a central data


warehouse that has already been created.

• Although this approach still requires the movement


and transformation of data but may provide a better
vehicle for performance-critical user queries.
Dependent data mart with ODS provides option for
operational data store obtaining current data

T
E
Single ETL for Dependent data marts
enterprise data warehouse (EDW) loaded from EDW
Reasons for creating Data Marts
• The motivations behind the creation of these two types of data marts
are also typically different.

• Dependent data marts are usually built to achieve improved


performance and availability, better control, and lower
telecommunication costs resulting from local access of data relevant to
a specific department.

• The creation of independent data marts is often driven by the need to


have a solution within a shorter time.
Reasons/Benefits for creating Data Marts
• Easy access to frequently needed data
• Creates collective view by a group of users.
• Data Mart helps to enhance/improve end- user's response time due
to a reduction in the volume of data.
• Ease of creation.
• Lower cost than implementing a full data warehouse.
• Potential users are more clearly defined than in a full data warehouse.
Disadvantages

• Many a times enterprises create too many disparate and unrelated


data marts without much benefit. It can become a big hurdle to
maintain.
• Data Mart cannot provide company-wide data analysis as their data
set is limited.
Approaches for building a data warehouse
• Before deciding to build a data warehouse for your organization, you
need to ask the following fundamental questions and address the
relevant issues:

• Top-down or bottom-up approach?


• Enterprise-wide or departmental?
• Which first – data warehouse or data mart?
• Build pilot or with a full-fledged implementation?
• Dependent or independent data marts?
Top-Down versus Bottom-Up Approach
• Inmon’s Top-Down Approach
• The advantages of this approach are:
• An enterprise view of data.
• Inherently architectured—not a union of disparate data marts.
• Single, central storage of data about the content.
• Centralized rules and control.
• The disadvantages are:
• Takes longer to build.
• High risk of failure.
• Needs high level of cross functional skills.
Top-Down versus Bottom-Up Approach
• Kimball’s Bottom-Up Approach
• The advantages of this approach are:
• Faster and easier implementation of manageable pieces
• Favorable ROI
• Less risk of failure
• Inherently incremental
• Allows project team to learn and grow.
• The disadvantages are:
• Permeates redundant data in every data mart
• Perpetuates inconsistent data.
• Each data mart has its own narrow view of data.
A Practical Approach
• Accommodating both views appears to be practical.

• The steps in this practical approach are as follows:


• Plan and define requirements at the overall corporate level.
• Create a surrounding architecture for a complete warehouse.
• Conform and standardize the data content.
• Implement the data warehouse as a series of supermarts (carefully
architectured data marts), one at a time.
Exercise
A data warehouse is subject-oriented. What would be the major critical
business subjects for the following companies?
a. an international manufacturing company
b. a local community bank
c. a domestic hotel chain
• For an airlines company, identify three operational applications that
would feed into the data warehouse. What would be the data load
and refresh cycles?

You might also like