$RRWYO9T
$RRWYO9T
Introduction:
Our capabilities of both generating and collecting data have been increasing rapidly in the
last several decades.
Contributing factors include the widespread use of bar codes for most commercial
products, the computerization of many business, scientific and government transactions,
and advances in data collection tools ranging from scanned text and image platforms to
satellite remote sensing systems.
Popular use of World Wide Web as a global information system has flooded us with
tremendous amount of data and information.
This explosive growth in stored data has generated an urgent need for new techniques
and automated tools that can intelligently assist us in transforming the vast amounts of data
into useful information and knowledge.
Management of data is one of the important objective of computer science.
Data warehousing helps in this respect which stores data in multiple dimensions.
• Definition:
1.A Data Warehouse is a repository of information collected from multiple
sources, stored under a unified schema and which usually resides at a
single site.
• Access of data
5. Accessible:
The primary purpose of data warehouse is to provide readily accessible
information to end users.
6. Process Oriented:
It is important to view data warehousing as a process for delivery of
information.
The maintenance of DW is ongoing and iterative in nature.
Characteristics:
• Smaller number of users.
• Instant response is less important (only for interactively composing reports.
• Read-only access by users.
• Most data access will be targeted at a small partition of the data: the last
month or quarter.
• Database access less frequent but executing large and complicated queries
that access many rows per table.
• Inconsistent, primarily long- running and complex read-only transactions
instead of high constant transaction rate.
• Load from operational data store will only insert new records, existing ones
do not get changed (updated).
• Bulk load from operational data store, no single-record inserts (at most once
daily).
• Database design partly de-normalized and redundant for better performance,
using a star or snowflake schema. Database design is data-driven, not
workflow-driven.
• Large storage capacity for historical data .
• May also contain aggregate data.
Benefits of data warehousing
Some of the benefits that a data warehouse provides are as follows:
• A data warehouse provides a common data model for all data of interest
regardless of the data's source.
• DW makes it easier to report and analyze information than it would be if
multiple data models were used to retrieve information such as sales invoices,
order receipts, general ledger charges, etc.
• Prior to loading data into the data warehouse, inconsistencies are identified and
resolved. This greatly simplifies reporting and analysis.
• Information in the data warehouse is under the control of data warehouse users
so that, even if the source system data is purged over time, the information in
the warehouse can be stored safely for extended periods of time.
• Because they are separate from operational systems, data warehouses provide
retrieval of data without slowing down operational systems.
• Data warehouses can work in conjunction with and, hence, enhance the value of
operational business applications, notably customer relationship management
(CRM) systems.
• Data warehouses facilitate decision support system applications such as trend
reports (e.g., the items with the most sales in a particular area within the last
two years), exception reports, and reports that show actual performance versus
goals.
Data Warehousing:
• Data warehousing is a process of constructing and using
data warehouses.
• The classic definition of the data warehouse focuses on
data storage.
• However, the means to retrieve and analyze data, to
extract, transform and load data, and to manage the
data dictionary are also considered essential
components of a data warehousing system.
• Many references to data warehousing use this broader
context.
• Thus, an expanded definition for data warehousing
includes business intelligence tools (, tools to extract,
transform, and load data into the repository, and tools to
manage and retrieve metadata.
Extract, Transform, and Load (ETL) is a process in data
warehousing that involves:
• extracting data from outside sources,
• transforming it to fit business needs
• loading it into the end target, i.e. the data warehouse.
1) Extract:
– The first part of an ETL process is to extract the data from the source
systems.
– Most data warehousing projects consolidate data from different source
systems.
– Each separate system may also use a different data organization
format.
– Common data source formats are relational databases and flat files.
– Extraction converts the data into a format for transformation
processing.
• An intrinsic part of the extraction is the parsing of extracted data,
resulting in a check if the data meets an expected pattern or
structure. If not, the data may be rejected entirely.
2) Transform:
• The transform stage applies to a series of rules or functions to the extracted
data.
• Some data sources will require very little or even no manipulation of data.
• In other cases, one or more of the following transformations types to meet the
business and technical needs of the end target may be required:
– Selecting only certain columns to load (or selecting null columns not to
load).
– Translating coded values (e.g., if the source system stores 1 for male and 2
for female, but the warehouse stores M for male and F for female) .
– Encoding free-form values (e.g., mapping "Male" to "1" and "Mr" to M)
– Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
– Filtering
– Sorting
– Joining together data from multiple sources.
– Aggregation.
– Transposing or pivoting (turning multiple columns into multiple rows or vice
versa)
– Splitting a column into multiple columns (e.g., putting a comma-separated
list specified as a string in one column as individual values in different
columns)
3) Load:
• The load phase loads the data into the end target, usually
being the data warehouse.
• Depending on the requirements of the organization, this
process ranges widely. Some data warehouses might
weekly overwrite existing information with cumulative,
updated data, while other DW (or even other parts of the
same DW) might add new data in a historized form, e.g.
hourly.
• As the load phase interacts with a database, the
constraints defined in the database schema as well as in
triggers activated upon data load apply (e.g. uniqueness,
referential integrity, mandatory fields), which also
contribute to the overall data quality performance of the
ETL process.
OLAP & OLTP
OLTP: Online transaction processing.
OLTP refers to a class of systems that facilitate and manage
transaction-oriented applications, typically for data entry and
retrieval transaction processing.
OLTP is used to refer to processing in which the system responds
immediately to user requests.
The major task of OLTP is to perform online transaction and query
processing.
They cover most of the day to day operations of an organization
such as purchasing, inventory, manufacturing, banking, payroll,
registration and accounting.
An automatic teller machine (ATM) for a bank is an example of a
commercial transaction processing application.
• Benefits
• Online Transaction Processing has two key benefits:
simplicity and efficiency.
• Reduced paper trails and the faster, more accurate
forecasts for revenues and expenses are both examples
of how OLTP makes things simpler for businesses.
• It also provides a concrete foundation for a stable
organization because of the timely updating.
• OLTP is proven efficient because it vastly broadens the
consumer base for an organization, the individual
processes are faster.
• Disadvantages
• It is a great tool for any organization, but in using OLTP,
there are a few things to be wary of: the security issues
and economic costs.
• One of the benefits of OLTP is also an attribute to a
potential problem. The worldwide availability that this
system provides to companies makes their databases that
much more susceptible to intruders and hackers.
• Another economic cost is the potential for server failures.
This can cause delays or even wipe out an immeasurable
amount of data.
• OLAP: Online Analytical Processing.
• Online Analytical Processing, or OLAP, is an approach to quickly
provide answers to analytical queries that are multi-dimensional
in nature.
• OLAP organizes and presents data in various formats in order to
accommodate the diverse needs of the different users.
• It serves users or knowledge workers in the role of data analysis
and decision making.
• The typical applications of OLAP are in business reporting for
sales, marketing, management reporting, business process
management (BPM), budgeting and forecasting, financial
reporting and similar areas.
• Databases configured for OLAP employ a multidimensional data
model, allowing for complex analytical and ad-hoc queries with a
rapid execution time.
The major distinguishing features between OLTP & OLAP are:
• Users & System Orientation:
OLTP: is customer- oriented and is used for transaction
processing.
OLAP: is market oriented and is used for data
analysis .
• Data contents:
OLTP: manages current data that are too detailed.
OLAP: manages large amounts of historical data, provides
facility for summarization & aggregation.
• Database Design:
OLTP: adopts Entity Relationship data model.
OLAP: adopts either a star or snowflake model.
• View:
OLTP: focuses mainly on the current data within an enterprise or
department.
OLAP: focuses on historical data.
• Access Patterns:
OLTP: consists of short, atomic transactions.
Requires concurrency and recovery
mechanisms.
OLAP: are mostly read only operations.
Feature OLTP OLAP
Characteristics Operational processing Information processing
Orientation transaction analysis
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
DB design ER based Star/snowflake
data current, up-to-date historical
View Detailed summarized
Focus Data in Information out
usage repetitive ad-hoc
access read/write lots of scans
unit of work short, simple transaction complex query
records accessed tens millions
users thousands hundreds
DB size 100MB-GB 100GB-TB
Multidimensional Data Model:
• A data model is a way to describe data and to issue queries
against it.
• DW & OLAP tools are based on a multi-dimensional data
model.
• This model views data in the form of a data cube.
Data Cube:
• Data cube allows data to be modeled and viewed in multiple
dimensions.
• It is defined by dimensions and facts.
Dimensions:
• Dimensions are perceptive or entities with respect to which an
organizations wants to keep records.
• For eg. A sales data warehouse in order to keep records of the
store’s sales with respect to dimensions time, item, branch, and
location.
• Each dimension may have table associated with it called a
dimension table, which further describes the dimension.
• For eg. A dimension table for item may contain attributes
item_name, brand, type etc.
Facts:
• Multidimensional data model is organized around a
central theme called as facts.
• Eg. Sales
• This theme is represented by a fact table.
• Facts are numerical measures.
• They are the quantities by which we want to analyze
relationships between dimensions.
• Eg. Facts for a sales DW include dollars_sold, units_sold,
amt_budgeted.
• The fact table contains the names of the facts, or
measures as well as keys to each of the related
dimension tables.
2 D View
location= “vancouver”
item (type)
Time (quarter) home ent. Computer phone security Time (quarter) home ent. Computer phone security
Time (quarter) home ent. Computer phone security Time (quarter) home ent. Computer phone security
New York
c
680
items
• In data warehousing literature, a data cube such
as each of the above is referred to as a cuboid.
• The cuboid that holds the lowest level of
summarization is called base cuboid.
• The top most 0-D cuboid, which holds the
highest-level of summarization, is called the apex
cuboid.
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Schemas for Multidimensional Database
•Database schema consists of a set of entities and the relationships
between them.
•The entity relationship data model is commonly used in the design
of relational database.
•Such data model is appropriate for on-line transaction processing.
•A data warehouse requires a concise, subject oriented schema that
facilitates on-line data analysis.
•The most popular data model for a data warehouse is a
multidimensional model.
•Such a model can exists in the form of a star schema, snowflake
schema or a fact constellation schema.
Star Schema:
• It is the most common modeling paradigm.
• In star schema DW contains:
• A large central table (fact table) containing bulk of data
with no redundancy.
• Facts are numerical measures.
• A set of smaller attendant tables (dimension tables) one
for each dimension.
• Dimensions are perceptive or entities with respect to
which an organizations wants to keep records.
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
• Sales are considered along four dimensions i.e. time, item, branch,
location.
• The schema contains a central fact table for sales that contains
keys to each of the four dimensions, along with measures:
dollars_sold and units_sold, avg_sales.
• In star schema each dimension is represented by only one table,
and each table contains a set of attributes.
• For eg. The location dimension table contains the attribute set
{location_key, street, city, province-or_state, country}.
• This constraint may introduce some redundancy.
• For eg. “vancouver” and “victoria” are both cities in the Canadian
province of British Columbia, Canada).
• Entries for such cities in the location dimension table will create
redundancy among attributes province_or_state and country.
Snowflake schema:
• It is a variant of the star schema model.
• Here dimension tables are normalized thereby further splitting the
data into additional tables.
• The resulting schema graph forms a shape similar to a snowflake.
• The dimension tables of the snowflake model may be kept in
normalized form to reduce redundancies.
• Such table is easy to maintain.
• Snowflake structure can reduce the effectiveness of browsing since
more joins will be needed ti execute a query.
• System performance may be adversely impacted.
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
• The sales fact table is identical to that of the star
schema.
• The main difference between these two schemas is in
the definition of dimension tables.
• The single dimension table for item in the star schema is
normalized in snow flake schema resulting into new item
and supplier tables.
Fact Constellation:
• Sophisticated applications may require multiple fact
tables to share dimension tables.
• This kind of schema can be viewed as a collection of
stars, and hence is called a galaxy schema or a fact
constellation.
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
Province or
state
New York Illinois
British Columbia Ontario
City
Country Year
Quarter
Province or
state
City Week
Month
Street Day
Chicago 440
at
New York
c
1560
Lo
Toronto 395
Vancouver 1087 968
968
Q1 605 825 14 400
682
Q2
Time
Q3
Q4
items
Roll Up:
• Is also called as Drill up operation.
• Performs aggregation on data cube either by
climbing up a concept hierarchy or for a
dimension by dimension reduction.
Roll Up
On location from cities to
countries
n
io
c at
Lo
USA
2000
Canada
968
Q1 1000
682
Q2
Time
Q3
Q4
n
On time from quarters to months
io
at
c
Lo
Jan
1087 968
968 150
Feb 100
Mar 150
Apr
May
June
Time
July
Aug
Sept
Oct
Nov
Dec
Chicago
Location
New York
Toronto
Toronto 395
c
Lo
Vancouver
968
Q1 605
Time
Q2
Item (type)
Pivot:
• Pivot is also called as rotate.
• It is a visualization operation that rotates the data axes in
view in order to provide an alternative presentation of
the data.
• Following figure shows a pivot operation where the item
and location axes in a 2- D slice are rotated.
Home Ent 605
Computer 825
Item (Type)
Phone 14
Security 400
Location (cities)
Other OLAP operations:
• Drill across:- Executes queries involving more than one
fact table.
• Drill through:- Makes use of relational SQL facilities to
drill through the bottom level of a data cube down to its
back-end relational tables.
• Ranking the top N or bottom N items in lists.
• Computing moving averages, growth rates, interests,
internal rates of returns, depreciation, currency
conversions, and statistical functions.
• OLAP offers analytical modeling capabilities including a
calculation engine for deriving ratios, variance.
• It can generate summarizations, aggregations etc.
• OLAP supports functional models for forecasting, trend
analysis, and statistical analysis.
Data warehouse Architecture
Monitor
Metadata & OLAP Server
External
Repository Integrator
sources
Analysis
Operational Extract Query
Databases Transform Data Serve Reports
Load Data mining
Refresh
Warehouse
OLAP Server
Data Marts
Top Tier:
• The top tier is a client which contains query and reporting tools,
analysis tools, and data mining tools as trend analysis,
prediction, etc.
Data Warehouse Models:
• There are three data warehouse models:
1. Enterprise warehouse:
• It collects all of the information about subjects spanning the entire
organization.
• It provides corporate wide data integration form one or more
operational systems or external sources.
• It is cross functional in scope.
• It contains detailed data as well as summarized data.
• It requires extensive business modeling and may take years to
design and build.
2. Data Marts:
• It contains a subset of corporate wide data that is of value to
specific group of users.
• The scope is limited to specific selected subjects.
• E.g. Marketing data mart may confine its subjects to customer,
item, sales etc.
• The data in data mart is summarized.
• Depending on the source of data, data marts can be categorized
as:
A. Independent:
• These are sourced from data captured from one or more
operational systems or external sources, or from data generated
locally within a particular department.
B. Dependent:
• These are sourced directly from enterprise data warehouse.
OLAP SERVER
tools
ROLAP
utilities
server
relational
DBMS
Advantages:
1. Can handle large amounts of data:
The data size limitation of ROLAP technology is the limitation
on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
ty
B
Ci
A
milk
Product
M.D. tools soda
eggs
soap
1 2 3 4
Date
utilities
multi-
dimensional
server
Advantages:
1. Excellent performance:
MOLAP cubes are built for fast data retrieval, and is optimal
for slicing and dicing operations.