0% found this document useful (1 vote)
339 views

Unit 3 Notes

The document provides an introduction to data warehouses. It discusses how a data warehouse stores integrated information from across an organization to support decision making. It describes the key components of a data warehouse including the central data warehouse, data marts, and legacy systems. It also discusses dimensional modeling techniques used to design relational data warehouses and categorizes different types of data warehouse users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
339 views

Unit 3 Notes

The document provides an introduction to data warehouses. It discusses how a data warehouse stores integrated information from across an organization to support decision making. It describes the key components of a data warehouse including the central data warehouse, data marts, and legacy systems. It also discusses dimensional modeling techniques used to design relational data warehouses and categorizes different types of data warehouse users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT 3 NOTES

DATAWARE HOUSE – INTRODUCTION

Data Warehouse (DW) is a database that stores information oriented to satisfy decision-making
requests. A very frequent problem in enterprises is the impossibility for accessing to corporate,
complete and integrated information of the enterprise that can satisfy decision-making requests.

A paradox occurs: data exists but information cannot be obtained. In general, a DW is


constructed with the goal of storing and providing all the relevant information that is generated
along the different databases of an enterprise.

A DW is a database with particular features. Concerning the data it contains, it is the result of
transformations, quality improvement and integration of data that comes from operational bases.
Besides, it includes indicators that are derived from operational data and give it additional value.
Concerning its utilization, it is supposed to support complex queries (summarization, aggregates,
crossing of data), while its maintenance does not suppose transactional load.

In addition, in a DW environment end users make queries directly against the DW through user-
friendly query tools, instead of accessing information through reports generated by specialists.

Building and maintaining a DW need to solve problems of many different aspects. In this chapter
we concentrate in DW design.

A data warehouse has three main components:

1. A “Central Data Warehouse” or “Operational Data Store(ODS)”, which is a data base


organized according to the corporate data model.

2. One or more “data marts”—extracts from the central data warehouse that are organized
according to the particular retrieval requirements of individual users.

3. The “legacy systems” where an enterprise’s data are currently kept.

THE CENTRAL DATA WAREHOUSE

The Central Data Warehouse is just that—a warehouse. All the enterprise’s data are stored in
there, “normalized”, in order to minimize redundancy and so that each may be found easily.
This is accomplished by organizing it according to the enterprise’s corporate data model. Think
of it as a giant grocery store warehouse where the chocolates are kept in one section, the T-shirt
is in another, and the CDs are in a third.

We found in the literature, globally two different approaches for Relational DW design:

One that applies dimensional modeling techniques, and another that bases mainly in the concept
of materialized view.

Dimensional models represent data with a “cube” structure, making more compatible logical

data representation with OLAP data management. According to the objectives of dimensional
modeling are:

(i) To produce database structures that are easy for end-users to understand and write queries
against,

(ii) To maximize the efficiency of queries

It achieves these objectives by minimizing the number of tables and relationships between them.
Normalized databases have some characteristics that are appropriate for OLTP systems, but not
for DWs:

1. Its structure is not easy for end-users to understand and use. In OLTP systems this is not a
problem because, usually end-users interact with the database through a layer of software.

2. Data redundancy is minimized. This maximizes efficiency of updates, but tends to penalize
retrievals. Data redundancy is not a problem in DWs because data is not updated on-line.

The basic concepts of dimensional modeling are: facts, dimensions and measures

 A fact is a collection of related data items, consisting of measures and context data. It
typically represents business items or business transactions.
 A dimension is a collection of data that describe one business dimension. Dimensions
determine the contextual background for the facts; they are the parameters over which we
want to perform OLAP.
 A measure is a numeric attribute of a fact, representing the performance or behavior of
the business relative to the dimensions.

GOALS OF DATA WAREHOUSE ARCHITECTURE

A data warehouse exists to serve its users—analysts and decision makers. A data warehouse
must be designed to satisfy the following requirements:

1. Deliver a great user experience—user acceptance is the measure of success.

2. Function without interfering with OLTP systems.

3. Provide a central repository of consistent data.

4. Answer complex queries quickly.

5. Provide a variety of powerful analytical tools such as OLAP and data mining.

Most successful data warehouses that meet these requirements have these common

characteristics:

1. Based on a dimensional model

2. Contain historical data

3. Include both detailed and summarized data

4. Consolidate disparate data from multiple sources while retaining consistency

5. Focus on a single subject such as sales, inventory, or finance

DATA WAREHOUSE USERS

The success of a data warehouse is measured solely by its acceptance by users. Without users,
historical data might as well be archived to magnetic tape and stored in the basement. Successful
data warehouse design starts with understanding the users and their needs.

Data warehouse users can be divided into four categories:


 Statisticians
 knowledge workers
 information consumers
 executives.

Each type makes up a portion of the user population as illustrated in this diagram

Statisticians

 There are typically only a handful of statisticians and operations research types in any
organization.
 Their work can contribute to closed loop systems that deeply influence the operations and
profitability of the company.

Knowledge Workers

 A relatively small number of analysts perform the bulk of new queries and analyses
against the data warehouse.
 These are the users who get the Designer or Analyst versions of user access tools. They
will figure out how to quantify a subject area. After a few iterations, their queries and
reports typically get published for the benefit of the Information Consumers.
 Knowledge Workers are often deeply engaged with the data warehouse design and place
the greatest demands on the ongoing data warehouse operations team for training and
support.

Information Consumers

 Most users of the data warehouse are Information Consumers; they will probably never
compose a true ad hoc query.
 They use static or simple interactive reports that others have developed. They usually
interact with the data warehouse only through the work product of others.
 This group includes a large number of people, and published reports are highly visible.
Set up a great communication infrastructure for distributing information widely, and
gather feedback from these users to improve the information sites over time.

Executives:

 Executives are a special case of the Information Consumers group.

Process Managers

Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −

1. Load manager

2. Warehouse manager

3. Query manager

1. Load Manager

 Load manager performs the operations required to extract and load the data into the
database.
 The size and complexity of a load manager varies between specific solutions from one
data warehouse to another.
 The load manager does performs the following functions −
 Extract data from the source system.
 Fast load the extracted data into temporary data store.
 Perform simple transformations into structure similar to the one in the data warehouse.

 The data is extracted from the operational databases or the external information
providers.
 Gateways are the application programs that are used to extract data.
 It is supported by underlying DBMS and allows the client program to generate SQL to be
executed at a server.
 Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples
of gateway.

EXTRACT DATA FROM SOURCE

 The data is extracted from the operational databases or the external information
providers.
 Gateways are the application programs that are used to extract data. It is supported by
underlying DBMS and allows the client program to generate SQL to be executed at a
server.
 Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples
of gateway.
FAST LOAD

 In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time.
 Transformations affect the speed of data processing.
 It is more effective to load the data into a relational database prior to applying
transformations and checks.
 Gateway technology is not suitable, since they are inefficient when large data volumes
are involved.

SIMPLE TRANSFORMATIONS

While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks.

Suppose we are loading the EPOS sales transaction, we need to perform the following checks −

 Strip out all the columns that are not required within the warehouse.
 Convert all the values to required data types.

2. Warehouse Manager

The warehouse manager is responsible for the warehouse management process. It consists of a
third-party system software, C programs, and shell scripts.

The size and complexity of a warehouse manager varies between specific solutions.

WAREHOUSE MANAGER ARCHITECTURE

A warehouse manager includes the following −

 The controlling process


 Stored procedures or C with SQL
 Backup/Recovery tool
 SQL scripts
FUNCTIONS OF WAREHOUSE MANAGER

A warehouse manager performs the following functions −

 Analyzes the data to perform consistency and referential integrity checks.


 Creates indexes, business views, partition views against the base data.
 Generates new aggregations and updates the existing aggregations.
 Generates normalizations.
 Transforms and merges the source data of the temporary store into the published data
warehouse.
 Backs up the data in the data warehouse.
 Archives the data that has reached the end of its captured life.

Note: A warehouse Manager analyzes query profiles to determine whether the index and
aggregations are appropriate
3. Query Manager

The query manager is responsible for directing the queries to suitable tables. By directing the
queries to appropriate tables, it speeds up the query request and response process.

In addition, the query manager is responsible for scheduling the execution of the queries posted
by the user.

QUERY MANAGER ARCHITECTURE

A query manager includes the following components −

 Query redirection via C tool or RDBMS


 Stored procedures
 Query management tool
 Query scheduling via C tool or RDBMS
 Query scheduling via third-party software

FUNCTIONS OF QUERY MANAGER

 It presents the data to the user in a form they understand.


 It schedules the execution of the queries posted by the end-user.
 It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate.
DATA WAREHOUSING OBJECTS

The following types of objects are commonly used in dimensional data warehouse schemas:

FACT TABLES

 Fact tables are the large tables in your warehouse schema that store business
measurements.
 Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables
representdata, usually numeric and additive, that can be analyzed and examined.
Examples include sales, cost, and profit.

DIMENSION TABLES

 Dimension tables, also known as lookup or reference tables, contain the relatively static
data in the warehouse.
 Dimension tables store the information you normally use to contain queries.
 Dimension tables are usually textual and descriptive and you can use them as the row
headers of the result set.
 Examples are customers, Location, Time, Suppliers or Products.
 Fact Tables
 A fact table typically has two types of columns: those that contain numeric facts (often
called measurements), and those that are foreign keys to dimension tables.
 A fact table contains either detail-level facts or facts that have been aggregated. Fact
tables that contain aggregated facts are often called SUMMARY TABLES.
 A fact table usually contains facts with the same level of aggregation.
 Though most facts are additive, they can also be semi-additive or non-additive. Additive
facts can be aggregated by simple arithmetical addition.
 A common example of this is sales. Non-additive facts cannot be added at all.
 An example of this is averages. Semi-additive facts can be aggregated along some of the
dimensions and not along others.
 An example of this is inventory levels, where you cannot tell what a level means simply
by looking at it.
Creating a new fact table

 You must define a fact table for each star schema.


 From a modeling standpoint, the primary key of the fact table is usually a composite key
that is made up of all of its foreign keys.
 Fact tables contain business event details for summarization. Fact tables are often very
large, containing hundreds of millions of rows and consuming hundreds of gigabytes or
multiple terabytes of storage.
 Because dimension tables contain records that describe facts, the fact table can be
reduced to columns for dimension foreign keys and numeric fact values. Text, BLOBs,
and denormalized data are typically not stored in the fact table

The definitions of this ‘sales’ fact table follow:

CREATE TABLE sales

prod_id NUMBER(7) CONSTRAINT sales_product_nn NOT NULL,

cust_id NUMBER CONSTRAINT sales_customer_nn NOT NULL,

time_id DATE CONSTRAINT sales_time_nn NOT NULL,

ad_id NUMBER(7),quantity_sold NUMBER(4) CONSTRAINT sales_quantity_nn

NOT NULL,

amount NUMBER(10,2) CONSTRAINT sales_amount_nn NOT NULL,

cost NUMBER(10,2) CONSTRAINT sales_cost_nn NOT NULL )

Multiple Fact Tables:

 Multiple fact tables are used in data warehouses that address multiple business functions,
such as sales, inventory, and finance.
 Each business function should have its own fact table and will probably have some
unique dimension tables.
 Any dimensions that are common across the business functions must represent the
dimension information in the same way, as discussed earlier in “Dimension Tables.”
 Each business function will typically have its own schema that contains a fact table,
several conforming dimension tables, and some dimension tables unique to the specific
business function.
 Such business-specific schemas may be part of the central data warehouse or
implemented as data marts. Very large fact tables may be physically partitioned for
implementation and maintenance design considerations.
 The partition divisions are almost always along a single dimension, and the time
dimension is the most common one to use because of the historical nature of most data
warehouse data.

Dimension Tables

 A dimension is a structure, often composed of one or more hierarchies, that categorizes


data.
 Dimensional attributes help to describe the dimensional value. They are normally
descriptive, textual values. Several distinct dimensions, combined with facts, enable you
to answer business questions.
 Commonly used dimensions are customers, products, and time. Dimension data is
typically collected at the lowest level of detail and then aggregated into higher-level
totals that are more useful for analysis.
 These natural rollups or aggregations within a dimension table are called hierarchies.
 A dimension table may be used in multiple places if the data warehouse contains multiple
fact tables or contributes data to data marts.
 A dimension such as customer, time, or product that is used in multiple schemas is called
a conforming dimension if all copies of the dimension are the same. Summarization data
and reports will not correspond if different schemas use different versions of a dimension
table.
The definitions of this ‘customer’ fact table follow:

CREATE TABLE customers ( cust_id NUMBER, cust_first_name VARCHAR2(20)


CONSTRAINT customer_fname_nn NOT NULL, cust_last_name VARCHAR2(40)
CONSTRAINT customer_lname_nn NOT NULL,cust_sex CHAR(1), cust_year_of_birth
NUMBER(4), cust_marital_status VARCHAR2(20),cust_street_address VARCHAR2(40)
CONSTRAINT customer_st_addr_nn NOT NULL, cust_postal_code VARCHAR2(10)
CONSTRAINT customer_pcode_nn NOT NULL,cust_city VARCHAR2(30) CONSTRAINT
customer_city_nn NOT NULL, cust_state_district VARCHAR2(40),country_id CHAR(2)
CONSTRAINT customer_country_id_nn NOT NULL, cust_phone_number VARCHAR2(25),
cust_income_level VARCHAR2(30), cust_credit_limit NUMBER, cust_email VARCHAR2(30)
)

CREATE DIMENSION products_dim

LEVEL product IS (products.prod_id)

LEVEL subcategory IS (products.prod_subcategory)

LEVEL category IS (products.prod_category)

HIERARCHY prod_rollup (

product CHILD OF

subcategory CHILD OF

category

ATTRIBUTE product DETERMINES products.prod_name

ATTRIBUTE product DETERMINES products.prod_desc

ATTRIBUTE subcategory DETERMINES products.prod_subcat_desc

ATTRIBUTE category DETERMINES products.prod_cat_desc;


 The records in a dimension table establish one-to-many relationships with the fact table.
 For example, there may be a number of sales to a single customer, or a number of sales of
a single product.
 The dimension table contains attributes associated with the dimension entry; these
attributes are rich and user-oriented textual details, such as product name or customer
name and address.
 Attributes serve as report labels and query constraints. Attributes that are coded in an
OLTP database should be decoded into descriptions.
 For example, product category may exist as a simple integer in the OLTP database, but
the dimension table should contain the actual text for the category.
 The code may also be carried in the dimension table if needed for maintenance. This
denormalization simplifies and improves the efficiency of queries and simplifies user
query tools.
 However, if a dimension attribute changes frequently, maintenance may be easier if the
attribute is assigned to its own table to create a snowflake dimension

Hierarchies:

 The data in a dimension is usually hierarchical in nature. Hierarchies are determined by


the business need to group and summarize data into usable information. For example, a
time dimension often contains the hierarchy elements: (all time), Year, Quarter, Month,
Day or Week.
 A dimension may contain multiple hierarchies – a time dimension often contains both
calendar and fiscal year hierarchies.
 Geography is seldom a dimension of its own; it is usually a hierarchy that imposes a
structure on sales points, customers, or other geographically distributed dimensions.
 An example geography hierarchy for sales points is: (all), country, region, state or
district, city, store
 Level relationships specify top-to-bottom ordering of levels from most general (the root)
to most specific information.
 They define the parent-child relationship between the levels in a hierarchy. Hierarchies
are also essential components in enabling more complex rewrites.
Multi-use dimensions

 Sometimes data warehouse design can be simplified by combining a number of small,


unrelated dimensions into a single physical dimension, often called a junk dimension.
 This can greatly reduce the size of the fact table by reducing the number of foreign keys
in fact table records. Often the combined dimension will be prepopulated with the
cartesian product of all dimension values.
 If the number of discrete values creates a very large table of all possible value
combinations, the table can be populated with value combinations as they are
encountered during the load or update process.
 A common example of a multi-use dimension is a dimension that contains customer
demographics selected for reporting standardization.
 Another multiuse dimension might contain useful textual comments that occur
infrequently in the source data records; collecting these comments in a single dimension
removes a sparse text field from the fact table and replaces it with a compact foreign key.

DATA WAREHOUSING SCHEMAS

 A schema is a collection of database objects, including tables, views, indexes, and


synonyms.
 You can arrange schema objects in the schema models designed for data warehousing in
a variety of ways.
 Most data warehouses use a dimensional model. The model of your source data and the
requirements of your users help you design the data warehouse schema.
 You can sometimes get the source model from your company’s enterprise data model and
reverse-engineer the logical data model for the data warehouse from this.
 The physical implementation of the logical data warehouse model may require some
changes to adapt it to your system parameters—size of machine, number of users, storage
capacity, type of network, and software
Dimensional Model Schemas

 The principal characteristic of a dimensional model is a set of detailed business facts


surrounded by multiple dimensions that describe those facts.
 When realized in a database, the schema for a dimensional model contains a central fact
table and multiple dimension tables.
 A dimensional model may produce a star schema or a snowflake schema.

Star Schemas

 A schema is called a star schema if all dimension tables can be joined directly to the fact
table.
 The following diagram shows a classic star schema. In the star schema design, a single
object (thefact table) sits in the middle and is radically connected to other surrounding
objects (dimension lookup tables) like a star.
 A star schema can be simple or complex. A simple star consists of one fact table; a
complex star can have more than one fact table.

Steps in Designing Star Schema

 Identify a business process for analysis (like sales).


 Identify measures or facts (sales dollar).
 Identify dimensions for facts (product dimension, location dimension, time dimension,
organization dimension).
 List the columns that describe each dimension (region name, branch name, subregion
name).
 Determine the lowest level of summary in a fact table (sales dollar).
Star schema with time dimension

Snowflake Schemas

 A schema is called a snowflake schema if one or more dimension tables do not join
directly to the fact table but must join through other dimension tables.
 For example, a dimension that describes products may be separated into three tables
(snowflaked).
 The snowflake schema is an extension of the star schema where each point of the star
explodes into more points.
 The main advantage of the snowflake schema is the improvement in query performance
due to minimized disk storage requirements and joining smaller lookup tables.
 The main disadvantage of the snowflake schema is the additional maintenance efforts
needed due to the increase number of lookup tables.
Important Aspects of Star Schema & Snowflake Schema

 In a star schema every dimension will have a primary key.


 In a star schema, a dimension table will not have any parent table.
 Whereas in a snowflake schema, a dimension table will have one or more parent tables.
 Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
 Whereas hierarchies are broken into separate tables in snowflake schema. These
hierarchies help to drill down the data from topmost hierarchies to the lowermost
hierarchies.

You might also like