Unit 3 Notes
Unit 3 Notes
Data Warehouse (DW) is a database that stores information oriented to satisfy decision-making
requests. A very frequent problem in enterprises is the impossibility for accessing to corporate,
complete and integrated information of the enterprise that can satisfy decision-making requests.
A DW is a database with particular features. Concerning the data it contains, it is the result of
transformations, quality improvement and integration of data that comes from operational bases.
Besides, it includes indicators that are derived from operational data and give it additional value.
Concerning its utilization, it is supposed to support complex queries (summarization, aggregates,
crossing of data), while its maintenance does not suppose transactional load.
In addition, in a DW environment end users make queries directly against the DW through user-
friendly query tools, instead of accessing information through reports generated by specialists.
Building and maintaining a DW need to solve problems of many different aspects. In this chapter
we concentrate in DW design.
2. One or more “data marts”—extracts from the central data warehouse that are organized
according to the particular retrieval requirements of individual users.
The Central Data Warehouse is just that—a warehouse. All the enterprise’s data are stored in
there, “normalized”, in order to minimize redundancy and so that each may be found easily.
This is accomplished by organizing it according to the enterprise’s corporate data model. Think
of it as a giant grocery store warehouse where the chocolates are kept in one section, the T-shirt
is in another, and the CDs are in a third.
We found in the literature, globally two different approaches for Relational DW design:
One that applies dimensional modeling techniques, and another that bases mainly in the concept
of materialized view.
Dimensional models represent data with a “cube” structure, making more compatible logical
data representation with OLAP data management. According to the objectives of dimensional
modeling are:
(i) To produce database structures that are easy for end-users to understand and write queries
against,
It achieves these objectives by minimizing the number of tables and relationships between them.
Normalized databases have some characteristics that are appropriate for OLTP systems, but not
for DWs:
1. Its structure is not easy for end-users to understand and use. In OLTP systems this is not a
problem because, usually end-users interact with the database through a layer of software.
2. Data redundancy is minimized. This maximizes efficiency of updates, but tends to penalize
retrievals. Data redundancy is not a problem in DWs because data is not updated on-line.
The basic concepts of dimensional modeling are: facts, dimensions and measures
A fact is a collection of related data items, consisting of measures and context data. It
typically represents business items or business transactions.
A dimension is a collection of data that describe one business dimension. Dimensions
determine the contextual background for the facts; they are the parameters over which we
want to perform OLAP.
A measure is a numeric attribute of a fact, representing the performance or behavior of
the business relative to the dimensions.
A data warehouse exists to serve its users—analysts and decision makers. A data warehouse
must be designed to satisfy the following requirements:
5. Provide a variety of powerful analytical tools such as OLAP and data mining.
Most successful data warehouses that meet these requirements have these common
characteristics:
The success of a data warehouse is measured solely by its acceptance by users. Without users,
historical data might as well be archived to magnetic tape and stored in the basement. Successful
data warehouse design starts with understanding the users and their needs.
Each type makes up a portion of the user population as illustrated in this diagram
Statisticians
There are typically only a handful of statisticians and operations research types in any
organization.
Their work can contribute to closed loop systems that deeply influence the operations and
profitability of the company.
Knowledge Workers
A relatively small number of analysts perform the bulk of new queries and analyses
against the data warehouse.
These are the users who get the Designer or Analyst versions of user access tools. They
will figure out how to quantify a subject area. After a few iterations, their queries and
reports typically get published for the benefit of the Information Consumers.
Knowledge Workers are often deeply engaged with the data warehouse design and place
the greatest demands on the ongoing data warehouse operations team for training and
support.
Information Consumers
Most users of the data warehouse are Information Consumers; they will probably never
compose a true ad hoc query.
They use static or simple interactive reports that others have developed. They usually
interact with the data warehouse only through the work product of others.
This group includes a large number of people, and published reports are highly visible.
Set up a great communication infrastructure for distributing information widely, and
gather feedback from these users to improve the information sites over time.
Executives:
Process Managers
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −
1. Load manager
2. Warehouse manager
3. Query manager
1. Load Manager
Load manager performs the operations required to extract and load the data into the
database.
The size and complexity of a load manager varies between specific solutions from one
data warehouse to another.
The load manager does performs the following functions −
Extract data from the source system.
Fast load the extracted data into temporary data store.
Perform simple transformations into structure similar to the one in the data warehouse.
The data is extracted from the operational databases or the external information
providers.
Gateways are the application programs that are used to extract data.
It is supported by underlying DBMS and allows the client program to generate SQL to be
executed at a server.
Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples
of gateway.
The data is extracted from the operational databases or the external information
providers.
Gateways are the application programs that are used to extract data. It is supported by
underlying DBMS and allows the client program to generate SQL to be executed at a
server.
Open Database Connection (ODBC) and Java Database Connection (JDBC) are examples
of gateway.
FAST LOAD
In order to minimize the total load window, the data needs to be loaded into the
warehouse in the fastest possible time.
Transformations affect the speed of data processing.
It is more effective to load the data into a relational database prior to applying
transformations and checks.
Gateway technology is not suitable, since they are inefficient when large data volumes
are involved.
SIMPLE TRANSFORMATIONS
While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks.
Suppose we are loading the EPOS sales transaction, we need to perform the following checks −
Strip out all the columns that are not required within the warehouse.
Convert all the values to required data types.
2. Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a
third-party system software, C programs, and shell scripts.
The size and complexity of a warehouse manager varies between specific solutions.
Note: A warehouse Manager analyzes query profiles to determine whether the index and
aggregations are appropriate
3. Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing the
queries to appropriate tables, it speeds up the query request and response process.
In addition, the query manager is responsible for scheduling the execution of the queries posted
by the user.
The following types of objects are commonly used in dimensional data warehouse schemas:
FACT TABLES
Fact tables are the large tables in your warehouse schema that store business
measurements.
Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables
representdata, usually numeric and additive, that can be analyzed and examined.
Examples include sales, cost, and profit.
DIMENSION TABLES
Dimension tables, also known as lookup or reference tables, contain the relatively static
data in the warehouse.
Dimension tables store the information you normally use to contain queries.
Dimension tables are usually textual and descriptive and you can use them as the row
headers of the result set.
Examples are customers, Location, Time, Suppliers or Products.
Fact Tables
A fact table typically has two types of columns: those that contain numeric facts (often
called measurements), and those that are foreign keys to dimension tables.
A fact table contains either detail-level facts or facts that have been aggregated. Fact
tables that contain aggregated facts are often called SUMMARY TABLES.
A fact table usually contains facts with the same level of aggregation.
Though most facts are additive, they can also be semi-additive or non-additive. Additive
facts can be aggregated by simple arithmetical addition.
A common example of this is sales. Non-additive facts cannot be added at all.
An example of this is averages. Semi-additive facts can be aggregated along some of the
dimensions and not along others.
An example of this is inventory levels, where you cannot tell what a level means simply
by looking at it.
Creating a new fact table
NOT NULL,
Multiple fact tables are used in data warehouses that address multiple business functions,
such as sales, inventory, and finance.
Each business function should have its own fact table and will probably have some
unique dimension tables.
Any dimensions that are common across the business functions must represent the
dimension information in the same way, as discussed earlier in “Dimension Tables.”
Each business function will typically have its own schema that contains a fact table,
several conforming dimension tables, and some dimension tables unique to the specific
business function.
Such business-specific schemas may be part of the central data warehouse or
implemented as data marts. Very large fact tables may be physically partitioned for
implementation and maintenance design considerations.
The partition divisions are almost always along a single dimension, and the time
dimension is the most common one to use because of the historical nature of most data
warehouse data.
Dimension Tables
HIERARCHY prod_rollup (
product CHILD OF
subcategory CHILD OF
category
Hierarchies:
Star Schemas
A schema is called a star schema if all dimension tables can be joined directly to the fact
table.
The following diagram shows a classic star schema. In the star schema design, a single
object (thefact table) sits in the middle and is radically connected to other surrounding
objects (dimension lookup tables) like a star.
A star schema can be simple or complex. A simple star consists of one fact table; a
complex star can have more than one fact table.
Snowflake Schemas
A schema is called a snowflake schema if one or more dimension tables do not join
directly to the fact table but must join through other dimension tables.
For example, a dimension that describes products may be separated into three tables
(snowflaked).
The snowflake schema is an extension of the star schema where each point of the star
explodes into more points.
The main advantage of the snowflake schema is the improvement in query performance
due to minimized disk storage requirements and joining smaller lookup tables.
The main disadvantage of the snowflake schema is the additional maintenance efforts
needed due to the increase number of lookup tables.
Important Aspects of Star Schema & Snowflake Schema