Datawarehousing Concepts
Datawarehousing Concepts
BASIC DEFINITIONS
Datawarehousing :
DWH (Datawarehousing) is a repository of integrated information, specifically structured for
Queries and analysis. Data and information are extracted from heterogeneous sources as they are
generated. This makes it much easier and more efficient to run queries over data that originally
came from different sources.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
Collection of data in support of management’s decision making process”.
It usually requires
Two operations: initial loading of data and access of data.
Data Mart :
Datawarehousing Concepts
Page 1 of 11
It is a collection of subject areas organized for decision support based on the needs of
a given department. Ex: sales, marketing etc. the data mart is designed to suit the
needs of a department. Data mart is much less granular than the ware house data.
Data Mart is
OLTP :
OLTP is Online Transaction Processing. This is standard, normalized database
structure. OLTP is designed for Transactions, which means that inserts, updates and
deletes must be fast.
OLAP :
OLAP is Online Analytical Processing. Read-only, historical, aggregated data.
Datawarehousing Concepts
Page 2 of 11
Difference between OLTP and OLAP :
OLTP OLAP
Current data Current and historical data
Fact Table :
It contains the quantitative measures about the business.
Fact tables that contain aggregated facts are often called summary tables.
Dimension Table :
It is a descriptive data about the facts (business).
Aggregate tables :
Aggregate Tables are pre-stored summarized tables. Usage of Aggregates can increase
the performance of Queries by several times.
Conformed dimensions :
Conformed dimensions are a dimension table shared by fact tables. These tables
connect separate star schemas into an enterprise star schema.
Datawarehousing Concepts
Page 3 of 11
Schema :
Star Schema :
Star Schema is a set of tables comprised of a single, central fact table surrounded by
de-normalized dimensions. Star schema implement dimensional data structures with
de-normalized dimensions
Snow Flake Schema:
Snow Flake Schema is a set of tables comprised of a single, central fact table
surrounded by normalized dimension hierarchies. Snowflake schema implement
dimensional data structures with fully normalized dimensions.
Queries :
The DWH contains 2 types of queries. There will be
Fixed queries that are clearly defined and well understood, such as regular
reports.
Ad Hoc Query: Is the starting point for any analysis into a database. The ability
to run any query when desired and expect a reasonable response that makes the
data warehouse worthwhile and makes the design such a significant challenge.
There will also be ad hoc queries that are unpredictable, both in quantity and
frequency.
The end-user access tools are capable of automatically generating the database
query that answers any question posted by the user.
Canned Queries: are pre-defined queries. Canned queries contain prompts that
allow you to customize the query for your specific needs
Datawarehousing Concepts
Page 4 of 11
Bottom up: Acc. To Ralph Kimball, when you plan to design analytical solutions
for an enterprise, try building data marts. When you have 3 or 4 such data marts,
you would be having an enterprise wide data warehouse built up automatically
without time and effort from exclusively spent on building the EDWH. Because
the time required for building a data mart is lesser than for an EDWH.
Top down: try to build an Enterprise wide Data warehouse first and all the data
marts will be the subsets of the EDWH. Acc. To him, independent data marts
cannot make up an enterprise data warehouse under any circumstance, but they
will remain isolated pieces of information –stove pieces.
ER Diagram :
ER model is a conceptual data model that views the real world as entities and
Relationships. A basic component of the model is the Entity-Relationship diagram
which is used to visually represent data objects.
ETL :
ETL Tools in the market for eg, Informatica, Ascential Data stage, Acta ,Oracle
Warehouse Builder(OWB) etc.,
Datawarehousing Concepts
Page 5 of 11
Staging Area :
It is the work place where raw data is brought in, cleaned, combined, archived and
exported to one or more data marts. The purpose of data staging area is to get data
ready for loading into a presentation layer.
Dimensions are said to be slowly changing dimensions when their attributes remain
almost constant, requiring minor alterations.
Eg Marital status
Bitmap index, B tree index are the indexing mechanism use for a typical data
warehouse.
Datawarehousing Concepts
Page 6 of 11
OLAP tools in the market eg Business Objects, Brio, Cognos , Microstrategy ,
Alphablock, Crystal Reports etc.,
ROLAP: Relationnal OLAP, the users see cubes but under the hood it is
pure relational table, Micro-Strategy is a ROLAP product.
MOLAP: Multi dimensionnal OLAP, the users see cubes and under the hood
there a big cube, Oracle Express used to be a MOLAP product.
DOLAP: Desktop OLAP, the users see many cubes and under the hood there
are many small cubes, Cognos PowerPlay.
HOLAP: Hybrid OLAP, combines MOLAP and ROLAP, Essbase
Types of Facts:
Additive
Nonadditive
Semi Additive
Attributes :
Datawarehousing Concepts
Page 7 of 11
Business intelligence is actually an environment in which business users receive data
that is reliable, consistent, understandable, easily manipulated and timely. With this
data, business users are able to conduct analyses that yield overall understanding of
where the business has been, where it is now and where it will be in the near future.
Business intelligence serves two main purposes. It monitors the financial and
operational health of the organization (reports, alerts, alarms, analysis tools, key
performance indicators and dashboards). It also regulates the operation of the
organization providing two- way integration with operational systems and information
feedback analysis.
Data Integration :
Pulling together and reconciling dispersed data for analytic purposes that
organizations have maintained in multiple, heterogeneous systems. Data needs to be
accessed and extracted, moved and loaded, validated and cleaned, and standardized
and transformed.
Data Mapping :
Data Mining :
A technique using software tools geared for the user who typically does not know
exactly what he's searching for, but is looking for particular patterns or trends. Data
mining is the process of shifting through large amounts of data to produce data
content relationships. It can predict future trends and behaviors, allowing businesses
to make proactive, knowledge-driven decisions. This is also known as data surfing.
Data Modeling :
A method used to define and analyze data requirements needed to support the business
functions of an enterprise. These data requirements are recorded as a conceptual data
model with associated data definitions. Data modeling defines the relationships
between data elements and structures.
Drill Down:
A method of exploring detailed data that was used in creating a summary level of
data. Drill down levels depend on the granularity of the data in the data warehouse.
Meta Data:
Meta data is data that expresses the context or relativity of data. Examples of meta
data include data element descriptions, data type descriptions, attribute/property
Datawarehousing Concepts
Page 8 of 11
descriptions, range/domain descriptions and process/method descriptions. The
repository environment encompasses all corporate meta data resources: database
catalogs, data dictionaries and navigation services. Meta data includes name, length,
valid values and description of a data element. Meta data is stored in a data dictionary
and repository. It insulates the data warehouse from changes in the schema of
operational systems.
Normalization:
The process of reducing a complex data structure into its simplest, most stable
structure. In general, the process entails the removal of redundant attributes, keys, and
relationships from a conceptual data model.
Surrogate Key:
In the OLAP world, there are mainly two different types: Multidimensional OLAP
(MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to
technologies that combine MOLAP and ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats.
Advantages:
• Excellent performance: MOLAP cubes are built for fast data retrieval, and is
Datawarehousing Concepts
Page 9 of 11
optimal for slicing and dicing operations.
• Can perform complex calculations: All calculations have been pre-generated when
the cube is created. Hence, complex calculations are not only doable, but they
return quickly.
Disadvantages:
• Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of
data in the cube itself. This is not to say that the data in the cube cannot be derived
from a large amount of data. Indeed, this is possible. But in this case, only
summary-level information will be included in the cube itself.
• Requires additional investment: Cube technology are often proprietary and do not
already exist in the organization. Therefore, to adopt MOLAP technology, chances
are additional investments in human and capital resources are needed.
ROLAP
This methodology relies on manipulating the data stored in the relational database to give
the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.
Advantages:
• Can handle large amounts of data: The data size limitation of ROLAP technology
is the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
• Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies, since
they sit on top of the relational database, can therefore leverage these
functionalities.
Disadvantages:
• Performance can be slow: Because each ROLAP report is essentially a SQL query
(or multiple SQL queries) in the relational database, the query time can be long if
the underlying data size is large.
• Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL statements
do not fit all needs (for example, it is difficult to perform complex calculations
using SQL), ROLAP technologies are therefore traditionally limited by what SQL
can do. ROLAP vendors have mitigated this risk by building into the tool out-of-
the-box complex functions as well as the ability to allow users to define their own
Datawarehousing Concepts
Page 10 of 11
functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For
summary-type information, HOLAP leverages cube technology for faster performance.
When detail information is needed, HOLAP can "drill through" from the cube into the
underlying relational data.
Datawarehousing Concepts
Page 11 of 11