100% found this document useful (1 vote)
115 views

Multidimensional Data Mode:-: Characteristics of Data Warehouse

The document discusses the key characteristics of a data warehouse. It states that a data warehouse is subject-oriented rather than transaction-oriented, stores data in a consistent format, retains data for an extended time period, and does not overwrite previous data when new data is added. It also compares some of the key differences between databases and data warehouses, such as purpose, processing methods, usage, data modeling techniques, and data types stored. Finally, it provides examples of how databases are used in different business sectors like banking, airlines, education, and more.

Uploaded by

palprakah
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
115 views

Multidimensional Data Mode:-: Characteristics of Data Warehouse

The document discusses the key characteristics of a data warehouse. It states that a data warehouse is subject-oriented rather than transaction-oriented, stores data in a consistent format, retains data for an extended time period, and does not overwrite previous data when new data is added. It also compares some of the key differences between databases and data warehouses, such as purpose, processing methods, usage, data modeling techniques, and data types stored. Finally, it provides examples of how databases are used in different business sectors like banking, airlines, education, and more.

Uploaded by

palprakah
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

multidimensional data mode:-

Characteristics of Data Warehouse


 A data warehouse is subject oriented as it offers information related to theme instead of companies'
ongoing operations.
 The data also needs to be stored in the Datawarehouse in common and unanimously acceptable
manner.
 The time horizon for the data warehouse is relatively extensive compared with other operational
systems.
 A data warehouse is non-volatile which means the previous data is not erased when new information
is entered in it.

Parameter Database Data Warehouse

Purpose Is designed to record Is designed to analyze

Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing Processing (OLAP).
(OLTP)

Usage The database helps to Data warehouse allows you to analyze your
perform fundamental business.
operations for your business

Tables and Tables and joins of a Table and joins are simple in a data
Joins database are complex as they warehouse because they are denormalized.
are normalized.

Orientation Is an application-oriented It is a subject-oriented collection of data


collection of data

Storage limit Generally limited to a single Stores data from any number of applications
application

Availability Data is available real-time Data is refreshed from source systems as


and when needed

Usage ER modeling techniques are Data modeling techniques are used for
used for designing. designing.

Technique Capture data Analyze data

Data Type Data stored in the Database Current and Historical Data is stored in Data
is up to date. Warehouse. May not be up to date.

Storage of Flat Relational Approach Data Ware House uses dimensional and
data method is used for data normalized approach for the data structure.
multidimensional data mode:-
storage. Example: Star and snowflake schema.

Query Type Simple transaction queries Complex queries are used for analysis
are used. purpose.

Data Detailed Data is stored in a It stores highly summarized data.


Summary database.

Applications of Database

Sector Usage

Banking Use in the banking sector for customer


information, account-related activities,
payments, deposits, loans, credit cards, etc.

Airlines Use for reservations and schedule


information.

Universities To store student information, course


registrations, colleges, and results.

Telecommunication It helps to store call records, monthly bills,


balance maintenance, etc.

Finance Helps you to store information related stock,


sales, and purchases of stocks and bonds.

Sales & Production Use for storing customer, product and sales
details.

Manufacturing It is used for the data management of the


supply chain and for tracking production of
items, inventories status.

HR Management Detail about employee's salaries, deduction,


generation of paychecks, etc.
multidimensional data mode:-

metadata
Posted by: Margaret Rouse

WhatIs.com





Metadata is data that describes other data. Meta is a prefix that in most information technology usages means

"an underlying definition or description."

Metadata summarizes basic information about data, which can make finding and working with particular

instances of data easier. For example, author, date created and date modified and file size are examples of

very basic document metadata. Having the abilty to filter through that metadata makes it much easier for

someone to locate a specific document.

In addition to document files, metadata is used for images, videos, spreadsheets and web pages. The use of

metadata on web pages can be very important. Metadata for web pages contain descriptions of the page’s

contents, as well as keywords linked to the content. These are usually expressed in the form of metatags. The

metadata containing the web page’s description and summary is often displayed in search results by search

engines, making its accuracy and details very important since it can determine whether a user decides to visit

the site or not. Metatags are often evaluated by search engines to help decide a web page’s relevance, and

were used as the key factor in determining position in a search until the late 1990s. The increase in search

engine optimization (SEO) towards the end of the 1990s led to many websites “keyword stuffing” their

metadata to trick search engines, making their websites seem more relevant than others. Since then search

engines have reduced their reliance on metatags, though they are still factored in when indexing pages. Many

search engines also try to halt web pages’ ability to thwart their system by regularly changing their criteria for

rankings, with Google being notorious for frequently changing their highly-undisclosed ranking algorithms.

Metadata can be created manually, or by automated information processing. Manual creation tends to be more

accurate, allowing the user to input any information they feel is relevant or needed to help describe the file.
multidimensional data mode:-

Automated metadata creation can be much more elementary, usually only displaying information such as file

size, file extension, when the file was created and who created the file.

data mining:-

Data mining is the process of sorting through large data sets to identify patterns and establish relationships to

solve problems through data analysis. Data mining tools allow enterprises to predict future trends.

Data mining is considered an interdisciplinary field that joins the techniques of computer science and statistics.

Note that the term “data mining” is a misnomer. It is primarily concerned with discovering patterns and

anomalies within datasets, but it is not related to the extraction of the data itself

Data mining is the process of uncovering patterns and finding anomalies and relationships in large datasets

that can be used to make predictions about future trends. The main purpose of data mining is extracting

valuable information from available data.

Applications of Data Mining


Data mining offers many applications in business. For example, the establishment of proper data (mining)
processes can help a company to decrease its costs, increase revenues, or derive insights from the behavior and
practices of its customers. Certainly, it plays a vital role in the business decision-making process nowadays.

Data mining is also actively utilized in finance. For instance, relevant techniques allow users to determine and
assess the factors that influence the price fluctuations of financial securities.

The field is rapidly evolving. New data emerges at enormously fast speeds while technological advancements
allow for more efficient ways to solve existing problems. In addition, developments in the areas of artificial
intelligence and machine learning provide new paths to the precision and efficiency in the field.

Data Mining Process


Generally, the process can be divided into the following steps:

1. Define the problem: Determine the scope of the business problem and objectives of the data
exploration project.
2. Explore the data: The step includes the exploration and collection of data that will help solve the
stated business problem.
3. Prepare the data: Clean and organize collected data to prepare it for the
further modeling procedures.
multidimensional data mode:-
4. Modeling: Create the model using data mining techniques that will help solve the stated problem.
5. Interpretation and evaluation of results: Draw conclusions from the data model and assess its
validity. Translate the results into a business decision.

Data Mining Techniques


The most commonly used techniques in the field include:

1. Detection of anomalies: Identifying unusual values in a dataset.


2. Dependency modeling: Discovering existing relationships within a dataset. It frequently involves
regression analysis.
3. Clustering: Identifying structures (clusters) in unstructured data.
4. Classification: Generalizing the known structure and applying it to the data.

difference between data mining and dbms


DBMS : It is a Database Management System. Its basically a product or software which is used to manage data. Eg. SQL Server.

Data Mining: Data Mining is the procedure to extract the information from huge amount of raw data which can be used to take
business decisions or to decide future strategies.

dbms:-The key to this is looking at the verbs: manage and mine. Data management (and the system that supports it) is focused on
data as the object of concern. Topics like design, storage, performance, integrity, and security are central.

Data mining is focused on using the data to identify patterns that impact business concerns. Data becomes the tool rather than the
subject.

A DBMS is a product.

A data mining is a process.

A DBMS (Database Management System) is a complete system used for managing digital databases that allows storage of
database content, creation/maintenance of data, search and other functionalities. On the other hand, Data Mining is a field in
computer science, which deals with the extraction of previously unknown and interesting information from raw data. Usually, the
data used as the input for the Data mining process is stored in databases. Users who are inclined toward statistics use Data Mining.
They utilize statistical models to look for hidden patterns in data. Data miners are interested in finding useful relationships between
different data elements, which is ultimately profitable for businesses.
multidimensional data mode:-
DBMS is a full-fledged system for housing and managing a set of digital databases. However Data Mining is a technique or a
concept in computer science, which deals with extracting useful and previously unknown information from raw data. Most of the
times, these raw data are stored in very large databases. Therefore Data miners use the existing functionalities of DBMS to handle,
manage and even preprocess raw data before and during the Data mining process. However, a DBMS system alone cannot be
used to analyze data. But, some DBMS at present have inbuilt data analyzing tools or capabilities.

Difference between Data Warehousing and Data Mining


A data warehouse is built to support management functions whereas data mining is used to extract useful information and
patterns from data. Data warehousing is the process of compiling information into a data warehouse.
Data Warehousing:
It is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed rather
than transaction processing. A data warehouse is designed to support management decision-making process by providing
a platform for data cleaning, data integration and data consolidation. A data warehouse contains subject-oriented,
integrated, time-variant and non-volatile data.
Data warehouse consolidates data from many sources while ensuring data quality, consistency and accuracy. Data
warehouse improves system performance by separating analytics processing from transnational databases. Data flows into
a data warehouse from the various databases. A data warehouse works by organizing data into a schema which describes
the layout and type of data. Query tools analyze the data tables using schema.

Figure – Data Warehousing process

Data Mining:
It is the process of finding patterns and correlations within large data sets to identify relationships between data. Data mining
tools allow a business organization to predict customer behavior. Data mining tools are used to build risk models and detect
fraud. Data mining is used in market analysis and management, fraud detection, corporate analysis and risk management.
multidimensional data mode:-

Figure – Data Mining process

Comparison between data mining and data warehousing:


DATA WAREHOUSING DATA MINING

A data warehouse is database system which is

designed for analytical analysis instead of Data mining is the process of

transactional work. analyzing data patterns.

Data is stored periodically. Data is analyzed regularly.

Data warehousing is the process of extracting and Data mining is the use of pattern

storing data to allow easier reporting. recognition logic to identify patterns

Data warehousing is solely carried out by Data mining is carried by business

engineers. users with the help of engineers.

Data mining is considered as a

Data warehousing is the process of pooling all process of extracting data from large

relevant data together. data sets.


multidimensional data mode:-
knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. This
widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating
prior knowledge on data sets and interpreting accurate solutions from the observed results.
Major KDD application areas include marketing, fraud detection, telecommunication and manufacturing.

Traditionally, data mining and knowledge discovery was performed manually. As time passed, the amount of data in many
systems grew to larger than terabyte size, and could no longer be maintained manually. Moreover, for the successful
existence of any business, discovering underlying patterns in data is considered essential. As a result, several software tools
were developed to discover hidden data and make assumptions, which formed a part of artificial intelligence.
The KDD process has reached its peak in the last 10 years. It now houses many different approaches to discovery, which
includes inductive learning, Bayesian statistics, semantic query optimization, knowledge acquisition for expert systems and
information theory. The ultimate goal is to extract high-level knowledge from low-level data.
KDD includes multidisciplinary activities. This encompasses data storage and access, scaling algorithms to massive data
sets and interpreting results. The data cleansing and data access process included in data warehousing facilitate the KDD
process. Artificial intelligence also supports KDD by discovering empirical laws from experimentation and observations. The
patterns recognized in the data must be valid on new data, and possess some degree of certainty. These patterns are
considered new knowledge. Steps involved in the entire KDD process are:

1. Identify the goal of the KDD process from the customer’s perspective.
2. Understand application domains involved and the knowledge that's required
3. Select a target data set or subset of data samples on which discovery is be performed.
4. Cleanse and preprocess data by deciding strategies to handle missing fields and alter the data as per the
requirements.

Difference between Data Mining and KDD

Data, in its raw form, is just a collection of things, where little information might be derived. Together with the

development of information discovery methods(Data Mining and KDD), the value of the info is significantly improved.

Data mining is one among the steps of Knowledge Discovery in Databases(KDD) as can be shown by the image

below.KDD is a multi-step process that encourages the conversion of data to useful information. Data mining is the

pattern extraction phase of KDD. Data mining can take on several types, the option influenced by the desired

outcomes.
multidimensional data mode:-

Data mining is one among the steps of Knowledge Discovery in Databases(KDD).Data mining is the pattern extraction phase of KDD.

KDD is a multi-step process that encourages the conversion of data to useful information.You can check the KDD process flow
chart from this link.

KDD has the following steps:

Data Selection

KDD isn’t prepared without human interaction. The choice of subset and the data set requires knowledge of the domain from which
the data is to be taken. Removing non-related information elements from the dataset reduces the search space during the data
mining phase of KDD. The sample size and structure are established during this point, if the dataset can be assessed employing a
testing of the info.

Pre-processing

Databases do contain incorrect or missing data. During the pre-processing phase, the information is cleaned. This warrants the
removal of “outliers”, if appropriate; choosing approaches for handling missing data fields; accounting for time sequence
information, and applicable normalization of data.

Transformation

Within the transformation phase attempts to reduce the variety of data elements can be assessed while preserving the quality of the
info. During this stage, information is organized, changed in one type to some other (i.e. changing nominal to numeric) and new or
“derived” attributes are defined.

Data mining

Now the info is subjected to one or several data-mining methods such as regression, group, or clustering. The information mining
part of KDD usually requires repeated iterative application of particular data mining methods. Different data-mining techniques or
models can be used depending on the expected outcome.

Evaluation

The final step is documentation and interpretation of the outcomes from the previous steps. Steps during this period might consist
of returning to a previous step up the KDD approach to help refine the acquired knowledge, or converting the knowledge in to a
form clear for the user. In this stage the extracted data patterns are visualized for further reviews.

Data Mining Techniques


multidimensional data mode:-
There are four main operations associated with data mining techniques which include:
• Predictive modeling
• Database segmentation
• Link analysis
• Deviation detection.
Techniques are specific implementations of the· data mining operations. However, each
operation has its own strengths and weaknesses. With this in mind, data mining tools
sometimes offer a choice of operations to implement a technique.

Predictive Modeling
It is designed on a similar pattern of the human learning experience in using
observations to form a model of the important characteristics of some task. It
corresponds to the 'real world'. It 'is developed using a supervised learning approach,
which has to phases: training and testing. Training phase is based on a large sample of
historical data called a training set, while testing involves trying out the model on new,
previously unseen data to determine its accuracy and physical performance
characteristics.
It is commonly used in customer retention management, credit approval, cross-selling,
and direct marketing. There are two techniques associated with predictive modeling.
These are:
• Classification
• Value prediction
Classification
Classification is used to classify the records to form a finite set of possible class values. There
are two specializations of classification: tree induction and neural induction. An example of classification using tree
induction is shown in Figure.

In this example, we are interested in predicting whether a customer who is currently


renting property is likely to be interested in buying property. A predictive model has
determined that only two variables are of interest: the length· of the customer has
rented property and the age of the customer. The model predicts that those customers
who have rented for more than two years and are over 25 years old are the most likely to
multidimensional data mode:-
.be interested in buying property. An example of classification using neural induction is
shown in Figure.

A neural network contain collections of connected nodes with input, output, and
processing at each node. Between the visible input and output layers may be a number
of hidden processing layers. Each processing unit (circle) in one layer is connected to
each processing unit in the next layer by a weighted value, expressing the strength of the
relationship. This approach is an attempt to copy the way the human brain works· in
recognizing patterns by arithmetically combining all the variables associated with a
given data point.
Value prediction
It uses the traditional statistical techniques of linear regression and nonlinear
regression. These techniques are easy to use and understand. Linear regression
attempts to fit a straight line through a plot of the data, such that the line is the best
representation of the average of all observations at that point in the plot. The problem
with linear regression is that the technique only works well with linear data and is
sensitive to those data values which do not conform to the expected norm. Although
nonlinear regression avoids the main problems of linear regression, it is still not flexible
enough to handle all possible shapes of the data plot. This is where the traditional
statistical analysis methods and data mining methods begin to diverge. Applications of
value prediction include credit card fraud detection and target mailing list identification.
Database Segmentation
Segmentation is a group of similar records that share a number of properties. The aim
of database segmentation is to partition a database into an unknown number
of segments, or clusters.
This approach uses unsupervised learning to discover homogeneous sub-populations in
a database to improve the accuracy of the profiles. Applications of database
segmentation include customer profiling, direct marketing, and cross-selling.
multidimensional data mode:-

As shown in figure, using database segmentation, we identify the cluster that


corresponds to legal tender and forgeries. Note that there are two clusters of forgeries,
which is attributed to at least two gangs of forgers working on falsifying the banknotes.
Link Analysis
Link analysis aims to establish links, called associations, between the individual record
sets of records, in a database. There are three specializations of link analysis. These are:
• Associations discovery
• Sequential pattern discovery
• Similar time sequence discovery.
Association’s discovery finds items that imply the presence of other items in the same
event. There are association rules which are used to define association. For example,
'when a customer rents property for more than two years and is more than 25 years old,
in 40% of cases, the customer will buy a property. This association happens in 35% of all
customers who rent properties'.
Sequential pattern discovery finds patterns between events such that the presence of
one set of item is followed by another set of items in a database of events over a period
of the. For example, this approach can be used to understand long-term customer
buying behavior.
Time sequence discovery is used in the discovery of links between two sets of data that
are time-dependent. For example, within three months of buying property, new home
owners will purchase goods such as cookers, freezers, and washing machines.
Applications of link analysis include product affinity analysis, direct marketing, and
stock price movement.
Deviation Detection
Deviation detection is a relatively new technique in terms of commercially available data
mining tools. However, deviation detection is often a source of true discovery because it
multidimensional data mode:-
identifies outliers, which express deviation from some previously known expectation
"and norm. This operation can be performed
using statistics and visualization techniques.
Applications of deviation detection include fraud detection in the use of credit cards and
insurance claims, quality control, and defects tracing.

data processing:-
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-
world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further
processing.

Data preprocessing is used database-driven applications such as customer relationship management and rule-based
applications (like neural networks)

Data goes through a series of steps during preprocessing:

 Data Cleaning: Data is cleansed through processes such as filling in missing values, smoothing the noisy data, or
resolving the inconsistencies in the data.
 Data Integration: Data with different representations are put together and conflicts within the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.
 Data Reduction: This step aims to present a reduced representation of the data in a data warehouse.
 Data Discretization: Involves the reduction of a number of values of a continuous attribute by dividing the range of
attribute intervals.
multidimensional data mode:-

Multidimensional data model stores data in the form of data cube.Mostly, data warehousing supports two or three-

dimensional cubes.

A data cube allows data to be viewed in multiple dimensions.A dimensions are entities with respect to which an

organization wants to keep records.For example in store sales record, dimensions allow the store to keep track of

things like monthly sales of items and the branches and locations.

A multidimensional databases helps to provide data-related answers to complex business queries quickly and

accurately.

Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional data model.OLAP

in data warehousing enables users to view data from different angles and dimensions.

data cubes:- A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally
used to explain the time sequence of an image's data. It is a data abstraction to evaluate aggregated data from a variety of
viewpoints. It is also useful for imaging spectroscopy as a spectrally-resolved image is depicted as a 3-D volume.

A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be viewed as a
collection of identical 2-D tables stacked upon one another. Data cubes are used to represent data that is too complex to be
described by a table of columns and rows. As such, data cubes can go far beyond 3-D to include many more dimensions.

A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions
as certain measures of business requirements. A cube's every dimension represents certain characteristic of the database,
for example, daily, monthly or yearly sales. The data included inside a data cube makes it possible analyze almost all the
figures for virtually any or all customers, sales agents, products, and much more. Thus, a data cube can help to establish
trends and analyze performance.

Data cubes are mainly categorized into two categories:

 Multidimensional Data Cube: Most OLAP products are developed based on a structure where the cube is
patterned as a multidimensional array. These multidimensional OLAP (MOLAP) products usually offers improved
performance when compared to other approaches mainly because they can be indexed directly into the structure
of the data cube to gather subsets of data. When the number of dimensions is greater, the cube becomes sparser.
That means that several cells that represent particular attribute combinations will not contain any aggregated data.
This in turn boosts the storage requirements, which may reach undesirable levels at times, making the MOLAP
solution untenable for huge data sets with many dimensions. Compression techniques might help; however, their
use can damage the natural indexing of MOLAP.

 Relational OLAP: Relational OLAP make use of the relational database model. The ROLAP data cube is employed
as a bunch of relational tables (approximately twice as many as the quantity of dimensions) compared to a
multidimensional array. Each one of these tables, known as a cuboid, signifies a specific view.

व्याख्यान
multidimensional data mode:-
What is Multidimensional schemas?
Multidimensional schema is especially designed to model data warehouse systems. The schemas are
designed to address the unique needs of very large databases designed for the analytical purpose (OLAP).

Types of Data Warehouse Schema:

Following are 3 chief types of multidimensional schemas each having its unique advantages.

 Star Schema
 Snowflake Schema
 Galaxy Schema

What is a Star Schema?


The star schema is the simplest type of Data Warehouse schema. It is known as star schema as its structure
resembles a star. In the Star schema, the center of the star can have one fact tables and numbers of
associated dimension tables. It is also known as Star Join Schema and is optimized for querying large data
sets.
multidimensional data mode:-
For example, as you can see in the above-given image that fact table is at the center which contains keys to
every dimension table like Deal_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes like Units
sold and revenue.

Characteristics of Star Schema:

 Every dimension in a star schema is represented with the only one-dimension table.
 The dimension table should contain the set of attributes.
 The dimension table is joined to the fact table using a foreign key
 The dimension table are not joined to each other
 Fact table would contain key and measure
 The Star schema is easy to understand and provides optimal disk usage.
 The dimension tables are not normalized. For instance, in the above figure, Country_ID does not
have Country lookup table as an OLTP design would have.
 The schema is widely supported by BI Tools

What is a Snowflake Schema?


A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is called
snowflake because its diagram resembles a Snowflake.

The dimension tables are normalized which splits data into additional tables. In the following example, Country
is further normalized into an individual table.
multidimensional data mode:-
Characteristics of Snowflake Schema:

 The main benefit of the snowflake schema it uses smaller disk space.
 Easier to implement a dimension is added to the Schema
 Due to multiple tables query performance is reduced
 The primary challenge that you will face while using the snowflake Schema is that you need to perform
more maintenance efforts because of the more lookup tables.

Star Vs Snowflake Schema: Key Differences


Star Schema Snow Flake Schema

Hierarchies for the dimensions are Hierarchies are divided into separate
stored in the dimensional table. tables.

It contains a fact table surrounded by One fact table surrounded by


dimension tables. dimension table which are in turn
surrounded by dimension table

In a star schema, only single join A snowflake schema requires many


creates the relationship between the joins to fetch the data.
fact table and any dimension tables.

Simple DB Design. Very Complex DB Design.

Denormalized Data structure and query Normalized Data Structure.


also run faster.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains Data Split into different Dimension


aggregated data. Tables.

Cube processing is faster. Cube processing might be slow


because of the complex join.

Offers higher performing queries using The Snow Flake Schema is represented
Star Join Query Optimization. Tables by centralized fact table which unlikely
may be connected with multiple connected with multiple dimensions.
dimensions.

What is a Galaxy schema?


multidimensional data mode:-
A Galaxy Schema contains two fact table that shares dimension tables. It is also called Fact Constellation
Schema. The schema is viewed as a collection of stars hence the name Galaxy Schema.

As you can see in above figure, there are two facts table

1. Revenue
2. Product.

In Galaxy schema shares dimensions are called Conformed Dimensions.

Characteristics of Galaxy Schema:

 The dimensions in this schema are separated into separate dimensions based on the various levels of
hierarchy.
 For example, if geography has four levels of hierarchy like region, country, state, and city then Galaxy
schema should have four dimensions.
 Moreover, it is possible to build this type of schema by splitting the one-star schema into more Star
schemes.
 The dimensions are large in this schema which is needed to build based on the levels of hierarchy.
 This schema is helpful for aggregating fact tables for better understanding.

What is Star Cluster Schema?


multidimensional data mode:-

Snowflake schema contains fully expanded hierarchies. However, this can add complexity to the Schema and
requires extra joins. On the other hand, star schema contains fully collapsed hierarchies, which may lead to
redundancy. So, the best solution may be a balance between these two schemas which is star cluster schema
design.

Overlapping dimensions can be found as forks in hierarchies. A fork happens when an entity acts as a parent in
two different dimensional hierarchies. Fork entities then identified as classification with one-to-many
relationships.

Summary:
 Multidimensional schema is especially designed to model data warehouse systems
 The star schema is the simplest type of Data Warehouse schema. It is known as star schema as its
structure resembles a star.
 A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is called
snowflake because its diagram resembles a Snowflake.
 In a star schema, only single join creates the relationship between the fact table and any dimension
tables.
 Star schema contains a fact table surrounded by dimension tables.
 Snow flake schema is surrounded by dimension table which are in turn surrounded by dimension table
 A snowflake schema requires many joins to fetch the data.
 A Galaxy Schema contains two fact table that shares dimension tables. It is also called Fact
Constellation Schema.
 Star cluster schema contains attributes of Start schema and Slow flake schema.

 Prev

Report a Bug
multidimensional data mode:-

data warehouse process and


architecture
What is Data warehouse?
Data warehouse is an information system that contains historical and commutative data from single or multiple
sources. It simplifies reporting and analysis process of the organization.

It is also a single version of truth for any company for decision making and forecasting.

Characteristics of Data warehouse


A data warehouse has following characteristics:

 Subject-Oriented
 Integrated
 Time-variant
 Non-volatile

Subject-Oriented

A data warehouse is subject oriented as it offers information regarding a theme instead of companies' ongoing
operations. These subjects can be sales, marketing, distributions, etc.

A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on modeling and analysis
of data for decision making. It also provides a simple and concise view around the specific subject by
excluding data which not helpful to support the decision process.

Integrated

In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from
the dissimilar database. The data also needs to be stored in the Datawarehouse in common and universally
acceptable manner.

A data warehouse is developed by integrating data from varied sources like a mainframe, relational databases,
flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.

This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures,
encoding structure etc. have to be ensured. Consider the following example:
multidimensional data mode:-

In the above example, there are three different application labeled A, B and C. Information stored in these
applications are Gender, Date, and Balance. However, each application's data is stored different way.

 In Application A gender field store logical values like M or F


 In Application B gender field is a numerical value,
 In Application C application, gender field stored in the form of a character value.
 Same is the case with Date and balance

However, after transformation and cleaning process all this data is stored in common format in the Data
Warehouse.

Time-Variant

The time horizon for data warehouse is quite extensive compared with operational systems. The data collected
in a data warehouse is recognized with a particular period and offers information from the historical point of
view. It contains an element of time, explicitly or implicitly.

One such place where Datawarehouse data display time variance is in in the structure of the record key. Every
primary key contained with the DW should have either implicitly or explicitly an element of time. Like the day,
week month, etc.

Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
multidimensional data mode:-
Non-volatile

Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.

Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what &
when happened. It does not require transaction process, recovery and concurrency control mechanisms.

Activities like delete, update, and insert which are performed in an operational application environment are
omitted in Data warehouse environment. Only two types of data operations performed in the Data Warehousing
are

1. Data loading
2. Data access

Here, are some major differences between Application and Data Warehouse

Operational Application Data Warehouse

Complex program must be coded to make This kind of issues does not happen because
sure that data upgrade processes maintain data update is not performed.
high integrity of the final product.

Data is placed in a normalized form to Data is not stored in normalized form.


ensure minimal redundancy.

Technology needed to support issues of It offers relative simplicity in technology.


transactions, data recovery, rollback, and
resolution as its deadlock is quite complex.

Data Warehouse Architectures


There are mainly three types of Datawarehouse Architectures: -

Single-tier architecture

The objective of a single layer is to minimize the amount of data stored. This goal is to remove data
redundancy. This architecture is not frequently used in practice.

Two-tier architecture

Two-layer architecture separates physically available sources and data warehouse. This architecture is not
expandable and also not supporting a large number of end-users. It also has connectivity problems because of
network limitations.

Three-tier architecture

This is the most widely used architecture.

It consists of the Top, Middle and Bottom Tier.


multidimensional data mode:-
1. Bottom Tier: The database of the Datawarehouse servers as the bottom tier. It is usually a relational
database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either
ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database.
This layer also acts as a mediator between the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get
data out from the data warehouse. It could be Query tools, reporting tools, managed query tools,
Analysis tools and Data mining tools.

Datawarehouse Components
––––
multidimensional data mode:-
The data warehouse is based on an RDBMS server which is a central information repository that is surrounded
by some key components to make the entire environment functional, manageable and accessible

There are mainly five components of Data Warehouse:

Data Warehouse Database

The central database is the foundation of the data warehousing environment. This database is implemented on
the RDBMS technology. Although, this kind of implementation is constrained by the fact that traditional RDBMS
system is optimized for transactional database processing and not for data warehousing. For instance, ad-hoc
query, multi-table joins, aggregates are resource intensive and slow down performance.

Hence, alternative approaches to Database are used as listed below-

 In a datawarehouse, relational databases are deployed in parallel to allow for scalability. Parallel
relational databases also allow shared memory or shared nothing model on various multiprocessor
configurations or massively parallel processors.
 New index structures are used to bypass relational table scan and improve speed.
 Use of multidimensional database (MDDBs) to overcome any limitations which are placed because of
the relational data model. Example: Essbase from Oracle.

Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)

The data sourcing, transformation, and migration tools are used for performing all the conversions,
summarizations, and all the changes needed to transform data into a unified format in the datawarehouse.
They are also called Extract, Transform and Load (ETL) Tools.

Their functionality includes:

 Anonymize data as per regulatory stipulations.


 Eliminating unwanted data in operational databases from loading into Data warehouse.
 Search and replace common names and definitions for data arriving from different sources.
 Calculating summaries and derived data
 In case of missing data, populate them with defaults.
 De-duplicated repeated data arriving from multiple datasources.

These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs, shell
scripts, etc. that regularly update data in datawarehouse. These tools are also helpful to maintain the Metadata.

These ETL Tools have to deal with challenges of Database & Data heterogeneity.

Metadata

The name Meta Data suggests some high- level technological concept. However, it is quite simple. Metadata is
data about data which defines the data warehouse. It is used for building, maintaining and managing the data
warehouse.

In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source, usage,
values, and features of data warehouse data. It also defines how data can be changed and processed. It is
closely connected to the data warehouse.

For example, a line in sales database may contain:


multidimensional data mode:-
4030 KJ732 299.90

This is a meaningless data until we consult the Meta that tell us it was

 Model number: 4030


 Sales Agent ID: KJ732
 Total sales amount of $299.90

Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.

Metadata helps to answer the following questions

 What tables, attributes, and keys does the Data Warehouse contain?
 Where did the data come from?
 How many times do data get reloaded?
 What transformations were applied with cleansing?

Metadata can be classified into following categories:

1. Technical Meta Data: This kind of Metadata contains information about warehouse which is used by
Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users a way easy to
understand information stored in the data warehouse.

Query Tools
One of the primary objects of data warehousing is to provide information to businesses to make strategic
decisions. Query tools allow users to interact with the data warehouse system.

These tools fall into four different categories:

1. Query and reporting tools


2. Application Development tools
3. Data mining tools
4. OLAP tools

1. Query and reporting tools:

Query and reporting tools can be further divided into

 Reporting tools
 Managed query tools

Reporting tools: Reporting tools can be further divided into production reporting tools and desktop report
writer.

1. Report writers: This kind of reporting tool are tools designed for end-users for their analysis.
2. Production reporting: This kind of tools allows organizations to generate regular operational reports. It
also supports high volume batch jobs like printing and calculating. Some popular reporting tools are
Brio, Business Objects, Oracle, PowerSoft, SAS Institute.

Managed query tools:


multidimensional data mode:-
This kind of access tools helps end users to resolve snags in database and SQL and database structure by
inserting meta-layer between users and database.

2. Application development tools:

Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization. In such
cases, custom reports are developed using Application development tools.

3. Data mining tools:

Data mining is a process of discovering meaningful new correlation, pattens, and trends by mining large
amount data. Data mining tools are used to make this process automatic.

4. OLAP tools:

These tools are based on concepts of a multidimensional database. It allows users to analyse the data using
elaborate and complex multidimensional views.

Data warehouse Bus Architecture


Data warehouse Bus determines the flow of data in your warehouse. The data flow in a data warehouse can be
categorized as Inflow, Upflow, Downflow, Outflow and Meta flow.

While designing a Data Bus, one needs to consider the shared dimensions, facts across data marts.

Data Marts

A data mart is an access layer which is used to get data out to the users. It is presented as an option for large
size data warehouse as it takes less time and money to build. However, there is no standard definition of a data
mart is differing from person to person.

In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for partition of data
which is created for the specific group of users.

Data marts could be created in the same database as the Datawarehouse or a physically separate Database.

You might also like