Multidimensional Data Mode:-: Characteristics of Data Warehouse
Multidimensional Data Mode:-: Characteristics of Data Warehouse
Processing The database uses the Online Data warehouse uses Online Analytical
Method Transactional Processing Processing (OLAP).
(OLTP)
Usage The database helps to Data warehouse allows you to analyze your
perform fundamental business.
operations for your business
Tables and Tables and joins of a Table and joins are simple in a data
Joins database are complex as they warehouse because they are denormalized.
are normalized.
Storage limit Generally limited to a single Stores data from any number of applications
application
Usage ER modeling techniques are Data modeling techniques are used for
used for designing. designing.
Data Type Data stored in the Database Current and Historical Data is stored in Data
is up to date. Warehouse. May not be up to date.
Storage of Flat Relational Approach Data Ware House uses dimensional and
data method is used for data normalized approach for the data structure.
multidimensional data mode:-
storage. Example: Star and snowflake schema.
Query Type Simple transaction queries Complex queries are used for analysis
are used. purpose.
Applications of Database
Sector Usage
Sales & Production Use for storing customer, product and sales
details.
metadata
Posted by: Margaret Rouse
WhatIs.com
Metadata is data that describes other data. Meta is a prefix that in most information technology usages means
Metadata summarizes basic information about data, which can make finding and working with particular
instances of data easier. For example, author, date created and date modified and file size are examples of
very basic document metadata. Having the abilty to filter through that metadata makes it much easier for
In addition to document files, metadata is used for images, videos, spreadsheets and web pages. The use of
metadata on web pages can be very important. Metadata for web pages contain descriptions of the page’s
contents, as well as keywords linked to the content. These are usually expressed in the form of metatags. The
metadata containing the web page’s description and summary is often displayed in search results by search
engines, making its accuracy and details very important since it can determine whether a user decides to visit
the site or not. Metatags are often evaluated by search engines to help decide a web page’s relevance, and
were used as the key factor in determining position in a search until the late 1990s. The increase in search
engine optimization (SEO) towards the end of the 1990s led to many websites “keyword stuffing” their
metadata to trick search engines, making their websites seem more relevant than others. Since then search
engines have reduced their reliance on metatags, though they are still factored in when indexing pages. Many
search engines also try to halt web pages’ ability to thwart their system by regularly changing their criteria for
rankings, with Google being notorious for frequently changing their highly-undisclosed ranking algorithms.
Metadata can be created manually, or by automated information processing. Manual creation tends to be more
accurate, allowing the user to input any information they feel is relevant or needed to help describe the file.
multidimensional data mode:-
Automated metadata creation can be much more elementary, usually only displaying information such as file
size, file extension, when the file was created and who created the file.
data mining:-
Data mining is the process of sorting through large data sets to identify patterns and establish relationships to
solve problems through data analysis. Data mining tools allow enterprises to predict future trends.
Data mining is considered an interdisciplinary field that joins the techniques of computer science and statistics.
Note that the term “data mining” is a misnomer. It is primarily concerned with discovering patterns and
anomalies within datasets, but it is not related to the extraction of the data itself
Data mining is the process of uncovering patterns and finding anomalies and relationships in large datasets
that can be used to make predictions about future trends. The main purpose of data mining is extracting
Data mining is also actively utilized in finance. For instance, relevant techniques allow users to determine and
assess the factors that influence the price fluctuations of financial securities.
The field is rapidly evolving. New data emerges at enormously fast speeds while technological advancements
allow for more efficient ways to solve existing problems. In addition, developments in the areas of artificial
intelligence and machine learning provide new paths to the precision and efficiency in the field.
1. Define the problem: Determine the scope of the business problem and objectives of the data
exploration project.
2. Explore the data: The step includes the exploration and collection of data that will help solve the
stated business problem.
3. Prepare the data: Clean and organize collected data to prepare it for the
further modeling procedures.
multidimensional data mode:-
4. Modeling: Create the model using data mining techniques that will help solve the stated problem.
5. Interpretation and evaluation of results: Draw conclusions from the data model and assess its
validity. Translate the results into a business decision.
Data Mining: Data Mining is the procedure to extract the information from huge amount of raw data which can be used to take
business decisions or to decide future strategies.
dbms:-The key to this is looking at the verbs: manage and mine. Data management (and the system that supports it) is focused on
data as the object of concern. Topics like design, storage, performance, integrity, and security are central.
Data mining is focused on using the data to identify patterns that impact business concerns. Data becomes the tool rather than the
subject.
A DBMS is a product.
A DBMS (Database Management System) is a complete system used for managing digital databases that allows storage of
database content, creation/maintenance of data, search and other functionalities. On the other hand, Data Mining is a field in
computer science, which deals with the extraction of previously unknown and interesting information from raw data. Usually, the
data used as the input for the Data mining process is stored in databases. Users who are inclined toward statistics use Data Mining.
They utilize statistical models to look for hidden patterns in data. Data miners are interested in finding useful relationships between
different data elements, which is ultimately profitable for businesses.
multidimensional data mode:-
DBMS is a full-fledged system for housing and managing a set of digital databases. However Data Mining is a technique or a
concept in computer science, which deals with extracting useful and previously unknown information from raw data. Most of the
times, these raw data are stored in very large databases. Therefore Data miners use the existing functionalities of DBMS to handle,
manage and even preprocess raw data before and during the Data mining process. However, a DBMS system alone cannot be
used to analyze data. But, some DBMS at present have inbuilt data analyzing tools or capabilities.
Data Mining:
It is the process of finding patterns and correlations within large data sets to identify relationships between data. Data mining
tools allow a business organization to predict customer behavior. Data mining tools are used to build risk models and detect
fraud. Data mining is used in market analysis and management, fraud detection, corporate analysis and risk management.
multidimensional data mode:-
Data warehousing is the process of extracting and Data mining is the use of pattern
Data warehousing is the process of pooling all process of extracting data from large
Traditionally, data mining and knowledge discovery was performed manually. As time passed, the amount of data in many
systems grew to larger than terabyte size, and could no longer be maintained manually. Moreover, for the successful
existence of any business, discovering underlying patterns in data is considered essential. As a result, several software tools
were developed to discover hidden data and make assumptions, which formed a part of artificial intelligence.
The KDD process has reached its peak in the last 10 years. It now houses many different approaches to discovery, which
includes inductive learning, Bayesian statistics, semantic query optimization, knowledge acquisition for expert systems and
information theory. The ultimate goal is to extract high-level knowledge from low-level data.
KDD includes multidisciplinary activities. This encompasses data storage and access, scaling algorithms to massive data
sets and interpreting results. The data cleansing and data access process included in data warehousing facilitate the KDD
process. Artificial intelligence also supports KDD by discovering empirical laws from experimentation and observations. The
patterns recognized in the data must be valid on new data, and possess some degree of certainty. These patterns are
considered new knowledge. Steps involved in the entire KDD process are:
1. Identify the goal of the KDD process from the customer’s perspective.
2. Understand application domains involved and the knowledge that's required
3. Select a target data set or subset of data samples on which discovery is be performed.
4. Cleanse and preprocess data by deciding strategies to handle missing fields and alter the data as per the
requirements.
Data, in its raw form, is just a collection of things, where little information might be derived. Together with the
development of information discovery methods(Data Mining and KDD), the value of the info is significantly improved.
Data mining is one among the steps of Knowledge Discovery in Databases(KDD) as can be shown by the image
below.KDD is a multi-step process that encourages the conversion of data to useful information. Data mining is the
pattern extraction phase of KDD. Data mining can take on several types, the option influenced by the desired
outcomes.
multidimensional data mode:-
Data mining is one among the steps of Knowledge Discovery in Databases(KDD).Data mining is the pattern extraction phase of KDD.
KDD is a multi-step process that encourages the conversion of data to useful information.You can check the KDD process flow
chart from this link.
Data Selection
KDD isn’t prepared without human interaction. The choice of subset and the data set requires knowledge of the domain from which
the data is to be taken. Removing non-related information elements from the dataset reduces the search space during the data
mining phase of KDD. The sample size and structure are established during this point, if the dataset can be assessed employing a
testing of the info.
Pre-processing
Databases do contain incorrect or missing data. During the pre-processing phase, the information is cleaned. This warrants the
removal of “outliers”, if appropriate; choosing approaches for handling missing data fields; accounting for time sequence
information, and applicable normalization of data.
Transformation
Within the transformation phase attempts to reduce the variety of data elements can be assessed while preserving the quality of the
info. During this stage, information is organized, changed in one type to some other (i.e. changing nominal to numeric) and new or
“derived” attributes are defined.
Data mining
Now the info is subjected to one or several data-mining methods such as regression, group, or clustering. The information mining
part of KDD usually requires repeated iterative application of particular data mining methods. Different data-mining techniques or
models can be used depending on the expected outcome.
Evaluation
The final step is documentation and interpretation of the outcomes from the previous steps. Steps during this period might consist
of returning to a previous step up the KDD approach to help refine the acquired knowledge, or converting the knowledge in to a
form clear for the user. In this stage the extracted data patterns are visualized for further reviews.
Predictive Modeling
It is designed on a similar pattern of the human learning experience in using
observations to form a model of the important characteristics of some task. It
corresponds to the 'real world'. It 'is developed using a supervised learning approach,
which has to phases: training and testing. Training phase is based on a large sample of
historical data called a training set, while testing involves trying out the model on new,
previously unseen data to determine its accuracy and physical performance
characteristics.
It is commonly used in customer retention management, credit approval, cross-selling,
and direct marketing. There are two techniques associated with predictive modeling.
These are:
• Classification
• Value prediction
Classification
Classification is used to classify the records to form a finite set of possible class values. There
are two specializations of classification: tree induction and neural induction. An example of classification using tree
induction is shown in Figure.
A neural network contain collections of connected nodes with input, output, and
processing at each node. Between the visible input and output layers may be a number
of hidden processing layers. Each processing unit (circle) in one layer is connected to
each processing unit in the next layer by a weighted value, expressing the strength of the
relationship. This approach is an attempt to copy the way the human brain works· in
recognizing patterns by arithmetically combining all the variables associated with a
given data point.
Value prediction
It uses the traditional statistical techniques of linear regression and nonlinear
regression. These techniques are easy to use and understand. Linear regression
attempts to fit a straight line through a plot of the data, such that the line is the best
representation of the average of all observations at that point in the plot. The problem
with linear regression is that the technique only works well with linear data and is
sensitive to those data values which do not conform to the expected norm. Although
nonlinear regression avoids the main problems of linear regression, it is still not flexible
enough to handle all possible shapes of the data plot. This is where the traditional
statistical analysis methods and data mining methods begin to diverge. Applications of
value prediction include credit card fraud detection and target mailing list identification.
Database Segmentation
Segmentation is a group of similar records that share a number of properties. The aim
of database segmentation is to partition a database into an unknown number
of segments, or clusters.
This approach uses unsupervised learning to discover homogeneous sub-populations in
a database to improve the accuracy of the profiles. Applications of database
segmentation include customer profiling, direct marketing, and cross-selling.
multidimensional data mode:-
data processing:-
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-
world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further
processing.
Data preprocessing is used database-driven applications such as customer relationship management and rule-based
applications (like neural networks)
Data Cleaning: Data is cleansed through processes such as filling in missing values, smoothing the noisy data, or
resolving the inconsistencies in the data.
Data Integration: Data with different representations are put together and conflicts within the data are resolved.
Data Transformation: Data is normalized, aggregated and generalized.
Data Reduction: This step aims to present a reduced representation of the data in a data warehouse.
Data Discretization: Involves the reduction of a number of values of a continuous attribute by dividing the range of
attribute intervals.
multidimensional data mode:-
Multidimensional data model stores data in the form of data cube.Mostly, data warehousing supports two or three-
dimensional cubes.
A data cube allows data to be viewed in multiple dimensions.A dimensions are entities with respect to which an
organization wants to keep records.For example in store sales record, dimensions allow the store to keep track of
things like monthly sales of items and the branches and locations.
A multidimensional databases helps to provide data-related answers to complex business queries quickly and
accurately.
Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional data model.OLAP
in data warehousing enables users to view data from different angles and dimensions.
data cubes:- A data cube refers is a three-dimensional (3D) (or higher) range of values that are generally
used to explain the time sequence of an image's data. It is a data abstraction to evaluate aggregated data from a variety of
viewpoints. It is also useful for imaging spectroscopy as a spectrally-resolved image is depicted as a 3-D volume.
A data cube can also be described as the multidimensional extensions of two-dimensional tables. It can be viewed as a
collection of identical 2-D tables stacked upon one another. Data cubes are used to represent data that is too complex to be
described by a table of columns and rows. As such, data cubes can go far beyond 3-D to include many more dimensions.
A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions
as certain measures of business requirements. A cube's every dimension represents certain characteristic of the database,
for example, daily, monthly or yearly sales. The data included inside a data cube makes it possible analyze almost all the
figures for virtually any or all customers, sales agents, products, and much more. Thus, a data cube can help to establish
trends and analyze performance.
Multidimensional Data Cube: Most OLAP products are developed based on a structure where the cube is
patterned as a multidimensional array. These multidimensional OLAP (MOLAP) products usually offers improved
performance when compared to other approaches mainly because they can be indexed directly into the structure
of the data cube to gather subsets of data. When the number of dimensions is greater, the cube becomes sparser.
That means that several cells that represent particular attribute combinations will not contain any aggregated data.
This in turn boosts the storage requirements, which may reach undesirable levels at times, making the MOLAP
solution untenable for huge data sets with many dimensions. Compression techniques might help; however, their
use can damage the natural indexing of MOLAP.
Relational OLAP: Relational OLAP make use of the relational database model. The ROLAP data cube is employed
as a bunch of relational tables (approximately twice as many as the quantity of dimensions) compared to a
multidimensional array. Each one of these tables, known as a cuboid, signifies a specific view.
व्याख्यान
multidimensional data mode:-
What is Multidimensional schemas?
Multidimensional schema is especially designed to model data warehouse systems. The schemas are
designed to address the unique needs of very large databases designed for the analytical purpose (OLAP).
Following are 3 chief types of multidimensional schemas each having its unique advantages.
Star Schema
Snowflake Schema
Galaxy Schema
Every dimension in a star schema is represented with the only one-dimension table.
The dimension table should contain the set of attributes.
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above figure, Country_ID does not
have Country lookup table as an OLTP design would have.
The schema is widely supported by BI Tools
The dimension tables are normalized which splits data into additional tables. In the following example, Country
is further normalized into an individual table.
multidimensional data mode:-
Characteristics of Snowflake Schema:
The main benefit of the snowflake schema it uses smaller disk space.
Easier to implement a dimension is added to the Schema
Due to multiple tables query performance is reduced
The primary challenge that you will face while using the snowflake Schema is that you need to perform
more maintenance efforts because of the more lookup tables.
Hierarchies for the dimensions are Hierarchies are divided into separate
stored in the dimensional table. tables.
Offers higher performing queries using The Snow Flake Schema is represented
Star Join Query Optimization. Tables by centralized fact table which unlikely
may be connected with multiple connected with multiple dimensions.
dimensions.
As you can see in above figure, there are two facts table
1. Revenue
2. Product.
The dimensions in this schema are separated into separate dimensions based on the various levels of
hierarchy.
For example, if geography has four levels of hierarchy like region, country, state, and city then Galaxy
schema should have four dimensions.
Moreover, it is possible to build this type of schema by splitting the one-star schema into more Star
schemes.
The dimensions are large in this schema which is needed to build based on the levels of hierarchy.
This schema is helpful for aggregating fact tables for better understanding.
Snowflake schema contains fully expanded hierarchies. However, this can add complexity to the Schema and
requires extra joins. On the other hand, star schema contains fully collapsed hierarchies, which may lead to
redundancy. So, the best solution may be a balance between these two schemas which is star cluster schema
design.
Overlapping dimensions can be found as forks in hierarchies. A fork happens when an entity acts as a parent in
two different dimensional hierarchies. Fork entities then identified as classification with one-to-many
relationships.
Summary:
Multidimensional schema is especially designed to model data warehouse systems
The star schema is the simplest type of Data Warehouse schema. It is known as star schema as its
structure resembles a star.
A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. It is called
snowflake because its diagram resembles a Snowflake.
In a star schema, only single join creates the relationship between the fact table and any dimension
tables.
Star schema contains a fact table surrounded by dimension tables.
Snow flake schema is surrounded by dimension table which are in turn surrounded by dimension table
A snowflake schema requires many joins to fetch the data.
A Galaxy Schema contains two fact table that shares dimension tables. It is also called Fact
Constellation Schema.
Star cluster schema contains attributes of Start schema and Slow flake schema.
Prev
Report a Bug
multidimensional data mode:-
It is also a single version of truth for any company for decision making and forecasting.
Subject-Oriented
Integrated
Time-variant
Non-volatile
Subject-Oriented
A data warehouse is subject oriented as it offers information regarding a theme instead of companies' ongoing
operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on modeling and analysis
of data for decision making. It also provides a simple and concise view around the specific subject by
excluding data which not helpful to support the decision process.
Integrated
In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from
the dissimilar database. The data also needs to be stored in the Datawarehouse in common and universally
acceptable manner.
A data warehouse is developed by integrating data from varied sources like a mainframe, relational databases,
flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures,
encoding structure etc. have to be ensured. Consider the following example:
multidimensional data mode:-
In the above example, there are three different application labeled A, B and C. Information stored in these
applications are Gender, Date, and Balance. However, each application's data is stored different way.
However, after transformation and cleaning process all this data is stored in common format in the Data
Warehouse.
Time-Variant
The time horizon for data warehouse is quite extensive compared with operational systems. The data collected
in a data warehouse is recognized with a particular period and offers information from the historical point of
view. It contains an element of time, explicitly or implicitly.
One such place where Datawarehouse data display time variance is in in the structure of the record key. Every
primary key contained with the DW should have either implicitly or explicitly an element of time. Like the day,
week month, etc.
Another aspect of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
multidimensional data mode:-
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what &
when happened. It does not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application environment are
omitted in Data warehouse environment. Only two types of data operations performed in the Data Warehousing
are
1. Data loading
2. Data access
Here, are some major differences between Application and Data Warehouse
Complex program must be coded to make This kind of issues does not happen because
sure that data upgrade processes maintain data update is not performed.
high integrity of the final product.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to remove data
redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This architecture is not
expandable and also not supporting a large number of end-users. It also has connectivity problems because of
network limitations.
Three-tier architecture
Datawarehouse Components
––––
multidimensional data mode:-
The data warehouse is based on an RDBMS server which is a central information repository that is surrounded
by some key components to make the entire environment functional, manageable and accessible
The central database is the foundation of the data warehousing environment. This database is implemented on
the RDBMS technology. Although, this kind of implementation is constrained by the fact that traditional RDBMS
system is optimized for transactional database processing and not for data warehousing. For instance, ad-hoc
query, multi-table joins, aggregates are resource intensive and slow down performance.
In a datawarehouse, relational databases are deployed in parallel to allow for scalability. Parallel
relational databases also allow shared memory or shared nothing model on various multiprocessor
configurations or massively parallel processors.
New index structures are used to bypass relational table scan and improve speed.
Use of multidimensional database (MDDBs) to overcome any limitations which are placed because of
the relational data model. Example: Essbase from Oracle.
The data sourcing, transformation, and migration tools are used for performing all the conversions,
summarizations, and all the changes needed to transform data into a unified format in the datawarehouse.
They are also called Extract, Transform and Load (ETL) Tools.
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs, shell
scripts, etc. that regularly update data in datawarehouse. These tools are also helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
Metadata
The name Meta Data suggests some high- level technological concept. However, it is quite simple. Metadata is
data about data which defines the data warehouse. It is used for building, maintaining and managing the data
warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source, usage,
values, and features of data warehouse data. It also defines how data can be changed and processed. It is
closely connected to the data warehouse.
This is a meaningless data until we consult the Meta that tell us it was
Therefore, Meta Data are essential ingredients in the transformation of data into knowledge.
What tables, attributes, and keys does the Data Warehouse contain?
Where did the data come from?
How many times do data get reloaded?
What transformations were applied with cleansing?
1. Technical Meta Data: This kind of Metadata contains information about warehouse which is used by
Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users a way easy to
understand information stored in the data warehouse.
Query Tools
One of the primary objects of data warehousing is to provide information to businesses to make strategic
decisions. Query tools allow users to interact with the data warehouse system.
Reporting tools
Managed query tools
Reporting tools: Reporting tools can be further divided into production reporting tools and desktop report
writer.
1. Report writers: This kind of reporting tool are tools designed for end-users for their analysis.
2. Production reporting: This kind of tools allows organizations to generate regular operational reports. It
also supports high volume batch jobs like printing and calculating. Some popular reporting tools are
Brio, Business Objects, Oracle, PowerSoft, SAS Institute.
Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization. In such
cases, custom reports are developed using Application development tools.
Data mining is a process of discovering meaningful new correlation, pattens, and trends by mining large
amount data. Data mining tools are used to make this process automatic.
4. OLAP tools:
These tools are based on concepts of a multidimensional database. It allows users to analyse the data using
elaborate and complex multidimensional views.
While designing a Data Bus, one needs to consider the shared dimensions, facts across data marts.
Data Marts
A data mart is an access layer which is used to get data out to the users. It is presented as an option for large
size data warehouse as it takes less time and money to build. However, there is no standard definition of a data
mart is differing from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for partition of data
which is created for the specific group of users.
Data marts could be created in the same database as the Datawarehouse or a physically separate Database.