Scsa3001 1 58
Scsa3001 1 58
COURSE OBJECTIVES
➢ Identify the scope and necessity of Data Mining & Warehousing for the society.
➢ Describe various Data Models and Design Methodologies of Data Warehousing destined to solve the root
problems.
➢ To understand various Tools of Data Mining and their Techniques to solve the real time problems.
➢ To learn how to analyze the data, identify the problems, and choose the relevant algorithms to apply.
➢ To assess the Pros and Cons of various algorithms and analyze their behaviour on real datasets.
58
B.E /B.TECH REGULAR REGULATION 2019
SCSA3001 Data Mining And Data Warehousing
SCHOOL OF COMPUTING
1
SCSA3001 Data Mining And Data Warehousing
DATA MINING
Introduction - Steps in KDD - System Architecture – Types of data -Data mining
functionalities - Classification of data mining systems - Integration of a data mining
system with a data warehouse - Issues - Data Preprocessing - Data Mining
Application.
INTRODUCTION
What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object – Examples: eye color of a person,
temperature, etc. – Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object – Object is also known as record, point, case,
sample, entity, or instance Attributes
Data sets are made up of data objects. A data object represents an entity—in a sales database, the
objects may be customers, store items, and sales; in a medical database, the objects may be
patients; in a university database, the objects may be students, professors, and courses. Data
objects are typically described by attributes. Data objects can also be referred to as samples,
examples, instances, data points, or objects. If the data objects are stored in a database, they
are data tuples. That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes.
Attribute:
It can be seen as a data field that represents characteristics or features of a data object. For a
customer object attributes can be customer Id, address etc.
We can say that a set of attributes used to describe a given object are known as attribute
vector or feature vector.
Type of attributes:
This is the First step of Data Data-preprocessing. We differentiate between different types of
attributes and then pre process the data. So here is description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary (B)).
2. Quantitative (Discrete, Continuous)
2
SCSA3001 Data Mining And Data Warehousing
3
SCSA3001 Data Mining And Data Warehousing
Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence or
ranking(order) between them, but the magnitude between values is not actually known, the order
of values that shows what is important but don’t indicate how important it is.
4
SCSA3001 Data Mining And Data Warehousing
Data Mining also known as Knowledge Discovery in Databases refers to the nontrivial extraction
of implicit, previously unknown and potentially useful information from data stored in databases.
5
SCSA3001 Data Mining And Data Warehousing
3. Data Selection: Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
Data Transformation is a two-step process:
6
SCSA3001 Data Mining And Data Warehousing
SYSTEM ARCHITECTURE
Data mining is a very important process where potentially useful and previously unknown
information is extracted from large volumes of data. There are a number of components involved
in the data mining process. These components constitute the architecture of a data mining system.
Data Mining Architecture
The major components of any data mining system are data source, data warehouse server, data
mining engine, pattern evaluation module, graphical user interface and knowledge base.
7
SCSA3001 Data Mining And Data Warehousing
data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the server.
These processes are not as simple as we think. A number of techniques may be performed on the
data as part of cleaning, integration and selection.
b) Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed.
Hence, the server is responsible for retrieving the relevant data based on the data mining request
of the user.
c) Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of a number
of modules for performing data mining tasks including association, classification,
characterization, clustering, prediction, time-series analysis etc.
d) Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the
pattern by using a threshold value. It interacts with the data mining engine to focus the search
towards interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining system.
This module helps the user use the system easily and efficiently without knowing the real
complexity behind the process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the knowledge base to make the result
more accurate and reliable. The pattern evaluation module interacts with the knowledge base on a
regular basis to get inputs and also to update it.
Summary
Each and every component of data mining system has its own role and importance in completing
data mining efficiently.
8
SCSA3001 Data Mining And Data Warehousing
9
SCSA3001 Data Mining And Data Warehousing
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
DATA MINING SYSTEM CLASSIFICATION
A data mining system can be classified according to the following criteria −
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases
mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a
relational, transactional, object-relational, or data warehouse mining system.
10
SCSA3001 Data Mining And Data Warehousing
11
SCSA3001 Data Mining And Data Warehousing
dimensions). The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the
domain to be mined is useful for guiding the knowledge discovery process and for evaluating the
patterns found. Concept hierarchies are a popular form of background knowledge, which allow
data to be mined at multiple levels of abstraction. User beliefs regarding relationships in the data
are another form of background knowledge. The interestingness measures and thresholds for
pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate
the discovered patterns. Different kinds of knowledge may have different interestingness
measures. For example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified thresholds are
considered uninteresting. The expected representation for visualizing the discovered patterns: This
refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes. A data mining query language can be designed to
incorporate these primitives, allowing users to flexibly interact with data mining systems. Having
a data mining query language provides a foundation on which user-friendly graphical interfaces
can be built.
12
SCSA3001 Data Mining And Data Warehousing
13
SCSA3001 Data Mining And Data Warehousing
14
SCSA3001 Data Mining And Data Warehousing
15
SCSA3001 Data Mining And Data Warehousing
data into partitions which is further processed in a parallel fashion. Then the results from the
partitions are merged. The incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues:
Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.
DATA PREPROCESSING
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method
of resolving such issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks).
Data goes through a series of steps during pre processing:
Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
Data Integration: Data with different representations are put together and conflicts within
the data are resolved.
Data Transformation: Data is normalized, aggregated and generalized.
Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.
Data Discretization: Involves the reduction of a number of values of a continuous attribute
by dividing the range of attribute intervals.
Integration of a data mining system with a data warehouse:
DB and DW systems, possible integration schemes include no coupling, loose coupling, semi-
tight coupling, and tight coupling. We examine each of these schemes, as follows:
16
SCSA3001 Data Mining And Data Warehousing
1. No coupling: No coupling means that a DM system will not utilize any function of a DB or
DW system. It may fetch data from a particular source (such as a file system), process data using
some data mining algorithms, and then store the mining results in another file.
2. Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or
DW system, fetching data from a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file or in a designated place in a database or
data Warehouse. Loose coupling is better than no coupling because it can fetch any portion of data
stored in databases or data warehouses by using query processing, indexing, and other system
facilities.
However, many loosely coupled mining systems are main memory-based. Because mining does
not explore data structures and query optimization methods provided by DB or DW systems, it is
difficult for loose coupling to achieve high scalability and good performance with large data sets.
3. Semi-tight coupling: Semi-tight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives (identified by
the analysis of frequently encountered data mining functions) can be provided in the DB/DW
system. These primitives can include sorting, indexing, aggregation, histogram analysis, multi
way join, and pre computation of some essential statistical measures, such as sum, count, max,
min ,standard deviation,
4. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system.
17
SCSA3001 Data Mining And Data Warehousing
18
SCSA3001 Data Mining And Data Warehousing
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It is
natural that the quantity of data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and satisfaction. Here is the list
of examples of data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns,
catch fraudulent activities, make better use of resource, and improve quality of service. Here is the
list of examples for which data mining improves telecommunication services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
19
SCSA3001 Data Mining And Data Warehousing
important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
Semantic integration of heterogeneous, distributed genomic and proteomic databases.
Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
Discovery of structural patterns and analysis of genetic networks and protein pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and
ecosystem modelling, chemical engineering, fluid dynamics, etc. Following are the applications of
data mining in the field of Scientific Applications −
Data Warehouses and data preprocessing.
Graph-based mining.
Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection −
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build discriminating
attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
20
SCSA3001 Data Mining And Data Warehousing
There are many data mining system products and domain specific data mining applications. The
new data mining systems and applications are being added to the previous systems. Also, efforts
are being made to standardize data mining languages.
Choosing a Data Mining System
The selection of a data mining system depends on the following features −
Data Types − The data mining system may handle formatted text, record-based data, and
relational data. The data could also be in ASCII text, relational database data or data warehouse
data. Therefore, we should check what exact format the data mining system can handle.
System Issues − We must consider the compatibility of a data mining system with different
operating systems. One data mining system may run on only one operating system or on several.
There are also data mining systems that provide web-based user interfaces and allow XML data as
input.
Data Sources − Data sources refer to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while others on multiple
relational sources. Data mining system should also support ODBC connections or OLE DB for
ODBC connections.
Data Mining functions and methodologies − There are some data mining systems that provide
only one data mining function such as classification while some provides multiple data mining
functions such as concept description, discovery-driven OLAP analysis, association mining,
linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis,
similarity search, etc.
Coupling data mining with databases or data warehouse systems − Data mining systems need
to be coupled with a database or a data warehouse system. The coupled components are integrated
into a uniform information processing environment. Here are the types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
Scalability − There are two scalability issues in data mining −
21
SCSA3001 Data Mining And Data Warehousing
o Row (Database size) Scalability − A data mining system is considered as row scalable when
the number or rows are enlarged 10 times. It takes no more than 10 times to execute a query.
o Column (Dimension) Scalability − A data mining system is considered as column scalable if
the mining query execution time increases linearly with the number of columns.
Visualization Tools − Visualization in data mining can be categorized as follows −
o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
Data Mining query language and graphical user interface − An easy-to-use graphical user
interface is important to promote user-guided, interactive data mining. Unlike relational database
systems, data mining systems do not share underlying data mining query language.
Trends in Data Mining
Data mining concepts are still evolving and here are the latest trends that we get to see in this field
Application Exploration.
Scalable and interactive data mining methods.
Integration of data mining with database systems, data warehouse systems and web database
systems.
Standardization of data mining query language.
Visual data mining.
New methods for mining complex types of data.
Biological data mining.
Data mining and software engineering.
Web mining.
Distributed data mining.
Real time data mining.
Multi database data mining.
Privacy protection and information security in data mining
22
SCSA3001 Data Mining And Data Warehousing
PART-A
1. Define Data mining. List out the steps in data mining. Remember BTL-1
7. Define an efficient procedure for cleaning the noisy data. Remember BTL-1
PART-B
1. ii) Describe in detail about the applications of data mining Remember BTL-1
(6)
i) State and explain the various classifications of data
mining systems with example. (7)
2. Analyze BTL-4
ii) Explain the various data mining functionalities in
detail. (6)
i) Describe the steps involved in Knowledge discovery in
databases (KDD). (7)
3. Remember BTL-1
ii) Draw the diagram and Describe the architecture of data
mining system. (6)
23
SCSA3001 Data Mining And Data Warehousing
24
SCSA3001 Data Mining And Data Warehousing
3. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction To Data Mining”, Person
Education, 2007.
4. K.P. Soman, Shyam Diwakar and V. Ajay, “Insight into Data mining Theory and Practice”,
Easter Economy Edition,
Prentice Hall of India, 2006.
5. G. K. Gupta, “Introduction to Data Mining with Case Studies”, Easter Economy Edition,
Prentice Hall of India, 2006.
6. Daniel T.Larose, “Data Mining Methods and Models”, Wile-Interscience, 2006
25
SCSA3001 Data Mining And Data Warehousing
26
SCSA3001 Data Mining And Data Warehousing
DATA WAREHOUSING
27
SCSA3001 Data Mining And Data Warehousing
This integration helps in effective analysis of data. Consistency in naming conventions, attribute
measures, encoding structure etc. has to be ensured.
Time-Variant
The time horizon for data warehouse is quite extensive compared with operational systems. The
data collected in a data warehouse is recognized with a particular period and offers information
from the historical point of view. It contains an element of time, explicitly or implicitly. One such
place where Data warehouse data display time variance is in in the structure of the record key.
Every primary key contained with the DW should have either implicitly or explicitly an element
of time. Like the day, week month, etc. Another aspect of time variance is that once data is
inserted in the warehouse, it can't be updated or changed.
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new data is
entered in it. Data is read-only and periodically refreshed. This also helps to analyze historical
data and understand what & when happened. It does not require transaction process, recovery and
concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application
environment are omitted in Data warehouse environment. Only two types of data operations
performed in the Data Warehousing are
1. Data loading
2. Data access
Data Warehouse Architectures
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This goal is to remove
data redundancy. This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large number of end-users. It also has
connectivity problems because of network limitations.
Three-tier architecture
This is the most widely used architecture.
It consists of the Top, Middle and Bottom Tier.
28
SCSA3001 Data Mining And Data Warehousing
1. Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually a
relational database system. Data is cleansed, transformed, and loaded into this layer using
back-end tools.
2. Middle-Tier: The middle tier in Data warehouse is an OLAP server which is implemented
using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted
view of the database. This layer also acts as a mediator between the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect
and get data out from the data warehouse. It could be Query tools, reporting tools, managed
query tools, Analysis tools and Data mining tools.
DATA WAREHOUSE COMPONENTS
29
SCSA3001 Data Mining And Data Warehousing
for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are resource
intensive and slow down performance.
Hence, alternative approaches to Database are used as listed below-
In a data warehouse, relational databases are deployed in parallel to allow for scalability.
Parallel relational databases also allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
New index structures are used to bypass relational table scan and improve speed.
Use of multidimensional database (MDDBs) to overcome any limitations which are placed
because of the relational data model. Example: Essbase from Oracle.
Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
The data sourcing, transformation, and migration tools are used for performing all the
conversions, summarizations, and all the changes needed to transform data into a unified format in
the data warehouse. They are also called Extract, Transform and Load (ETL) Tools.
Their functionality includes:
Anonymize data as per regulatory stipulations.
Eliminating unwanted data in operational databases from loading into Data warehouse.
Search and replace common names and definitions for data arriving from different sources.
Calculating summaries and derived data
In case of missing data, populate them with defaults.
De-duplicated repeated data arriving from multiple data sources.
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol
programs, shell scripts, etc. that regularly update data in data warehouse. These tools are also
helpful to maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
Metadata
The name Meta Data suggests some high- level technological concept. However, it is quite
simple. Metadata is data about data which defines the data warehouse. It is used for building,
maintaining and managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source,
usage, values, and features of data warehouse data. It also defines how data can be changed and
processed. It is closely connected to the data warehouse.
30
SCSA3001 Data Mining And Data Warehousing
31
SCSA3001 Data Mining And Data Warehousing
32
SCSA3001 Data Mining And Data Warehousing
Consider implementing an ODS model when information retrieval need is near the bottom of
the data abstraction pyramid or when there are multiple operational sources required to be
accessed.
One should make sure that the data model is integrated and not just consolidated. In that case,
you should consider 3NF data model. It is also ideal for acquiring ETL and Data cleansing
tools
Summary:
Data warehouse is an information system that contains historical and commutative data from
single or multiple sources.
A data warehouse is subject oriented as it offers information regarding subject instead of
organization's ongoing operations.
In Data Warehouse, integration means the establishment of a common unit of measure for all
similar data from the different databases
Data warehouse is also non-volatile means the previous data is not erased when new data is
entered in it.
A Data warehouse is Time-variant as the data in a DW has high shelf life.
There are 5 main components of a Data warehouse. 1) Database 2) ETL Tools 3) Meta Data 4)
Query Tools 5) Data Marts
These are four main categories of query tools 1. Query and reporting, tools 2. Application
Development tools, 3. Data mining tools 4. OLAP tools
The data sourcing, transformation, and migration tools are used for performing all the
conversions and summarizations.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data.
BUILDING A DATA WAREHOUSE
In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources into a staging area
2. Transforming the transactional data
3. Loading the transformed data into a dimensional database
4. Building pre-calculated summary values to speed up report generation
5. Building (or purchasing) a front-end reporting tool
33
SCSA3001 Data Mining And Data Warehousing
34
SCSA3001 Data Mining And Data Warehousing
35
SCSA3001 Data Mining And Data Warehousing
parts produced per hour or the number of cars rented per day). Dimensions, on the other hand, are
what your business users expect in the reports—the details about the measures. For example, the
time dimension tells the user that 2000 parts were produced between 7 a.m. and 7 p.m. on the
specific day; the plant dimension specifies that these parts were produced by the Northern plant.
Just like any modeling exercise the dimensional modeling is not to be taken lightly. Figuring out
the needed dimensions is a matter of discussing the business requirements with your users over
and over again. When you first talk to the users they have very minimal requirements: "Just give
me those reports that show me how each portion of the company performs." Figuring out what
"each portion of the company" means is your job as a DW architect. The company may consist of
regions, each of which report to a different vice president of operations. Each region, on the other
hand, might consist of areas, which in turn might consist of individual stores. Each store could
have several departments. When the DW is complete, splitting the revenue among the regions
won't be enough. That's when your users will demand more features and additional drill-down
capabilities. Instead of waiting for that to happen, an architect should take proactive measures to
get all the necessary requirements ahead of time.
It's also important to realize that not every field you import from each data source may fit into the
dimensional model. Indeed, if you have a sequential key on a mainframe system, it won't have
much meaning to your business users. Other columns might have had significance eons ago when
the system was built. Since then, the management might have changed its mind about the
relevance of such columns. So don't worry if all of the columns you imported are not part of your
dimensional model.
Loading the Data:
After you've built a dimensional model, it's time to populate it with the data in the staging
database. This step only sounds trivial. It might involve combining several columns together or
splitting one field into several columns. You might have to perform several lookups before
calculating certain values for your dimensional model.
Keep in mind that such data transformations can be performed at either of the two stages: while
extracting the data from their origins or while loading data into the dimensional model. I wouldn't
recommend one way over the other—make a decision depending on the project. If your users need
to be sure that they can extract all the data first, wait until all data is extracted prior to
36
SCSA3001 Data Mining And Data Warehousing
transforming it. If the dimensions are known prior to extraction, go on and transform the data
while extracting it.
Generating Precalculated Summary Values:
The next step is generating the precalculated summary values which are commonly referred to
as aggregations. This step has been tremendously simplified by SQL Server Analysis Services (or
OLAP Services, as it is referred to in SQL Server 7.0). After you have populated your
dimensional database, SQL Server Analysis Services does all the aggregate generation work for
you. However, remember that depending on the number of dimensions you have in your DW,
building aggregations can take a long time. As a rule of thumb, the more dimensions you have, the
more time it'll take to build aggregations. However, the size of each dimension also plays a
significant role.
Prior to generating aggregations, you need to make an important choice about which dimensional
model to use: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), or HOLAP
(Hybrid OLAP). The ROLAP model builds additional tables for storing the aggregates, but this
takes much more storage space than a dimensional database, so be careful! The MOLAP model
stores the aggregations as well as the data in multidimensional format, which is far more efficient
than ROLAP. The HOLAP approach keeps the data in the relational format, but builds
aggregations in multidimensional format, so it's a combination of ROLAP and MOLAP.
Regardless of which dimensional model you choose, ensure that SQL Server has as much memory
as possible. Building aggregations is a memory-intensive operation, and the more memory you
provide, the less time it will take to build aggregate values.
Building (or Purchasing) a Front-End Reporting Tool
After you've built the dimensional database and the aggregations you can decide how
sophisticated your reporting tools need to be. If you just need the drill-down capabilities, and
your users have Microsoft Office 2000 on their desktops, the Pivot Table Service of Microsoft
Excel 2000 will do the job. If the reporting needs are more than what Excel can offer, you'll have
to investigate the alternative of building or purchasing a reporting tool. The cost of building a
custom reporting (and OLAP) tool will usually outweigh the purchase price of a third-party tool.
That is not to say that OLAP tools are cheap.
There are several major vendors on the market that have top-notch analytical tools. In addition to
the third-party tools, Microsoft has just released its own tool, Data Analyzer, which can be a cost-
37
SCSA3001 Data Mining And Data Warehousing
effective alternative. Consider purchasing one of these suites before delving into the process of
developing your own software because reinventing the wheel is not always beneficial or
affordable. Building OLAP tools is not a trivial exercise by any means.
MULTIDIMENSIONAL DATA MODEL
Multidimensional data model stores data in the form of data cube. Mostly, data warehousing
supports two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. Dimensions are entities with respect
to which an organization wants to keep records. For example in store sales record, dimensions
allow the store to keep track of things like monthly sales of items and the branches and locations.
A multidimensional database helps to provide data-related answers to complex business queries
quickly and accurately. Data warehouses and Online Analytical Processing (OLAP) tools are
based on a multidimensional data model. OLAP in data warehousing enables users to view data
from different angles and dimensions
The multi-Dimensional Data Model is a method which is used for ordering data in the database
along with good arrangement and assembling of the contents in the database.
The Multi-Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to
38
SCSA3001 Data Mining And Data Warehousing
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi-dimensional databases. It is
used to show multiple dimensions of the data to users.
Working on a Multidimensional Data Model
The following stages should be followed by every project for building a Multi-Dimensional Data
Model:
Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about
the range of data which can be gained with the selected technology and collect the complete data
in detail.
Stage 2: Grouping different segments of the system: In the second stage, the Multi-Dimensional
Data Model recognizes and classifies all the data to the respective section they belong to and also
builds it problem-free to apply step by step.
Stage 3: Noticing the different proportions: In the third stage, it is the basis on which the design
of the system is based. In this stage, the main factors are recognized according to the user’s point
of view. These factors are also known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related
qualities. These qualities are also known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their qualities: In the
fifth stage, A Multi-Dimensional Data Model separates and differentiates the actuality from the
factors which are collected by it. These actually play a significant role in the arrangement of a
Multi-Dimensional Data Model.
Stage 6: Building the Schema to place the data, with respect to the information collected from
the steps above: In the sixth stage, on the basis of the data which was collected previously, a
Schema is built.
For Example:
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the basis of
different factors such as geographical location of firm’s workplace, products of the firm,
advertisements done, time utilized to flourish a product, etc.
39
SCSA3001 Data Mining And Data Warehousing
40
SCSA3001 Data Mining And Data Warehousing
41
SCSA3001 Data Mining And Data Warehousing
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
By climbing up a concept hierarchy for a dimension
By dimension reduction
The following diagram illustrates how roll-up works.
Roll-up is performed by climbing up a concept hierarchy for the dimension location.
Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of
city to the level of country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are removed.
42
SCSA3001 Data Mining And Data Warehousing
43
SCSA3001 Data Mining And Data Warehousing
44
SCSA3001 Data Mining And Data Warehousing
45
SCSA3001 Data Mining And Data Warehousing
46
SCSA3001 Data Mining And Data Warehousing
Generally a data warehouses adopts three-tier architecture. Following are the three tiers of the data
warehouse architecture.
These 3 tiers are:
1. Bottom Tier (Data warehouse server)
2. Middle Tier (OLAP server)
3. Top Tier (Front end tools)
47
SCSA3001 Data Mining And Data Warehousing
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either
of the following ways.
By Relational OLAP (ROLAP), which is an extended relational database management
system? The ROLAP maps the operations on multidimensional data to standard relational
operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations?
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting
tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models
From the perspective of data warehouse architecture, we have the following data warehouse
models
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build
a virtual warehouse. Building a virtual warehouse requires excess capacity on operational
database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific
groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales. Data
marts are confined to subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks
rather than months or years.
48
SCSA3001 Data Mining And Data Warehousing
The life cycle of a data mart may be complex in long run, if its planning and design are
not organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data marts are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire
organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
SCHEMAS FOR MULTI-DIMENSIONAL DATA MODEL
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational model,
while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we
will discuss the schemas used in a data warehouse.
Star Schema
49
SCSA3001 Data Mining And Data Warehousing
50
SCSA3001 Data Mining And Data Warehousing
Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore,
it becomes easy to maintain and the save storage space.
Fact Constellation Schema
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
51
SCSA3001 Data Mining And Data Warehousing
52
SCSA3001 Data Mining And Data Warehousing
53
SCSA3001 Data Mining And Data Warehousing
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To store
and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
Implementation of aggregation navigation logic.
Optimization for each DBMS back end.
Additional tools and services.
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allow to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.
INTEGRATED OLAP AND OLAM ARCHITECTURE
Online Analytical Mining integrates with Online Analytical Processing with data mining and
mining knowledge in multidimensional databases. Here is the diagram that shows the integration
of both OLAP and OLAM
OLAM is important for the following reasons −
High quality of data in data warehouses − The data mining tools are required to work on
integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of data.
The data warehouses constructed by such preprocessing are valuable sources of high quality data
for OLAP and data mining as well.
Available information processing infrastructure surrounding data warehouses − Information
processing infrastructure refers to accessing, integration, consolidation, and transformation of
54
SCSA3001 Data Mining And Data Warehousing
multiple heterogeneous databases, web-accessing and service facilities, reporting and OLAP
analysis tools
Online selection of data mining functions − Integrating OLAP with multiple data mining
functions and online analytical mining provide users with the flexibility to select desired data
mining functions and swap data mining tasks dynamically
Features of OLTP and OLAP:
The major distinguishing features between OLTP and OLAP are summarized as follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application oriented database design. An OLAP system typically adopts either a star or
snowflake model and a subject-oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP system
often spans multiple versions of a database schema. OLAP systems also deal with information that
originates from different organizations, integrating information from many data stores. Because of
their huge volume, OLAP data are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex
queries.
56
SCSA3001 Data Mining And Data Warehousing
PART-A
6. How would you evaluate the goals of data mining? Evaluate BTL-5
7. Can you list the categories of tools in business analysis? Remember BTL-1
PART-B
57